utf8

UTF-8 in Lua

 
 
View on GitHub
Download as .tar.gz
Download as .zip

local utf8 = require'utf8'

Low-level module for working with UTF-8 encoded strings. Byte indices are i's, char indices are ci's, and "char" means unicode codepoint. Invalid characters are counted as 1-byte chars so they don't get lost. Validate/sanitize beforehand as needed.

utf8.byte_indices(s) -> iterator<i, valid> iterate the chars in a string, returning the byte index followed by a valid flag1, for each char.
utf8.byte_indices_reverse(s) -> iterator<i> iterate the chars in a string in reverse, returning the byte index of each char.
utf8.next(s, last_i) -> i, isvalid | nil byte index of the next char after the char at byte index i, followed by a valid flag***. nil if out of range.
utf8.prev(s, i) -> prev_i byte index of the prev. char before the char at byte index i, or nil if i is out of range.
utf8.byte_index(s, target_ci) -> i byte index given char index. nil if the index is out of range.
utf8.char_index(s, target_i) -> ci char index given byte index. nil if the index is out of range.
utf8.len(s) -> len number of chars in string
utf8.sub(s, start_ci[, end_ci]) -> s sub based on char indices (also, start_ci must be >= 1 and end_ci, if given, can't be negative)
utf8.contains(s, i, sub) -> true|false check if a string contains a substring at byte index i
utf8.count(s, sub) -> n count the number of occurences of a substring in a string
utf8.isvalid(s, i) -> true|false check if there's a valid utf8 codepoint at byte index i
utf8.valid_byte_indices(s) -> iterator<i> iterate valid chars, returning the byte index where each char starts.
utf8.next_valid(s, last_i) -> i | nil byte index of the next valid utf8 char after the char at byte index i. nil if out of range. invalid chars are skipped.
utf8.validate(s) assert that a string only contains valid utf8 characters; raise an error if that's not the case

Extending

At the heart of the module is the utf8.next function, which you can redefine for different semantics. In particular, you can reassign utf8.next to utf8.next_valid to change the behavior of the entire module to skip on invalid indices. Preferably you would not do that directly on the module table returned by require, but make a new module instead:

my_utf8.lua:

local glue = require'glue'
local utf8 = require'utf8'
return glue.merge({utf8.next = utf8.next_valid}, utf8)

  1. when iterating, validation only performed on the first byte. to fully validate that a utf8 codepoint is within the accepted range, use utf8.isvalid().