unicode line breaking

local ub = require'libunibreak'

A ffi binding to libunibreak, a C library implementing the unicode line breaking algorithm and word breaking from unicode text segmentation.

Line breaking

ub.linebreaks_utf8(s[,size[,lang]]) -> line_breaks
ub.linebreaks_utf16(s[,size[,lang]]) -> line_breaks
ub.linebreaks_utf32(s[,size[,lang]]) -> line_breaks

The returned line_breaks is a 0-based array of flags, one for each byte of the input string:

0 Break is mandatory.
1 Break is allowed.
2 No break is possible.
3 A UTF-8/16 sequence is unfinished.

Word breaking

ub.wordbreaks_utf8(s[,size[,lang]]) -> word_breaks
ub.wordbreaks_utf16(s[,size[,lang]]) -> word_breaks
ub.wordbreaks_utf32(s[,size[,lang]]) -> word_breaks

The returned word_breaks is a 0-based array of flags, one for each byte of the input string:

0 Break is allowed.
1 No break is allowed.
2 A UTF-8/16 sequence is unfinished.

Unicode helpers

ub.chars_utf8(s) -> iter() -> i, codepoint
ub.chars_utf16(s) -> iter() -> i, codepoint
ub.chars_utf32(s) -> iter() -> i, codepoint

Iterate codepoints.

ub.len_utf8(s[,size]) -> len
ub.len_utf16(s[,size]) -> len
ub.len_utf32(s[,size]) -> len

Get the number of codepoints in string.

