Divide text into spans that define a single grapheme, and additionally return the cell length of the whole string. The returned spans will cover every index in the string, with no gaps. It is possible for some graphemes to have a cell length of zero. This can occur for nonsense strings like
(
text: str, unicode_version: str = "auto"
)
| 159 | |
| 160 | |
| 161 | def split_graphemes( |
| 162 | text: str, unicode_version: str = "auto" |
| 163 | ) -> "tuple[list[CellSpan], int]": |
| 164 | """Divide text into spans that define a single grapheme, and additionally return the cell length of the whole string. |
| 165 | |
| 166 | The returned spans will cover every index in the string, with no gaps. It is possible for some graphemes to have a cell length of zero. |
| 167 | This can occur for nonsense strings like two zero width joiners, or for control codes that don't contribute to the grapheme size. |
| 168 | |
| 169 | Args: |
| 170 | text: String to split. |
| 171 | unicode_version: Unicode version, `"auto"` to auto detect, `"latest"` for the latest unicode version. |
| 172 | |
| 173 | Returns: |
| 174 | A tuple of a list of *spans* and the cell length of the entire string. A span is a list of tuples |
| 175 | of three values consisting of (<START>, <END>, <CELL LENGTH>), where START and END are string indices, |
| 176 | and CELL LENGTH is the cell length of the single grapheme. |
| 177 | """ |
| 178 | |
| 179 | cell_table = load_cell_table(unicode_version) |
| 180 | codepoint_count = len(text) |
| 181 | index = 0 |
| 182 | last_measured_character: str | None = None |
| 183 | |
| 184 | total_width = 0 |
| 185 | spans: list[tuple[int, int, int]] = [] |
| 186 | SPECIAL = {"\u200d", "\ufe0f"} |
| 187 | while index < codepoint_count: |
| 188 | if (character := text[index]) in SPECIAL: |
| 189 | if not spans: |
| 190 | # ZWJ or variation selector at the beginning of the string doesn't really make sense. |
| 191 | # But handle it, we must. |
| 192 | spans.append((index, index := index + 1, 0)) |
| 193 | continue |
| 194 | if character == "\u200d": |
| 195 | # zero width joiner |
| 196 | # The condition handles the case where a ZWJ is at the end of the string, and has nothing to join |
| 197 | index += 2 if index < (codepoint_count - 1) else 1 |
| 198 | start, _end, cell_length = spans[-1] |
| 199 | spans[-1] = (start, index, cell_length) |
| 200 | else: |
| 201 | # variation selector 16 |
| 202 | index += 1 |
| 203 | if last_measured_character: |
| 204 | start, _end, cell_length = spans[-1] |
| 205 | if last_measured_character in cell_table.narrow_to_wide: |
| 206 | last_measured_character = None |
| 207 | cell_length += 1 |
| 208 | total_width += 1 |
| 209 | spans[-1] = (start, index, cell_length) |
| 210 | else: |
| 211 | # No previous character to change the size of. |
| 212 | # Shouldn't occur in practice. |
| 213 | # But handle it, we must. |
| 214 | start, _end, cell_length = spans[-1] |
| 215 | spans[-1] = (start, index, cell_length) |
| 216 | continue |
| 217 | |
| 218 | if character_width := get_character_cell_size(character, unicode_version): |