yokome.features.symbol_stream
¶
This package deals with symbol streams.
Symbol streams are a data structure to represent text. In contrast to strings however, they are based on python generators and allow to easily manipulate text using state machines while always retaining information on how the text looked before manipulation.
A symbol stream itself is an iterator over symbols. A symbol is a tuple that
contains an integer encoding a Unicode character as its first element. If a
manipulator wishes to replace multiple characters with another one, it yields a
tuple containing an integer encoding the new character in the first position and
all the symbols that were replaced in the other positions, in the order of their
appearance in the original symbol stream. This way, a new symbol’s lineage is stored
and the original symbol stream can be restored at any time using expand()
.
The character code None
encodes the empty string. Thus, in the case that
multiple symbols are to be added to the new stream, the special symbol
(None,)
may be added as lineage indicator. Furthermore, yielding a symbol
with None
in the first position and old symbols in the other ones
effectively deletes those symbols.
Because symbol streams are build on top of generators, they allow manipulators to be applied in a composite fashion as a pipeline without excessive storage requirements: Later parts of a text are requested in upstream manipulators not before they are required in downstream ones.
For conversion between symbol streams and strings, see to_symbol_stream()
and to_text()
.
-
yokome.features.symbol_stream.
ASCII_LETTER_RANGES
= ((65, 90), (97, 122))¶ The ranges of ASCII letters.
-
exception
yokome.features.symbol_stream.
BracketingError
(bracketing_structure, *args, **kwargs)¶ An exception indicating an incorrect bracketing structure.
May indicate mismatched brackets or an unbalanced bracketing structure.
- Parameters
bracketing_structure – A symbol stream containing the brackets found in an analyzed document.
-
yokome.features.symbol_stream.
ascii_case_fold
(symbol_stream)¶ Lowercase all symbols in the stream.
- Parameters
symbol_stream – A stream over symbols.
- Returns
A symbol stream like the input symbol stream, with all uppercase ASCII letters replaced by their lowercase counterparts.
-
yokome.features.symbol_stream.
ascii_fold
(symbol_stream)¶ Turn certain Unicode characters into the sequence that is their closest ASCII character transcription.
- Parameters
symbol_stream – A stream over symbols.
- Returns
A symbol stream like the input symbol stream, with characters from most Latin ranges of Unicode replaced by the sequence that is the closest ASCII character transcription.
-
yokome.features.symbol_stream.
enumerate_alternatives
(sentence) → Iterator¶ Generate all sentence alternatives in sequence.
- Parameters
sentence – A sentence, split into tokens.
- Returns
An iterable over iterables over tokens, one for each token in the original sentence.
-
yokome.features.symbol_stream.
expand
(symbol_stream)¶ Restore the original symbol stream from a manipulated one.
- Parameters
symbol_stream – A stream over symbols.
- Returns
A stream over the symbols from which the input symbol stream was created.
-
yokome.features.symbol_stream.
in_ranges
(char, ranges)¶ Determines whether the given character is in one of several ranges.
- Parameters
char (int) – An integer encoding a Unicode character.
ranges – A sequence of pairs, where the first element of each pair is the start Unicode character code (including) and the second element of each pair is the end Unicode character code (including) of a range.
-
yokome.features.symbol_stream.
sample_alternatives
(sentence, n, seed) → Iterator¶ From all sentence alternatives, sample
n
instances uniformly.- Parameters
sentence – A sentence, split into tokens.
n – The number of sample to generate.
seed –
- Random seed used for the random number generator. For
non-seeded behavior, use
None
.
- return
An iterable over iterables over tokens, one for each token in the original sentence.
-
yokome.features.symbol_stream.
to_symbol_stream
(text)¶ Convert a string into a symbol stream.
- Parameters
text (str) – The string to be converted.
- Returns
A stream of symbols contaning exactly the characters from the specified text as symbols
See also
-
yokome.features.symbol_stream.
to_text
(symbol_stream)¶ Convert a symbol stream into a string.
- Parameters
symbol_stream – A stream over symbols.
- Returns
The string that corresponds to the symbol stream, with all lineage symbols omitted.
See also
-
yokome.features.symbol_stream.
validate_brackets
(symbol_stream, brackets) → Iterator¶ Validate the stream’s bracketing structure.
Yield the symbols from the symbol stream verbatim while checking for unbalanced and mismatched brackets. Raise
BracketingError
after yielding every symbol in an invalid input.- Parameters
symbol_stream – A stream over symbols.
brackets – A dictionary where the keys are the chars for the opening brackets and their values are the corresponding closing brackets.
- Returns
A stream over the same symbols as the input.
- Raises
BracketingError – If brackets in the symbol stream are unbalanced or mismatched