yokome.features.symbol_stream

This package deals with symbol streams.

Symbol streams are a data structure to represent text. In contrast to strings however, they are based on python generators and allow to easily manipulate text using state machines while always retaining information on how the text looked before manipulation.

A symbol stream itself is an iterator over symbols. A symbol is a tuple that contains an integer encoding a Unicode character as its first element. If a manipulator wishes to replace multiple characters with another one, it yields a tuple containing an integer encoding the new character in the first position and all the symbols that were replaced in the other positions, in the order of their appearance in the original symbol stream. This way, a new symbol’s lineage is stored and the original symbol stream can be restored at any time using expand().

The character code None encodes the empty string. Thus, in the case that multiple symbols are to be added to the new stream, the special symbol (None,) may be added as lineage indicator. Furthermore, yielding a symbol with None in the first position and old symbols in the other ones effectively deletes those symbols.

Because symbol streams are build on top of generators, they allow manipulators to be applied in a composite fashion as a pipeline without excessive storage requirements: Later parts of a text are requested in upstream manipulators not before they are required in downstream ones.

For conversion between symbol streams and strings, see to_symbol_stream() and to_text().

yokome.features.symbol_stream.ASCII_LETTER_RANGES = ((65, 90), (97, 122))

The ranges of ASCII letters.

exception yokome.features.symbol_stream.BracketingError(bracketing_structure, *args, **kwargs)

An exception indicating an incorrect bracketing structure.

May indicate mismatched brackets or an unbalanced bracketing structure.

Parameters

bracketing_structure – A symbol stream containing the brackets found in an analyzed document.

yokome.features.symbol_stream.ascii_case_fold(symbol_stream)

Lowercase all symbols in the stream.

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with all uppercase ASCII letters replaced by their lowercase counterparts.

yokome.features.symbol_stream.ascii_fold(symbol_stream)

Turn certain Unicode characters into the sequence that is their closest ASCII character transcription.

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with characters from most Latin ranges of Unicode replaced by the sequence that is the closest ASCII character transcription.

yokome.features.symbol_stream.enumerate_alternatives(sentence) → Iterator

Generate all sentence alternatives in sequence.

Parameters

sentence – A sentence, split into tokens.

Returns

An iterable over iterables over tokens, one for each token in the original sentence.

yokome.features.symbol_stream.expand(symbol_stream)

Restore the original symbol stream from a manipulated one.

Parameters

symbol_stream – A stream over symbols.

Returns

A stream over the symbols from which the input symbol stream was created.

yokome.features.symbol_stream.in_ranges(char, ranges)

Determines whether the given character is in one of several ranges.

Parameters
  • char (int) – An integer encoding a Unicode character.

  • ranges – A sequence of pairs, where the first element of each pair is the start Unicode character code (including) and the second element of each pair is the end Unicode character code (including) of a range.

yokome.features.symbol_stream.sample_alternatives(sentence, n, seed) → Iterator

From all sentence alternatives, sample n instances uniformly.

Parameters
  • sentence – A sentence, split into tokens.

  • n – The number of sample to generate.

  • seed

    Random seed used for the random number generator. For

    non-seeded behavior, use None.

    return

    An iterable over iterables over tokens, one for each token in the original sentence.

yokome.features.symbol_stream.to_symbol_stream(text)

Convert a string into a symbol stream.

Parameters

text (str) – The string to be converted.

Returns

A stream of symbols contaning exactly the characters from the specified text as symbols

See also

to_text()

yokome.features.symbol_stream.to_text(symbol_stream)

Convert a symbol stream into a string.

Parameters

symbol_stream – A stream over symbols.

Returns

The string that corresponds to the symbol stream, with all lineage symbols omitted.

yokome.features.symbol_stream.validate_brackets(symbol_stream, brackets) → Iterator

Validate the stream’s bracketing structure.

Yield the symbols from the symbol stream verbatim while checking for unbalanced and mismatched brackets. Raise BracketingError after yielding every symbol in an invalid input.

Parameters
  • symbol_stream – A stream over symbols.

  • brackets – A dictionary where the keys are the chars for the opening brackets and their values are the corresponding closing brackets.

Returns

A stream over the same symbols as the input.

Raises

BracketingError – If brackets in the symbol stream are unbalanced or mismatched