yokome.features.jpn

This package makes extensive use of symbol streams. To understand how this data structure is defined, see symbol_stream.

yokome.features.jpn.chasen_loader(filename)

Loads a file from the JEITA corpus and yields symbols from it.

Parameters

filename (str) – The filename of the document to load.

Returns

A symbol stream that encodes the text from the loaded document.

yokome.features.jpn.combining_voice_mark_fold(symbol_stream)

Normalize words with combining voice marks.

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with combining voice/semi-voice marks combined with their preceding voicable/semi-voicable symbols to form voiced/semi-voiced symbols. Voice/semi-voice marks that do not follow a voicable/semi-voicable symbol are replaced by KATAKANA-HIRAGANA VOICED SOUND MARK (U+309B) / KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK (U+309c).

yokome.features.jpn.content_sentences(symbol_streams)

Filter out non-content symbol streams.

Parameters

symbol_streams – An iterable over symbol streams.

Returns

An iterable over all symbol streams that contain content symbols.

yokome.features.jpn.fullwidth_fold(symbol_stream)

Turn the ASCII space, the Latin letters in ASCII, and the halfwidth forms into their fullwidth counterparts.

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with halfwidth characters replaced by their fullwidth counterparts.

yokome.features.jpn.hiragana_to_katakana(phrase: str) → str

Convert hiragana to katakana.

Do not handle the use of prolonged sound marks.

Parameters

phrase (str) – The phrase in which to replace all hiragana characters by katakana characters.

yokome.features.jpn.is_content_sentence(symbol_stream)

Detect whether the symbol stream contains content symbols.

Parameters

symbol_stream – A stream over symbols.

Returns

True if the symbol stream contains content symbols, else False.

yokome.features.jpn.is_reading(phrase: str) → bool

Determine whether the specified phrase is a reading representation.

Reading representations contain only characters from the hiragana, hiragana phonetic extensions, katakana and katakana phonetic extensions unicode blocks, as well as the wave dash (U+301c) and the fullwidth tilde (U+ff5e). The unused code points U+3040, U+3097 and U+3091 are excluded.

Phrases that consist only of the wave dash, the fullwidth tilde, or the katakana middle dot (U+30fb) are not considered readings. In JMdict, they are used as the headlines for descriptive entries for these forms of punctuation.

Parameters

phrase (str) – The phrase to test.

Returns

True if the specified phrase is a reading representation, False otherwise.

yokome.features.jpn.iteration_fold(symbol_stream)

Normalize words with iteration marks.

Replace each kana/kanji iteration mark with the characters it stands for.

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with iteration characters replaced by the characters that they stand for.

yokome.features.jpn.longest_common_prefix_len(a, b)

Determine the length of the longest common prefix of two strings.

Parameters
  • a (str) – The first string.

  • b (str) – The second string.

Returns

The length of the longest common prefix of both strings.

yokome.features.jpn.match_reading(splits)

Match graphic and phonetic word representations and lemma.

Discern the notations ‘’ for space and ‘’ for backslash (with ‘ ‘ as field separator) in JUMAN++ output.

Parameters

splits (list[str]) – The sections for word token (graphic), word token (phonetic), and lemma, split on ‘ ‘ from a joint string representation with ‘ ‘ as separator. The input may contain more than three elements.

Returns

A triple consisting of the graphic word token, the phonetic word token, and the lemma.

yokome.features.jpn.parse_jumanpp_output(output)

Parse JUMAN++ tokenizer output format.

The output is one-token-per-line, with space-separated annotations. There are twelve annotations for regular tokens and twelve annotations and an additional ‘@ ‘ at the beginning of lines to mark the beginning of alternatives for a preceding regular token.

Start processing from the end of the line, since there are ambiguities for the first three annotation types: Spaces are denoted as ‘’, while backslashes are denoted by ‘’ only, resulting in conflicting interpretations for ‘’ as “space”, and “backslash” + “end of annotation”, respectively.

Furthermore, ‘”’ is not escaped or enclosed in single quotation marks, while the last annotation, if existent, is always enclosed in double quotation marks. Thus, manual line splitting is necessary, and cannot be done via shlex.

The remaining annotation types are a fixed set of keywords, with odd and even annotations encoding the same information, once in string form and once as a numerical ID.

Parameters

output (str) – The raw output of JUMAN++.

Returns

An iterable over tuples of candidates, each candidate being one of the possible tokens for its token position in the iterable. A candidate is a dictionary of the form described in to_dict().

yokome.features.jpn.repetition_contraction(symbol_stream)

Contract representations of repetition symbols in the input stream.

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with repetition symbols contracted to one symbol only.

yokome.features.jpn.segmenter(symbol_stream, whitespace_marks_end_of_paragraph=False)

Accept a stream of symbols and yield symbol streams for each sentence.

This function works most reliably with balanced bracketing.

Parameters
  • symbol_stream – A stream over symbols.

  • whitespace_marks_end_of_paragraph (bool) – Whether whitespace marks the end of a paragraph in the symbol stream.

Returns

An iterable over symbol streams, each corresponding to a sentence.

yokome.features.jpn.semivoice(char: int) → int

Return the semi-voiced version of char.

Parameters

char (int) – An unvoiced Unicode character to semi-voice.

Returns

The Unicode character that is the semi-voiced version of char.

yokome.features.jpn.stream_tokenizer(symbol_stream, partially_annotated=False)

Tokenize a symbol stream using JUMAN++, in a synchronous fashion.

Parameters
  • symbol_stream – The symbol stream to tokenize.

  • partially_annotated (bool) – Whether the input is partially annotated.

Returns

An iterable over tuples of candidates, each candidate being one of the possible tokens for its token position in the iterable. A candidate is a dictionary of the form described in to_dict().

yokome.features.jpn.strip(symbol_streams)

Remove leading and trailing whitespace from a symbol stream.

The definition of whitespace is language-dependent and refers to Japanese conventions here.

As is generally the case with symbol streams, all removed whitespace can be restored using symbol_stream.expand().

Parameters

symbol_stream – A stream over symbols.

Returns

A symbol stream like the input symbol stream, with whitespace replaced by None-symbols.

yokome.features.jpn.to_dict(token)

Turn an array of JUMAN++-style token annotations into a dictionary.

Parameters

token – An array version of a line of JUMAN++ output. It either has twelve elements and is the first candidate for a token, or it has thirteen elements and is a later candidate for a token. In the latter case the first element is '@'.

Returns

A dictionary describing the token candidate corresponding to the input. It has the following form:

{
  'surface_form': {'graphic': ..., 'phonetic': ...},
  'base_form': {'graphic': ..., 'phonetic': ...},
  'lemma': {'graphic': ..., 'phonetic': ...},
  'pos': <list of POS tags>,
  'inflection': <list of POS/inflection tags>
}

The surface form is the inflected form as it was found in the text, along with its reading in katakana. The base form is the uninflected form. For both graphic representation and reading, it may be different from the lemma for different graphic variants of the same lexeme. The lemma is the canonical form for both graphic reprepresentation and reading, intended to be unique for all variants of a lexeme.

yokome.features.jpn.to_morae(symbol_stream)

Group morae in a symbol stream.

A mora is a subunit of a syllable that may consist of multiple characters. For Japanese, it is the logical unit of counting sounds of speech. A Japanese syllable typically consists of one mora or two morae where the second mora prolongs the first. A Japanese mora consists of a regular kana letter or a kana letter and an ensuing glide sound, e.g. “ち”, “ゆ” or “ちゅ ” (but not “ちゅう”).

Parameters

symbol_stream – A stream over symbols.

Returns

A list of morae, each consisting of its symbols.

yokome.features.jpn.tokenize_async(text, partially_annotated=False)

Tokenize a text using JUMAN++, in an asynchronous fashion.

While waiting for the result of tokenization is performed asynchronously, the token candidates are yielded in a blocking fashion, i.e. every coroutine building on this tokenizer has access to all resulting tokens without interference of other coroutines.

Parameters
  • text (str) – The text to tokenize.

  • partially_annotated (bool) – Whether the input is partially annotated.

Returns

An iterable over tuples of candidates, each candidate being one of the possible tokens for its token position in the iterable. A candidate is a dictionary of the form described in to_dict().

yokome.features.jpn.tokenize_stream_async(symbol_stream, partially_annotated=False)

Tokenize a symbol stream using JUMAN++, in an asynchronous fashion.

While waiting for the result of tokenization is performed asynchronously, the token candidates are yielded in a blocking fashion, i.e. every coroutine building on this tokenizer has access to all resulting tokens without interference of other coroutines.

Parameters
  • symbol_stream – The symbol stream to tokenize.

  • partially_annotated (bool) – Whether the input is partially annotated.

Returns

An iterable over tuples of candidates, each candidate being one of the possible tokens for its token position in the iterable. A candidate is a dictionary of the form described in to_dict().

yokome.features.jpn.tokenizer(text, partially_annotated=False)

Tokenize a text using JUMAN++, in a synchronous fashion.

Parameters
  • text (str) – The text to tokenize.

  • partially_annotated (bool) – Whether the input is partially annotated.

Returns

An iterable over tuples of candidates, each candidate being one of the possible tokens for its token position in the iterable. A candidate is a dictionary of the form described in to_dict().

yokome.features.jpn.unsemivoice(char: int) → int

Return the unvoiced version of char.

Parameters

char (int) – A semi-voiced Unicode character to unvoice.

Returns

The Unicode character that is the unvoiced version of char.

yokome.features.jpn.unvoice(char: int) → int

Return the unvoiced version of char.

Parameters

char (int) – A voiced Unicode character to unvoice.

Returns

The Unicode character that is the unvoiced version of char.

yokome.features.jpn.voice(char: int) → int

Return the voiced version of char.

Parameters

char (int) – An unvoiced Unicode character to voice.

Returns

The Unicode character that is the voiced version of char.