yokome.language

class yokome.language._lang.Language(code, name, *, loader, tokenizer, extractor, parallel_extractor)

Resources of a specific language.

Stores information about the language and provides methods for text analysis that are tailored to that language.

static by_code(code)

Look up a language by its unique identifier.

property code

The unique identifier of this language.

This is usually the ISO 639-3 language code of this language.

property extract

Function to turn an iterable of tokens into language model input.

Differs from extract_parallel() only for character-level extracts.

Parameters

tokens – An iterable of tokens (see tokenize() for the token representation).

Returns

An iterable of token identifiers that is understood by the language model.

property extract_parallel

Function to turn an iterable of tokens into language model input.

Differs from extract() only for character-level extracts.

Parameters

tokens – An iterable of tokens (see tokenize() for the token representation).

Returns

An iterable of token identifiers that are understood by the language model.

property load

Function to load corpus sentences in this language.

The order of sentences is randomized (independently of the number of samples requested and consistently in between calls requesting the same number of samples).

Does not necessarily load the sentences themselves, but may provide IDs if tokenize(), extract() and extract_parallel() can handle this format.

Parameters

n_samples (int) – The number of sample sentences to load. If None, load all samples.

Returns

A tuple of sentences or sentence IDs.

property tokenize

Function to tokenize a sentence in this language.

Parameters

sentence – A sentence or sentence ID.

Returns

A tuple of tuples of tokens. A token is represented as a dictionary of the following form:

{
  'surface_form': {'graphic': ..., 'phonetic': ...},
  'base_form': {'graphic': ..., 'phonetic': ...},
  'lemma': {'graphic': ..., 'phonetic': ...},
  'pos': <list of POS tags as strings>,
  'inflection': <list of POS/inflection tags>
}

”Surface form” refers to the graphic variant used in an original document and its pronunciation. “Base form” refers to a lemmatized version of the surface form. “Lemma” a normalized version of the base form. (In Japanese, for example, there is a single lemma for multiple graphical variants of the base form which mean the same thing.)

The POS and inflection lists are meant to be read by a features.tree.TemplateTree.

yokome.language._jpn.JPN = <Language jpn>

Japanese language with methods that load sentences from the corpus.

yokome.language._jpn.JPN_UNSEEN = <Language jpn_unseen>

Japanese language with methods that work on unknown text.