yokome.language
¶
-
class
yokome.language._lang.
Language
(code, name, *, loader, tokenizer, extractor, parallel_extractor)¶ Resources of a specific language.
Stores information about the language and provides methods for text analysis that are tailored to that language.
-
static
by_code
(code)¶ Look up a language by its unique identifier.
-
property
code
¶ The unique identifier of this language.
This is usually the ISO 639-3 language code of this language.
-
property
extract
¶ Function to turn an iterable of tokens into language model input.
Differs from
extract_parallel()
only for character-level extracts.- Parameters
tokens – An iterable of tokens (see
tokenize()
for the token representation).- Returns
An iterable of token identifiers that is understood by the language model.
-
property
extract_parallel
¶ Function to turn an iterable of tokens into language model input.
Differs from
extract()
only for character-level extracts.- Parameters
tokens – An iterable of tokens (see
tokenize()
for the token representation).- Returns
An iterable of token identifiers that are understood by the language model.
-
property
load
¶ Function to load corpus sentences in this language.
The order of sentences is randomized (independently of the number of samples requested and consistently in between calls requesting the same number of samples).
Does not necessarily load the sentences themselves, but may provide IDs if
tokenize()
,extract()
andextract_parallel()
can handle this format.- Parameters
n_samples (int) – The number of sample sentences to load. If
None
, load all samples.- Returns
A tuple of sentences or sentence IDs.
-
property
tokenize
¶ Function to tokenize a sentence in this language.
- Parameters
sentence – A sentence or sentence ID.
- Returns
A tuple of tuples of tokens. A token is represented as a dictionary of the following form:
{ 'surface_form': {'graphic': ..., 'phonetic': ...}, 'base_form': {'graphic': ..., 'phonetic': ...}, 'lemma': {'graphic': ..., 'phonetic': ...}, 'pos': <list of POS tags as strings>, 'inflection': <list of POS/inflection tags> }
”Surface form” refers to the graphic variant used in an original document and its pronunciation. “Base form” refers to a lemmatized version of the surface form. “Lemma” a normalized version of the base form. (In Japanese, for example, there is a single lemma for multiple graphical variants of the base form which mean the same thing.)
The POS and inflection lists are meant to be read by a
features.tree.TemplateTree
.
-
static
-
yokome.language._jpn.
JPN
= <Language jpn>¶ Japanese language with methods that load sentences from the corpus.
-
yokome.language._jpn.
JPN_UNSEEN
= <Language jpn_unseen>¶ Japanese language with methods that work on unknown text.