yokome.models.wsd
¶
-
yokome.models.wsd.
disambiguate
(tokens, i)¶ Disambiguate the token at
i
in the tokenized sentencetokens
.- Parameters
tokens – A sentence, split into its tokens.
i (int) – The position of the token of interest in
tokens
.
- Returns
A list of data on lexemes, ranked by their overall suitability to describe the meaning of the token at
i
, with their connotations in turn associated with their suitability. Each element is a dictionary of the following form:{ 'entry_id': <ID of the lexeme in the dictionary>, 'headwords': <list of lemmas for the lexeme>, 'discriminator': <int for lexemes with the same main headword>, 'roles': [ { 'poss': <POS tag list for the role>, 'connotations': [ { 'sense_id': <the ID of the connotation within the lexeme>, 'glosses': ((<gloss_type>, <gloss>), ...), 'score': <connotation score> }, ... ] }, ... ], 'score': <overall lexeme score> }
-
yokome.models.wsd.
has_statistics
(conn, language_code, lemma)¶ See if
lemma
can be found in both the dictionary and the corpus.- Parameters
conn – The database connection for the dictionary and statistics.
language_code (str) – ISO 639-3 language code of the language of interest.
lemma – A dictionary that contains the keys
graphic
andphonetic
.
- Returns
True
if the lemma can be found in both the dictionary and the corpus,False
otherwise.
-
yokome.models.wsd.
lexeme_lemma_count
(conn, language_code, lemma)¶ Estimate the number of occurrences of a lexeme-lemma combination.
Get the number of occurrences of a lemma in the background corpus and estimate its contribution to the number of occurrences to one of its corresponding lexemes by dividing it by the number of lexemes for which it is listed as a headword.
- Parameters
conn – The database connection for the dictionary and statistics.
language_code – ISO 639-3 language code of the language of interest.
lemma – A dictionary that contains the keys
graphic
andphonetic
.
- Returns
The number of estimated occurrences of the specified lemma with one of its lexemes, assuming equal distribution of the lemma among its lexemes.
-
yokome.models.wsd.
n_lexemes_for_lemma
(conn, language_code, lemma) → int¶ Get the number of dictionary entries with
lemma
as headword.- Parameters
conn – The database connection for the dictionary.
language_code (str) – ISO 639-3 language code of the language of interest.
lemma – A dictionary that contains the keys
graphic
andphonetic
.
- Returns
The number of entries in the dictionary that have the specified lemma as one of its headwords.
-
yokome.models.wsd.
retrieve_substitute_lexemes
(entry_id, n_senses, sense)¶ Search for substitute lexemes with senses similary to the specified one.
- Parameters
id (entry_int) – The ID of the lexeme of interest in the dictionary.
n_senses (int) – The total number of senses of the lexeme with ID
entry_id
.sense – A list of gloss descriptions of the form
(gloss_type, gloss)
.
- Returns
A list of dictionaries of the form
{'entry_id': <entry ID of lexeme>, 'lemmas': <list of headwords of lexeme>, 'pos': <tree of POS-tags>, 'glosses': <list of glosses>, 'ir_score': <information retrieval score>}
of the connotations (i.e. lexeme-sense combinations) that most closely resemble
sense
in terms of their glosses. All connotations that pertain to the lexeme with the IDentry_id
are excluded.
-
yokome.models.wsd.
score_connotation
(tokens, i, sense_prior, pos_tree, role_pos_score, conn, substitute_lexemes, TOTAL_LEMMAS)¶ Score the substitution of the token at
i
withsubstitute_lexemes
.- Parameters
tokens – A sentence, split into tokens.
i – The position of the token of interest in
tokens
.sense_prior – The prior probability of the sense given the lexeme.
pos_tree (yokome.features.tree.TemplateTree) – The POS tree of the role to which the connotation belongs.
role_pos_score – The summed scores of the matches between
pos_tree
and each POS tree that pertains to a candidate lemma of the token of interest so that the candidate lemma is a headword of the lexeme to which the connotation belongs.conn – The database connection for the dictionary and statistics.
substitute_lexemes – A list of dicts that describe lexemes that could be used as substitutes for the token of interest.
TOTAL_LEMMAS – The total number of lemmas (i.e. of tokens) in the corpus.
- Returns
A score for the suitability of the connotation that suggested
substitute_lexemes
to describe the meaning of the token ati
.
-
yokome.models.wsd.
score_lexeme
(tokens, i, pos_trees, conn, lexeme, TOTAL_LEMMAS)¶ Disambiguate the token at
i
using the connotations oflexeme
.- Parameters
tokens – A sentence, split into its tokens.
i – The position of the token of interest in
tokens
.pos_trees (list[yokome.features.tree.TemplateTree]) – A list of POS trees of the token at
i
, each pertaining to one of the candidate lemmas of this token, restricted to those tokens that are headwords oflexeme
.conn – The database connection for the dictionary and statistics.
lexeme (yokome.features.dictionary.Lexeme) – An entry in the dictionary that possibly describes the meaning of the token at
i
.TOTAL_LEMMAS – The total number of lemmas (i.e. of tokens) in the corpus.
- Returns
A dictionary of data on the lexeme, together with the lexemes overall suitability to describe the meaning of the token at
i
. Each connotations in turn is associated with its suitability. The dictionary is of the following form:{ 'entry_id': <ID of the lexeme in the dictionary>, 'headwords': <list of lemmas for the lexeme>, 'discriminator': <int for lexemes with the same main headword>, 'roles': [ { 'poss': <POS tag list for the role>, 'connotations': [ { 'sense_id': <the ID of the connotation within the lexeme>, 'glosses': ((<gloss_type>, <gloss>), ...), 'score': <connotation score> }, ... ] }, ... ], 'score': <overall lexeme score> }
-
yokome.models.wsd.
total_lemmas
(conn, language_code)¶ Get the number of lemmas (i.e. of tokens) in the background corpus.
- Parameters
conn – The database connection for statistics.
language_code – ISO 639-3 language code of the language of interest.
- Returns
The total number of lemmas in the background corpus.