yokome.models.wsd

yokome.models.wsd.disambiguate(tokens, i)

Disambiguate the token at i in the tokenized sentence tokens.

Parameters
  • tokens – A sentence, split into its tokens.

  • i (int) – The position of the token of interest in tokens.

Returns

A list of data on lexemes, ranked by their overall suitability to describe the meaning of the token at i, with their connotations in turn associated with their suitability. Each element is a dictionary of the following form:

{
  'entry_id': <ID of the lexeme in the dictionary>,
  'headwords': <list of lemmas for the lexeme>,
  'discriminator': <int for lexemes with the same main headword>,
  'roles': [
    {
      'poss': <POS tag list for the role>,
      'connotations': [
        {
          'sense_id': <the ID of the connotation within the lexeme>,
          'glosses': ((<gloss_type>, <gloss>), ...),
          'score': <connotation score>
        },
        ...
      ]
    },
    ...
  ],
  'score': <overall lexeme score>
}

yokome.models.wsd.has_statistics(conn, language_code, lemma)

See if lemma can be found in both the dictionary and the corpus.

Parameters
  • conn – The database connection for the dictionary and statistics.

  • language_code (str) – ISO 639-3 language code of the language of interest.

  • lemma – A dictionary that contains the keys graphic and phonetic.

Returns

True if the lemma can be found in both the dictionary and the corpus, False otherwise.

yokome.models.wsd.lexeme_lemma_count(conn, language_code, lemma)

Estimate the number of occurrences of a lexeme-lemma combination.

Get the number of occurrences of a lemma in the background corpus and estimate its contribution to the number of occurrences to one of its corresponding lexemes by dividing it by the number of lexemes for which it is listed as a headword.

Parameters
  • conn – The database connection for the dictionary and statistics.

  • language_code – ISO 639-3 language code of the language of interest.

  • lemma – A dictionary that contains the keys graphic and phonetic.

Returns

The number of estimated occurrences of the specified lemma with one of its lexemes, assuming equal distribution of the lemma among its lexemes.

yokome.models.wsd.n_lexemes_for_lemma(conn, language_code, lemma) → int

Get the number of dictionary entries with lemma as headword.

Parameters
  • conn – The database connection for the dictionary.

  • language_code (str) – ISO 639-3 language code of the language of interest.

  • lemma – A dictionary that contains the keys graphic and phonetic.

Returns

The number of entries in the dictionary that have the specified lemma as one of its headwords.

yokome.models.wsd.retrieve_substitute_lexemes(entry_id, n_senses, sense)

Search for substitute lexemes with senses similary to the specified one.

Parameters
  • id (entry_int) – The ID of the lexeme of interest in the dictionary.

  • n_senses (int) – The total number of senses of the lexeme with ID entry_id.

  • sense – A list of gloss descriptions of the form (gloss_type, gloss).

Returns

A list of dictionaries of the form

{'entry_id': <entry ID of lexeme>,
 'lemmas': <list of headwords of lexeme>,
 'pos': <tree of POS-tags>,
 'glosses': <list of glosses>,
 'ir_score': <information retrieval score>}

of the connotations (i.e. lexeme-sense combinations) that most closely resemble sense in terms of their glosses. All connotations that pertain to the lexeme with the ID entry_id are excluded.

yokome.models.wsd.score_connotation(tokens, i, sense_prior, pos_tree, role_pos_score, conn, substitute_lexemes, TOTAL_LEMMAS)

Score the substitution of the token at i with substitute_lexemes.

Parameters
  • tokens – A sentence, split into tokens.

  • i – The position of the token of interest in tokens.

  • sense_prior – The prior probability of the sense given the lexeme.

  • pos_tree (yokome.features.tree.TemplateTree) – The POS tree of the role to which the connotation belongs.

  • role_pos_score – The summed scores of the matches between pos_tree and each POS tree that pertains to a candidate lemma of the token of interest so that the candidate lemma is a headword of the lexeme to which the connotation belongs.

  • conn – The database connection for the dictionary and statistics.

  • substitute_lexemes – A list of dicts that describe lexemes that could be used as substitutes for the token of interest.

  • TOTAL_LEMMAS – The total number of lemmas (i.e. of tokens) in the corpus.

Returns

A score for the suitability of the connotation that suggested substitute_lexemes to describe the meaning of the token at i.

yokome.models.wsd.score_lexeme(tokens, i, pos_trees, conn, lexeme, TOTAL_LEMMAS)

Disambiguate the token at i using the connotations of lexeme.

Parameters
  • tokens – A sentence, split into its tokens.

  • i – The position of the token of interest in tokens.

  • pos_trees (list[yokome.features.tree.TemplateTree]) – A list of POS trees of the token at i, each pertaining to one of the candidate lemmas of this token, restricted to those tokens that are headwords of lexeme.

  • conn – The database connection for the dictionary and statistics.

  • lexeme (yokome.features.dictionary.Lexeme) – An entry in the dictionary that possibly describes the meaning of the token at i.

  • TOTAL_LEMMAS – The total number of lemmas (i.e. of tokens) in the corpus.

Returns

A dictionary of data on the lexeme, together with the lexemes overall suitability to describe the meaning of the token at i. Each connotations in turn is associated with its suitability. The dictionary is of the following form:

{
  'entry_id': <ID of the lexeme in the dictionary>,
  'headwords': <list of lemmas for the lexeme>,
  'discriminator': <int for lexemes with the same main headword>,
  'roles': [
    {
      'poss': <POS tag list for the role>,
      'connotations': [
        {
          'sense_id': <the ID of the connotation within the lexeme>,
          'glosses': ((<gloss_type>, <gloss>), ...),
          'score': <connotation score>
        },
        ...
      ]
    },
    ...
  ],
  'score': <overall lexeme score>
}

yokome.models.wsd.total_lemmas(conn, language_code)

Get the number of lemmas (i.e. of tokens) in the background corpus.

Parameters
  • conn – The database connection for statistics.

  • language_code – ISO 639-3 language code of the language of interest.

Returns

The total number of lemmas in the background corpus.