yokome.features.corpus

yokome.features.corpus.generate_graphic_character_vocabulary(conn, min_coverage)

Generate a vocabulary of characters from graphic representations of lemmas with the specified minimal corpus coverage.

This is the smallest vocabulary of the most frequent characters so that these characters together cover at least a portion of min_coverage of the corpus.

Parameters
  • conn – Database connection for statistics.

  • min_coverage (float) – The minimal coverage.

Returns

A dictionary from characters from graphic representations of lemmas to their frequency rank.

yokome.features.corpus.generate_lemma_vocabulary(conn, min_coverage)

Generate a vocabulary of lemmas with the specified minimal corpus coverage.

This is the smallest vocabulary of the most frequent words so that these words together cover at least a portion of min_coverage of the corpus.

Parameters
  • conn – Database connection for statistics.

  • min_coverage (float) – The minimal coverage.

Returns

A dictionary from lemmas to their frequency rank.

yokome.features.corpus.generate_phonetic_character_vocabulary(conn, min_coverage)

Generate a vocabulary of characters from phonetic representations of lemmas with the specified minimal corpus coverage.

This is the smallest vocabulary of the most frequent characters so that these characters together cover at least a portion of min_coverage of the corpus.

Parameters
  • conn – Database connection for statistics.

  • min_coverage (float) – The minimal coverage.

Returns

A dictionary from characters from phonetic representations of lemmas to their frequency rank.

yokome.features.corpus.generate_vocabulary_from(language, sentences, min_coverage)

Generate a vocabulary with the specified minimal sentence coverage.

This is the smallest vocabulary of the most frequent tokens so that these tokens together cover at least a portion of min_coverage of the sentences. The tokens are determined by the tokenize method of the language.

Parameters
  • language (yokome.language.Language) – The language of interest.

  • sentences – A sequence of sentences, in a form that each sentence can be tokenized using the tokenize method of the language.

  • min_coverage (float) – The minimal coverage.

Returns

A dictionary from tokens to their frequency rank w.r.t. the sentences.

yokome.features.corpus.lemma_coverage(conn, graphic, phonetic) → int

Provide a measure of difficulty/infrequency for the specified lemma.

We here define a type’s corpus coverage as the proportion of tokens in a corpus that are instances of types that are at least as frequent as the type of interest.

Parameters
  • conn – Database connection for statistics.

  • graphic (str) – Graphic lemma variant of interest.

  • phonetic (str) – Phonetic lemma variant of interest.

Returns

The portion of tokens in the background corpus that are instances of types that are at least as frequent as the type of interest.

yokome.features.corpus.load_sentence(DATABASE, language, file, sequence_id)

Load a sentence from the database.

Parameters
  • DATABASE (str) – The database file.

  • language (str) – ISO 639-3 language code of the language of interest.

  • file (str) – ID of the corpus document from which the sentence stems.

  • sequence_id (int) – The number of the sentence in the document, 1-indexed.

Returns

A string if the sentence only contains stop-character content (espc. whitespace); a tokenized sentence otherwise.