yokome.data.jpn.corpus

Provides methods to access the Japanese corpus.

The corpus used is the JEITA Public Morphologically Tagged Corpus (in ChaSen format). All data is split into the following data sets:

  • Reserve (rsv) set: Not for direct use in this project, but for testing if the model creation process might have overfit on every other set.

  • Test (tst) set: For a final estimation of the quality of the best model built in this project.

  • Development (dev) set: All data that goes into training a model in this project.

    • Validation (vld) set: In a k-fold cross-validation process, the set on which to determine the quality of the model trained on an evaluation and a training set.

    • Evaluation (evl) set: In a k-fold cross-validation process, the set with which to determine the quality of the model during training, espc. to allow for early stopping.

    • Training (trn) set: In a k-fold cross-validation process, the set on which to train the model.

The JEITA Aozora and Genpaku corpora are split independently, as they contain different language content: The documents in the Aozora corpus were originally written in Japanese, while the documents in the Genpaku corpus stem from sources in other languages.

yokome.data.jpn.corpus.DATABASE = '/home/jbetz/Documents/Courses/UHH_MSc_Informatik/Master_Project/yokome/data/processed/data.db'

The database file location.

yokome.data.jpn.corpus.dev_files(corpus_dir)

Get the filenames of the corpus documents for development.

Parameters

corpus_dir (str) – The root directory of the corpus.

yokome.data.jpn.corpus.load_dev_sentence_ids(n_samples=None)

Load the identifiers of sentences from the development files of the Japanese corpus.

The order of identifiers is randomized (independently of the number of samples requested and consistently in between calls requesting the same number of samples).

Parameters

n_samples (int) – The number of sample identifiers to load. If None, load all identifiers.

Returns

A tuple of sentence identifiers of the form (<file name>, <sentence number>).

yokome.data.jpn.corpus.rsv_files(corpus_dir)

Get the filenames of the reserved corpus documents.

Parameters

corpus_dir (str) – The root directory of the corpus.

yokome.data.jpn.corpus.tst_files(corpus_dir)

Get the filenames of the corpus documents for tests.

Parameters

corpus_dir (str) – The root directory of the corpus.

yokome.data.jpn.corpus.validate_file(f)

Determine whether the file has high data quality.

Filter out documents with archaic writing styles, excess foreign content or improper bracketing structures.

Returns

True if the file passes all tests, False otherwise.