yokome.models.cross_validation

yokome.models.cross_validation.cross_validate(seed_dir, language, n_samples, n_splits, evl_size, max_epochs, batch_size, max_generalization_loss, min_coverage, hyperparams, seed=None, verbose=False, dashboard_port=6006)

Perform cross-validation on the

The process is designed to be able to continue with minimal additional effort after a crash. It can therefore be stopped and taken up again later.

Tensorboard is served during each training run.

Parameters
  • seed_dir (str) – Where to store model data for this seed. If cross-validation is performed for multiple seeds, multiple seed directories are needed.

  • language (yokome.language.Language) – The language to train on.

  • n_samples (int) – The number of sample sentences to load.

  • n_splits (int) – The number k of folds.

  • evl_size (float) – The portion of evaluation samples w.r.t. the non-validation part of all samples.

  • max_epochs (int) – The maximum number of epochs to train for. The actual number of epochs may be less if the training process stops early.

  • batch_size (int) – The number of sentences to estimate the probability for in parallel.

  • max_generalization_loss (float) – The maximum generalization loss at which the training process is still continued.

  • min_coverage – The portion of the corpus that has to be covered by the minimal vocabulary of the most frequent words that is used to encode incoming data.

  • hyperparams – The model parameters used in this pass of cross-validation.

  • seed (int) – The seed used for the pseudo-random number generator that generates the seeds for the models to be trained.

  • verbose (bool) – Whether to print progress indiation.

  • dashboard_port (int) – On which port to serve Tensorboard.

Returns

The average loss over all folds.

yokome.models.cross_validation.kfold(language, n_samples=None, n_splits=5, evl_size=0.25)

Create splits of corpus sentences to be used in cross-validation.

The sentences are loaded using the languages load method. The splits are performed randomly, and differently for different numbers of samples.

Parameters
  • language (yokome.language.Language) – The language to train on.

  • n_samples (int) – The number of sample sentences to load.

  • n_splits (int) – The number k of folds.

  • evl_size (float) – The portion of evaluation samples w.r.t. the non-validation part of all samples.

Returns

An iterable over triples of tuples over sentences. Each triple consists of the training, evaluation and validation splits, respectively.