A toolkit for neural sequence-to-sequence transduction
Sockeye can read the raw data at training time in two sentence-parallel files via the
--target command-line options.
You can also prepare the data ahead of time and dump it to disk as MXNet NDArrays.
This basically eliminates the data loading time when running training (since three passes over the raw data are required), and also reduces memory consumption, since prepared data is also placed into random shards (which have one million lines each, by default).
To run data preparation, you can use the following command:
> python -m sockeye.prepare_data usage: prepare_data.py [-h] --source SOURCE [--source-factors SOURCE_FACTORS [SOURCE_FACTORS ...]] --target TARGET [--source-vocab SOURCE_VOCAB] [--target-vocab TARGET_VOCAB] [--source-factor-vocabs SOURCE_FACTOR_VOCABS [SOURCE_FACTOR_VOCABS ...]] [--shared-vocab] [--num-words NUM_WORDS] [--word-min-count WORD_MIN_COUNT] [--pad-vocab-to-multiple-of PAD_VOCAB_TO_MULTIPLE_OF] [--no-bucketing] [--bucket-width BUCKET_WIDTH] [--max-seq-len MAX_SEQ_LEN] [--num-samples-per-shard NUM_SAMPLES_PER_SHARD] [--min-num-shards MIN_NUM_SHARDS] [--seed SEED] --output OUTPUT prepare_data.py: error: the following arguments are required: --source/-s, --target/-t, --output/-o
The main arguments are the required ones above (
--output to specify the directory to write the prepared data to).
Some other important ones are:
--shared-vocab: to produce a shared vocabulary between the source and target sides of the corpora.
--num-samples-per-shard: to control the shard size.
At training time (see next section), you then specify
--prepared-data instead of
Training is carried out by the
sockeye.train module. Basic usage is given by
> python -m sockeye.train usage: train.py [-h] --source SOURCE --target TARGET --validation-source VALIDATION_SOURCE --validation-target VALIDATION_TARGET --output OUTPUT [...]
Training requires 5 arguments:
--target: give the training data files. Gzipped files are supported, provided that their filenames end with .gz.
--validation-target: give the validation data files, gzip supported as above.
--output: gives the output directory where the intermediate and final results will be written to. Intermediate directories will be created if needed. Logging will be written to
<model_dir>/logas well as being echoed on the console.
For a complete list of supported options use the
All input files files should be UTF-8 encoded, tokenized with standard whitespaces. Each line should contain a single sentence and the source and target files should have the same number of lines. Vocabularies will automatically be created from the training data and vocabulary coverage on the validation set during initialization will be reported.
Training is governed by the concept of “checkpoints”, rather than epochs.
You can specify the checkpoint interval in terms of updates/batches with
Training performs early-stopping to prevent overfitting, i.e., training is stopped once a defined evaluation metric computed on the held-out validation data does not improve for a number of checkpoints given by the parameter
You can specify a maximum number of updates/batches using
Perplexity is the default metric to be considered for early-stopping, but you
can also choose to optimize accuracy or BLEU using the
argument. In case of optimizing with respect to BLEU, you will need to specify
--monitor-bleu. For efficiency reasons, sockeye spawns a sub-processes after each
checkpoint to decode the validation data and compute BLEU. This may introduce
some delay in the reporting of results, i.e. there may be checkpoints with no
BLEU results reported or with results corresponding to older checkpoints. This
is expected behaviour and sockeye internally keeps track of the results in the
Note that evaluation metrics for training data and held-out validation data are
written in a tab-separated file called
At each checkpoint, the internal state of the training process is stored to
disk. If the training is interrupted (e.g. due to a hardware failure), you can
start sockeye again, with the same parameters as for the initial call, and
training will resume from the last checkpoint. Note that this is different to
--params argument. This argument is used only to initialize the
training with pre-computed values for the parameters of the model, but the
parameters of the optimizer and other parts of the system are initialized from
Sockeye can write all evaluation metrics in a Tensorboard compatible format. This way you can monitor the training progress in the browser. To enable this feature, install the MXNet-compatible interface, mxboard:
> pip install mxboard
For visualization, you still need the official tensorboard release (i.e.
pip install tensorboard).
Start tensorboard and point it to the model directory (or any parent directory):
> tensorboard --logdir model_dir
By default, training is carried out on the first GPU device of your machine.
You can specify alternative GPU devices with the
--device-ids option, with
which you can also activate multi-GPU training (see below). If
--device-ids -1, sockeye will try to find a free GPU on your machine and block
until one is available. The locking mechanism is based on files and therefore assumes all processes are running
on the same machine with the same file system.
If this is not the case there is a chance that two processes will be using the same GPU and you run out of GPU memory.
If you do not have or do not want to use a GPU, specify
In this case a drop in performance is expected.
Training can be carried out on multiple GPUs by either specifying multiple GPU device ids:
--device-ids 0 1 2 3, or specifying the number GPUs required:
--device-ids -n attempts to acquire
n GPUs through
the locking mechanism described above.
This will train using Data Parallelism.
MXNet will divide the data in each batch and send it to the different devices.
Note that you should increase the batch size: for
k GPUs use
Also note that this will likely linearly increase your throughput in terms of sentences/second, but not necessarily
increase the model’s convergence speed.
A common technique for improving model performance is to average the weights for the last checkpoints. This can be done as follows:
> python -m sockeye.average <model_dir> -o <model_dir>/model.best.avg.params
Sockeye supports source factors, which are described in:
Rico Sennrich and Barry Haddow. 2016. Linguistic Input Features Improve Neural Machine Translation Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers.
Factors are enabled with two flags:
--source-factors argument takes one or more files that are token-parallel to the source.
This means that each line has the exact same number of whitespace-delimited tokens as the source file (
For example, if you have the following line of a source sentence:
the boy ate the waff@@ le .
You need a corresponding feature line of the following form:
O O O O B E O
(Same number of tokens).
This flag can also be supplied to the data preparation step.
Each source factor has its own vocabulary and learned embedding.
The source factors can be combined with the word embeddings in two ways: concatenation and summing.
For concatenation (
--source-factors-combine concat, the default), you need to specifiy the embedding sizes for each factor.
This is done with
--source-factors-num-embed X1 X2 ....
Since these embeddings concatenated to those of the word embeddings, the total source embedding size will be the sum of the word embeddings and all source factor embeddings.
You can also sum the embeddings (
In this case, you do not need to specify
--source-factors-num-embed, since they are automatically all set to the size of the word embeddings (
You then also have to apply factors for the source side at inference time.
Sockeye supports an auxiliary training objective that predicts length ratio (|reference|/|input|) or the reference length for each input,
that can be enabled by setting
--length-task, respectively, to
ratio or to
--length-task-layers to set the number of layers in the prediction MLP.
The weight of the loss in the global training objective is controlled with
--length-task-weight (standard cross-entropy loss has weight 1.0).
During inference the predictions can be used to reward longer translations by enabling