Although the quality of machine translation systems is nowadays remarkably good, sometimes it is important to specialize the MT output to the specifics of certain domains. These customizations may include preferring some word translation over others or adapting the style of the text, among others. In this tutorial, we show two methods on how to perform domain adaptation of a general translation system using Sockeye.
We assume you already have a trained Sockeye model, for example the one trained from the WMT tutorial. We also assume that you have two training sets, one composed of general or out-of-domain (OOD) data, and one composed of in-domain (ID) data on which you want to adapt your system. Note that both datasets need to be pre-processed in the same way.
First, you must be careful to prepare the in-domain training data using the same vocabulary as the out-of-domain data.
Assuming your prepared OOD data resides in ood_data
python -m sockeye.prepare_data \
-s data/id.train.src.bpe \
-t data/id.train.trg.bpe \
-o id_data \
--source_vocab ood_data/vocab.src.0.json \
--target_vocab ood_data/vocab.trg.0.json
Note: If your in-domain data is small, you may skip this step and add the corresponding arguments to the sockeye.train
calls.
This method fine-tunes a trained model and starts a second training run on in-domain data, initialized with the parameters obtained from the out-domain data. Thus you “continue training” on the data you are more interested in. Freitag and Al-Onaizan (2016) showed that this straightforward technique can achieve good results.
When training a model, you can load a set of parameters with the --params
argument in Sockeye, specifying an already trained model.
Assuming the trained model resides in ood
, a possible invocation could be
python -m sockeye.train \
--config ood/args.yaml \
-d id_data \
-vs data/id.dev.src.bpe \
-vt data/id.dev.trg.bpe \
--params ood/params.best \
-o id_continuation
Depending on the size of your training data you may want to adjust the parameters of the learning algorithm (learning rate, decay, etc.) and perhaps the checkpoint interval.
Learning Hidden Unit Contribution (LHUC) is a method proposed by Vilar (2018), where the output of the hidden units in a network are expanded with an additional multiplicative unit. This unit can the strengthen or dampen the output of the corresponding unit.
The usage is very similar as the call shown above, but you have to specify an additional --lhuc
argument.
This argument accepts a (space separated) list of components where to apply the LHUC units (encoder
, decoder
or state_init
) or you can specify all
for adding it to all supported components:
python -m sockeye.train \
--config ood/args.yaml \
-d id_data \
-vs data/id.dev.src.bpe \
-vt data/id.dev.trg.bpe \
--params ood/params.best \
--lhuc all \
-o id_lhuc
Again it may be beneficial to adjust the learning parameters for the adaptation run.
Markus Freitag and Yaser Al-Onaizan. 2016. Fast Domain Adaptation for Neural Machine Translation ArXiv e-prints.
David Vilar. 2018. Learning Hidden Unit Contribution for Adapting Neural Machine Translation Models Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers).