In this tutorial we will train a multilingual Sockeye model that can translate between several language pairs, including ones that we did not have training data for (this is called zero-shot translation).
Please note: this tutorial assumes that you are familiar with the introductory tutorials on copying sequences and training a standard WMT model.
There are several ways to train a multilingual translation system. This tutorial follows the approach described in Johnson et al (2016).
In a nutshell,
Wieder@@ aufnahme der Sitzungs@@ periode
Re@@ sumption of the session
we prefix the source sentence with a special token to indicate the desired target language:
<2en> Wieder@@ aufnahme der Sitzungs@@ periode
(We do not change the target sentence at all.)
Make sure to create a new Python virtual environment and activate it:
virtualenv -p python3 sockeye3
source sockeye3/bin/activate
Then install the correct version of Sockeye. We also install several libraries for preprocessing, monitoring and evaluation:
pip install matplotlib tensorboard
# install BPE library
pip install subword-nmt
# install sacrebleu for evaluation
pip install sacrebleu
# install Moses scripts for preprocessing
mkdir -p tools
git clone https://github.com/bricksdont/moses-scripts tools/moses-scripts
# install library to download Google drive files
pip install gdown
# download helper scripts
wget https://raw.githubusercontent.com/awslabs/sockeye/main/docs/tutorials/multilingual/prepare-iwslt17-multilingual.sh -P tools
wget https://raw.githubusercontent.com/awslabs/sockeye/main/docs/tutorials/multilingual/add_tag_to_lines.py -P tools
wget https://raw.githubusercontent.com/awslabs/sockeye/main/docs/tutorials/multilingual/remove_tag_from_translations.py -P tools
We will use data provided by the IWSLT 2017 multilingual shared task.
We limit ourselves to using the training data of just 3 languages (DE, EN and IT), but in principle you could include many more language pairs, for instance NL and RO which are also part of this IWSLT data set.
The preprocessing consists of the following steps:
Run the following script to obtain IWSLT17 data in a convenient format, the code is adapted from the Fairseq example for preparing IWSLT17 data.
bash tools/prepare-iwslt17-multilingual.sh
After executing this script, all original files will be in iwslt_orig
and extracted text files will be
in data
.
MOSES=tools/moses-scripts/scripts
DATA=data
TRAIN_PAIRS=(
"de en"
"en de"
"it en"
"en it"
)
TRAIN_SOURCES=(
"de"
"it"
)
TEST_PAIRS=(
"de en"
"en de"
"it en"
"en it"
"de it"
"it de"
)
We first create symlinks for the reverse training directions, i.e. EN-DE and EN-IT:
for SRC in "${TRAIN_SOURCES[@]}"; do
for LANG in "${SRC}" "${TGT}"; do
for corpus in train valid; do
ln -s $corpus.${SRC}-${TGT}.${LANG} $DATA/$corpus.${TGT}-${SRC}.${LANG}
done
done
done
We then normalize and tokenize all texts:
for PAIR in "${TRAIN_PAIRS[@]}"; do
PAIR=($PAIR)
SRC=${PAIR[0]}
TGT=${PAIR[1]}
for LANG in "${SRC}" "${TGT}"; do
for corpus in train valid; do
cat "$DATA/${corpus}.${SRC}-${TGT}.${LANG}" | perl $MOSES/tokenizer/normalize-punctuation.perl | perl $MOSES/tokenizer/tokenizer.perl -a -q -l $LANG > "$DATA/${corpus}.${SRC}-${TGT}.tok.${LANG}"
done
done
done
for PAIR in "${TEST_PAIRS[@]}"; do
PAIR=($PAIR)
SRC=${PAIR[0]}
TGT=${PAIR[1]}
for LANG in "${SRC}" "${TGT}"; do
cat "$DATA/test.${SRC}-${TGT}.${LANG}" | perl $MOSES/tokenizer/normalize-punctuation.perl | perl $MOSES/tokenizer/tokenizer.perl -a -q -l $LANG > "$DATA/test.${SRC}-${TGT}.tok.${LANG}"
done
done
On tokenized text, we learn a BPE model as follows:
cat $DATA/train.*.tok.* > train.tmp
subword-nmt learn-joint-bpe-and-vocab -i train.tmp \
--write-vocabulary bpe.vocab \
--total-symbols --symbols 32000 -o bpe.codes
rm train.tmp
This will create a joint source and target BPE vocabulary. Next, we apply the Byte Pair Encoding to our training and development data:
for PAIR in "${TRAIN_PAIRS[@]}"; do
PAIR=($PAIR)
SRC=${PAIR[0]}
TGT=${PAIR[1]}
for LANG in "${SRC}" "${TGT}"; do
for corpus in train valid; do
subword-nmt apply-bpe -c bpe.codes --vocabulary bpe.vocab --vocabulary-threshold 50 < "$DATA/${corpus}.${SRC}-${TGT}.tok.${LANG}" > "$DATA/${corpus}.${SRC}-${TGT}.bpe.${LANG}"
done
done
done
for PAIR in "${TEST_PAIRS[@]}"; do
PAIR=($PAIR)
SRC=${PAIR[0]}
TGT=${PAIR[1]}
for LANG in "${SRC}" "${TGT}"; do
subword-nmt apply-bpe -c bpe.codes --vocabulary bpe.vocab --vocabulary-threshold 50 < "$DATA/test.${SRC}-${TGT}.tok.${LANG}" > "$DATA/test.${SRC}-${TGT}.bpe.${LANG}"
done
done
We also need to prefix the source sentences with a special tag to indicate target language:
for PAIR in "${TRAIN_PAIRS[@]}"; do
PAIR=($PAIR)
SRC=${PAIR[0]}
TGT=${PAIR[1]}
for corpus in train valid; do
cat $DATA/$corpus.${SRC}-${TGT}.bpe.${SRC} | python tools/add_tag_to_lines.py --tag "<2${TGT}>" > $DATA/$corpus.${SRC}-${TGT}.tag.${SRC}
cat $DATA/$corpus.${SRC}-${TGT}.bpe.${TGT} | python tools/add_tag_to_lines.py --tag "<2${SRC}>" > $DATA/$corpus.${SRC}-${TGT}.tag.${TGT}
done
done
for PAIR in "${TEST_PAIRS[@]}"; do
PAIR=($PAIR)
SRC=${PAIR[0]}
TGT=${PAIR[1]}
cat $DATA/test.${SRC}-${TGT}.bpe.${SRC} | python tools/add_tag_to_lines.py --tag "<2${TGT}>" > $DATA/test.${SRC}-${TGT}.tag.${SRC}
cat $DATA/test.${SRC}-${TGT}.bpe.${TGT} | python tools/add_tag_to_lines.py --tag "<2${SRC}>" > $DATA/test.${SRC}-${TGT}.tag.${TGT}
done
Concatenate all individual files to obtain final training and development files:
for corpus in train valid; do
touch $DATA/$corpus.tag.src
touch $DATA/$corpus.tag.trg
# be specific here, to be safe
cat $DATA/$corpus.de-en.tag.de $DATA/$corpus.en-de.tag.en $DATA/$corpus.it-en.tag.it $DATA/$corpus.en-it.tag.en > $DATA/$corpus.tag.src
cat $DATA/$corpus.de-en.tag.en $DATA/$corpus.en-de.tag.de $DATA/$corpus.it-en.tag.en $DATA/$corpus.en-it.tag.it > $DATA/$corpus.tag.trg
done
As our test data, we need both the raw text and the preprocessed, tagged version: the tagged file as input for translation, the raw text for evaluation, to compute detokenized BLEU.
As a sanity check, compute number of lines in all files:
wc -l $DATA/*
Sanity checks to perform at this point:
Before we start training we will prepare the training data by splitting it into shards and serializing it in matrix format:
python -m sockeye.prepare_data \
-s $DATA/train.tag.src \
-t $DATA/train.tag.trg \
-o train_data \
--shared-vocab
We can now kick off the training process:
python -m sockeye.train -d train_data \
-vs $DATA/valid.tag.src \
-vt $DATA/valid.tag.trg \
--shared-vocab \
--weight-tying-type src_trg_softmax \
--device-ids 0 \
--decode-and-evaluate-device-id 0 \
-o iwslt_model
An interesting outcome of multilingual training is that a trained model is (to some extent) capable of translating between language pairs that is has not seen training examples for.
To test the zero-shot condition, we translate not only the trained directions, but also from German to Italian and vice versa. Both of those pairs are unknown to the model.
Let’s first try this for a single sentence in German. Remember to preprocess input text in exactly the same way as the training data.
echo "Was für ein schöner Tag!" | \
perl $MOSES/tokenizer/normalize-punctuation.perl | \
perl $MOSES/tokenizer/tokenizer.perl -a -q -l de | \
subword-nmt apply-bpe -c bpe.codes --vocabulary bpe.vocab --vocabulary-threshold 50 | \
python tools/add_tag_to_lines.py --tag "<2it>" | \
python -m sockeye.translate \
-m iwslt_model \
--beam-size 10 \
--length-penalty-alpha 1.0 \
--device-ids 1
If you trained your model for at least several hours, the output should be similar to:
<2en> Era un bel giorno !
Which is a reasonable enough translation! Note that a well-trained model always generates a special language tag as the first token.
In this case it’s <2en>
since Italian data was always paired with English data in our training set.
Now let’s translate all of our test sets to evaluate performance in all translation directions:
mkdir -p translations
for TEST_PAIR in "${TEST_PAIRS[@]}"; do
TEST_PAIR=($TEST_PAIR)
SRC=${TEST_PAIR[0]}
TGT=${TEST_PAIR[1]}
python -m sockeye.translate \
-i $DATA/test.${SRC}-${TGT}.tag.${SRC} \
-o translations/test.${SRC}-${TGT}.tag.${TGT} \
-m iwslt_model \
--beam-size 10 \
--length-penalty-alpha 1.0 \
--device-ids 0 \
--batch-size 64
done
Next we post-process the translations, first removing the special target language tag, then removing BPE, then detokenizing:
for TEST_PAIR in "${TEST_PAIRS[@]}"; do
TEST_PAIR=($TEST_PAIR)
SRC=${TEST_PAIR[0]}
TGT=${TEST_PAIR[1]}
# remove target language tag
cat translations/test.${SRC}-${TGT}.tag.${TGT} | \
python tools/remove_tag_from_translations.py --verbose \
> translations/test.${SRC}-${TGT}.bpe.${TGT}
# remove BPE encoding
cat translations/test.${SRC}-${TGT}.bpe.${TGT} | sed -r 's/@@( |$)//g' > translations/test.${SRC}-${TGT}.tok.${TGT}
# remove tokenization
cat translations/test.${SRC}-${TGT}.tok.${TGT} | $MOSES/tokenizer/detokenizer.perl -l "${TGT}" > translations/test.${SRC}-${TGT}.${TGT}
done
Finally, we compute BLEU scores for both zero-shot directions with sacreBLEU:
for TEST_PAIR in "${TEST_PAIRS[@]}"; do
TEST_PAIR=($TEST_PAIR)
SRC=${TEST_PAIR[0]}
TGT=${TEST_PAIR[1]}
echo "translations/test.${SRC}-${TGT}.${TGT}"
cat translations/test.${SRC}-${TGT}.${TGT} | sacrebleu $DATA/test.${SRC}-${TGT}.${TGT}
done
In this tutorial you trained a multilingual Sockeye model that can translate between several languages, including zero-shot pairs that did not occur in the training data.
You now know how to modify the training data to include special target language tags and how to translate and evaluate zero-shot directions.