MTMA22 Day 4: Benchmarking and Applications

4 minute read


Now that the package is mostly solid, I spent the day benchmarking the code. I previously developed a small microbenchmark to test native vs. pybind speeds to ensure there are no major regressions, but I wanted to check against other existing libraries. Because we’re using Helsinki NLP’s OPUS MT models to test the Marian inference, we thought we would use the HuggingFace Transformer’s exported checkpoints of the same models. I’m not exactly sure how the checkpoints were made compatible (likely model surgery like how HF exports fairseq checkpoints), but it mostly doesn’t matter.

I revisited my original benchmarking script and added a similar script for HF’s native implementation. The first issue I discovered is that neither HF PretrainedModels nor Pipelines support multi-GPU inference. I’ve been benchmarking Marian and Pymarian with 4 GPU’s and while we could easily drop down to a single GPU, it’s not particularly interesting. I found this lovely library called parallelformers which mostly transparently wraps HF models and supports multi-GPU inference with AMP. Now I can test with 4 devices like in the marian models. Another stumbling block was comparison in batching. Marian forms batches internally which HF more-or-less requires batching up front. For such a small test set of 1.4k examples it likely won’t contribute much noise, but I picked the largest batch size I could for HF without encountering an OOM. The script is below:

#!/usr/bin/env python3

import sys
import itertools

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from parallelformers import parallelize

def translate(model, tokenizer, batch):
    inputs = tokenizer(batch, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, num_beams=4)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

def batch(s, batch_size=1600):
    it = iter(s)
    while True:
        chunk = list(itertools.islice(it, batch_size))
        if not chunk:
        yield chunk

if __name__ == "__main__":
    model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")
    tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
    parallelize(model, num_gpus=4, fp16=True)

    # 128 / 4 devices = 32 sents/GPU
    for b in batch(map(str.strip, sys.stdin), 128):
        print("\n".join(translate(model, tokenizer, b)))

Running this several times 4 NVIDIA TITAN Xp’s showed that the Marian code was on average ~6-6.5x faster than HuggingFace’s models (computed with fp16 precision). I also noticed that the translations were not identical between HF and Marian, though the translations from HF were typically adequate. Again, since I don’t know how the checkpoints were exported, it’s not easy to figure out what’s going on but it doesn’t matter much.

The next thing I wanted to do was create a tiny demo application for Pymarian. I have used streamlit in the past and I like it pretty well for demo purposes, so I decided to go with that. Marcin also wrote a PyQt fat client app and Matt Post wrote a tiny web service to mimic the MT API at Microsoft for development, but I won’t discuss these applications in more detail.

I wanted my demo app to support relatively few operations: I wanted to load a Marian model using Pymarian, an equivalent model in HuggingFace, I wanted to support a text box for some arbitrary source text (setting an arbitrary upper limit at 4k chars) and a translate button. I wanted to display Marian and HF translations side-by-side with some coarse timing info (this is a very unscientific way to compare, but it’s all super rough). Luckily streamlit is super easy to use so I got something working in minutes. The whole of the app is below:

import time

import pymarian

import streamlit as st
import sentencepiece as spm

from transformers import pipeline

st.title("Hello, PyMarian!")

def get_src_spm():
    return spm.SentencePieceProcessor('/home/marcinjd/MTMA/source.spm')

def get_tgt_spm():
    return spm.SentencePieceProcessor('/home/marcinjd/MTMA/target.spm')

def get_model():
    return pymarian.Translator('--config /home/marcinjd/MTMA/decoder.yml -b4 --mini-batch 16 --quiet --maxi-batch 100 -d 0 1 2 3')

def get_hf_pipeline():
    return pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de", device=7)

def translate_pymarian(text):
    encoded_src = " ".join(get_src_spm().EncodeAsPieces(text))
    translation_tok = get_model().translate(encoded_src)
    return get_tgt_spm().Decode(translation_tok.split())

def translate_hf(text):
    return "\n".join(e["translation_text"] for e in get_hf_pipeline()(text))

text = st.text_area("Add text to translate...", max_chars=4096)

if st.button("Translate"):
    col1, col2 = st.columns(2)

    with col1:
        t0 = time.time()
        translation = translate_pymarian(text)
        translation_time = time.time() - t0
        st.write(f"Took {translation_time:.3f} seconds")
    with col2:
        t0 = time.time()
        translation = translate_hf(text)
        translation_time = time.time() - t0
        st.write(f"Took {translation_time:.3f} seconds")

This resulted in a very functional and nice-looking demo application. Pretty happy with the progress. :-)