🐵 The new version of GluonNLP
, an NLP package for MXNet features a BERT Base comparable with the original BERT Large, specialized BERT versions, new models (ERNIE, GPT-2, ESIM etc.), and more datasets.
🤖 OpenAI released a bigger version of their GPT-2 language model. They also discuss lessons from coordinating with the research community on publication norms.
🌍 Not only monolingual models are getting more powerful, but also cross-lingual models: New pretrained cross-lingual language models that outperform multilingual BERT are now available in 100 languages.
🤖 If you’re using spaCy and have been waiting to incorporate pretrained models in your applications, then look no further than spacy-pytorch-transformers
. It allows you to use models like BERT in spaCy
by interfacing with Hugging Face
‘s PyTorch implementations. The library also aligns the transformer features with spaCy’s linguistic tokenization, so you can apply the features to the actual words, instead of just wordpieces.
📝 Talking about tokenization, wordpieces are great, but can be quite slow at times, particularly when they are learned on very large corpora. YouTokenToMe is an unsupervised text tokenizer that implements byte pair encoding, but is much much
(up to 90x) faster in training and tokenization than both fastBPE
💎 If you are working with spaCy and legal documents, then Blackstone is for you. It is a spaCy
model and library for processing long-form, unstructured legal text. As far as I’m aware, it is the first open-source model trained for use on legal text, so should be a great starting point if you’re working in this area.
🏊♂️ If you generally want get more out of your data, then you should take a look at the new version of Snorkel, the state-of-the-art toolkit for programmatically building and managing training datasets. It introduces a unified, modular framework that should allow you to manage your training data and leverage weak supervision a lot more easily.
📖 Textual data becomes a first-class citizen in TensorFlow 2.0 with TensorFlow Text. Text
is a collection of text related classes and ops, including ops for preprocessing, which are directly done as part of the TensorFlow graph.