NLP + Speech. Does natural language processing (NLP) include speech? By definition,
natural language can take various forms. In practice, most current work in the NLP community (here: work published at
*ACL conferences) focuses on processing
written language rather than the sound waves of a person’s speech. There are some notable exceptions such as work on spoken language understanding (focusing on processing speech transcripts) and the use of phoneme representations for cross-lingual transfer (
Tsvetkov et al., 2016;
Peters et al., 2017). More recently, a
paper focused on vocalizing silent speech won the Best Paper award at EMNLP 2020 while models such as the Transformer that are popular in NLP are also used to process speech (
Baevski et al., 2020). However, by and large, most work that focuses on speech is published in distinct conferences such as
ICASSP,
Interspeech, and others.
Voice as an interface. At the same time, the importance of voice as an interface is growing. As more people are going online in countries where using a keyboard is not the default way to interact with a device, voice will be the preferred mode of interaction. For ML, this means voice data will become more important and easier to obtain in real-world applications.
Unwritten languages. Furthermore, many languages do not have a strong written tradition: they may not have an orthography or not use one consistently. For such languages, systems that cut out the middleman and directly process speech without mapping it to text may be the only way to serve the needs of the speakers (
Scharenborg et al., 2019). A recent line of such work focuses on end-to-end speech translation (
Jia et al., 2019).
Speech-based SLU systems. As AI is applied to more challenging and more real-world settings, it will become more common to combine speech and spoken language understanding. For instance, similar to how neural models have initially been applied to textual features and then to raw text, spoken language understanding systems will become more powerful once they receive raw audio data as additional context, which will allow them to model
prosody, among other things. Prosody provides information regarding the emotional state of the speaker, their use of irony or sarcasm, etc. and is thus important for affective computing systems. For a deep dive on how prosody can be incorporated into spoken language understanding models, check out
Trang Tran’s PhD thesis from 2020.
Multi-modal representation learning. Bridging different modalities will also become more important for learning models that can be applied to the real world, which is inherently multi-modal. For example, for analyzing videos, incorporating information from video, audio, and speech data is important (
Alayrac et al., 2020).
Beyond text and speech. While this section mainly focused on bridging written and spoken language, other modalities are equally relevant. For instance, language can be expressed visually as sign language (have a look at this amazing overview by
Amit Moryossef) or via extralinguistic information. Language can also be
tactile and researchers have even used olfactory signals to ground semantics (
Kiela et al., 2015).