In this post, we’ll take a closer look at the question “Do LMs represent space and time?” inspired by a recent paper. We’ll look at how spatial and temporal information has been encoded in LMs, what this means for practical applications, as well as other aspects such as encoding of fine-grained spatial information and across cultures.
Language Models Represent Space and Time (Gurnee & Tegmark, Oct ‘23)
In a recent paper, Gurnee and Tegmark show that LLMs (Llama-2 models specifically) learn linear representations of space and time. What does this mean exactly?
The general setup looks like this:
The authors process the names of places and historical figures with Llama-2. They create their own dataset, sourced from Wikipedia, for this.
They then take the hidden state of the last token of the entity (for each layer) as a representation of the entity name.
Finally, they train a linear probe (a one-layer MLP) on the representation to predict its coordinates (latitude and longitude) or year of death.1
They find that spatial and temporal information in Llama-2 can indeed be recovered with a linear probe, larger models are better at encoding this information, and that representations in the upper layers (from the middle layer to the last) achieve the highest accuracy. In other words, models learn a representation of places that—after a linear transformation—is more or less consistent with their location on a map.

That’s a pretty cool finding and visualization. However, how surprising is it that current LLMs encode a map-like representation of places? On the whole, not very.
Spatial Relations in word2vec
Alexander Doria already pointed out on Twitter that this is not a new observation and that geographic relationships have been encoded already by much older models. The classic example is the word analogy task for word embeddings, which identifies representations in the embedding space using simple vector offset such asParis - France + Italy = ?
where the answer (i.e., the closest nearest neighbor in the embedding space) is expected to be Rome in this case.2 In the embedding literature, models generally performed very well at encoding such semantic relations.

These well-known relationships are between countries and their capitals. What about more fine-grained information such as the spatial coordinates investigated by Gurnee and Tegmark?
Analyzing Geographic Knowledge
There is an existing thread of research that has focused on injecting and analyzing geographic knowledge in models. Hovy and Purschke (2018) learn continuous representations of German cities using doc2vec. More recently, Hofmann et al. (2022) adapt pre-trained BERT models in different languages with geographic knowledge by predicting geolocations on geo-labeled data while Faisal and Anastasopoulos (2022) probe the geographic knowledge in GPT-2, mGPT, and BLOOM. In these studies, models do well at predicting place coordinates and produce map-like representations.

There is other work that focuses specifically on the task of geolocation where even simple one-hidden layer MLPs can do well (Rahimi et al., 2017). In light of this prior work, it is unsurprising that the latest LLMs encode spatial information. In addition, encoding spatial information does not seem to be a property that emerges only with sufficient model size. In order to know whether recent LLMs are actually more spatially aware than prior models, it is thus important to compare them to prior models and on established tasks such as user geolocation.
Overall, studies such as the one by Gurnee and Tegmark are crucial to get a better understanding of LLMs. However, rather than focusing solely on work on LLMs, these studies would benefit from being aware of and leveraging prior work as a source of baselines, evaluation datasets, and methods.
LLMs as Geographic Information Systems
As LLMs capture a surprising amount of geographic information, they may be useful for a range of other geography-related applications such as a geographic information system (GIS). A GIS is a computer system that stores, checks, analyzes, and displays geographic data—something that could be emulated by an LLM. Li and Ning (May 2023) show through GIS case studies how LLMs can be used, for instance, to identify the population living close to hazardous waste facilities and to map their distribution, among other applications.

Beyond accurately encoding and reasoning with geographic data, the use of LLMs as GIS thus also requires LLMs to interface with auxiliary tools such as data readers, calculators, code execution, and visualization, which I covered in a previous newsletter:
Encoding Fine-grained Spatial Relations
LLMs encode spatial information on a macro level—related to cities and places—but what about fine-grained spatial relations such as whether something is behind, next to, left to, above, etc.? Are these also encoded in a consistent manner?
Prior work in this area (Ramalho et al., 2018) generated images from their textual descriptions using a VAE to learn to encode spatial relations from natural language. Recently, Ji and Gao (July 2023) evaluated GPT-2 and BERT on their ability to encode geometric attributes, achieving up to 73% accuracy on spatial relations. For the largest LLMs, I am only aware of case studies that show awareness of certain spatial relations such as Bubeck et al. (2023) so there is potential for more work in this area.

Can you draw a pyplot showing the position of the rooms/places and connect them using lines?
”. GPT-4 is able to track the locations and visualize them correctly (Bubeck et al., 2023).Encoding Time
Regarding the encoding of time, it is important to look beyond synthetic tasks and to practical applications for evaluation. Given that the world we live in is constantly changing, it is critical to ensure that models reflect up-to-date information about the world. Prior work has used language modeling (Lazaridou et al., 2021) and question answering (Zhang & Choi, 2021) for model evaluation.
More recently, Tan et al. (ACL 2023) introduced a new temporal reasoning QA benchmark that assesses models on three levels of temporal reasoning: 1) relations between different times; 2) relations between times and events; and 3) relations between different events. Particularly time-event and event-event reasoning is still challenging even for the latest LLMs.
Encoding Space and Time Across Cultures
The way spatial and temporal information is expressed differs across languages and cultures. In Swahili, time is based on sunset and sunrise rather than a.m. and p.m. For example, 11.30 am in standard time is 5.30 in the morning in Swahili time. For a recent paper (Hlavnova & Ruder, ACL 2023), we evaluated LLMs on different types of reasoning across languages and found that they did much worse on languages with different time expressions such as Swahili. Similarly, understanding of time expressions can also be evaluated based on models’ ability to ground time expressions, i.e., to map culture-specific time expressions such as “morning” in English or “manhã” in Portuguese to specific hours in the day (Shwartz, 2022).
For spatial information, datasets such as MarVL (Liu et al., 2021) and Crossmodal-3600 (Thapliyal et al., 2022) can be used to investigate models’ visual perception across cultures—but I’m not aware of any datasets that enable an analysis of cross-cultural encoding of spatial information.

I hope you found this short review of space and time representations in language models interesting. Did I miss any interesting work in this space? What are your favorite observations and insights about how LLMs encode information? Let me know in the comments.
Training a linear classifier on the hidden representations of a model (without further fine-tuning) is standard methodology and has been used extensively to analyze BERT (Rogers et al., 2020). For more information on probing pre-trained models, check out these slides by Mohit Iyyer (Intro to NLP Spring 2023; based on slides from Tu Vu).
While there are well-known issues with the analogy formulation (see, for instance, Garneau et al., 2021), it can help illustrate relationships encoded in the embedding space.
Thank you for sharing! This work may be relevant: https://arxiv.org/abs/2308.15197. It is interesting that it does something about time-aware prediction.
It goes back much longer than the "classic observations" from 2013 you mention. Here's a quite extensive paper from 2009 from Louwerse & Zwaan that might be helpful in reconstructing the intellectual history of these ideas a bit further back: https://onlinelibrary.wiley.com/doi/full/10.1111/j.1551-6709.2008.01003.x (it uses LSA, which was the state of the art embedding technique before word2vec)