A search for dread returned red paintings. What that reveals about language, meaning, and cultural heritage infrastructure.
The finding
During testing of our semantic search pilot, I searched the collection using the Norwegian word “redsel” (dread, fear).
The results came back full of red things. Red paintings. Red textiles. Red objects.
It took a moment. Look at the word: redsel. The first three letters are r-e-d. The machine had seen those letters, recognised the English colour, and found red things. It did not read “redsel” as fear. It read it as red with noise attached.
What I was actually testing
The search that led me there was “å ha hjertet i halsen” (heart in your throat). If the system understood Norwegian idiom, it should have returned images of dread. Munch’s The Scream first of all.
It returned nothing relevant. I swapped embedding models to compare: models from Google, OpenAI, and Voyage. Same result across all of them.
My first assumption was that the painting was simply not described correctly. Wrong. When I ran The Scream through our vision-language model, the description came back full of the right words: anxiety, nervousness, unease. The image was understood. The description was correct.
The failure was not in the image. It was not in the metadata. It was in the bridge between a Norwegian query and a Norwegian description. That bridge did not exist.
Why this happens
These models are trained on text that is overwhelmingly English. Norwegian is present, but thinly. And before any word becomes meaning, it is split into tokens, using a pattern also fitted to English.
The model has rarely seen “redsel” as a whole unit. It splits the word, and among the fragments sits “red”, which it has seen millions of times. When the thin Norwegian signal around that fragment is not strong enough to pull the result toward fear, the English colour wins.
Underrepresentation is usually discussed as a matter of degree. Norwegian is captured “a little less well” than English. More data, gap closed. That framing makes it sound manageable.
But this is not a degree. It is a kind. When the Norwegian signal runs too thin, the model does not degrade gracefully. It falls back on English surface. It reads Norwegian as badly-spelled English, and returns results that are confident and wrong.
What goes missing
The failure did not start with vocabulary. The model will correctly translate “redsel” to “fear” if asked directly. What failed was the associative layer: the path from a phrase in the body (“heart in your throat”) to a feeling, from the feeling to Munch, from Munch to the particular Norwegian dark.
That path is not made of words. It is cultural. It is the way a people tied a sensation to an image to a mood over a long shared history. That associative layer is exactly what a national museum holds. And it is exactly the layer that stays invisible to a model trained on someone else’s texts.
What this means for the collection
The museum’s project right now is access. Digitise, publish, make searchable, so anyone can find their way in.
But if the search cannot follow Norwegian meaning, the access is incomplete. The works are scanned and online. They are out of reach for anyone who thinks and searches in Norwegian.
We digitised the collection and made it slightly foreign to its own language.
Where we go from here
The finding does not point toward replacing foundation models. The vision model’s description of The Scream was already good. The problem is one specific link in the chain: where a Norwegian query should meet Norwegian meaning.
The question is where in the pipeline you place Norwegian understanding, and who builds it.
Options we are evaluating:
- Query rewriting before the embedding step, to bridge Norwegian input to richer Norwegian vocabulary
- Norwegian-language description artifacts generated and stored alongside each object, so the embedding space reflects Norwegian cultural meaning
- Fine-tuning or adapting the embedding model on Norwegian cultural heritage text
A Norwegian-language description layer, when considered carefully, is a form of curation frozen into weights. Human understanding of the collection, placed one layer deeper in the infrastructure.
The accident that found it
I would never have searched for “redsel” in a planned test. I would have searched for “angst” or “fear”, gotten a half-decent result, and moved on. It was the throwaway search that cracked the system open and showed how it actually reads Norwegian.
That is worth noting for anyone building this kind of infrastructure. The failures that look like noise often carry the diagnosis.
This work is part of the semantic search pilot at Nasjonalmuseet. Read more about the pilot here.