(By Tord Nilsen, Senior Advisor National Museum Norway)
In our Semantic Search prototype, we have experienced some issues where the OpenAI GPT-4 Vision API sometimes will misinterpret images with low resolution.
To avoid higher API costs than necessary, we have been sending images with a resolution of 256×256 pixels, but have observed that the API sometimes misinterprets the content or details of the image. This happens especially on blurry or pixelated images or drawings.
The recommended image resolution for the OpenAI GPT-4 Vision API is 512×512 pixels. It is also possible to set a detail parameter (low or high) which controls the resolution in which the model views the image. The default is auto, which means that if the image is of low resolution, it will be converted to 512×512 pixels, and processed using 65 tokens; else it sets to high and uses 130 tokens.
So how does the resolution affect the interpretation of the image?
The image description for the above image, when analysed from a 256×256 pixel source, is this:
The image titled “Mor, hvorfor holder du meg så fast? VII” by artist Arne Bendik Sjur is a fine art piece depicting a warm and intimate scene. The artwork is a fine example of cold needle on paper technique, often producing rich textures and fine lines, which can be felt throughout the piece. At the heart of the image is a child, their face alight with a serene and contented expression. The child’s eyes are rendered with attention to detail, twinkling with a sense of joy and security. The cheeks are rosy, suggesting the flush of warmth, and the lips curve into a subtle smile. The child is hugging a small dog close to their chest. The dog appears relaxed, nestled within the child’s embrace, its fur suggested through fine, delicate lines that offer a sense of softness. The child’s hands gently wrap around the dog, delicate fingers showing a careful but firm grip, indicative of both affection and a desire to protect. The use of monochromatic tones here suggest the use of black ink, providing a strong contrast between the figures and the background. The background is filled with crosshatched shading, which may symbolize the enclosing warmth or perhaps the confines of a room or a safe space the child and dog inhabit. The composition is centered within the paper, with a generous amount of whitespace surrounding the central image, which draws the viewer’s focus directly to the tender act depicted. Overall, the image encapsulates a moment frozen in time, one that speaks to universal themes of love, protection, and the simple joys of companionship between a child and their pet.
Obviously, this is not a warm and intimate scene, with “a child hugging a small dog”.
By increasing the resolution to 512×512 we get this description:
…The image is rather dark, conveying a sense of intimacy or possibly eeriness depending on interpretation. At the center of the composition, there is a figure that appears to be a woman or young girl with very distinct, yet simple facial features exaggerated in size, such as the large, round eyes that appear luminous and slightly gazing upward. Her mouth is small and smiles subtly. Her hair is dark and appears to be pulled back from her face. The woman is embracing a creature or an object that resembles a small animal, possibly a monkey or a bear, with notable details such as eyes and a defined snout, and it’s cradled closely to her chest. The creature’s expression is ambiguous, possibly content or somber. Despite the use of minimal color, the image is rich with textural details achieved through the etching technique; the cross-hatching and different pressures of the tool create a tactile feel that suggests the softness of the figure’s hair and the fur of the creature she holds….
Now the API understands that it might be “rather dark”, “embracing a creature”. Not a dog, but possibly a monkey or a bear.
So what if we try 1024×1024 pixels?
…In the image, we see a central figure that appears to be a young woman with a wide, unsettling smile, gazing directly at the viewer. Her eyes are enlarged, which contributes to an eerie expression. She is hunched over, her arms wrapped around an object—a human skull—holding it closely to her chest. The pose and embrace could be interpreted as maternal or nurturing, despite the macabre nature of the object she’s cradling…
The skull is depicted with a fair degree of realism; its hollow eye sockets and teeth are carefully etched into the metal plate to capture the shadow and depth of the human bone structure.
We now have a much more correct description of this image.
So what do we do next?
For the production database we will probably re-index all images with at least 512×512 pixels resolution. We will also extend the functionality of the feedback system to make it easy for users to report images that have misinterpreted or inaccurate descriptions.
We could also consider sophisticating the image description retrieval process, by first sending the image with low resolution (e.g. 512×512 pixels), and if the result contains ‘sketch’, ‘blurry’ or other keywords, we can resend the image to the API in a higher resolution.