Using a VLM for Categorisation of Images (Part 2)

Last month I produced a proof of concept zero shot categoriser for images using OpenAI’s GPT-4 Vision-Language model (VLM) capabilities. However this had the disadvantage that each image had to be base 64 encoded and sent to OpenAI for processing with both cost and privacy implications. It would be nice to be able perform this task locally.

Last month I attempted to run MobileVLM and TinyGPT-V locally, but was thwarted by Linux only dependencies. However this month I found a different approach using LLaVA running on Ollama.

LLaVA

LLaVA is an open source VLM model it is available in 7 billion, 13 billion and 34 billion parameter variants. Llava marries the Vicuna LLM with a visual encoder for flexible audio-visual processing. The small size makes it practical to run this VLM locally on standard desktop hardware.

Another option might have been LLaVA, however I decided to test out LLaVA since it seemed better supported and documented.

Ollama

Ollama is very useful tool which allows you to run LLMs and VLMs locally on your machine. Ollama allows you to access the models you download and run either through the command line, or via a locally hosted endpoint found at http://localhost:11434 by default. Additionally there is now an Ollama python library which can be installed with pip install ollama which abstracts away the endpoint calls to make interacting with your local models even easier.

Conveniently Ollama now supports LLaVA and bakllava although some other promising models like CogVLM and Qwen-VL are not yet supported.

Setting up Ollama

There are excellent tutorials on getting OIlama set up like this one and this one. However in short:

Go to the Ollama website and download Ollama
Run Ollama, on Windows this will give you a PowerShell prompt
Entering `ollama run llava:7b` will download and install and run the smallest version of LLaVA
Once LLaVA is running on Ollama it can be addressed either through the PowerShell window, locally hosted endpoint or the python library.

Modifications to Proof of Concept Categoriser

Modifying my previous proof of concept categoriser was relatively simple. I replaced the call to OpenAI with a call to LLaVA model on my local Ollama server. The Ollama library makes it easy to feed images to the model direct from their file address so I removed base 64 encoding step I previously used for the images. Additionally, I found that the ollama.chat command returns the results as a python object so I was also able to remove the code which decoded returned JSON and instead process the response more directly.

The amended code using Ollama and LLaVA can be found on Github.

Performance and Limitations

We can compare last months OpenAI based classifier to the Ollam-LLaVA based classifier on several factors.

Speed: On my system, images processed at 1600-1800 images per hour using LLaVA:7b. This is about twice as fast as via OpenAI. Obviously this is more likely to be system dependent than the processing done via OpenAI, but even on a modest system, speeds should be competitive

Price: The only cost is electricity, thus the Ollama:LlaVA approach wins on price hands down making it practical for categorising large volumes of images

Accuracy: Ollama:LLaVA was less accurate than OpenAI. It achieved accurate labelling in about 80% to 90% of cases. In particular it had a couple of interesting failure modes. These failures were not so much failures in image recognition, but rather failures of task understanding.

One failure mode was that rather than returning a pipe separated tuple of the form “Style|Subject|Description”, the model simply returned the exact string “Style|Subject|Description”

Another failure mode was to hallucinate additional categories for style or subject which had not been specified as permissible, for instance “Sports and Recreation”

examples of erroneous classification including repeating the input prompt and hallucination of categories — examples of erroneous classification are highlighted

I discovered that submitting the same image with the same prompt can result in successful and unsuccessful classifications on different runs, therefore identifying and rerunning failed images might be a reasonable way to improve performance.

OpenAI has a clear accuracy advantage over LLaVA:7b but this is hardly surprising given its larger size. Nevertheless, for casual and non-critical tasks even the smallest LLaVA model appears to be a decent zero shot classifier which can assist considerably with labelling tasks.

Privacy: Here the local approach wins hands down. If you are processing sensitive images use Ollama and a local VLM

Conclusion

Overall the Ollama-LLaVA approach wins on most metrics. While it has significantly lower accuracy than OpenAI for its smallest model, LLaVA’s speed, ease of local use and price competitiveness makes it the more practical approach for on device classification of large numbers of images. Furthermore using a larger version of the model or using some basic parsing to identify hallucinations or inappropriate responses and either mark them as such or refer them to a more competent model. This should be able to ensure that bad classifications are minimised at the same time that costs are kept low.