Categorisation of images has long been possible with a variety of visual machine learning models. However in almost all cases the categories have to be specifically trained in advance using a labelled dataset. It would be nice if we could use a model to categorise without needing to train first. Such a model could be used for instance to create a database which categorises the images found in a folder structure to allow for more intuitive retrieval of desired images.

Vision Language Models
One possible way to approach this problem is to use a Vision-Language Model (VLM) to perform zero shot categorisation. I decided to investigate the practicality of classifying images at scale using such a VLM. I decided to attempt to return an image style, an image category and a short description for each image. Through careful construction of a prompt I found that I was able to have OpenAI’s gpt-4-vision-preview model perform all three of these tasks in a single pass with good reliability. I achieved 100% reliability for returning a pipe separated triple. I also achieved 100% accuracy of classification of image style and category. I also got what I would subjectively judge to be good image descriptions in all test cases. Sample code and prompt can be found on my Github.
For my test samples, cost appeared to be approximately £4-5.00 per thousand images with a processing rate of approximately 800-1000 images per hour. This certainly seems practical and useful for many applications.
Local Implementation of VLM
Given the cost and confidentiality implications of sending the images to an online VLM, it might be nice to run categorisation using a local VLM . The issue here is that many of these models are very large and cannot be run locally on most people’s systems. However of late a couple of promising compact VLMs have become available. These are MobileVLM and TinyGPT-V. Code for both is available on Github.
I decided to investigate MobileVLM since its installation process seems less involved. It has good installation instructions, however one of the libraries in its requirements.txt requires a prior PyTorch installation. You can pick the appropriate PyTorch installation for your machine from the PyTorch website. You also need to have the CUDA Compiler as well as the CUDA runtime. This can be downloaded from NVIDIA, be sure to use the version that matches your torch.version.cuda. The version I needed was 11.8, found here.
Once you have installed MobileVLM as per its instructions, and downloaded one of the models, you are able to run its example code. I decided to download version 2 of the smallest model: MobileVLM_V2-1.7B. Attempting to run this model ran into issues however. It transpired that some dependencies were extremely intricate or limited to particular operating systems. I came across three particular issues.
- It would appear that
flash-attnis incompatible with my version of CUDA - MobileVLM requires the
bitsandbyteslibrary, but the standard version of this is not Windows compatible and workarounds are temperamental - For the
deepspeedlibrary I got an error message “unable to install flash-attn because it would not build and deepspeed because it is windows only” but on the other hand other libraries appeared to be linux only including some of deepspeed’s dependencies!
Overall these dependency difficulties prevented me from installing MobileVLM as this stage. Attempts to install tinyGPT-V were also frustrated by an extremely long dependency list including triton which is Linux only and some highly specific CUDA related libraries.
Conclusion
The basic idea of using a VLM for assisting with classification of image libraries appears to be practical. Using the OpenAI gpt-4-vision-preview was successful at a zero shot implementation without needing to use any approaches beyond prompt engineering. As mentioned above sample code and prompt can be found on my Github.
I have not yet been able to test use of a local VLM for categorisation, though in principle it should be possible given a sufficiently competent local VLM model. Hopefully a more user friendly compact VLM will be available soon at which point I can revisit local implementation.