AI startup Hugging Face has released an open-source family of compact visual language models named SmolVLM. With two billion parameters, it’s built for on-device inference, which the company claims outperforms similar models with comparable GPU RAM usage and token throughputs. Three models are available at launch: SmolVLM-Base, which offers downstream fine-tuning; SmolVLM-Synthetic, a fine-tuned variant on synthetic data; and SmolVLM Instruct, a fine-tuned instruction variant.
What Is SmolVLM?
SmolVLM closely follows the architecture used with another Hugging Face visual language model, Idefics3. However, there are a few areas where SmolVLM deviates: Instead of using Llama 3.1 8B as its language backbone, Hugging Face opted for SmolLM2 1.7B, which it introduced in early November. The company also “aggressively” compressed the patched visual information 9x compared to the 4x with Idefics3. Other differences include:
- We use patches of 384*384, instead of 364×364, because 384 is divisible by 3, which is necessary for our pixel shuffle strategy to work
- For this, we change the vision backbone to use shape-optimized SigLIP with patches of 384×384 pixels and inner patches of 14×14
“This year has seen a boom in multimodal AI with many large vision language models released. The trends were to initially scale up compute, later scale up the data diversity by generating synthetic data with large models, and, recently, scale down to make these models more efficient,” writes Hugging Face researchers Andres Marafioti, Merve Noyan, Miquel Farre, Elie Bakouch, and Pedro Cuenca in a blog post. “Small open models allow local deployment to browser or edge devices, cut inference costs, and enable user customization. Some notable examples of these models include PaliGemma 3B, moondream2, and Qwen2VL.”
SmolVLM can understand and process images and text. It can also help with various tasks such as image captioning (what’s happening in an image), visual question answering (responding to questions based on what’s in an image), and storytelling (generating a story based on a series of images). However, Hugging Face states it cannot create images.
The company also warns against misuse, cautioning that SmolVLM cannot evaluate or score individuals, assist with critical automated decision-making, generate unreliable factual content, or be used for malicious purposes such as spam generation, disinformation campaigns, harassment, or unauthorized surveillance.
Here’s an example provided by the company:
When prompted, “I’m planning a visit to this temple, give me travel tips,” Hugging Face states this was the response provided by SmolVLM:
The Grand Palace is the most famous landmark in Bangkok and it’s worth visiting for its stunning architecture and history. Here are some of our favorite things about the palace:
The Main Building: Located at the center of the complex, the main building features intricate tile work with traditional Thai motifs that cover every inch of the structure. This includes patterns such as dragons, mythical creatures, flowers, and geometric shapes. The roofline has an elaborate design featuring multiple layers of tiles adorned with gold accents.
Towers & Spires: There are several towers within the Grand Palace including the Phra Nakhon Si Ayutthaya National Museum Tower which houses exhibits on Thai culture and heritage. Another notable tower is the Wat Yai Chai Mongkol Tower where you can find beautiful murals depicting scenes from Buddhist mythology.
Architecture Style: The architectural style combines elements of Hindu and Khmer styles seen commonly throughout Southeast Asia. You’ll notice influences like stupas (dome-shaped structures), lotus flower carvings, and ornate details typical of these regions’ religious buildings.
Visitor Experience: While exploring the Grand Palace, visitors should be aware of certain rules due to security measures taken by authorities. These include no photography inside certain areas or using flashbulbs outside. Additionally, there may be queues during peak tourist seasons so plan accordingly if you’re traveling here.
Travel Tips For Visiting The Grand Palace:
- Best Time To Visit: The best time to visit would likely be early morning before sunrise when temperatures drop significantly making walking around easier.
- Transportation: Public transportation options exist but they might not provide direct access to the Grand Palace itself; consider taking a taxi instead.
- Food Options: There aren’t many dining facilities available near the Grand Palace yet local street food vendors do offer delicious meals nearby.
- Clothing Guidelines: While not mandatory, wearing traditional Thai clothing (sarongs) could add cultural immersion experience. However, shorts/trousers are allowed provided they don’t go below knee level.
- Safety Measures: Visitors must follow signs indicating restricted zones marked “No Photography” among others. Always carry identification documents just in case.
By following these guidelines, your trip will undoubtedly enhance both your understanding of Thailand’s rich historical tapestry and personal enjoyment!
Hugging Face highlights that its newest visual language model has the “best memory usage” among all the incumbent models in transformers. “This allows it to run efficiently on-device such as a laptop,” the company explains.
SmolVLM’s model checkpoints, VLM datasets, training recipes, and tools have all been released under the Apache 2.0 license. Developers interested in playing with the model can use this available demo.
Featured Image: An AI-generated illustration of a vision-language model with a futuristic digital brain and interconnected nodes. Image credit: Dream Machine
Leave a Reply
You must be logged in to post a comment.