Ai2’s Molmo Outperforms Closed Multimodal Models in Open Source Demonstration 

AI

​The common belief is that only companies like Google, OpenAI, and Anthropic, with unlimited financial resources and top researchers, have the ability to create cutting-edge foundation models. However, as one of these companies has famously stated, they “have no moat” – and Ai2 has proven this with the release of Molmo, a multimodal AI model that rivals the best in the industry while also being small, free, and truly source. To clarify, Molmo (multimodal open language model) is a visual understanding engine, not a full-service chatbot like ChatGPT. It does not have API, is not suitable for enterprise integration, and does not search the web for its own purposes. Think of it as the part of these models that can see an image, comprehend it, and provide descriptions or answer questions about it. Molmo comes in 72B, 7B, and 1B-parameter variants and, like other multimodal models, is capable of identifying and answering questions about almost any everyday situation or object. For example, it can answer questions like “How do you work this coffee maker?” or “How many dogs in this picture have their tongues out?” or “Which options this menu are vegan?” or “What are the variables in this diagram?” These are tasks that we have seen demonstrated with varying levels of success and latency over the years. What sets Molmo apart is not necessarily its capabilities (which you can see in the demo below or test here), but how it achieves them. Visual understanding is a broad domain, ranging from counting sheep in a field to guessing a person’ emotional to summarizing a menu. As such, it is difficult to describe and test quantitatively. However, as Ai2 CEO Ali Farhadi explained at a demo event at the organization’s headquarters in , it is possible to show that two models are similar in their capabilities. “One thing that we are demonstrating today is that open is equal to closed,” he said. “And small is now equal to big.” (He clarified that he meant “equivalency” rather than “identity,” which is an important distinction for some.) In the of AI , the mantra has always been “bigger is better.” More training data, more parameters in the resulting model, and more computing power to create and operate them. However, there comes a point where it is simply not feasible to make them any bigger – either there is not enough data, or the costs and time required for computing become too high. In these cases, it is necessary to make do with what you have, or even better, do more with less. Farhadi explained that Molmo, while performing on par with models like GPT-4o, Gemini 1.5 Pro, and Claude-.5 Sonnet, is only about one-tenth of their size (according to best estimates). And it achieves this level of capability with a model that is only one-tenth of that size. “There are many different benchmarks that people use to evaluate models. I don’ like this approach scientifically, but I had to provide a number for people,” he said. “Our largest model, the 72B, is actually a small model, and it outperforms many larger models.” 

Read More @ techcrunch.com 

Share This Article
Leave a Comment