Vision Language Models

Regular price €76.99
Quantity:
Will Deliver When Available
Will Deliver When Available
14 days return policy Shipping & Delivery
A01=Andres Marafioti
A01=Merve Noyan
A01=Miquel Farre
A01=Orr Zohar
Author_Andres Marafioti
Author_Merve Noyan
Author_Miquel Farre
Author_Orr Zohar
Category=UYQL
Category=UYQM
Category=UYQP
Category=UYQV
eq_bestseller
eq_computing
eq_isMigrated=1
eq_isMigrated=2
eq_nobargain
eq_non-fiction
forthcoming
Vision-Language Models (VLMs) Multimodal AI Computer Vision Natural Language Processing Deep Learning Visual Question Answering Image Captioning Zero-shot Learning Transfer Learning Neural Networks Attention Mechanisms

Product details

  • ISBN 9798341624047
  • Publication Date: 30 Jun 2026
  • Publisher: O'Reilly Media
  • Publication City/Country: US
  • Product Form: Paperback
Secure checkout Fast Shipping Easy returns

Vision language models (VLMs) combine computer vision and natural language processing to create powerful systems that can interpret, generate, and respond in multimodal contexts. Vision Language Models is a hands-on guide to building real-world VLMs using the most up-to-date stack of machine learning tools from Hugging Face, Meta (PyTorch), NVIDIA (Cuda), and others, written by leading researchers and practitioners Merve Noyan, Miquel Farre, Andres Marafioti, and Orr Zohar. From image captioning and document understanding to advanced zero-shot inference and retrieval-augmented generation (RAG), this book covers the full VLM application and development lifecycle.

Designed for ML engineers, data scientists, and developers, this guide distills cutting-edge VLM research into practical techniques. Readers will learn how to prepare datasets, select the right architectures, fine-tune and deploy models, and apply them to real-world tasks across a range of industries.

  • Explore core model architectures and alignment techniques
  • Train and fine-tune VLMs with Hugging Face, PyTorch, and others
  • Deploy models for applications like image search and captioning
  • Implement advanced inference strategies, from zero-shot to agentic systems
  • Build scalable VLM systems ready for production use
Merve Noyan is a machine learning engineer working in the ML advocacy engineering team at Hugging Face. She builds tools to enable people to build with vision language models across the Hugging Face ecosystem (transformers, TRL, smolagents). Previously she worked for different companies building natural language understanding based solutions on information retrieval and conversational agents. Andres Marafioti holds a PhD in applied machine learning, with a focus on multimodal generative methods. Previously a senior ML engineer at Unity, he played a key role in bringing multimodal ML-based products from concept to market adoption. Now at Hugging Face, Andres leads cutting-edge research in multimodal and memory-efficient models, leading the development of SmolVLM, a state-of-the-art vision-language model. He has co-authored several impactful papers in the VLM space, such as "Building and Better Understanding Vision-Language Models." Miquel Farre is a video technology expert with over 15 years of experience and more than 60 patents in machine learning and information science. His career began at the Fraunhofer Institute, where he designed advanced video codecs, and Nagravision, where he developed video streaming security modules. Transitioning to video understanding, Miquel joined Disney to architect the enterprise content metadata platform, leading machine learning initiatives across Pixar, Marvel, Lucasfilm, ABC, and ESPN. He then moved to YouTube, driving search monetization before expanding his focus to lead monetization for the platform's Home and Watch Next surfaces. Before joining Studio Jadu, he worked at Hugging Face on video multimodal large language models and founded Arbro AI to build automated farming solutions. Orr Zohar is a PhD candidate in SVL at Stanford University, advised by Professor Serena Yeung-Levy and supported by the Knight-Hennessy Scholarship. His research centers on large multimodal models, particularly in video understanding, with a focus on self-training methodologies and agentic design. Orr has co-developed innovative approaches such as Video-STaR, a self-training method for video instruction tuning, and VideoAgent, an agent-based framework for long-form video comprehension. Notably, he led the Apollo project, a comprehensive study exploring video understanding in large multimodal models, resulting in the creation of the Apollo family of models that set new benchmarks in the field.

More from this author