Vision Language Models

Name: Vision Language Models
Brand: O'Reilly Media
SKU: 9798341624047
Price: 76.99 EUR
Availability: OutOfStock

Merve Noyan | Andres Marafioti | Miquel Farre | Orr Zohar

€76.99

Product variants

Quantity:

Ships in 14-28 days 14 days return policy

4.8/5

Judge.me

603 verified reviews

100% verified

Ships in 14-28 days

Delivery/Collection within 10-20 working days

14 days return policy Shipping & Delivery

A01=Andres Marafioti

A01=Merve Noyan

A01=Miquel Farre

A01=Orr Zohar

Author_Andres Marafioti

Author_Merve Noyan

Author_Miquel Farre

Author_Orr Zohar

Category=UYQL

Category=UYQM

Category=UYQP

Category=UYQV

eq_bestseller

eq_computing

eq_isMigrated=1

eq_isMigrated=2

eq_new_release

eq_nobargain

eq_non-fiction

Vision-Language Models (VLMs) Multimodal AI Computer Vision Natural Language Processing Deep Learning Visual Question Answering Image Captioning Zero-shot Learning Transfer Learning Neural Networks Attention Mechanisms

Product details

ISBN 9798341624047
Publication Date: 30 Jun 2026
Publisher: O'Reilly Media
Publication City/Country: US
Product Form: Paperback

Secure checkout

Fast Shipping

Easy returns

Vision language models (VLMs) combine computer vision and natural language processing to create powerful systems that can interpret, generate, and respond in multimodal contexts. Vision Language Models is a hands-on guide to building real-world VLMs using the most up-to-date stack of machine learning tools from Hugging Face, Meta (PyTorch), NVIDIA (Cuda), and others, written by leading researchers and practitioners Merve Noyan, Miquel Farre, Andres Marafioti, and Orr Zohar. From image captioning and document understanding to advanced zero-shot inference and retrieval-augmented generation (RAG), this book covers the full VLM application and development lifecycle.

Designed for ML engineers, data scientists, and developers, this guide distills cutting-edge VLM research into practical techniques. Readers will learn how to prepare datasets, select the right architectures, fine-tune and deploy models, and apply them to real-world tasks across a range of industries.

Explore core model architectures and alignment techniques
Train and fine-tune VLMs with Hugging Face, PyTorch, and others
Deploy models for applications like image search and captioning
Implement advanced inference strategies, from zero-shot to agentic systems
Build scalable VLM systems ready for production use

Merve Noyan is a machine learning engineer working in the ML advocacy engineering team at Hugging Face. She builds tools to enable people to build with vision language models across the Hugging Face ecosystem (transformers, TRL, smolagents). Previously she worked for different companies building natural language understanding based solutions on information retrieval and conversational agents. Andres Marafioti holds a PhD in applied machine learning, with a focus on multimodal generative methods. Previously a senior ML engineer at Unity, he played a key role in bringing multimodal ML-based products from concept to market adoption. Now at Hugging Face, Andres leads cutting-edge research in multimodal and memory-efficient models, leading the development of SmolVLM, a state-of-the-art vision-language model. He has co-authored several impactful papers in the VLM space, such as "Building and Better Understanding Vision-Language Models." Miquel Farre is a video technology expert with over 15 years of experience and more than 60 patents in machine learning and information science. His career began at the Fraunhofer Institute, where he designed advanced video codecs, and Nagravision, where he developed video streaming security modules. Transitioning to video understanding, Miquel joined Disney to architect the enterprise content metadata platform, leading machine learning initiatives across Pixar, Marvel, Lucasfilm, ABC, and ESPN. He then moved to YouTube, driving search monetization before expanding his focus to lead monetization for the platform's Home and Watch Next surfaces. Before joining Studio Jadu, he worked at Hugging Face on video multimodal large language models and founded Arbro AI to build automated farming solutions. Orr Zohar is a PhD candidate in SVL at Stanford University, advised by Professor Serena Yeung-Levy and supported by the Knight-Hennessy Scholarship. His research centers on large multimodal models, particularly in video understanding, with a focus on self-training methodologies and agentic design. Orr has co-developed innovative approaches such as Video-STaR, a self-training method for video instruction tuning, and VideoAgent, an agent-based framework for long-form video comprehension. Notably, he led the Apollo project, a comprehensive study exploring video understanding in large multimodal models, resulting in the creation of the Apollo family of models that set new benchmarks in the field.

Vision Language Models

Shipping & Delivery

Product details

More from this author

Submit Withdrawal Request