Google has introduced a new family of PaliGemma vision-language models, offering scalable performance, long captioning, and support for specialized tasks.
PaliGemma 2 was announced December 5, nearly seven months after the initial version launched as the first vision-language model in the Gemma family. Building on Gemma 2, PaliGemma 2 models can see, understand, and interact with visual input, according to Google.
PaliGemma 2 makes it easier for developers to add more-sophisticated vision-language features to apps, Google said. It also enables more-sophisticated captioning abilities, including identifying emotions and actions in images. Scalable performance capabilities in PaliGemma 2 mean performance can be optimized for any task via multiple model sizes (3B, 10B, 28B parameters) and resolutions (224px, 448px, 896px). Long captioning in PaliGemma 2 generates detailed, contextually relevant captions for images, going beyond simple object identification to describe actions, emotions, and the overall narrative of the scene, Google said.