Wednesday, December 18, 2024

Google Unveils PaliGemma 2 Vision-Language Models for Advanced Task Transfer

Must read

Google has announced the launch of PaliGemma 2, a family of vision-language models (VLMs) based on the Gemma 2 architecture, building on its predecessor with broader task applicability.

The upgrade includes three model sizes (3B, 10B, and 28B) and three resolutions (224px², 448px², and 896px²), designed to optimise transfer learning across diverse domains.

According to Google, the models were trained in three stages using Cloud TPU infrastructure to handle multimodal datasets spanning captioning, optical character recognition (OCR), and radiography report generation. These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis.

In their paper, the researchers explain, “We observed that increasing the image resolution and model size significantly impacts transfer performance, especially for document and visual-text recognition tasks.” The models achieved state-of-the-art accuracy on datasets such as HierText for OCR and GrandStaff for music score transcription.

The fine-tuning capabilities of PaliGemma 2 allow it to address applications beyond traditional benchmarks. The researchers noted that while increasing compute resources yields better results for most tasks, certain specialised applications benefit more from either higher resolution or larger model size, depending on task complexity.

PaliGemma 2 also emphasises accessibility, with models designed to operate on low-precision formats for on-device inference. Researchers highlight, “Quantization of models for CPU-only environments retains nearly equivalent quality, making it suitable for broader deployments.”

Google DeepMind has introduced Genie 2, a large-scale foundation world model capable of generating diverse playable 3D environments. Genie 2 transforms a single image into interactive virtual worlds that can be explored by humans or AI using standard keyboard and mouse controls, facilitating the development of embodied AI agents.

Additionally, Google DeepMind has launched GenCast, an AI model that enhances weather predictions by providing faster and more accurate forecasts up to 15 days in advance, while also addressing uncertainties and risks.

Google has also unveiled its experimental AI model, Gemini-Exp-1121, positioned as a competitor to OpenAI’s GPT-4o. The company is gearing up to release Google Gemini 2, which is expected to compete with OpenAI’s forthcoming model, o1.

[This story has been read by 220 unique individuals.]

Latest article