Thursday, September 19, 2024

Google’s DataGemma AI is a statistics wizard

Must read


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Google is expanding its AI model family while addressing some of the biggest issues in the domain. Today, the company debuted DataGemma, a pair of open-source, instruction-tuned models that take a step toward mitigating the challenge of hallucinations – the tendency of large language models (LLMs) to provide inaccurate answers – on queries revolving around statistical data.

Available on Hugging Face for academic and research use, both new models build on the existing Gemma family of open models and use extensive real-world data from the Google-created Data Commons platform to ground their answers. The public platform provides an open knowledge graph with over 240 billion data points sourced from trusted organizations across economic, scientific, health and other sectors.

The models use two distinct approaches to enhance their factual accuracy in response to user questions. Both methods proved fairly effective in tests covering a diverse set of queries.

The answer to factual hallucinations 

LLMs have been the breakthrough in technology we all needed. Even though these models are just a few years old, they are already powering a range of applications, right from code generation to customer support, and saving enterprises precious time/resources. However, even after all the progress, the tendency of models to hallucinate while dealing with questions around numerical and statistical data or other timely facts continues to be a problem. 

“Researchers have identified several causes for these phenomena, including the fundamentally probabilistic nature of LLM generations and the lack of sufficient factual coverage in training data,” Google researchers wrote in a paper published today

Even traditional grounding approaches have not been very effective for statistical queries as they cover a range of logic, arithmetic, or comparison operations. Public statistical data is distributed in a wide range of schemas and formats. It requires considerable background context to interpret correctly. 

To address these gaps, Google researchers tapped Data Commons, one of the largest unified repositories of normalized public statistical data, and used two distinct approaches to interface it with the Gemma family of language models — essentially fine-tuning them into the new DataGemma models.

The first approach, called Retrieval Interleaved Generation or RIG, enhances factual accuracy by comparing the original generation of the model with relevant stats stored in Data Commons. To do this, the fine-tuned LLM produces natural language queries describing the originally generated LLM value. Once the query is ready, a multi-model post-processing pipeline converts it into a structured data query and runs it to retrieve the relevant statistical answer from Data Commons and back or correct the LLM generation, with relevant citations.

While RIG builds on a known Toolformer technique, the other approach, RAG, is the same retrieval augmented generation many companies already use to help models incorporate relevant information beyond their training data.

In this case, the fine-tuned Gemma model uses the original statistical question to extract relevant variables and produce a natural language query for Data Commons. The query is then run against the database to fetch relevant stats/tables. Once the values are extracted, they, along with the original user query, are used to prompt a long-context LLM – in this case, Gemini 1.5 Pro – to generate the final answer with a high level of accuracy. 

Significant improvements in early tests

When tested on a hand-produced set of 101 queries, DataGemma variants fined-tuned with RIG were able to improve the 5-17% factuality of baseline models to about 58%. 

With RAG, the results were a little less impressive – but still better than baseline models.

DataGemma models were able to answer 24-29% of the queries with statistical responses from Data Commons. For most of these responses, the LLM was generally accurate with numbers (99%). However, it struggled to draw correct inferences from these numbers 6 to 20% of the time.

That said, it is clear that both RIG and RAG can prove effective in improving the accuracy of models handling statistical queries, especially those tied to research and decision-making. They both have different strengths and weaknesses, with RIG being faster but less detailed (as it retrieves individual statistics and verifies them) and RAG providing more comprehensive data but being constrained by data availability and the need for large context-handling capabilities.

Google hopes the public release of DataGemma with RIG and RAG will push further research into both approaches and open a way to build stronger, better-grounded models.

“Our research is ongoing, and we’re committed to refining these methodologies further as we scale up this work, subject it to rigorous testing, and ultimately integrate this enhanced functionality into both Gemma and Gemini models, initially through a phased, limited-access approach,” the company said in a blog post today.

Latest article