Insights · AI & Data

Why nutrition data quality will make or break AI in food

Artificial intelligence gets the headlines in food innovation. The real bottleneck is quieter, and it sits underneath every model: the quality of the data it learns from. A model is only ever as good as what you feed it, and in food that input is a mess.

Garbage in, garbage out

The oldest rule in computing still decides who wins. Whether you are running a large language model or a specialised machine-learning pipeline, the system depends on structured, accurate, standardised inputs. In food that means nutritional composition, bioactive compound concentrations, how processing changes nutrient stability, and the regulatory frameworks that govern health claims. Get the inputs wrong and the most sophisticated model on the market will hand you confident, well-formatted nonsense.

Why food is a hard data problem

Most of this information is scattered across academic papers, government databases, supplier specifications, and proprietary archives, and almost none of it agrees. Take a single ingredient. One dataset lists turmeric's curcumin content in milligrams per 100 grams. Another reports it as a percentage, on a different moisture basis. A third ignores processing entirely, even though drying and heating change the answer. Merge those sources and you do not get a richer picture. You get a patchwork of contradictions that quietly poisons every prediction downstream.

What bad data actually costs

This is not an academic concern. Poor data produces false positives: ingredients a model rates as promising that fail the moment they meet a real trial. It produces missed opportunities: compounds overlooked because they were documented badly or filed under the wrong name. And it produces regulatory dead ends: recommendations that cannot be approved in the target market because the compliance data was never there. In every case the weak link is not the algorithm. It is the input.

The foundation-model paradox

The arrival of powerful general models has made this worse, not better, in one specific way. Foundation models are extraordinary at language and pattern, and they are confidently wrong in exactly the places food science is most fragile: precise concentrations, dose response, processing effects, and jurisdiction-specific rules. Retrieval-augmented generation helps, but retrieval is only as good as the corpus behind it. Point a brilliant model at a contradictory library and you industrialise the contradictions. The model does not fix the data problem. It amplifies whatever you already had.

The model does not fix the data problem. It amplifies whatever you already had.

The GLP-1 moment raises the stakes

The timing matters. The GLP-1 wave has reset what people expect from food, with sudden and serious demand for protein, fibre, satiety, and metabolic health. Personalised nutrition has moved from novelty to expectation, and both consumers and regulators are scrutinising health claims harder than ever. The companies trying to ride this with guesswork and trend-chasing are about to learn that the market now rewards evidence, and evidence is a data problem before it is anything else.

How the serious players fix it

None of the fixes are glamorous, which is exactly why most teams skip them. The advanced players do the unglamorous work. They normalise units, methods, and reference standards across sources. They build ontologies and knowledge graphs so machines understand how compounds, foods, and outcomes actually relate. They mine the literature with retrieval systems that turn scattered papers into structured, traceable records. They validate continuously with labs and suppliers, and they version and track the provenance of every value, so a number can always be traced to where it came from. Increasingly they also build their own evaluations: benchmarks that test whether a model's nutrition claims hold up, rather than assuming they do.

The real moat is the data

Here is the part founders and investors should sit with. In a world where everyone can rent the same models, the model is not the moat. The proprietary, validated, well-structured nutrition data underneath it is. Two teams using the same off-the-shelf model will produce wildly different results depending on what they feed it, and the one with the cleaner, deeper, better-governed data will win quietly and repeatedly. That advantage compounds, because good data infrastructure makes every additional dataset cheaper to add.

This is the bet behind what I am building at Alchemyst. The interesting work in AI for food is not a flashier model. It is the discipline of turning a fragmented, contradictory body of nutrition science into something a machine can reason over with confidence, and doing it rigorously enough that the output survives a lab, a regulator, and a shelf.

Bottom line

AI in food will not be won by the team with the cleverest algorithm. It will be won by the team that took the data seriously when no one else wanted to. Get the data wrong and nothing else matters. Get it right and you unlock genuine precision in how we design what we eat.

← All insights