Summary
Large language models with web search produce variable outputs: the same prompt, submitted repeatedly to the same system on the same day, yields different responses each time. The brands these systems name and the sources they cite shift from one run to the next. This variability poses a practical problem for anyone monitoring them — when a brand's visibility moves from 40% to 35% week over week, it is difficult to know whether something real has changed (a model update, new content, a reindex) or whether the difference reflects sampling noise alone.
This study addresses a single question: how many times must a prompt be run before a genuine change can be distinguished from random variation? We submitted 7,200 queries in a single day — six product-research prompts spanning three industries, on both ChatGPT and Gemini — and measured the magnitude of sampling noise directly, rather than estimating it from a theoretical formula.
The result is a detection threshold: the smallest change, at a given number of runs, that is unlikely (less than a 5% probability) to arise from noise alone. Three findings stand out. First, the threshold declines slowly — halving it requires quadrupling the number of runs. Second, the threshold depends far less on the engine than on the brand itself: a brand mentioned roughly half the time is the hardest to track, requiring a swing of about 14 percentage points at 100 runs to confirm. Third, for rarely-cited sources the noise exceeds the signal — 83% of long-tail domains remain undetectable even at 600 runs.
The implication is direct: any comparison of two batches of LLM outputs should report its sample size and treat differences below the threshold as noise rather than evidence of change.
Acknowledgments
With thanks to Zachary Bergman, whose work on the experimental pipeline and analysis made this research possible.
Download
Download the full report here.

