Writing the new GEO playbook: LLM scrapers get the real answers

Demand is growing for the ability to collect and analyse the outputs of Generative AI (GenAI) tools like ChatGPT and Perplexity. These tools, which use Large Language Models (LLMs), are increasingly being used as alternatives to traditional internet search engines. For this reason, professionals working with search engine optimisation (SEO) and its new incarnation, generative engine optimisation (GEO), are keen to understand what sources LLMs draw from and how they present topics relevant to particular brands and industries.

API endpoints provided by LLM companies are one means to gather this information. Scraping tools that can access LLM-generated responses are an emerging alternative. These scrapers benefit from geographic targeting and an ability to reflect what real users see, meaning they can give greater precision and real-world accuracy than the API endpoints offered by the likes of ChatGPT and Perplexity themselves. As the shift from SEO to GEO gathers pace, it is important for marketers to understand how these scraping tools work and what features to look for.

Users turn to GenAI as an alternative to traditional search engines

AI is rapidly transforming web search, with users turning to GenAI tools for answers. These tools can rapidly compile clear, concise responses that are generated based on their training data and live information retrieval from the web. This saves users the need to click through multiple pages and read long texts in order to find the answer they need.

This shift is already having a measurable impact. Gartner predicts that there will be a 25% drop in volume on traditional search engines like Google and Bing. Apple has already reported that the use of Google search on its browser, Safari, dropped for the first time in 22 years. And Google execs have reportedly predicted an "inevitable decline" in Google search engine traffic. In response, Google has introduced AI-generated summaries in its Search experience – combining traditional search results with Gen AI-powered answers.

Traditional SEO practices not suited for GenAI

For more than a decade, marketers, content creators, and SEO agencies have been devising ways to optimise online content for search engines using keywords, alt texts, structured headlines, and other SEO practices. Now, they must also optimise their content for LLMs, which requires more than a straightforward continuation of old tactics.

LLMs are non-deterministic – the same input will not always generate a 1:1 result. Additionally, they draw answers from both training data and live web search, which increases unpredictability of the output. Furthermore, while SEO best practices are based on decades of insights into how search engines crawl, rank, and display content, GEO is a new and evolving discipline, with limited historical data or best practices to rely on.

Understanding LLM outputs

To adapt to the new realities, marketers and GEO professionals are now analysing how LLMs actually present brand and industry-related information in their outputs. By monitoring how often and in what context a brand name appears in responses for targeted keywords, they can evaluate the brand's visibility and reputation in AI-powered search.

But this is just the first step in formulating full-fledged GEO strategies. Professionals also need additional data points, such as information on top-ranking LLM responses in a particular niche or field. Analysing such data, will help them figure out how to match the format and approach of the content most favoured by LLMs.

Underpinning all of this work is the understanding that these tools are programmatic. That means there will be patterns in terms of which sources are frequently chosen for particular queries and how particular information is displayed and referenced. However, uncovering these patterns requires a lot of data.

API endpoints from LLMs not matching real user outputs

A method for acquiring LLM output data already exists. Brands and agencies can access the responses of LLMs via API endpoints. The likes of OpenAI, Perplexity, and Claude offer paid packages that enable companies to view generated responses from LLMs for particular queries and prompts. Companies simply purchase credits, submit prompts, and then receive answers via the API.

However, these API endpoints often deliver responses that do not match the real-world outputs users are receiving. This is because LLM APIs are set up with specific parameters that guide output generation. These parameters can, for example, determine whether the model provides the safer, more probable responses, or aims for more creative outputs, which can also increase inaccuracies. Crucially, the parameters set for the API may not match those of real-world users, leading to different outputs.

Additionally, APIs cannot by themselves mimic requests from various specific locations. However, in real-world usage, LLMs can use your location to refine the output they give. For SEO and GEO, then, the value of API-generated outputs may be limited, especially for specific types of queries such as geographically targeted ones.

Scraping LLM output

Hence, the emergence of web scrapers that target LLM responses. Users send requests via an API that includes their queries and the source they would like to scrape – for example, ChatGPT, Perplexity, or Google's AI Mode. These scrapers, which themselves are AI-powered, extract LLM-generated responses and package them up into structured data (e.g., in JSON) and deliver them to the user.

The benefits of scraping LLMs

LLM scrapers provide SEO and GEO experts with the same responses that actual users get when making the same queries. Thus, when compared to APIs, their overarching advantage is that they provide data that reflects actual user experience, not restrained by API parameters.

Additionally, the most versatile LLM scraping platforms allow for geographic targeting, which provides data on how LLM responses are affected by the user's location. Such platforms can also be a single source of data from multiple major models, from ChatGPT to Google AI Mode and beyond.

Thus, such multifaceted LLM-scraping tools can be a convenient way to acquire data from very varied generative search experiences. For SEO and GEO experts, this means an opportunity to uncover patterns and factors across various models, geographic locations, and circumstances.

Meanwhile, data and AI companies can use the diverse data scraped from real-world LLM responses to enrich their training datasets and fine-tune AI models. For example, a machine learning team could extract a range of responses from LLMs based on domain-specific prompts. These responses could form a dataset for training a custom AI assistant, with the benefit that this assistant would know current language usage, tone, and up-to-date information.

Effective LLM scraping conditions

While LLM scraping solves crucial shortcomings of API endpoints, it also has its own set of challenges. Building an LLM scraper in-house is an expensive and complicated endeavour, requiring specialised technical knowledge. And not every LLM scraper one builds or finds in the market will be effective. There are certain conditions one should meet to expect workable results from LLM output scraping.

A large proxy pool

Scraping tools rely on proxies to send requests, so LLM scrapers backed by a large proxy pool have a significant advantage. With more proxies, they can achieve higher success rates, broader geographic coverage, and greater resilience against IP blocks and CAPTCHAs. Machine learning can also be used to automate proxy rotation, further decreasing the risk of IP blocks and scraping hiccups.

Prompting at scale

The ability to prompt at scale is also important. Companies looking to assemble comprehensive datasets need the ability to submit thousands of prompts or URLs in a single request, and to extract high volumes of data quickly and efficiently.

Enabled for different modes

Tools like Perplexity and ChatGPT have web search modes and shopping assistant features, although these must be enabled by users and included in the user's package. Upon receiving a query, the tool decides whether generating a good answer requires web search or shopping assistance based on the specific prompt. Search modes enable marketers to compare outputs generated from training data with those incorporating web search.

Furthermore, search modes provide not only the final output, but also links to all of the sources that were referenced, which is crucial for GEO. Meanwhile, shopping assistant features present product recommendations, and, in the case of Perplexity in the US, can even be used to make purchases. To provide full value for SEO and GEO professionals, LLM scrapers should have the ability to extract search mode and shopping assistant output.

Structured output and parsing

One key component in a web scraping pipeline is data parsing – structuring the scraped data into a usable format. The tool you use must be able to effectively produce parsed datasets, as only with such data available, analysts can start drawing insights.

Keeping pace with GenAI

Two years of Gen AI have turned online marketing on its head, as tried and tested SEO practices have lost their efficacy. The solution in the transition to GEO is to gather data. After all, LLM tools are programmatic, and there should be patterns underpinning their outputs, patterns that data can reveal.

LLM output scrapers offer an alternative to API endpoints for gathering this data, thanks to their geographic targeting and ability to capture precisely what end users are seeing when they use GenAI tools. Then, just as they did in response to the original emergence of search engines, marketers will discover new practices to ensure their content is optimised for LLMs.

For more startup news, check out the other articles on the website, and subscribe to the magazine for free. Listen to The Cereal Entrepreneur podcast for more interviews with entrepreneurs and big-hitters in the startup ecosystem.