Sunday, December 22, 2024

Top 10 most scraped websites revealed: Google, Amazon, Tripadvisor top list

Must read

A new report has identified Google, Amazon, and Tripadvisor as the top three most targeted websites for data scraping through APIs.

The report comes from web data collection platform Smartproxy, which has analysed millions of unique data requests made by users of its API scraping solutions over the past year. The firm has not released the total number of users who scraped a particular target, however, as it says, these figures are treated as internal data only.

According Smartproxy’s report, Google is the most scraped web platform request, with users largely collecting data for SEO analysis, keyword research, content optimisation, and market trend tracking.

Extracting data from a website is usually done using an API transfer or an automatic ‘data scraping’ program that harvests content and data from the website over the Internet.

While the latter, in some cases, has proven controversial — particularly concerning training LLMs for chatbots such as ChatGPT — Smartproxy’s report comprises data gathered via API users.

According to the firm, the results show that in 2023 and 2024 Q1, search engines comprised over 42% of all scraping API requests.

The most scraped Google categories are Search, Travel, Shopping (search, product, pricing), Images, and Ads.

Fellow search engine Bing, from Microsoft, attracts 1.2 billion visitors monthly and has reached sixth place on Smartproxy’s most scraped list.

eCommerce

 

E-commerce giants are another prime target for API data scraping. Over 18% of all requests were attributed to online shopping platforms, and over half of the sites in the top ten list are online retailers. Amazon ranked second overall.

Users scrape eCommerce websites to gather product data for price comparison, market analysis, trend tracking, and competitive intelligence.

Publicly available web data allows companies to optimise their pricing strategies, identify top-selling products, and analyse the peak shopping hours to better their offering’s positioning.

According to Smartproxy CEO Vytautas Savickas, businesses tend to ramp up scraping efforts during peak shopping times like Black Friday, Halloween, and Christmas to capture the value of data generated by the rush of online shoppers seeking discounts and special offers.

“This data is crucial for gaining comprehensive insights into consumer behaviour, identifying popular product categories, and spotting emerging market trends,” he says.

“By analysing the rich datasets collected during these periods, companies can fine-tune their marketing strategies, optimise inventory levels, and tailor their offerings to better meet consumer demand,” he added.

Other popular retail sites regularly scraped for information include Walmart, which comes fourth on the list and accounts for 25% of all online grocery sales in the US.

According to Smartproxy, online businesses collect and analyse real-time data from the market leader to adjust their pricing, research bestselling products in various locations, and compare their product attributes with those of the major sellers on Walmart’s marketplace.

Online auction site eBay, ranked seventh on the list, was also described by the web scraping solutions provider as a “treasure trove of valuable data” for eCommerce businesses.

Other retailers with data that users thought was worth mining include Shopify and the popular Southeast Asian e-commerce platform Lazada, which rank 8th and 9th, respectively.

Other categories

 

With sheets of insights behind every review, Tripadvisor was the sole travel company to make it to the top of the most mined site list, coming third overall.

Businesses leverage real-time data collected from Tripadvisor to improve their services, optimise pricing, and analyse competition.

According to Smartproxy, US premier online real estate agent Zillow was tenth, with house-selling websites comprising over 3% of all scraping API requests.

In addition to house hunters, businesses are leveraging real estate data to monitor competitor listings, adjust pricing strategies, analyse user reviews, and identify areas for improvement.

Strangely absent from the top ten were social media sites such as Facebook or Instagram — despite social media emerging as the second most popular category for web scraping, with over 27% of all API scraping requests.

Similarly, Quora and Reddit were not included in the list, despite community forums accounting for 7% of all API requests.

This may be because firms often seek a multiplatform scraping solution to collect real-time data from sites, eliminating the need for a custom-built scaping API for each social media platform.

Another reason may be that some social media sites and forums do not offer APIs (only 1% of websites currently do), causing users to revert to alternative means of scraping the data.

Businesses use social media-extracted data for various purposes, including monitoring brand sentiment, analysing market trends, and conducting competitor research.

Real-time data for AI training

 

According to Smartproxy, new industries are also adopting web scraping for real-time data to improve predictive models and NLP technologies.

What is Predictive Modelling?

Predictive modelling, also known as predictive analytics, involves creating AI models that recognise patterns in historical data to forecast future events and outcomes.

This enables businesses to make more data-backed decisions and strategic plans.

What is Natural Language Processing?

Natural language processing (NLP) is a branch of AI that focuses on the interaction between computers and human language. It underpins conversational AI applications.

For example, a customer support chatbot can be trained using recent user interactions and feedback scraped from review sites, improving its ability to handle inquiries and provide helpful responses.

According to Savickas, the trend of scraping search engines highlights the critical need for real-time search data across various sectors, including the AI field, where data plays a crucial role in training AI models, optimising NLPs, and helping AI agents scrape web pages efficiently.

To access the full report, click here. Interested in getting more informed about data scraping? Click here.

Latest article