Web Scraping

Global Sensor Network vs Web Scraping

Web scraping is a method commonly used to determine the SaaS applications on a particular website.

In this case, many products have a unique tag or snippet that must be added to a site's code to run the software.

By looking for these code snippets across millions of domains, companies like BuiltWith and SimilarTech can determine the products installed on a given website.

There are many challenges to this methodology, let’s cover four big ones below:

Scraping is binary.

Web crawling may tell users what service HGInsights.com is using for hosting, but it can’t tell how much is being spent or where this product is being used.

Larger businesses use multiple hosting providers to power their websites, apps, APIs and more, so understanding this relationship is extremely important.

Scraping leads to false positives.

Adding code to a website is easy, and there is minimal performance cost to leaving it there.

Many sites have code installed that is no longer active or code that has been added but never utilized, which means users have no insight into which SaaS applications a company is using.

Scraping is domain-based.

Take a company like Nike, which operates hundreds of domains around the world. Web scrapers will treat each of those domains as a distinct entity, inflating the count of deployments and giving a false sense of usage and breadth.

Scraping misses many products.

Web scraping is limited to products installed on websites. Many providers like Google Cloud, Amazon Web Services, and Neustar do not require code snippets because they operate behind a web server.

To identify these products and how much a business is spending, you need to approach the problem from a whole new angle.