The common practice of “scraping” a website’s publicly available data has come under legal attack. A landmark court decision (HiQ Labs v. LinkedIn) recently concluded that scraping is lawful, but LinkedIn stated that “the case is far from over.”
As someone who has personally relied on scraping in my academic research and in the companies I’ve founded, I want to speak up in favor of the court’s decision and invite you to join the discussion.
Web scraping is the process of extracting data from websites. Web search engines “crawl” the web moving from one website to another scraping websites to retrieve and index their contents. The contents can be created material (text, images, or video), which are often subject to copyright, or it may consist of facts (e.g., the price of a product or the author list of an article) which cannot be copyrighted.
In this post, I focus on the practice of scraping facts, which often benefits the “information have-nots” at the cost of major corporations such as LinkedIn, Amazon, and others that collect and aggregate data.
Consider the case of comparison shopping, which enables people to easily compare different prices for the same product across multiple vendors.
In 1996, I co-founded Netbot — the first company to offer online comparison shopping to consumers. In later startups, my colleagues and I extended the idea to airfares (at Farecast), electronics products (at Decide.com), and more. We helped consumers get the best price, figure out when is the best time to buy a product, and busted myths around the value of Black Friday discounts, helping to level the playing field for consumers.
All of these startups (and many others) rely intrinsically on web scraping to obtain key product and price information. Thus, scraping is a boon to consumers who can compare product prices side by side, which also incentivizes more expensive vendors to offer more competitive prices.
Web scraping is also good for research. For example, in their Nature paper, Nicholas J. DeVito, Georgia C. Richards, and Peter Inglesby explain how they rely on scraping to analyze coroners’ reports to prevent future deaths.
At the Allen Institute for AI (AI2), one of our flagship projects, Semantic Scholar, is built on the ability to scrape for information about academic papers. Created on the hypothesis that the cure for cancers may live buried within millions of research papers, we set out to develop a dynamic repository of academic content to help researchers stay up-to-date with scientific literature.
Scraping also promotes transparency and accountability. Scraping democratizes data that can be used for myriad analyses. Journalists, for example, have used scraping as a tool in groundbreaking investigations including adoption scandals, surveillance networks, and illegal gun sales.
Today, we often engage with websites that rely on scraping, most notably Google. Which is why the HiQ Labs v. LinkedIn decision is so important.
LinkedIn claims HiQ’s accessing of member data threatens its member’s privacy, but this is merely a fig leaf; what it really boils down to is data, access, and profit.
The Ninth Circuit Court’s opinion concludes that “LinkedIn’s own actions undercut its argument that users have an expectation of privacy in public profiles. LinkedIn’s ‘Recruiter’ product enables recruiters to ‘follow’ prospects, get ‘alert[ed] when prospects make changes to their profiles,’ and ‘use those [alerts] as signals to reach out at just the right moment,’ without the prospect’s knowledge…”
It goes on to say that “LinkedIn has explored ways to capitalize on the vast amounts of data contained in LinkedIn profiles by marketing new products. In June 2017, LinkedIn’s Chief Executive Officer (“CEO”), Jeff Weiner, appearing on CBS, explained that LinkedIn hoped to ‘leverage all this extraordinary data we’ve been able to collect by virtue of having 500 million people join the site.’”
Despite agreeing with the court’s ruling in this case, I do have concerns about certain use cases for scraped data. For example, HiQ Labs claims to “provide a crystal ball that helps … determine skills gaps or turnover risks months ahead of time…” The company’s Keeper product, for example, analyzes attrition risk.
It’s not a huge leap to see the potential for bias to creep into its analysis and disproportionately affect specific groups. We’ve already seen this play out in recidivism predictions, hiring decisions, financial decisions, and many other ways.
Data is a key currency in our society and scraping provides access to that currency universally. It’s up to us to decide how it is used.
Not everyone is in favor of web scraping, often citing privacy as the main reason against it. There is some validity to this concern.
This is not an insignificant challenge, but the data was publicly available at the time of scraping, and the pros outweigh the cons for considering web scraping as a practice.
Overall, the benefits of scraping for research, for commercial competition, and for the public outweigh its costs. The courts should affirm their support for this common practice and defend it from legal challenges.