May 31, 2022

Problem of Sequential Numbers and "Screen Scraping"

Holland & Knight IP/Decode Blog
Jacob W. S. Schneider
IP/Decode Blog

When I was in high school, the seniors would pull a prank in the late spring after college acceptance letters came in. It usually involved animals. The class above us released crickets in the lunchroom; another class set up chicken pens (with chickens) in the courtyard. There was a legend that one class in the 1970s had released four pigs in the hallways. Each pig was painted with a number: 1, 2, 3 and 5. The joke was that the school would spend all day searching for Pig No. 4 after catching the others.

This legend is almost definitely not true: 1) There are not many pigs in or around my Massachusetts hometown that strictly regulates livestock; 2) The story was sometimes told with lobsters, which while there are more lobsters in Massachusetts, makes even less sense (how do you paint a number on a lobster?); and 3) The story was told to me by a 14-year-old. (There was also no pool on the roof of the school.) All that said, the story is a fun one because people would assume that there was a Pig No. 4 lumbering around campus, searching for a loose Dunkin' Donut or a revolution to start.

And the reason people would assume Pig No. 4 existed is because we are biased to believe that numbers occur sequentially without gaps. That assumption, however, can have disastrous consequences for a client's software, particularly when trying to avoid "screen scraping" – where an application visits a website to copy its contents – and the best design decision is often to avoid sequential numbers at all costs.

Why Sequential Numbers Occur in Software

Sequential numbers occur in software because of the assumption above, but also because counting sequentially maintains the uniqueness of each number. If software counts data records (or pigs) sequentially with whole numbers, then it can guarantee that each data record has a unique index. That index (e.g., 437) can later be used to search for and retrieve the record (and only that record).

This feature occurs frequently in database design, where tables of data usually have a unique index field per row. For example, MySQL has a built-in function to count sequentially to generate unique IDs, and a database table of famous pigs may look like this:

 

Unique ID

Animal Type

Name

1

Pig

Snowball

2

Pig

Napoleon

3

Pig

Wilbur

4

Pig

Mercy Watson

5

Pig

Peppa

Why Sequential Numbering May Be a Problem

Sequential numbering may be fine if the data is never exposed to others, but trouble arises when your client has built a web application that dynamically generates a webpage for each of the famous pigs above. Web applications often take input in the form of variables nestled in their URLs. For example, clicking here will perform a Google search for pigs because the URL has the variable "q" with the value "pigs." Switching out "pigs" for another word will modify the search criteria. With each page load, Google is dynamically generating a set of search results based on q's value.

I wrote a simple web application that has a similar URL structure (sequentialfindapig.php?id=X) where "id" is our variable and "1" is its value. Click the link and it will access the famous pig database and return Snowball's name. Count up from 1, change the value and see what other pigs you can find. If you sequentially try all "Unique ID" values from 1 through 5, you will have accessed all of the data in our famous pig database.

That may be no problem for the famous pig database, but allowing the public to access a company's data by simply counting can have disastrous consequences. In the wake of the Jan. 6, 2021, Capitol attack, hackers discovered that the social media app Parler sequentially numbered its posts. Exploiting that vulnerability, those hackers sequentially accessed and downloaded a copy of each and every Parler post ever submitted, consisting of 56 terabytes of data that included geocached photos and videos. Effectively, the hackers had made a copy of an entire social media platform as it existed just after many of its users posted allegedly incriminating information.

This type of attack can also enable screen scraping. In the famous pigs example, loading five sequential pages would allow someone to copy the entire database. In a more problematic example, if a site publishes original poetry on sequentially numbered pages, then a screen scraping application can flip through them and copy each work. With all poems in hand, the hacker could republish the works elsewhere. Copying and republishing the expressive poetic works would constitute copyright infringement. See 17 U.S.C. §§ 106, 501. If screen scraping accesses unauthorized digital spaces, then the activity can also run afoul of the Computer Fraud and Abuse Act. See 18 U.S.C. § 1030.

Solving the Sequential Numbering Problem

Software developers avoid the sequential numbering problem in several ways. First, they can simply avoid counting sequentially and force hackers to waste endless time looking for "Pig No. 4." The Universal Unique Identifier (UUID) specification describes a 128-bit ID that is statistically unique, meaning it is very unlikely to be duplicated. You would need to generate 2.71 quintillion UUIDs before there would be a 50 percent chance that two are the same. (Note: This is the same probability exercise as the Birthday Paradox, which proves that there is a roughly 50 percent likelihood that 2 of 23 randomly selected people share the same birthdate.) Twitter has developed its own in-house unique ID generator called Snowflake that generates nonconsecutive unique identifiers for each tweet, direct message, user, etc.

To demonstrate how nonsequential numbers can protect data, I built a modified version of the famous pig web application that uses UUIDs (UUIDfindapig.php?id=X). (If you can guess the ID for "Napoleon," then please email me because you have conquered statistics.)

Second, software developers can avoid exposing data via web applications that they do not want scraped and copied. If that data needs to be exposed to the public, consider a rate controller that slows access for heavy users (who might be scraping) or detection methods to identify and block scrapers.

Lawyers can protect against screen scraping by prohibiting it within their client site's terms of use. This makes the scraping act expressly prohibited by the site and removes all doubt as to whether the information on the website is fair game for scraping and copying. An example:

Users shall not seek to obtain access to any services, website content, materials, accounts, or information through scraping, hacking, data harvesting, data mining, or through other means not intentionally made available to you through the platform.

Related Insights