10 Million Data Requests: How a Times Team Tracked Covid
The project began 18 months ago as a simple concept: Count every known U.S. case at the time. When the virus grew exponentially, so did the efforts to document it.,
Times Insider explains who we are and what we do, and delivers behind-the-scenes insights into how our journalism comes together.
As of this morning, programs written by New York Times developers have made more than 10 million requests for Covid-19 data from websites around the world. The data we’re collecting are daily snapshots of the virus’s ebb and flow, including for every U.S. state and thousands of U.S. counties, cities and ZIP codes.
You may have seen slices of this data in the daily maps and graphics we publish at The Times. These pages combined, which have involved more than 100 journalists and engineers from across the organization, are the most-viewed collection in the history of nytimes.com and are a key component of the package of Covid reporting that won The Times the 2021 Pulitzer Prize for public service.
The Times’s coronavirus tracking project was one of several efforts that helped fill the gap in the public’s understanding of the pandemic left by the lack of a coordinated governmental response. Johns Hopkins University’s Coronavirus Resource Center collected both domestic and international case data. And the Covid Tracking Project at The Atlantic marshaled an army of volunteers to collect U.S. state data, in addition to testing, demographics and health care facility data.
At The Times, our work began with a single spreadsheet.
In late January 2020, Monica Davey, an editor on the National desk, asked Mitch Smith, a correspondent based in Chicago, to start gathering information about every individual U.S. case of Covid-19. One row per case, meticulously reported based on public announcements and entered by hand, with details like age, location, gender and condition.
By mid-March, the virus’s explosive growth proved too much for our workflow. The spreadsheet grew so large it became unresponsive, and reporters did not have enough time to manually report and enter data from the ever-growing list of U.S. states and counties we needed to track.
At this time, many domestic health departments began rolling out Covid-19 reporting efforts and websites to inform their constituents of local spread. The federal government faced early challenges in providing a single, reliable federal data set.
The available local data were all over the map, literally and figuratively. Formatting and methodology varied widely from place to place.
Within The Times, a newsroom-based group of software developers was quickly tasked with building tools to augment as much of the data acquisition work as possible. The two of us — Tiff is a newsroom developer, and Josh is a graphics editor — would end up shaping that growing team.
On March 10, 2020, the day before the World Health Organization declared the virus a pandemic, newsroom developers wrote the first lines of code for our custom tools that enabled journalists to edit and approve our collected data.
On March 16, the core application largely worked, but we needed help scraping many more sources. To tackle this colossal project, we recruited developers from across the company, many with no newsroom experience, to pitch in temporarily to write scrapers.
By the end of April, we were programmatically collecting figures from all 50 states and nearly 200 counties. But the pandemic and our database both seemed to be expanding exponentially.
Also, a few notable sites changed several times in just a couple of weeks, which meant we had to repeatedly rewrite our code. Our newsroom engineers adapted by streamlining our custom tools — while they were in daily use.
As many as 50 people beyond the scraping team have been actively involved in the day-to-day management and verification of the data we collect. Some data is still entered by hand, and all of it is manually verified by reporters and researchers, a seven-day-a-week operation. Reporting rigor and subject-matter fluency were essential parts of all our roles, from reporters to data reviewers to engineers.
In addition to publishing data to The Times’s website, we made our data set publicly available on GitHub in late March 2020 for anyone’s use.
As vaccinations curb the virus’s toll across the country — overall, 33.5 million cases have been reported — a number of health departments and other sources are updating their data less often. Conversely, the federal Centers for Disease Control and Prevention has expanded its reporting to include comprehensive figures that had been only partly available in 2020.
All of that means that some of our own custom data collection can be shut down. Since April 2021, our number of programmatic sources has dropped nearly 44 percent.
Our goal is to get down to about 100 active scrapers by late summer or early fall, mainly for tracking potential hot spots.
The dream, of course, is to conclude our efforts as the virus’s threat substantially subsides.