New: You Can Now Search the Full Text of 3 Million Nonprofit Tax Records for Free

June 6, 2019, 8:09 am

≫ Next: “Your Default Position Should Be Skepticism” and Other Advice for Data Journalists From Hadley Wickham

≪ Previous: The Ticket Trap: Front to Back

On Thursday, we launched a new feature for our Nonprofit Explorer database: The ability to search the full text of nearly 3 million electronically filed nonprofit tax filings sent to the IRS since 2011.

Nonprofit Explorer already lets researchers, reporters and the general public search for tax information from more than 1.8 million nonprofit organizations in the United States, as well as allowing users to search for the names of key employees and directors of organizations.

Now, users of our free database can dig deep and search for text that appears anywhere in a nonprofit’s tax records, as long as those records were filed digitally — which according to the IRS covers about two-thirds of nonprofit tax filings in recent years.

How can this be useful to you? For one, this feature lets you find organizations that gave grants to other nonprofits. Any nonprofit that gives grants to another must list those grants on its tax forms — meaning that you can research a nonprofit’s funding by using our search. A search for “ProPublica,” for example, will bring up dozens of foundations that have given us grants to fund our reporting (as well as a few filings that reference Nonprofit Explorer itself).

Just another example: When private foundations have investments or ownership interest in for-profit companies, they have to list those on their tax filings as well. If you want to research which foundations have investments in a company like ExxonMobil, for example, you can simply search for the company name and check which organizations list it as an investment.

The possibilities are nearly limitless. You can search for the names or addresses of independent contractors that made more than $100,000 from a nonprofit, you can search for addresses, keywords in mission statements or descriptions of accomplishments. You can even use advanced search operators, so for instance you can find any filing that mentions either “The New York Times,” “nytimes” or “nytimes.com” in one search.

The new feature contains every electronically filed Form 990, 990-PF and 990-EZ released by the IRS from 2011 to date. That’s nearly 3 million filings. The search does not include forms filed on paper.

So please, give this search a spin. If you write a story using information from this search, or you come across bugs or problems, drop us a line! We’re excited to see what you all do with this new superpower.

↧

“Your Default Position Should Be Skepticism” and Other Advice for Data Journalists From Hadley Wickham

June 10, 2019, 9:50 am

≫ Next: Making Sense of Messy Data

≪ Previous: New: You Can Now Search the Full Text of 3 Million Nonprofit Tax Records for Free

by Hannah Fresques and Meg Marco

So you want to explore the world through data. But how do you actually *do* it?

Hadley Wickham is a leading developer of open source tools for data science and works as the chief scientist at RStudio. We talked with him about interrogating data, what stories might be hiding in the gaps and how bears can really mess things up. What follows is a transcript of our talk, edited for clarity and length.

ProPublica: You’ve talked about the way data visualization can help the process of exploratory data analysis. How would you say this applies to data journalism?

Wickham: I’m not sure whether I should have the answers or you should have the answers! I think the question is: How much of data journalism is reporting the data that you have versus finding the data that you don’t have ... but you should have ... or want to have ... that would tell the really interesting story. Hadley Wickham Courtesy of Hadley Wickham

I help teach a data science class at Stanford, and I was just looking through this dataset on emergency room visits in the United States. There is a sample of every emergency visit from like 2013 to 2017 ... and then there’s this really short narrative, a one-sentence description of what caused the accident.

I think that’s a fascinating dataset because there are so many stories in it. I look at the dataset every year, and each time I try and pull out a little different story. This year, I decided to look at knife-related injuries, and there are massive spikes on Memorial Day, Fourth of July, Thanksgiving, Christmas Day and New Year’s.

As a generalist you want to turn that into a story, and there are so many questions you can ask. That kind of exploration is really a warmup. If you’re more of an investigative data journalist, you’re also looking for the data that isn’t there. You’ve got to force yourself to think, well, what should I be seeing that I’m not?

ProPublica: What’s a tip for someone who thinks that they have found something that isn’t there. What’s the next step that you take when you have that intuition?

Wickham: This is one of the things I learned from going to NICAR, which is completely unnatural to me, and that’s picking up a phone and talking to someone. Which I would never do. There is no situation in my life in which I would ever do that unless it’s life-threatening emergency.

But, I think that’s when you need to just start talking to people. I remember one little anecdote. I was helping a biology student analyze their field work data, and I was looking at where they collected data over time.

And one year they had no data for a given field. And so I go talk to them. And I was like: “Well, why is that? This is really weird.”

And they’re like, well, there was a bear in the field that year. And so we couldn’t collect any data.

But kind of an interesting story, right?

ProPublica: What advice would you have for editors who are managing or collaborating with highly technical people in a journalism environment but who may not share the same skill set? How can they be effective?

Wickham: Learn a little bit of R and basic data analysis skills. You don’t have to be an expert; you don’t have to work with particularly large datasets. It’s a matter of finding something in your own life that’s interesting that you want to dig into.

One [recent example]: I noticed on the account from my yoga class, there was a page that has every single yoga class that I had ever taken.

And so I thought it would be kind of fun to take a look at that. See how things change over time. Everyone has little things like that. You’ve got a Google Sheet of information about your neighbors, or your baby, or your cat, or whatever. Just find something in life where you have data that you’re interested in. Just so you’ve got that little bit of visceral experience of working with data.

The other challenge is: When you’re really good at something, you make it look easy. And then people who don’t know so much are like: “Wow, that looks really easy. It must have taken you 30 minutes to scrape those 15,000 Excel spreadsheets of varying different formats.”

It sounds a little weird, but it’s like juggling. If you’re really, really, really good at juggling, you just make it look easy, and people are like: “Oh well. That’s easy. I can juggle eight balls at a time.” And so jugglers deliberately build mistakes into their acts. I’m not saying that’s a good idea for data science, but you’ve taken this very hard problem, broken it down into several pieces, made the whole thing look easy. How do you also convey that this is something you had to spend a huge amount of time on? It looks easy now, because I’ve spent so much time on it, not because it was a simple problem.

Data cleaning is hard because it always takes longer than you expect. And it’s really, really difficult to predict in advance where the problems are going to lie. At the same time, that’s where you get the value and can do stuff that no one has done before. The easy, clean dataset has already been analyzed to death. If you want something that’s unique and really interesting, you’ve got to dig for it.

ProPublica: During that data cleaning process, is that where the journalist comes out? When you’re cleaning up the data but you’re also getting to know it better and you’re figuring out the questions and the gaps?

Wickham: Yeah, absolutely. That’s one of the things that really irritates me. I think it’s easy to go from “data cleaning” to “Well, you’ve got a data cleaning problem, you should hire a data janitor to take care of it.” And it’s not this “janitorial” thing. Actually cleaning your data is when you’re getting to know it intimately. That’s not something you can hand off to someone else. It’s an absolutely critical part of the data science process.

ProPublica: The perennial question. What makes R an effective environment for data analysis and visualization? What does it offer over other tool sets and platforms?

Wickham: I think you have basically four options. You’ve got R and Python. You’ve got JavaScript, or you’ve got something point and click, which obviously encompasses a very, very large number of tools.

The first question you should ask yourself is: Do I want to use something point and clicky, or do I want to use a programming language? It basically comes down to how much time do you spend? Like, if you’re doing data analysis every day, the time it takes to learn a programming language pays off pretty quickly because you can automate more and more of what you do.

And so then, if you decided you wanted to use a programming language, you’ve got the choice of doing R or Python or JavaScript. If you want to create really amazing visualizations, I think JavaScript is a place to do it, but I can’t imagine doing data cleaning in JavaScript.

So, I think the main competitors are R and Python for all data science work. Obviously, I am tremendously biased because I really love R. Python is awesome, too. But I think the reason that you can start with R is because in R you can learn how to do data science and then you can learn how to program, whereas in Python you’ve got to learn programming and data science simultaneously.

R is kind of a bit of a weird creature as a programming language, but one of the advantages is that you can get some basic templates that you copy and paste. You don’t have to learn what a function is, exactly. You don’t have to learn any programming language jargon. You can just kind of dive in. Whereas with Python you’re gonna learn a little bit more that’s just programming.

ProPublica: It’s true. I’ve tried to make some plots in Python and it was not pretty.

Wickham: Every team I talked to, there are people using R, and there are people using Python, and it’s really important to help those people work together. It’s not a war or a competition. People use different tools for different purposes. I think is very important and one project, to that end, it is this thing called Apache Arrow, which Wes [McKinney] has been working on because of this new organization called Ursa.

Basically, the idea of Apache Arrow is to just to sit down and really think, “What is the best way to store data-science-type data in memory?” Let’s figure that out. And then once we’ve figured it out, let’s build a bunch of shared infrastructure. So Python can store the data in the same way. R can store the data in the same way. Java can store the data in the same way. And then you can see, and mostly use, the same data in any programming language. So you’re not popping it about all the time.

ProPublica: Do you think journalists risk making erroneous assumptions about the accuracy of data or drawing inappropriate conclusions, such as mistaking correlation for causation?

Wickham: One of the challenges of data is that if you can quantify something precisely, people interpret it as being more “truthy.” If you’ve got five decimal places of accuracy, people are more likely to just kind of “believe it” instead of questioning it. A lot of people forget that pretty much every dataset is collected by a person, or there are many people involved. And if you ignore that, your conclusions are going to possibly be fantastically wrong.

I was judging a data science poster competition, and one of the posters was about food safety and food inspection reports. And I … and this probably says something profound about me ... but I immediately think: “Are there inspectors who are taking bribes, and if there were, how would you spot that from the data?”

You shouldn’t trust the data until you’ve proven that it is trustworthy. Until you’ve got another independent way of backing it up, or you’ve asked the same question three different ways and you get the same answer three different times. Then you should feel like the data is trustworthy. But until you’ve understood the process by which the data has been collected and gathered ... I think you should be very skeptical. Your default position should be skepticism.

ProPublica: That’s a good fit for us.

↧

Making Sense of Messy Data

August 7, 2019, 1:03 pm

≫ Next: Working Together Better: Our Guide to Collaborative Data Journalism

≪ Previous: “Your Default Position Should Be Skepticism” and Other Advice for Data Journalists From Hadley Wickham

Angeliki Kastanis, Associated Press

I used to work as a sound mixer on film sets, noticing any hums and beeps that would make an actor’s performance useless after a long day’s work. I could take care of the noisiness in the moment, before it became an issue for postproduction editors.

Now as a data analyst, I only get to notice the distracting hums and beeps in the data afterward. I usually get no say in what questions are asked to generate the datasets I work with; answers to surveys or administrative forms are already complete.

To add to that challenge, when building a national dataset across several states, chances are there will be dissonance in how the data is collected from state to state, making it even more complicated to draw meaning from compiled datasets.

Get info about new and updated data from ProPublica.

The Associated Press recently added a comprehensive dataset on medical marijuana registry programs across the U.S. to the ProPublica Data Store. Since a national dataset did not exist, we collected the data from each state through records requests, program reports and department documents.

One question we sought to answer with that data: why people wanted a medical marijuana card in the first place.

The answers came in many different formats, in some cases with a single response question, in others with a multiple response question. It’s the difference between “check one” and “check all.”

When someone answers a single response question, they are choosing what they think is the most important and relevant answer. This may be an accurate assessment of the situation — or an oversimplified take on the question.

When someone is given the chance to choose one or more responses, they are choosing all they think is relevant and important, and in no particular order. If you have four response choices, you may have to split the data into up to 16 separate groups to cover each combination. Or you may be given a summary table with the results for each option without any information on how they combine.

In the medical marijuana data, some states have 10 or more qualifying conditions — from cancer and epilepsy to nausea and post-traumatic stress disorder. Of the 16 states where data on qualifying condition is available, 13 allow for multiple responses. And of those, three states even shifted from collecting single to multiple responses over the years.

This makes it nearly impossible to compare across states when given only summary tables.

So, what can we do?

One tip is to compare states that have similar types of questionnaires — single response with single response, multiple with multiple. We used this approach for clarification when looking into the numbers for patients reporting PTSD as a qualifying condition. We found that half of all patients in New Mexico use medical marijuana to treat PTSD, and the numbers do not seem to be inflated by the method of data collection. New Mexico asks for a single qualifying condition, yet the proportion of people reporting PTSD as their main ailment is two to three times the number than those that could report multiple responses in other states.

Using data from the 13 states that allow multiple responses, we found that when states expand their medical markets to include PTSD, registry numbers ramp up and the proportion of patients reporting PTSD increase at a quick pace. The data didn’t enable us to get one single clean statistic, but it still made it possible for us to better understand how people used medical marijuana.

Get the data (with a description of the caveats you’ll need to keep in mind when working with it) for your own analysis here.

↧

Working Together Better: Our Guide to Collaborative Data Journalism

August 14, 2019, 8:00 am

≫ Next: Making Collaborative Data Projects Easier: Our New Tool, Collaborate, Is Here

≪ Previous: Making Sense of Messy Data

by Rachel Glickhouse

Today we’re launching a guidebook on how newsrooms can collaborate around large datasets.

Since our founding 11 years ago, ProPublica has made collaboration one of the central aspects of its journalism. We partner with local and national outlets across the country in many different ways — including to work with us to report stories, to share data and to republish our work. That’s because we understand that in working together, we can do more powerful journalism, reach wider audiences and have more impact.

Never miss the most important reporting from ProPublica’s newsroom. Subscribe to the Big Story newsletter.

In the last several years, we’ve taken on enormous collaborations, working with hundreds of journalists at a time. It started in 2016 with Electionland, a project to monitor voting problems in real time during the presidential election. That project brought together more than 1,000 journalists and students across the country. Then we launched Documenting Hate in 2017, a collaborative investigation that included more than 170 newsrooms reporting on hate crimes and bias incidents. We did Electionland again in 2018, which involved around 120 newsrooms.

In order to make each of these projects work, we developed software that allows hundreds of people to access and work with a shared pool of data. That information included datasets acquired via reporting as well as story tips sent to us by thousands of readers across the country. We’ve also developed hard-won expertise in how to manage these types of large-scale projects.

Thanks to a grant from the Google News Initiative, we’ve created the Collaborative Data Journalism Guide to collaborative data reporting, which we’re launching today. We’re also developing an open-source version of our software, which will be ready this fall (sign up here for updates).

Our guidebook covers:

Types of newsroom collaborations and how to start them
How a collaboration around crowdsourced data works
Questions to consider before starting a crowdsourced collaboration
Ways to collaborate around a shared dataset
How to set up and manage workflows in data collaborations

The guidebook represents the lessons we’ve learned over the years, but we know it isn’t the only way to do things, so we made the guidebook itself collaborative: We’ve made it easy for others to send us input and additions. Anybody with a GitHub account can send us ideas for changes or even add their own findings and experiences (and if you don’t have a GitHub account, you can do the same by contacting me via email).

We hope our guide will inspire journalists to try out collaborations, even if it’s just one or two partners.

Access the guidebook here.

↧

Making Collaborative Data Projects Easier: Our New Tool, Collaborate, Is Here

September 11, 2019, 8:10 am

≫ Next: Building a Database From Scratch: Behind the Scenes With Documenting Hate Partners

≪ Previous: Working Together Better: Our Guide to Collaborative Data Journalism

by Rachel Glickhouse

On Wednesday, we’re launching a beta test of a new software tool. It’s called Collaborate, and it makes it possible for multiple newsrooms to work together on data projects.

Collaborations are a major part of ProPublica’s approach to journalism, and in the past few years we’ve run several large-scale collaborative projects, including Electionland and Documenting Hate. Along the way, we’ve created software to manage and share the large pools of data used by our hundreds of newsrooms partners. As part of a Google News Initiative grant this year, we’ve beefed up that software and made it open source so that anybody can use it.

Collaborate allows newsrooms to work together around any large shared dataset, especially crowdsourced data. In addition to CSV files and spreadsheets, Collaborate supports live connections to Google Sheets and Forms as well as Screendoor, meaning that updates made to your project in those external data sources will be reflected in Collaborate, too. For example, if you’re collecting tips through Google Forms, any new incoming tips will appear in Collaborate as they come in through your form.

Once you’ve added the data to Collaborate, users can:

Create users and restrict access to specific projects;
Assign “leads” to other reporters or newsrooms;
Track progress and keep notes on each data point;
Create a contact log with tipsters;
Assign labels to individual data points;
Redact names;
Sort, filter and export the data.

Collaborate is free and open source. We’ve designed it to be easy to set up for most people, even those without a tech background. That said, the project is in beta, and we’re continuing to resolve bugs.

If you are tech savvy, you can find the code for Collaborate on Github, and you’re welcome to fork the code to make your own changes. (We also invite users to submit bugs on Github.)

This new software is part of our efforts to make it easier for newsrooms to work together; last month, we published a guide to data collaborations, which shares our experiences and best practices we’ve learned through working on some of the largest collaborations in news.

Starting this month, we’ll provide virtual trainings about how to use Collaborate and how to plan and launch crowd-powered projects around shared datasets. We hope newsrooms will find the tool useful, and we welcome your feedback.

Get started here.

Set up Collaborate.
Read tips on running a collaborative data project.
Get the code for Collaborate.
File a problem or bug on Github.
Want to schedule a training for your newsroom or give feedback on the tool? Email rachel.glickhouse@propublica.org.

↧

Building a Database From Scratch: Behind the Scenes With Documenting Hate Partners

October 23, 2019, 9:00 am

≫ Next: I Spent Three Years Running a Collaboration Across Newsrooms. Here’s What I Learned.

≪ Previous: Making Collaborative Data Projects Easier: Our New Tool, Collaborate, Is Here

by Rachel Glickhouse

For nearly three years, ProPublica’s Documenting Hate project has given newsrooms around the country access to a database of personal reports sent to us by readers about hate crimes and bias incidents. We’ve brought aboard more than 180 newsrooms, and some have followed up on these reports — verifying them, uncovering patterns and telling the stories of victims and witnesses. Some partners have done significant data journalism projects of their own to augment what they found in the shared dataset.

The latest such project comes from News12 in Westchester County, New York. Reporter Tara Rosenblum joined the Documenting Hate project after a spate of hate incidents in her coverage area. Last month, News12 aired her five-part series about hate crimes in the Hudson Valley and a half-hour special covering hate in the tri-state area. The station also published a public database of hate incidents going back a decade. It was the result of two years of work.

Rosenblum and her team built the database by requesting records from every police department in their coverage area, following up on tips from Documenting Hate and collecting clips about hate incidents the news network was already reporting on. Getting records was a laborious process, particularly from small agencies, some of which accept requests only by fax. “It was definitely torturous, but a labor of love project,” Rosenblum said.

Never miss the most important reporting from ProPublica’s newsroom. Subscribe to the Big Story newsletter.

She also expanded the scope of the project beyond her local newsroom and brought in News12 reporters from the network’s bureaus in Connecticut, New Jersey, Long Island, the Bronx and Brooklyn. The local newsrooms used Rosenblum’s investigation as their model, examining hate incidents since 2016. In all, six News12 reporters in three states documented around 2,300 hate incidents.

“We knew that this was one of those cases, the more the merrier,” Rosenblum said of collaborating with other newsrooms. “Why not flex our investigative muscle and get everyone working on this at the same time so we can really get a regional look?”

After the series aired, Rosenblum heard from a number of lawmakers — some who said they’d experienced discrimination — as well as students and schools. The special also aired on national and international networks, garnering responses from other states and countries. “A lot of what I heard is people being really grateful that we were shining the light on this,” she said.

Catherine Rentz, a reporter at The Baltimore Sun, wanted to investigate hate incidents in her area after learning how the Maryland State Police tracks hate crimes. (Since this writing, Rentz left the Sun to pursue freelance projects.) Maryland has been collecting hate crime data since the 1980s, so there was much to explore, Rentz said. Her reporting was also sparked by the May 2017 homicide of Richard Collins III, a second lieutenant in the Army who was days away from his college graduation. He was stabbed to death at the University of Maryland in what may have been a racially motivated attack; the suspect will be tried in December.

Rentz began her hate crimes investigation the summer after the killing, and she worked on it on and off for a year, she said. She sent public records requests to the Maryland State Police, city police departments and the state judiciary, and she built a public database of hate crimes and bias incidents reported to police in Maryland from 2016-17, including narratives of the incidents. She also worked with Documenting Hate to look into Maryland-based reports in our database.

To collect the data, she set up a spreadsheet and entered each case by hand, since the state police records were in PDF files and she wasn’t able to easily extract data from them. She had a number of other challenges. For instance, many agencies redacted victims’ names, so it was a challenge to use the data to find potential sources to interview. And when she did find names, some victims didn’t want to talk about what happened to them.

“I completely understood that, and I didn’t want to do any more damage than had already been done,” she said.

In the course of her investigation, Rentz discovered that there were agencies that did collect reports of potential bias crimes, but that they weren’t reporting it to the state police, so the data wasn’t being counted. She also looked at prosecutions; in 2017, there were nearly 400 bias crimes reported to police, but only three hate crime convictions.

Following the Sun’s hate crime reporting, the state police held several trainings with local police and reminded agencies that they’re required by law to turn in their bias crime reports on specific deadlines. In April, the governor signed three new bills into law on hate crimes.

Last year, Reveal investigated hate incidents that involved the invocation of President Donald Trump’s name or policies. They published a longform story and produced a radio show. Reporter Will Carless built a database using reports from the Documenting Hate project and news clips. He worked his way through a color-coded spreadsheet of hundreds of entries to verify reports and find sources to highlight in the story. After the investigation published, Carless says he received emails from readers who said similar incidents had happened to them; others thanked him for connecting the dots and gathering data on previously disparate stories. He also said a few academics told him they were going to include the story in their courses that involve hate speech.

And this year, HuffPost created a database for a forthcoming story examining hate incidents in which the perpetrator used the phrase “go back to your country” or “go back to” a specific country. Their database combined tips submitted to the Documenting Hate project, along with news clips culled from the Lexis-Nexis database, social media reports, as well as police reports gathered by ProPublica. The investigation is slated to publish this fall.

“The thing I want to stress for this project is that this type of hatred or bigotry or white nationalism is kind of ubiquitous and foundational to American society,” said HuffPost reporter Christopher Mathias. “It’s very much a common thing that people who aren’t white in this country experience on a regular basis. There’s no better way to show that than to create a database of many, many incidents like this across the country.”

Like News12, HuffPost opened the project up to its newsroom colleagues, bringing in reporters from HuffPost bureaus in the United Kingdom and Canada. After HuffPost published a questionnaire to collect more stories from readers, its U.K. and Canadian colleagues set up their own crowdsourcing forms to collect stories. (Documenting Hate is a U.S.-based project, and our form is limited to the U.S.) Their plan is to publish stories using the tips they collect when HuffPost’s U.S. newsroom publishes its investigation.

Want to create your own database of hate crimes? Here are some tips about how to get started. 1. Get hate crimes data from your local law enforcement agency.

We have a searchable app where you can see the data we received directly from police departments, as well as the numbers that agencies sent the FBI. (The federal data is deeply flawed, as we’ve found in our reporting.) We have partial 2017 data for some agencies.

If we don’t have data from your police department, you can replicate our records request.

Also, some states, like Maryland and California, release statewide hate crime data reports, so find out if top-line data is publicly available.

Some things to keep in mind:

More than half of hate crime victims don’t report to the police at all. And the police don’t always do a good job handling these crimes.

That’s because police officers don’t always receive adequate training about how to investigate or track hate crimes. Still, training isn’t a guarantee to ensure these crimes are handled properly. Some police mismark hate crimes or don’t know how to fill out forms to properly track these crimes. Some victims believe officers don’t take them seriously; in some cases, victims say police even refuse to fill out a report. 2. Put together a list of known incidents using media reports and crimes tracked by nonprofit organizations.

It’s a good idea to search for clips of suspected hate crimes during the time period in question to compare them to police data. You can use tools like Google News, LexisNexis, Newspapers.com and others.

You can also consult organizations that track incidents and add them to your list of known crimes. They can give you a sense of how police respond to hate crimes against these groups. Here are some examples.

CAIR (Muslim community)
ADL (Jewish community)
SAALT (South Asian community)
AAI (Arab community)
AVP (LGBTQ community)
MALDEF (Latino community)
NAACP (black community)
NCTE (trans community)
HRC (LGBTQ community)

3. Review the police records carefully, and request incident reports to get the full picture.

Once you receive data from the police department, compare it with your list of known hate crimes from media and nonprofit reports. That will be especially useful if the police claim to have no hate crimes in the time period. Ask about any discrepancies.

You can also check to see if the department’s data matches what it sent the FBI. If the department’s numbers don’t match what they sent the feds, ask why.

The best way to get a deeper look at the data is to get narratives. Ask for a police report or talk to the public information officer to get the narrative from the incident report.

Then review the data and incident reports for potential mismarked crimes. Take a look at the types of bias listed for each crime. We found that reports of anti-heterosexual bias crimes were almost always mismarked, either as different types of bias crimes or crimes that weren’t hate crimes at all.

Also check the quantity of each bias type. Is there a large number of a specific bias crime that may not fit with the area’s demographics? We’ve encountered cases in which police marked incidents as having anti-Native American bias in their forms or computer systems because they thought they were selecting “none” or “not applicable.”

Next, check the crime types. We’ve also seen that certain crime types are unlikely to involve a bias motivation but are sometimes erroneously marked as hate crimes; examples include drug charges, suicide, drug overdose and hospice death. Request incident reports, and follow up with police to ask about cases that don’t appear to be bias crimes. Police have often told us that mismarking happens as a result of human error, and that officials will sometimes rectify the errors found.

↧

I Spent Three Years Running a Collaboration Across Newsrooms. Here’s What I Learned.

December 13, 2019, 7:00 am

≫ Next: Meet the Baconator

≪ Previous: Building a Database From Scratch: Behind the Scenes With Documenting Hate Partners

by Rachel Glickhouse

ProPublica’s Documenting Hate collaboration comes to a close next month after nearly three years. It brought together hundreds of newsrooms around the country to cover hate crimes and bias incidents.

The project started because we wanted to gather as much data as we could, to find untold stories and to fill in gaps in woefully inadequate federal data collection on hate crimes. Our approach included asking people to tell us their stories of experiencing or witnessing hate crimes and bias incidents.

As a relatively small newsroom, we knew we couldn’t do it alone. We’d have to work with partners － lots of them － to reach the biggest possible audience. So we published a tip form in English and Spanish, and recruited newsrooms around the country to share it with their readers.

Never miss the most important reporting from ProPublica’s newsroom. Subscribe to the Big Story newsletter.

We ended up working with more than 180 partners to report stories based on the leads we collected and the data we gathered. Partnering with national, local and ethnic media, we were able to investigate individual hate incidents and patterns in how hate manifested itself on a national scale. (While the collaboration between newsrooms is coming to an end, ProPublica will continue covering hate crimes and hate groups.)

Our partners reported on kids getting harassed in school, middle schoolers forming a human swastika, hate crime convictions, Ivy League vandalism, hate incidents at Walmarts and the phrase “go back to your country,” to name just a few. Since the project began in 2017, we received more than 6,000 submissions, gathered hundreds of public records on hate crimes and published more than 230 stories.

Projects like Documenting Hate are part of the growing phenomenon of collaborative data journalism, which involves many newsrooms working together around a single, shared data source.

If you’re working on such a collaboration or considering starting one, I’ve written a detailed guidebook to collaborative data projects, which is also available in Spanish and Portuguese. But as the project winds down, I wanted to share some broader lessons we’ve learned about managing large-scale collaborations:

Overshare information. Find as many opportunities as possible to explain how the project works, the resources available and how to access them. Journalists are busy and are constantly deluged with information, so using any excuse to remind them of what they need to know benefits everyone involved. I used introductory calls, onboarding materials, training documents and webinars as a way to do this.

Prepare for turnover. More than 500 journalists joined Documenting Hate over its nearly three-year run. But more than 170 left the newsrooms with which they were associated at the beginning of their participation in the project, either because they got a new job, were laid off, left journalism or their company went under. Sometimes journalists would warn me they were leaving, but most of the time I had to figure it out from email bounces. Sadly, it was rare that reporters changing jobs would rejoin.

Be understanding about the news cycle. Intense news cycles, whether it’s hurricanes or political crises, mean that reporters are not only going to get pulled away from the project but from their daily work, too. Days with breaking news may mean trainings or calls need to be rescheduled and publication dates bumped back. It’s important to be flexible on scheduling and timelines.

Adapt to the realities of the beat. It’s not uncommon for crime victims, especially hate-crime victims, to be reluctant to go on the record or even speak to journalists. Their cases are difficult to report out and verify. So like in a lot of beats, a promising lead doesn’t guarantee an achievable story. Crowdsourced data made the odds even longer in many cases, since we didn’t receive tips for every partner. That’s why it’s important to set expectations and offer context and guidance about the beat from the outset.

Expand your offerings. Given the aforementioned challenges, it’s a good idea to diversify potential story sources. We made a log of hate-crime-related public records requests at ProPublica for our reporting, and we made those records available to partners. We also offered a weekly newsletter with news clips and new reports/data from external sources, monthly webinars and guidance on investigating hate crimes.

Be flexible on communication strategies. Even though Slack can be useful for quick communication, especially among large groups, not everyone likes to use it or knows how. Email is what I’ve used most consistently, but reporters’ inboxes tend to pile up, and sometimes calling is easiest. Some journalists are heavy WhatsApp users, and I get through to them fastest there. Holding webinars and trainings is helpful to get some virtual face time, and sending event invites is another way you can get someone’s attention amid a crowded inbox. It’s useful to get a sense of the methods to which people are most responsive.

Celebrate success stories. There is a huge amount of work that doesn’t end up seeing the light of day, so I make an effort to signal-boost work that gets produced. I’ve highlighted big stories that ProPublica and our partners have done to show other partners how they can do similar work or localize national stories. Amplifying these stories by sharing on social and in newsletters, as well as featuring them in webinars, can help inspire more great work.

Be diligent about tracking progress. Our database software has a built-in tracking system for submissions, but I separately track stories produced from the project, news clips and interviews that mention the project, as well as impact from reporting. I keep on top of stories partners are working on, and I also use Google Alerts, internal PR emails and daily clip searches.

Evaluate your work. I’m surveying current and past Documenting Hate participants to get feedback and gauge how participants felt about working with us. I’m also going to write a post-mortem on the project to leave behind a record of the lessons we learned.

↧

Meet the Baconator

October 2, 2020, 5:00 am

≫ Next: She Photographed Police Abuse at a 2014 BLM March Then Watched the Image Go Viral During Capitol Riot

≪ Previous: I Spent Three Years Running a Collaboration Across Newsrooms. Here’s What I Learned.

by Frank Sharpe

As a member of the team responsible for keeping ProPublica’s website online, there were times when I wished our site were static. Static sites have a simpler configuration with fewer moving parts between the requester and the requested webpage. All else being equal, a static site can handle more traffic than a dynamic one, and it is more stable and performant. However, there is a reason most sites today, including ProPublica’s, are dynamically generated.

In dynamic sites, the structure of a webpage — which includes items such as titles, bylines, article bodies, etc. — is abstracted into a template, and the specific data for each page is stored in a database. When requested by a web browser or other end client, a server-side language can then dynamically generate many different webpages with the same structure but different content. This is how frameworks like Ruby on Rails and Django, as well as content management systems like WordPress, work.

That dynamism comes at a cost. Instead of just HTML files and a simple web server, a dynamic site needs a database to hold its content. And while a server for a static site responds to incoming requests by simply fetching existing HTML files, a dynamic site’s server has the additional job of generating those files from scripts, templates and a database. With moderately high levels of traffic, this can become resource intensive and, consequently, expensive.

Never miss the most important reporting from ProPublica’s newsroom. Subscribe to the Big Story newsletter.

This is where caching comes into play. At its most basic, caching is the act of saving a copy of the output of a process. For example, your web browser caches images and scripts of sites you visit so subsequent visits to the same page will load much faster. By using locally cached assets, the web browser avoids the slow, resource-intensive process of downloading them again.

Caching is also employed by dynamic sites in the webpage generation process: at the database layer for caching the results of queries; in the content management system for caching partial or whole webpages; and by using a “reverse proxy,” which sits between the internet and a web server to cache entire webpages. (A proxy server can be used as an intermediary for requests originating from a client, like a browser. A reverse proxy server is used as an intermediary for traffic to and from a server.)

However, even with these caching layers, the demands of a dynamically generated site can prove high.

This was the case two years ago, shortly after we migrated ProPublica’s website to a new content management system. Our new CMS allowed for a better experience both for members of our production team, who create and update our articles, and for our designers and developers, who craft the end-user experience of the site. However, those improvements came at a cost. More complex pages, or pages requested very frequently, could tax the site to the point of making it crash. As a workaround we began saving fully rendered copies of resource-intensive pages and rerouting traffic to them. Everything else was still served by our CMS.

As we built tools to support this, our team was also having conversations about improving platform performance and stability. We kept coming back to the idea of using a static site generator. As the name suggests, a static site generator does for an entire site what our workaround did for resource-intensive pages. That is, generate and save a copy of each page. It can be thought of as a kind of cache, saving our servers the work of responding to requests in real time. It also provides security benefits, reducing a website’s attack surface by minimizing the amount users interact directly with potentially vulnerable server-side scripts.

In 2018, we brought the idea to a digital agency, Happy Cog, and began to workshop solutions. Because performance was important to us, they proposed that we use distributed serverless technologies like Cloudflare Workers or AWS Lambda@Edge to create a new kind of caching layer in front of our site. Over the coming months, we designed and implemented that caching layer, which we affectionately refer to as “The Baconator.” (Developers often refer to generating a static page as “baking a page out.” So naturally, the tool we created to do this programmatically for the entire site took on the moniker “The Baconator.”) While the tool isn’t exactly a static site generator, it has given us many of the benefits of one, while allowing us to retain the production and development workflows we love in our CMS. How Does It Work?

There are five core components:

Cache Data Store: A place to store cached pages. This can be a file system, database or in-memory data store like Redis or Memcached, etc.
Source of Truth (or Origin): A CMS, web framework or “thing which makes webpages” to start with. In other words, the original source of the content we’ll be caching.
Reverse Proxy: A lightweight web server to receive and respond to incoming requests. There are a number of lightweight but powerful tools that can play this role, such as AWS Lambda or Cloudflare Workers. However, the same can be achieved with Apache or Nginx and some light scripting.
Queue: A queue to hold pending requests for cache regeneration. This could be as simple as a table in a database.
Queue Worker: A daemon to process pending queue requests. Here again, “serverless” technologies, like Google Cloud, could be employed. However, a simple script on a cron could do the trick as well.

How Do the Components Interact?

When a resource (like a webpage) is requested, the reverse proxy receives the request and will then check the cache data store. If the cache for that resource exists, its expiration, or time to live (TTL), is saved in a variable to check against later, and the cache is served. The TTL is then checked. If the cache has not yet expired for the resource, it is considered valid and nothing else is done. If the cache has expired, the reverse proxy then adds a request to the queue for that resource’s cache to be updated.

Meanwhile, the queue worker is constantly checking the queue. As requests come into the queue, it generates the webpage from the origin and updates the corresponding cache in the data store.

And finally at the origin, anytime a page is created or edited, the cache data store is amended or updated. How the elements in the Baconator interact.

For our team, the chief benefit of this system is the separation between our origin and web servers. Where previously the servers that housed our CMS (the origin) also responded to a percentage of incoming requests from the internet, now the two functions are completely separate. Our origin servers are only tasked with creating and updating content, and the reverse proxy is our web server that focuses solely on responding to requests. As a consequence, our origin servers could be offline and completely inaccessible, but our site would remain available, served by the reverse proxy from the content in our cache. In this scenario, we would be unable to update or create new pages, but our site would stay live. Moreover, because the web server simply retrieves and serves resources, and does not generate them, the site can handle more traffic and is more stable and performant.

Another important reason for moving to this caching system was to ease the burden on our origin servers. However, it should be noted that even with this caching layer it is possible to overload origin servers with too much traffic, though it’s far less likely. Remember, the reverse proxy will add expired pages to the queue, so if the cache TTLs are too short the queue will grow. And if the queue worker is configured to be too aggressive, the origin servers could be inundated with more traffic than they can handle. Conversely, if the queue worker does not run frequently enough, the queue will stay high, and stale pages will remain in cache and be served to end users for longer than desired.

The key to this system (as with any caching system) is proper configuration of TTLs: long enough so that the queue stays relatively low and the origin servers are not overwhelmed, but short enough to limit the time stale content is in cache. This will likely be different for different kinds of content (e.g., listing pages that change more frequently may need shorter TTLs than article pages). In our implementation, this has been the biggest challenge with moving to this system. It’s taken some time to get this right, and we continue to tweak our configurations to find the right balance.

For those interested in this kind of caching system, we’ve built a simple open-source version that you can run on your own computer. You can use it to explore the ideas outlined above.

↧

She Photographed Police Abuse at a 2014 BLM March Then Watched the Image Go Viral During Capitol Riot

January 11, 2021, 12:21 pm

≫ Next: New: View an Organization’s Employees and Officers on Nonprofit Explorer

≪ Previous: Meet the Baconator

by Lisa Larson-Walker

In the midst of Wednesday’s assault on the U.S. Capitol, Twitter user @thejuliacarter gave voice to the outrage many felt at the stark difference between what appeared to be the accommodating treatment of the rioters by the Capitol Police and the brutal treatment of peaceful protesters by armor-clad officers in cities like Ferguson, Missouri, and Minneapolis in the past few years.

The tweet included two photos. On the left was a photograph of a Black man being tackled by riot police. It was made by photojournalist Natalie Keyssar at a peaceful Black Lives Matter protest in Ferguson in 2014. On the right was a photograph of several white men, one of whom is carrying a Confederate battle flag, roaming freely around the Capitol. It was made by New York Times staff photographer Erin Schaff. Schaff later told a harrowing story of being trapped in the Capitol as rioters broke in.

The tweet soon went viral, with 188,000 retweets as of Jan. 8.

There are two Americas. pic.twitter.com/CptXMNRLYw— Julia Carter (@thejuliacarter) January 6, 2021

I’ve been friends with Keyssar for years, and she has done work for ProPublica. I asked for some of her insights about having played a small role in helping articulate America’s reaction to the riots and about revisiting the photo she took years ago in a new context. What follows is a conversation we had over email, lightly edited for clarity and length.

You took the photo on the left when you were covering the protests in response to the Ferguson grand jury’s decision not to prosecute Officer Darren Wilson for the killing of Michael Brown, a Black 18-year-old. What was the scene and moment like where you shot the photo?

This was six years ago now, but what stands out in my head really distinctly was that as this peaceful group of demonstrators marched through the surrounding blocks near the [St. Louis Rams] stadium led by Bishop Derrick Robinson and several other important community leaders, I could see that it was about to get ugly. The police were being extremely aggressive, which they often were at these demonstrations, but they seemed particularly angry about a protest outside of a football game. The level of tension concerned me. I’ve covered a lot of protests and I feel like I have a good sense of when things are about to get violent, and I started warning some parents with small children that I thought the police were going to attack soon. I use that word, attack, deliberately because this was not a situation where a group of militant protesters are pushing against a police line. We are talking about peaceful families chanting and marching on public streets.

Eventually the police pushed the relatively small group of remaining protesters into a public park across the street from the stadium. Robinson was speaking into the megaphone, and the police just started charging people and tackling them to the ground. I remember one of them hit me while photographing as maybe four police in riot gear threw a teenage girl, maybe 90 pounds, to the ground. She was screaming. Then they went for the pastor. He was not struggling in this image, just trying to keep his balance. He broke no law. He did not fight them. It was a profoundly disturbing scene.

I know that, like many photojournalists, you’re concerned about the ways in which photographs can oversimplify and how they can objectify Black suffering, so I’m curious how you felt when you saw the juxtaposition of your photo next to the rioters who stormed the Capitol.

When I saw this juxtaposition pop up on my timeline I was kind of shocked because this moment in St. Louis is burned into my memory, and I was still processing this attack on the Capitol. My response was that this pairing represents the rage and sadness I feel, seeing a white mob storm the Capitol with very little resistance from law enforcement, in a country where people of color are frequently subject to harsh punishment and extrajudicial killing for no transgression at all.

The problem with still images is that they are, almost by nature, simplifications. And the history of photography is inextricably weaved in with the history of exploitation, colonialism and the aestheticization of the suffering of Black, brown and poor people. I try very hard with my work to never fall into repeating these errors, although as a white photographer I am certain that I have and am constantly working to learn and do better. I think one of my most important metrics of whether an image is the objectification of suffering, or the necessary documentation of injustice, is that the perpetrators are represented, that the harm is contextualized, that the image serves to document wrongdoing by those in power, and I hope that it is true that this image does that.

I think my image functions as an important record of wrongdoing, and an undeniable proof of police abuse toward black activists. And this juxtaposition serves to enhance that message. I also saw today that Pastor Robinson posted this image pairing on his own Facebook page, which indicated that he approves of this usage, which is very important to me. This image has popped up in other comparative contexts, but this is by far the most visible this photo has ever been. Robinson has become a notable leader among the Ferguson, Missouri, protesters. Natalie Keyssar

Your name isn’t on the Ferguson image in the tweet, but it’s not hard to figure out that you’re the photographer. Have you gotten any reactions to this, other than from your colleagues?

The vast majority of the reactions I've seen have just been people retweeting and amplifying the diptych and the message it conveys. Other than a few inquiries from within the industry, the vast majority of people interacting with these images are far more interested in the disparity they convey and the events of Wednesday than they are in me as the photographer.

I’ve also seen comments about how this pairing is “cherry-picked” or taken out of context, a commentary which I disagree with vehemently, but it’s interesting to see the way people see what aligns with their beliefs.

There’s an inherent reduction of context on social media. There are no two ways about this; and I hope this image particularly does have a lot of symbols embedded in it. I saw the entire scene [in St. Louis] unfold, and this attack seemed completely unjustified. This is a peacefully protesting man of the cloth. Not that being a clergyman automatically conveys blanket innocence, but who is safe if not this person?

Because of the racist history of this country, you could find thousands of other images of unimpeachable behavior being punished violently, but I was struck that this photo had been chosen by whoever made this diptych, because its full context has so much specific relevance in comparison to the events of Wednesday.

So far this image has been favorited 461,000 times, and retweeted 188,000 times, on Twitter. Are you surprised by that? How does that virality, and the emotion that likely fuels it, change the way you think about the impact of your work?

Seeing a picture go viral is almost like watching it become property of the collective consciousness; it takes on its own meaning and power. I was thinking yesterday about the words “my picture” and what that meant because in many ways although I am the author, this is really Bishop Robinson’s picture if it’s anyone’s. It’s very complicated, the concept of authorship when you’re documenting people’s lives during very difficult moments.

I also think about the power of virality to transport something outside of the somewhat limited world of journalism and those who consume it to much farther reaching audiences. I mean this pairing and the words “There are two Americas” is really almost a meme right? Except it is not funny in any way. Usually memes are the language of comedy and incredibly widely consumed, and this is a powerful and necessary conduit for information. Journalism and art are often consumed by a relatively small audience. Though social media has its drawbacks, it enables these products and concepts to reach a wider audience.

The meme-ification of an image like this can bring attention to a breaking historical event to people who aren’t necessarily paying attention to news and analysis. They might see this pairing, and the four words, and come to some interaction with the same concepts. It’s putting the events of yesterday in an easy to share context.

How has the recontextualization of your image, taken so long ago, changed its meaning for you?

When I think back on when I made this image, it was just a few months after the Black Lives Matter movement became part of the national conversation after the killing of Brown. Six years later, seeing this image in this context, you can see that little if anything has changed in terms of the brutal treatment of Black and brown people at the hands of the police.

One of the stories that has been somewhat drowned out by Wednesday’s events is that the police [in Kenosha, Wisconsin] who shot Jacob Blake and paralyzed him were also not charged with anything. Meanwhile, as we know, there has been a massive rise in the visibility and activity of radical right-wing white supremacist extremists. One hopes, when documenting movements for social justice, that one is covering a process towards progress, that six years later this image would be more shocking because maybe some reforms would be in place to make this scene not one of so many like it that we’ve seen this year. But on the contrary, it’s part of a growing canon of photographs of these types of abuses. Seeing this picture paired with Erin Schaff’s image from yesterday raises questions for me about what the next in the sequence years from now will be. Will we be looking back on these times talking about how dark a chapter in our history this was? Or will we be shaking our heads, much as we are now, saying if we only knew then what these next years would bring?

↧

New: View an Organization’s Employees and Officers on Nonprofit Explorer

May 14, 2021, 2:00 am

≫ Next: Look Up Nursing Home Staff COVID-19 Vaccination Rates

≪ Previous: She Photographed Police Abuse at a 2014 BLM March Then Watched the Image Go Viral During Capitol Riot

by Andrea Suozzo

On Friday, we updated our Nonprofit Explorer database in two big ways. First, we’ve added the ability to view key employees and officers right on an organization’s page. Second, we’ve updated and extracted a ton of fresh data beyond our normal tax filing updates, adding millions of new employee records and tens of thousands of new audits.

Now, on an organization’s page on Nonprofit Explorer, you’ll notice a new section below each entity’s financial information for each fiscal year. In that section, you’ll find up to 25 key employees and officers of the organization, along with each person’s role and compensation.

This new feature provides detail beyond the executive compensation numbers reported in the financial summaries. When looking at universities and their athletic associations, for example, you’ll be able to quickly see that football coaches pull in some of the largest paychecks: Kirby Smart at the University of Georgia made $6.7 million in fiscal year 2019, while Dan Mullen at the University of Florida received $6 million in compensation in the same period. And you’ll find $18 million in compensation to Bobby Petrino, former head football coach at the University of Louisville, in the 2019 fiscal year. The organization’s filing explains that $13.1 million of that total was payment for early termination of Petrino’s contract. (Petrino is now head coach at Missouri State.) Employees listed on a recent filing from the University of Georgia Athletic Association, whose football coach makes one of the highest coaching salaries among nonprofit universities.

While we’ve had a people search feature since 2018, allowing users to find anyone listed on electronically filed tax returns as an employee or board member, this is the first time we’ve surfaced that information in an easy-to-use way.

Now, that information is available right on an organization’s main page. For full financial details — including benefits and compensation from related organizations — you’ll still need to read the organization’s full 990 filing. But this feature offers a more convenient way to view information on the more than 24 million key employees and board members in our database.

While the majority of tax-exempt organizations that file tax forms do so electronically, some still file on paper. (Small nonprofits do not need to file a 990.) You won’t see employee and officer data for organizations that submit paper filings, as we’re only able to extract this information from electronic filings. However, since a 2019 law mandates e-filing for all organizations in fiscal years ending after June 30, 2020, we’ll be able to provide employee data for an increasingly large proportion of organizations once newer filings come in.

While we update 990 filings monthly, we also provide other data about nonprofits. The employee records we’re surfacing are now updated for all filings processed through early May 2021, which adds nearly 8 million new records. We’ve also added more than 33,000 audits of nonprofit organizations from the Federal Audit Clearinghouse. These documents provide additional insight into the financial management and oversight at nonprofit organizations that spend $750,000 or more in federal grant money in a given fiscal year.

The recent updates expand the information already available on Nonprofit Explorer, which puts nearly a decade of financial information about tax-exempt organizations across the U.S. at your fingertips — whether you’re a journalist backgrounding a source, a potential donor researching charitable organizations or a researcher compiling data.

We make this public information available and easy to access so that anyone can use it — and we want to hear what you’re using it for. Have you used the compensation data, or other nonprofit data, to report a story? Drop us a line and tell us about it!

↧

Look Up Nursing Home Staff COVID-19 Vaccination Rates

April 21, 2022, 1:59 am

≫ Next: Visualizing Toxic Air

≪ Previous: New: View an Organization’s Employees and Officers on Nonprofit Explorer

by Ruth Talbot

On Thursday, ProPublica added staff COVID-19 vaccination data to the Nursing Home Inspect project.

The virus has killed more than 150,000 nursing home residents and staff since the beginning of the pandemic. Experts say that staff vaccination is a key part of protecting residents from outbreaks in their homes, but thousands of workers remain unvaccinated despite a federal COVID-19 vaccination mandate for health care employees. Some of those unvaccinated workers are claiming medical exemptions, which doctors say should be rare.

Nursing Home Inspect already lets the public, researchers and reporters search deficiency reports and other data across more than 15,000 nursing homes in the United States. Now, users can quickly compare staff COVID-19 vaccination and booster rates across states and between nursing homes.

Each state page allows users to sort homes by vaccination rate, making it easy to identify homes in your state with very low or very high vaccination rates. For each nursing home, a chart allows users to see how the home compares with both state and national averages.

Additionally, we have removed the COVID-19 case and death count data from the database because the figures were reported cumulatively and do not provide an accurate picture of recent outbreaks.

If you write a story using this new information, or you come across bugs or problems, please let us know!

↧

Visualizing Toxic Air

August 22, 2022, 2:00 am

≪ Previous: Look Up Nursing Home Staff COVID-19 Vaccination Rates

by Lylla Younes and Al Shaw

ProPublica is a nonprofit newsroom that investigates abuses of power. Sign up to receive our biggest stories as soon as they’re published.

This story was co-published with Investigative Reporters and Editors. It will appear in an upcoming special issue of The IRE Journal focused on pollution.

In November 2021, ProPublica published a series of immersive investigative stories about a statistical cancer-risk model created by the Environmental Protection Agency. Our reporting showed that although the model revealed increased cancer risk in communities all over the country, the agency did little to stop the toxic air emissions that were causing the increased risk — or even to inform affected communities.

Discovering the Information Gap

Building the project required that we develop a thorough understanding of a complex statistical model, ground-truth the sometimes unreliable data that had been self-reported by polluters, solve technical challenges associated with massive data sets and interview people who lived and worked near dangerous pollution.

The project builds on a series we worked on alongside The Times-Picayune and The Advocate of New Orleans in 2019. That project was about the residents of “Cancer Alley,” a region of southeast Louisiana home to many refineries and chemical plants. While residents had long complained that they were being sickened by industrial smokestacks, many regulators and corporate spokespeople argued the air was safe to breathe.

Companies that emit industrial pollution have long been required to report their emissions to the EPA, which makes the data public in an online database called the Toxic Release Inventory. But our reporting in Louisiana found that the TRI data is not precise enough to show the fine-grained degrees of risk in industrial areas, which left the people living closest to facilities unsure about their safety.

When we began researching how we might obtain data that would enable us to quantify pollution levels and cancer risk at a finer scale, we found out that the EPA had actually created its own high-precision model called the Risk-Screening Environmental Indicators Model, or RSEI, which was capable of doing just that.

The trouble was, the EPA published the results of the RSEI model in an interface that makes it very difficult to understand where the pollution travels and how serious the associated cancer risk is.

RSEI uses emissions estimates industrial companies submit to the agency each year along with weather data and facility-specific information to estimate concentrations of cancer-causing chemicals in half-mile-wide squares of land across the country. Using the powerful tool, we found that we could estimate how the emissions from, say, a plastics plant could be elevating the cancer risk near an elementary school several blocks away.

After we published a visual story about the dangerous concentrations of carcinogenic air blanketing neighborhoods in southeast Louisiana, we recognized the need for a deeper, national analysis. So we embarked on a two-year endeavor to identify toxic hot spots and to build an interactive map residents could use to look up the estimated cancer risks at any address in the country.

Taking the Investigation National

Expanding our original Cancer Alley analysis to include the entire country presented an enormous data challenge. The EPA organizes RSEI by splitting the entire country up into 810-by-810 meter grid cells. For each cell, there are rows for concentrations of every chemical attributed to each facility.

There are around 29 million 810-by-810 meter grid cells nationwide and more than 1.4 billion rows of data for a single year. Even using the largest database instance available on Amazon Web Services, it took up to a week to run queries on the data. Often, our queries took days simply to fail. It was a long, demotivating slog.

That’s when some colleagues told us about Google BigQuery, which is a Google Cloud services product that allows you to do SQL-style queries on very large data sets. Using BigQuery, code that once took a week to run finished in minutes.

Because of this dramatic speedup, we were also able to expand our ambitions. Averaging five years of data would make our analysis much more robust, since averaging across that time would account for a facility that happened to have had a particularly bad or good year in our observation data set. And because our analysis was meant to calculate incremental lifetime cancer risk, taking a five-year average instead of a one-year snapshot would result in a much more accurate estimate.

Loading in five years of RSEI data increased the size of the database from about 1.4 billion rows to about 7 billion rows. Yet BigQuery happily crunched through it.

When our code finished running, we had detected more than 1,000 toxic hot spots — some the size of a single grid cell, some encompassing entire cities or regions. We were also able to determine which facilities were responsible for the highest average cancer risks within a given radius surrounding them.

This led us to some shocking initial findings, one of which did not stand up to scrutiny once we started reporting it out.

Questioning Assumptions

Our colleague Ava Kofman started pursuing an initial finding that appeared to indicate that Boeing was responsible for substantially increasing cancer risk over the city of Portland, Oregon. But her interviews and comparisons with state databases showed that the company had actually misreported its data to the EPA, and that faulty data had shown up as a massive overestimate of risk in the RSEI model underlying our analysis. Boeing subsequently fixed the problem and sent amended data to the agency.

Ava’s finding led us to stop what we were doing and rethink our assumptions. We created a large-scale, systematized fact-checking process. We reached out to each of the top 200 facilities (ranked by the level of nearby cancer risk) to ask them if their emissions reporting was accurate — and if not, whether they would resubmit 2014-18 data to the EPA. Of the 109 companies that responded to us, 71% confirmed that their reported emissions were correct, and 29% noted errors, which we asked them to correct. We then worked with RSEI experts to adjust the output of the model to reflect the chemical concentrations the companies provided to us directly.

Finding Stories in the Data

Once we completed the nationwide interactive map, we had a trove of potential stories before us. Some of the hot spots we identified, like Cancer Alley and the Houston Ship Channel, were infamous. Others, like the cloud of toxic ethylene oxide covering a large swath of Laredo, Texas, were not previously known — even to residents breathing the contaminated air.

Seven more ProPublica reporters joined the effort. They fanned out to report out the conditions on the ground in some of the nation’s most toxic industrial areas and to investigate the state and local policy decisions driving the high emissions rates there.

Early on, we were interested in understanding which communities were most affected by the toxic pollution. Since RSEI data is available at the census-tract level, we were able to join our cancer risk estimates to demographic information. This analysis estimated that predominantly Black census tracts experience more than double the level of toxic industrial air pollution as majority-white tracts.

We were also curious about which companies were the primary drivers of the toxic pollution. We mapped the facility ownership profiles of the nation’s dominant chemical companies. We then computed the number of RSEI grid cells in which each company independently elevates cancer risk above various EPA risk thresholds. We published the results of this analysis in the first story of our “Sacrifice Zones” series.

After our stories and interactive news application launched, the EPA announced a raft of targeted actions and specific reforms including stepped-up air monitoring and scrutiny of industrial polluters. In February 2022, three Democratic U.S. representatives introduced a $500 million bill that would require the EPA to create a pilot program for air monitoring in communities overburdened with pollution.

Explore the interactive map and the stories that came out of it at propublica.org/toxmap.

↧

Latest Images