Learn Data, Design and Code for Journalism. Apply for ProPublica’s 2017 Data Institute!

January 25, 2017, 1:16 pm

ProPublica is proud to open applications for our 2017 Data Institute, a free 11-day intensive workshop on how to use data, design and programming for journalism. The workshop will be from June 7 to June 21 in ProPublica’s New York offices. The deadline to apply is March 31. Apply here.

This year we’re excited to be partnering with the The Ida B. Wells Society for Investigative Reporting. They are an organization dedicated to increasing and retaining reporters and editors of color in the field of investigative reporting. They do that through trainings, partnerships and mentoring. ProPublica will award one spot in our Class of 2017 to an Ida B. Wells Society member. If you are a member of the Ida B. Wells Society and are interested in applying, we’ll post a link to the application soon.

Geared towards journalists and journalism students, this workshop will cover everything from finding and analyzing data, to using colors and typography for better storytelling, to scraping a website using code. By the end of the Institute, students will have created an interactive data project from beginning to end, with help and guidance from some of the best designer/developer/data journalists in the world.

ProPublica’s News Apps team has worked on colleges that saddle poor students with debt, doctors who take money from drug companies, how much limbs are worth in different states, and even investigative space journalism. The workshop will cover step-by-step how ProPublica brainstorms, reports, designs and builds these types of interactive graphics and data-driven news applications.

One of the reasons we’re so excited about this workshop is because it is another step ProPublica is taking to increase the diversity of its own newsroom and beyond. That means training and empowering journalists from a broad array of social, ethnic and economic backgrounds. We are particularly dedicated to helping people from communities that have long been underrepresented not only in journalism but particularly in investigative and data journalism, including African Americans, Latinos, other people of color, women, LGBTQ people and people with disabilities.

The workshop is completely free to attend and ProPublica will cover travel and lodging, as well as breakfast and lunch during workshop days. Additionally, to make the Data Institute accessible to people for whom it would still be economically out of reach, ProPublica is offering a limited number of $1,000 stipends. Requests for stipends are part of the application.

The Data Institute is made possible by a grant from the John S. and James L. Knight Foundation.

So don’t wait, apply now! Or email this description to someone who you think should apply.

If you have any questions, email data.institute@propublica.org. We’ll be checking daily and can’t wait to hear from you.

Apply!

↧

Introducing the Vital Signs API

March 10, 2017, 9:02 am

≫ Next: Congress API Update

≪ Previous: Learn Data, Design and Code for Journalism. Apply for ProPublica’s 2017 Data Institute!

The Vital Signs web app, launched today, brings together the abundance of health reporting we’ve done over the years to provide patients with easy access to information that can help them manage the quality of their care.

Behind the app is our Vital Signs API, which also launches publicly today.

An API — essentially a way for computer programs to exchange data with each other — isn’t the kind of thing you might typically think of as journalism. But behind its creation is the same reporting that powers our in-depth investigative stories and interactive online databases. The API brings together facts and insights from literally years of work from our staff, and its aim is the same of all our work: impact.

There are a few key moments when the information included in Vital Signs is most important: when patients are meeting with their doctors, when patients are choosing doctors, and when other health professionals are choosing which providers to work with. Building a commercial API allows us to make sure our reporting is available to users at all of these moments, by making it easier to see our data in the apps and tools that they’re already using.

Vital Signs API

Learn more about how you can use the Vital Signs API.

We used the API to power our own web app, which focuses on helping patients have important, potentially difficult conversations with the providers they see. We’ve highlighted issues they might want to discuss and provided context for those decisions. We built it specifically for mobile devices so that it’s easy for patients to bring this data into appointments and other health care settings. You can read more about that here.

There are plenty of consumer review sites and doctor-finder tools that already exist to help patients find new doctors. Review sites, insurance companies and other apps give patients information about where doctors are located and what insurance they take, post reviews from other patients, provide star ratings and more. Releasing our data as an API makes it easier for these sites to include the Vital Signs data, as well, helping users make more informed choices about their health care.

An API also makes our data available to hospitals, insurance companies and health care agencies that want to link independent data and analysis with internal systems to assess and compare the performance of providers. When our reporting appears in these kinds of systems, it can help ensure that doctors and nurses who provide lower-cost and higher-quality care are recognized by insurers and employers. It can also help institutions identify and address problems with providers.

If you’re one of these types of users, we’d like to hear from you. The Vital Signs API is a commercial product; for now, it is available only to participants in a closed beta program. Over the coming months, we’ll be learning from them what endpoints, features and data would make the API even more useful. Some of our early users are health care agencies, consumer review sites, mobile app makers and insurers.

We’ll be welcoming a limited number of additional participants into the beta program this spring. If you’re interested in learning more about the API, you can view documentation and sign up for the beta user waitlist here. We’ll roll out the full, commercially available version later this year.

We’ll also be hosting a hackathon in Chicago in May. To apply, submit your information below. If you’re interested in partnering with us on the hackathon (or supporting it as a sponsor) email me at celeste.lecompte@propublica.org.

↧

Congress API Update

March 29, 2017, 4:43 pm

≫ Next: New in the Congress API: Congressional Statements and More

≪ Previous: Introducing the Vital Signs API

When we took over projects from the Sunlight Foundation last year, we inherited an Application Programming Interface, or API, that overlapped with one of our own.

Sunlight’s Congress API and ProPublica’s Congress API are similar enough that we decided to try to merge them together rather than run them separately, and to do so in a way that makes as few users change their code as possible.

Today we’ve got an update on our progress.

Users of the ProPublica Congress API can now access additional fields in responses for Members, Bills, Votes and Nominations. We’ve updated our documentation to provide examples of those responses. These aren’t new responses but existing ones that now include some new attributes brought over from the Sunlight API. Details on those fields are here.

We plan to fold in Sunlight fields and responses for Committees, Hearings, Floor Updates and Amendments, though that work isn’t finished yet.

As part of the merger process, we also need to deprecate a few features: For instance, the combined API will drop a few older attributes, including the unique identifiers for lawmakers that were used by the now-defunct Thomas website. We’ve also removed Facebook numeric IDs (not account names) for lawmakers, owing to changes in the Facebook API. Details on which fields we’re not supporting are here, organized by specific object (Member, Bill, etc.).

We are also deprecating the Sunlight API’s feature that let developers filter on fields within requests. The ProPublica Congress API has a more traditional RESTful style that has only a few query-string options.

Another feature we need to deprecate is the geographic lookup for members and districts. You likely already have moved off of this, as the lookups have been out of date since before we took over the API. They reflect the early part of the previous Congress, before Virginia, North Carolina and Florida approved new congressional maps for House districts.

As an alternative, we are encouraging developers to use the Google Civic Information API, which is what we use in our Represent news application. If you wish to run your own geographic lookup service, you can use the Pentagon codebase that Sunlight originally developed or a more recently updated project called Represent Boundaries.

With that exception, users of the existing Sunlight Congress API will see no change during this time, and we’ll continue to update the data in it.

Once we’ve completed the process of merging in all of the Sunlight responses and fields, we’ll release version 2 of the ProPublica Congress API.

We’ll be adding some brand-new responses to the ProPublica Congress API, too. Among them are public statements by members of Congress, which we’ve already started showing on Represent. In addition to being able to search the titles of these statements, Represent users can browse them on member pages or by popular subjects like “repealing and replacing the Affordable Care Act.”

We welcome your input and feedback at apihelp@propublica.org.

↧

New in the Congress API: Congressional Statements and More

May 21, 2017, 12:57 pm

≫ Next: New in Our Congress API: Bill Subjects, Personal Explanations and Sunsetting Sunlight

≪ Previous: Congress API Update

We’ve got a few updates on our Congress API to tell you about.

First, today we’re announcing that the congressional statements appearing in our Represent congressional database are now available via our Congress API. You can see examples of the API requests and responses on our new documentation site.

The statements, which are pulled directly from official House and Senate websites, are available by date, by member and in reverse chronological order. You also can search the titles of statements by keyword or retrieve statements by subject. Subjects are assigned by a ProPublica journalist.

Several times a day, we check RSS feeds for members who have them and load any new items into our database. For sites that don’t have RSS we resort to screen-scraping using the Statement Ruby gem. This is an imperfect solution because members of the House and Senate change the structure of their websites more often than you might think, and we have to play catch up when they do. If you see that we’re missing member statements for more than a few days, please email us at apihelp@propublica.org.

The Los Angeles Times has already used the API to build a page showing where members of the Senate stood on the firing of former FBI Director James Comey.

You can sign up for an API key here.

In addition to the new statement endpoints, we’ve added data from the Sunlight Congress API into the main ProPublica Congress API, as we continue to merge the two. For the bill endpoints, we’ve added recently enacted and recently vetoed bills. We’ve also updated endpoints for members, votes and amendments.

Next on the roadmap are House and Senate floor updates, more committee responses and the ability to search the full text of bills. Those will be coming early this summer, so stay tuned. If you run into any issues with the Congress API, you can create an issue on GitHub or email us at apihelp@propublica.org.

Finally, a bonus announcement: Demokratia, a civic technology firm that is using the Congress API for a project, has built a .NET client for testing its endpoints and released it to the public. You can find it here.

We’d love to hear from you if you’re building something with the API, too. Email derek.willis@propublica.org.

↧

New in Our Congress API: Bill Subjects, Personal Explanations and Sunsetting Sunlight

June 21, 2017, 2:32 pm

≫ Next: Authenticating Email Using DKIM and ARC, or How We Analyzed the Kasowitz Emails

≪ Previous: New in the Congress API: Congressional Statements and More

We've got a few updates to the ProPublica Congress API to tell you about.

First, and most importantly, if you're a user of the Sunlight Congress API, you need to change your code to point to the ProPublica Congress API soon. The Sunlight API URLs will no longer work after Aug. 31, 2017. Changing your code is easy. You just need to get a ProPublica Congress API key and make the necessary changes to your code to keep anything you built running after the switchover. We'll be publishing a guide that helps with the transition by identifying common endpoints between the two APIs.

As a reminder, we are no longer supporting the legislative boundary service that the Sunlight API supported. If your code used the Sunlight API to find legislators using a postal address, we're recommending that you switch to the Google Civic Information API for that task.

Since our last update to the ProPublica Congress API, we've made several fixes and added new responses. These include two new endpoints:

The floor updates endpoint contains descriptions of legislative activity that are updated in near real-time throughout the day. This endpoint exists for both the House and the Senate.

The House and Senate committee hearings endpoint details recent and upcoming meetings.

Both of these are similar to what the Sunlight Congress API has now.

We're also creating two new responses for bill subjects, which are assigned by the Congressional Research Service for most bills. Users can now retrieve recent bills for a given subject. For example, it's now possible to retrieve recently updated bills with the subject of “taxation,” like so (using Python):

Code:

import requests
r = requests.get("https://api.propublica.org/congress/v1/bills/subjects/taxation.json", headers={'X-API-Key': 'PROPUBLICA CONGRESS API KEY'})
print r.json()['results'][0]

Result:

{u'latest_major_action': u'Referred to the House Committee on Ways and Means.', u'committees': u'House Ways and Means Committee', u'title': u'Layoff Prevention Extension Act of 2016', u'sponsor_uri': u'https://api.propublica.org/congress/v1/members/D000216.json', u'number': u'H.R.5408', u'introduced_date': u'2016–06–08', u'bill_uri': u'https://api.propublica.org/congress/v1/114/bills/hr5408.json', u'latest_major_action_date': u'2016–06–08', u'sponsor_id': u'D000216', u'cosponsors': 6}

Because our data spans many years and indexing systems, there are more than 6,000 potential subjects. Since 2009 the CRS has used about 1,000 subjects; more information about those subjects and a list of terms is available at Congress.gov. To help users find the subjects they are interested in we've provided a keyword search endpoint.

We hope that search-by-subject will allow developers to make it easier for users to find and track legislation on topics they are interested in.

We've also added a pair of responses for “personal explanations,” which are statements inserted into the Congressional Record explaining why lawmakers missed votes and what their votes would have been (the statements are merely explanations and have no effect on the vote itself).

A database we launched last year called Personal Explanations listed these statements for every lawmaker, although they are most used by members of the House of Representatives. That interactive database app will be folded into Represent, our main congressional database.

The explanations in the Congress API are exactly as filed in the Congressional Record, with two additions: Included is a categorization of the explanation and, if the explanation concerns only a single vote, a link to the API URL for that vote. The new responses will be for recently filed explanations and explanations for a given lawmaker. Here's an example of what that data looks like:

{
member: "D000598",api_uri: "https://api.propublica.org/congress/v1/members/D000598.json",
name: "Susan A. Davis",
state: "CA",
year: 2017,
roll_call: 284,
party: "D",
date: "2017–06–06",
url: "https://www.congress.gov/congressional-record/2017/06/06/extensions-of-remarks-section/article/E768–3",
text: "Mr. Speaker, on Thursday, May 25, 2017, my vote was not recorded due to a technical error. I intended to vote YES on H.R. 1761 the Protecting Against Child Exploitation Act.",
category: "Vote not recorded",
parsed: true,
vote_api_uri: "https://api.propublica.org/congress/v1/115/house/sessions/1/votes/284.json"
}

Finally, we've added two attributes to member list responses: They now have lawmakers' dates of birth and, for House members, a geoid attribute based on a format used by the Census Bureau that can help developers displaying congressional data on a map. For bill actions, we've changed the individual action response to use a date, not a datetime. Bills can now be filtered by two additional types: “enacted” and “vetoed.”

One last change worth mentioning: For all JSON responses, we're now displaying integers, boolean and null values as native data types rather than as strings. XML responses remain unchanged.

The next steps for the ProPublica Congress API are to make the full text of bills searchable and to add more attributes to some bill responses. With that, we'll have merged as much of the functionality of the Sunlight Congress API as we can, and we'll be ready to shut it down

We encourage users to report errors or to request features on GitHub or by emailing apihelp@propublica.org.

↧

Authenticating Email Using DKIM and ARC, or How We Analyzed the Kasowitz Emails

July 19, 2017, 8:46 am

≫ Next: How (and Why) We’re Collecting Cook County Jail Data

≪ Previous: New in Our Congress API: Bill Subjects, Personal Explanations and Sunsetting Sunlight

It has become a common scenario: A reporter gets a newsworthy email forwarded out of the blue. But is the email legit? It turns out there are a few technical tools you can use to check on an email, in tandem with the traditional ones like calling for confirmation. I used some of these techniques last week to help authenticate some emails forwarded to my colleague Justin Elliott. Those emails were sent by Marc Kasowitz, one of President Trump’s personal attorneys.

This post is a very brief introduction to the tools I used and that you can use when you need to authenticate an email message.

There’s a cryptographic technique that can tell us if an email message that you or your source has received matches what was sent. It comes in two similar flavors. One’s called “DomainKeys Identified Mail,” or DKIM, and the other is “Authenticated Received Chain,” or ARC. You can use them to authenticate emails that come in over the transom. It takes a tiny bit of command-line work and maybe a little coaxing of your source, but it can offer you a mathematical guarantee that the email you have on your screen is identical to the one that the source received, with no possibility of intermediate tampering.

To understand it, we need to do a little bit of e-spelunking into how email and cryptography work.

An email message has two parts: the body, which is the text of the message, and the headers, which are kind of like the outside of an envelope on a piece of snail mail. The headers include stuff you are familiar with, like To, From and the Subject lines. But they also include a lot of other, more obscure fields that aren’t shown in Gmail or Outlook. For instance, one of the fields contains what is essentially a tracking log for the email, recording the path it took from the sender’s email service to the service hosting your own email.

The obscure header we’re interested in is called the DKIM Signature. It’s kind of like the shipper’s packing list. The DKIM Signature field contains two things: First, a set of instructions for making a summary of the email, mushing up some of the headers and the message itself, and, second, a version of that summary — technically, a “hash” — that’s cryptographically signed by the sending server.

It’s meant to give the receiving server the ability to see if the contents of the email changed in transit, the digital equivalent of detecting whether the mailman steamed open the envelope and modified the contents of a letter. We can put it to good use as journalists by creating our own version of the hash and then decrypting the one made by the sending server. If the hash we create from those instructions matches the decrypted one from the message exactly, we have mathematical proof that our email is the same as the one that was sent/received.

The inverse isn’t true. That is, if the hash we create isn’t the same as the hash in the DKIM Signature field, it doesn’t necessarily mean the messages are completely different or that the message was tampered with. Some email servers are a little wonky and make little changes to an email — adding or removing spaces at the end or something like that. Even a tiny change will totally throw off the cryptographic comparison, and isn’t at all uncommon. So if the keys don’t match, it’s possible this means the email was tampered with, but you can’t draw that conclusion from a DKIM hash mismatch alone.

There are other reasons why verification might fail on a genuine email. Older emails, in particular, are more likely to not validate because the public key used to decrypt the summary of the email might have been changed. (Remember — DKIM is meant to be used when the email was received right away, not months or years later.)

So that’s DKIM. Now for ARC. ARC is similar to DKIM, but instead of being used by the sending server, it’s used by intermediaries in the email process, like listservs or servers that receive email. Many emails that arrive into Gmail are signed by Google, but this is a new development — the ARC protocol isn’t even formally approved yet.

Some emails will have both DKIM signatures and ARC signatures. Some will have only one. For instance, the email our source received only had an ARC signature, put there by Google when it arrived in Gmail. It didn’t have a DKIM signature, because the email server used by the sender’s law firm doesn’t include them. And some have neither; both of these systems are slathered on top of the original email system like sunscreen — and, also like sunscreen, some people don’t use them.

What DKIM and ARC Prove (and What They Can’t Prove)

While a validated DKIM signature guarantees that you have the same email that was sent; a validated ARC signature can guarantee that you have the same email that was received by the receiving server. In practice, this was perfect for us, because we needed to know for sure that the email that was forwarded to us was exactly the one our source originally received.

Just because a message you’ve been forwarded matches what was sent or received, that doesn’t mean it’s completely authentic. DKIM and ARC can’t tell you whether the sender’s server was hacked or misconfigured.

And neither technique guarantees that the sender is who they say they are. It’s theoretically possible for me to create my own email server that pretends to be hillaryclinton.com. There’s a system called Sender Policy Framework (SPF) that validates whether a sending server is really allowed to send email on behalf of a given domain. Read up on that if you think that scenario is a possibility.

DKIM and ARC also can’t confirm that the person who typed the email was the person whose name is on the account, instead of somebody else with access to it. (An email that the sender “signed” with a different kind of encryption tool like PGP or S/MIME would have cryptographic proof that it came from a computer belonging to the sender, but that’s beyond the scope of this post. The emails we were analyzing didn’t have a PGP signature, and most people don’t use these tools.)

How to Check DKIM and ARC

You’ll need a little bit of command-line knowledge and to have Python installed on your computer. I’ll assume you have both, and I’ll also assume that you’re dealing with an email forwarded by a source you are still in contact with.

You’re also going to need an original copy of the email. A forwarded version won’t work at all — the headers we care about are stripped out. That means that the emails that Donald Trump Jr. tweeted can’t be verified using these techniques (but then, I suppose, he authenticated them for us).

You’ll need to find out the service your source uses to receive email — Gmail, Outlook, Yahoo, etc. — and then find their instructions on how to forward a message as an attachment. I’m including the Gmail ones here.

Here’s how you or your source can get the original message in Gmail:

Open the message.
Find the little down arrow next to the reply button and click it.
Click the Show Original button.
In the new tab that opens, you’ll see the source of the message — including all the headers that Gmail hides from you by default. Your source should click the Download Original message and email it to you as an attachment.

Now that you have the email you want to authenticate as an attachment, you’ll need two Python libraries.

dnspython. We’ll use this to fetch the decryption key that’s used to guarantee that the two summaries match. This library grabs the key from the DNS system (it’s not included in the message, which makes it harder to spoof). You should be able to install this using pip or easy_install.

dkimpy. This is a Python library for authenticating DKIM and ARC signatures. You can grab it at https://launchpad.net/dkimpy.

Once these are installed:

Using the command line, go to the directory where you unzipped dkimpy. On Mac and Linux, that’s probably something like cd ~/Downloads/dkimpy-0.6.2.
Make sure you know where the email message (sent as attachment) is. It might also be in Downloads — so let’s assume the path is ~/Downloads/original_msg.txt. The file path might be something like .eml or .msg. That’s fine, too.
Execute the signature validation tool, providing the original message as an argument. That’s going to look something like one of these two commands:

python dkimverify.py < original_msg.txt

python arcverify.py < original_msg.txt

Interpret the results. If the command comes back saying arc verification: cv=b'pass' success (for ARC) or signature ok (for DKIM) then we know the message is the same as sent or received, as the case may be. If the response is “signature verification failed” or “Message is not ARC signed,” we don’t know if the email’s been tampered with or not. (Seriously — you can’t conclude that it has been tampered with. You just don’t know.)

I hope you find these tools useful. Seeing as how the phrase “email scandal” could refer to any number of different political brouhahas over the past two years, it’s clear that email verification is a process that we’re all going to have to get more familiar with.

↧

How (and Why) We’re Collecting Cook County Jail Data

July 24, 2017, 10:39 am

≫ Next: Keep an Eye On Your State’s Congressional Delegation

≪ Previous: Authenticating Email Using DKIM and ARC, or How We Analyzed the Kasowitz Emails

At ProPublica Illinois, we’ve just restarted a data collection project to get new information about what happens to inmates at one of the country’s largest and most notorious jails.

Cook County Jail has been the subject of national attention and repeated reform efforts since its earliest days. Al Capone famously had “VIP accommodations” there in 1931, with homemade meals and a large cell in the hospital ward that he shared only with his bodyguard. Other prisoners have been more poorly accommodated: In the 1970s, the warden was fired for allegedly beating inmates with his own hands, and the facility was placed under federal oversight in 1974. During the 1980s, the federal government forced the jail to release inmates because of overcrowding. In 2008, the Department of Justice found systematic violation of inmates’ 8th Amendment rights and once again pushed for reforms.

These days, the jail, which has just recently been taken out of the federal oversight program, is under new management. Tom Dart, the charismatic and media-savvy sheriff of Cook County, oversees the facility. Dart has argued publicly for reducing the population and improving conditions at the jail. He’s also called the facility a de facto mental hospital, and said inmates should be considered more like patients, even hiring a clinical psychologist to run the jail.

Efforts to study the jail’s problems date back decades. A 1923 report on the jail by the Chicago Community Trust says, “Indifference of the public to jail conditions is responsible for Chicago’s jail being forty years behind the time.”

The promises to fix it go back just as far. The same 1923 report continues, “But at last the scientific method which has revolutionized our hospitals and asylums is making inroads in our prisons, and Chicago will doubtless join in the process.”

Patterns in the data about the inmate population could shed light on the inner workings of the jail, and help answer urgent questions, such as: How long are inmates locked up? How many court dates do they have? What are the most common charges? Are there disparities in the way inmates are housed or disciplined?

Such detailed data about the inmate population has been difficult to obtain, even though it is a matter of public record. The Sheriff’s Department never answered my FOIA request in 2012 when I worked for the Chicago Tribune.

Around the same time, I started a project at FreeGeek Chicago to teach basic coding skills to Chicagoans through working on data journalism projects. Our crew of aspiring coders and pros wrote code that scraped data from the web we couldn’t get other ways. Our biggest scraping project was the Cook County Jail website.

Over the years, the project lost momentum. I moved on and out of Chicago and the group dispersed. I turned off the scraper, which had broken for inexplicable reasons, last August.

I moved back home to Chicago earlier this month and found the data situation has improved a little. The Chicago Data Cooperative, a coalition of local newsrooms and civic-data organizations, is starting to get detailed inmate data via Freedom of Information requests. But there’s even more information to get.

So for my first project at ProPublica Illinois, I’m bringing back the Cook County Jail scraper. Along with Wilberto Morales, one of the original developers, we are rebuilding the scraper from scratch to be faster, more accurate and more useful to others interested in jail data. The scraper tracks inmates’ court dates over time and when and where they are moved within the jail complex, among other facts.

Our project complements the work of the Data Cooperative. Their efforts enable the public to understand the flow of people caught up in the system from arrest to conviction. What we’re adding will extend that understanding to what happens during inmates’ time in jail. It’s not clear yet if we’ll be able to track an individual inmate from the Data Cooperative’s data into ours. There’s no publicly available, stable and universal identifier for people arrested in Cook County.

The old scraper ran from June 5, 2014, until July 24, 2016. The new scraper has been running consistently since June 20, 2017. It is nearly feature-complete and robust, writing raw CSVs with all inmates found in the jail on a given day.

Wilberto will lead the effort to develop scripts and tools to take the daily CSVs and load them into a relational database.

We plan to analyze the data with tools such as Jupyter and R and use the data for reporting.

A manifest of daily snapshot files (for more information about those, read on) is available at https://s3.amazonaws.com/cookcountyjail.il.propublica.org/manifest.csv

How Our Scraper Works, a High-Level Overview

The scraper harvests the original inmate pages from the Cook County Jail website, mirrors those pages and processes them to create daily snapshots of the jail population. Each record in the daily snapshots data represents a single inmate on a single day.

The daily snapshots are anonymized. Names are stripped out, date of birth is converted to age at booking, and a one-way hash is generated from name, birth date and other personal details, so researchers can study recidivism. The snapshot data also contains the booking ID, booking date, race, gender, height, weight, housing location, charges, bail amount, next court data and next court location.

We don’t make the mirrored inmate pages public, to avoid misuse of personal data for things like predatory mugshot or background check websites.

How Our Scraper Works, the Nerdy Parts

The new scraper code is available on Github. It’s written in Python 3 and uses the Scrapy library for scraping.

Data Architecture

When we built our first version of the scraper in 2012, we could use the web interface to search for all inmates whose last name started with a given letter. Our code took advantage of this to collect the universe of inmates in the data management system, simply by running 26 searches and stashing the results.

Later, the Sheriff's Department tightened the possible search inputs and added a CAPTCHA. However, we were still able to access individual inmate pages via their Booking ID. This identifier follows a simple and predictable pattern: YYYY-MMDDXXX where XXX is a zero-padded number corresponding to the order that the inmate arrived that day. For example, an inmate with Booking ID “2017-0502016” would be the 16th inmate booked on May 2, 2017. When an inmate leaves the jail, the URL with that Booking ID starts returning a 500 HTTP status code.

The old scraper scanned the inmate locator and harvested URLs by checking all of the inmate URLs it already knew about and then incrementing the Booking ID until the server returned a 500 response. The new scraper works much the same way, though we’ve added some failsafes in case our scraper misses one or more days.

The new scraper can also use older data to seed scrapes. This reduces the number of requests we need to send and gives us the ability to compare newer records to older ones, even if our data set has missing days.

Scraping With Scrapy

We’ve migrated from a hodgepodge of Python libraries and scripts to Scrapy. Scrapy’s architecture makes scraping remarkably fast, and it includes safeguards to avoid overwhelming the servers we’re scraping.

Most of the processing is handled by inmate_spider.py. Spiders are perhaps the most fundamental elements that Scrapy helps you create. A spider handles generating URLs for scraping, follows links and parses HTML into structured data.

Scrapy also has a way to create data models, which it calls “Items.” Items are roughly analogous to Django models, but I found Scrapy’s system underdeveloped and difficult to test. It was never clear to me if Items should be used to store raw data and to process data during serialization or if they were basically fancy dicts that I should put clean data into.

Instead, I used a pattern I learned from Norbert Winklareth, one of the collaborators on the original scraper. I wrote about the technique in detail for NPR last year. Essentially, you create an object class that takes a raw HTML string in its constructor. The data model object then exposes parsed and calculated fields suitable for storage.

Despite several of its systems being a bit clumsy, Scrapy really shines due to its performance. Our original scraper worked sequentially and could harvest pages for the approximately 10,000 inmates under jail supervision in about six hours, though sometimes it took longer. Improvements we made to the scraper got this down to a couple hours. But in my tests, Scrapy was able to scrape 10,000 URLs in less than 30 minutes.

We follow the golden rule at ProPublica when we’re web scraping: “Do unto other people’s servers as you’d have them do unto yours.” Scrapy’s “autothrottle” system will back off if the server starts to lag, though we haven’t seen any effect so far on the server we’re scraping.

Scrapy’s speed gains are remarkable. It’s possible that these are due in part to increases in bandwidth, server capacity and in web caching at the Cook County Jail’s site, but in any event, it’s now possible to scrape the data multiple times every day for even higher accuracy.

Pytest

I also started using a test framework for this project I haven’t used before.

I’ve mostly used Nose, Unittest and occasionally Doctests for testing in Python. But people seem to like Pytest (including several of the original jail scraper developers) and the output is very nice, so I tried it this time around.

Pytest is pretty slick! You don’t have to write any boilerplate code, so it’s easy to start writing tests quickly. What I found particularly useful is the ability to parameterize tests over multiple inputs.

Take this abbreviated code sample:



testdata = (
   (get_inmate('2015-0904292'), {
     'bail_amount': '50,000.00',
   }),
   (get_inmate('2017-0608010'), {
     'bail_amount': '*NO BOND*',
   }),
   (get_inmate('2017-0611015'), {
     'bail_amount': '25,000',
   }),
)

@pytest.mark.parametrize("inmate,expected", testdata)
def test_bail_amount(inmate,expected):
assert inmate.bail_amount == expected['bail_amount']

In the testdata variable assignment, the get_inmate function loads an Inmate model instance from sample data and then defines some expected values based on direct observation of the scraped pages. Then, by using the @pytest.mark.parameterize(...) decorator and passing it the testdata variable, the test function is run for all the defined values.

There might be a more effective way to do this with Pytest fixtures. Even so, this is a significant improvement over using metaclasses and other fancy Python techniques to parameterize tests as I did here. Those techniques yield practically unreadable test code, even if they do manage to provide good test coverage for real-world scenarios.

In the future, we hope to use the Moto library to mock out the complex S3 interactions used by the scraper.

How You Can Contribute

We welcome collaborators! Check out the contributing section of the project README for the latest information about contributing. You can check out the issue queue, fork the project, make your contributions and submit a pull request on Github!

And if you’re not a coder but you notice something in our approach to the data that we could be doing better, don’t be shy about submitting an issue.

↧

Keep an Eye On Your State’s Congressional Delegation

July 26, 2017, 9:44 am

≫ Next: Nonprofit Explorer Update: Full Text of 1.9 Million Records

≪ Previous: How (and Why) We’re Collecting Cook County Jail Data

If you’re a user of Represent, our congressional news app, or a developer who uses our Congress API, we’ve got some new features to tell you about.

On Represent, we’ve added new pages for every state’s delegation (here’s Arizona) and redesigned bill category pages, like legislation about environmental protection, to provide more useful information. You also can search the full text of bills by keyword or phrase.

That same full-text search is available in the API. We’ve also added more details to bill and member responses to the API.

Let’s say you’re a reporter in Kentucky covering health care. Your representatives have been at the center of the recent health care debate. Represent already makes it easy to see what lawmakers such as Mitch McConnell, Rand Paul or Andy Barr are individually saying about the effort to repeal and replace the Affordable Care Act, but we’re now making it easier to keep track of all of the members of your state’s delegation in one place.

You can see your state’s current congressional members and a stream of their activities from the past two weeks. The stream shows members’ statements, any activity on a piece of legislation they’ve sponsored, and articles written about them in local and national publications, courtesy of Google News. It is filterable by the member’s party and the type of activity.

To see what McConnell is saying about the Republican health care bill, what WLKY Louisville is publishing about John Yarmuth, or what’s happening with Paul’s latest legislation, you can just check out the Kentucky state delegation page. This makes it easier for local and state reporters to track congressional activity relevant to their audience, and for voters to keep tabs on their representatives.

It can be easy to miss important things that happen on Capitol Hill if they aren’t covered widely in the press. While the House’s effort to pass the American Health Care Act garnered headlines, Congress has also been hard at work voting on other health care-related bills. It was easy for somebody interested in health to miss the House recently passing the Protecting Access to Care Act, which was just as partisan a vote as the AHCA.

To make it easier to track all bills in specific topic areas, we’re launching an update to our bill category pages. The update makes it easier to see where important bills are in the passage process by separating bills that have been signed into law from those that have been recently voted on. We’ve updated the visualizations to more easily compare vote margins, and added more information about sponsors and cosponsors.

Take a look at the Economics and Public Finance page to see how budget proposal bills are faring, or the Government Operations and Politics page to find resolutions urging inquiry into President Trump’s records and finances.

We’ve also updated the recent bill actions feed to include filters for the party, state and chamber of Congress of the bill’s sponsor, making it easier for readers and reporters to track bills important to their interests.

If you want to dive deeper into health-related bills to focus on those mentioning “pre-existing conditions,” Represent has you covered, too. Although such conditions are an important part of health insurance proposals, pre-existing conditions don’t always make it into the titles or summaries of bills. Now you can search the full text of every bill since 2013 for anything you’re interested in — like “preexisting.”

You’ll find a variety of bills from the current session of Congress, from a Democratic resolution “Urging the President to faithfully carry out the Affordable Care Act” to a Republican’s “Guaranteed Health Coverage for Pre-Existing Conditions Act of 2017” (which would take effect “upon repeal of the Patient Protection and Affordable Care Act”), along with a variety of bills about flood insurance.

Or if it’s lunchtime and your tastes are more light-hearted, you can simply search “pizza” — and you’ll find the Common Sense Nutrition Disclosure Act of 2017, a bill that aims to change nutrition labeling requirements for “standard menu items” — like pizza — that “come in different flavors, varieties, or combinations, but which are listed as a single menu item.”

The same full-text bill search is now in the Congress API, too, along with a number of beefed-up responses for bills and members. The details about individual bills now contain the complete summary, if available, and we’ve added additional details about sponsors to bill list responses. Our subjects’ responses now include values indicating whether a given subject has bills or statements associated with it. Finally, we’ve replaced empty strings with null values where appropriate. You can see a more detailed changelog here.

A reminder to users of the Sunlight Congress API that it will be shut down after Aug. 31, 2017. We’re encouraging Sunlight users to switch to the ProPublica Congress API. Bill text search was one of the last major features to add to the ProPublica Congress API, and once we finish API responses for amendment and upcoming bills the process of merging features from the Sunlight API into our own will be complete. We’ll have more updates to the ProPublica Congress API in the next month.

↧

Nonprofit Explorer Update: Full Text of 1.9 Million Records

August 10, 2017, 10:58 am

≫ Next: Bulk Downloads of Congressional Data Now Available

≪ Previous: Keep an Eye On Your State’s Congressional Delegation

We have updated our Nonprofit Explorer news application, adding raw data from more than 1.9 million electronically filed Form 990 documents dating back to 2010. This new trove includes the full text of more than 132,000 forms for which we did not previously have complete data.

In addition to making the machine-readable XML files available to download, we are publishing the full text of many of these documents as human-readable web pages. These appear similar to the PDFs that have appeared on Nonprofit Explorer in the past, but their text can be copy-and-pasted, and they are easier to browse and analyze.

You can find the XML and HTML of e-filed returns by clicking the buttons labeled “Full Text” and “Raw XML,” which appear on a nonprofit organization’s page under each year for which the data is available.

The release of the XML documents was made possible thanks to a 2015 lawsuit brought by Public.Resource.Org, a nonprofit organization that makes government documents available to the public. The suit compelled the IRS to fulfill Freedom of Information Act requests for electronically filed Form 990 documents in “Modernized e-File” XML format. The IRS started sharing the XML versions of e-filed forms as a public dataset starting in 2016.

For several years, Public.Resource.Org and its founder, Carl Malamud, have helped ProPublica acquire the page-image versions of Form 990 documents from the IRS. These documents make up the bulk of Nonprofit Explorer.

Malamud sees the release of XML data as a huge improvement.

“XML data is machine-processable,” Malamud wrote in an email to ProPublica. “You can instantly access the value of any specific field in a Form 990 (such as CEO compensation) from a computer program.”

Of the comparative advantage between XML and a page image, Malamud made an analogy. The raw XML data is like a spreadsheet, from which you can extract data easily. As for a page image, it’s as if “you make a printout of the spreadsheet, take a picture on your cellphone of the printout, and post the picture on Instagram.”

“Releasing the e-file data instead is vastly superior and will make the Form 990 a much more useful tool."

While the XML files provide the most complete and useful data possible for e-filed Form 990 documents, they’re formatted for computer programs to understand, not humans. So the IRS provides stylesheets that a programmer can use to make the documents look more like the paper forms that make up a Form 990 tax return. We adapted open-source code based on those IRS stylesheets to make cosmetic transformations for Form 990 documents from 2013 and later.

Most nonprofits file their tax documents electronically. However, there are still thousands of nonprofit organizations that file them on paper. We will continue to provide PDF versions of these documents in order to make sure we’re providing information for as many nonprofit organizations as possible.

Our work on the XML-based data is just beginning. In the coming months, we will continue to improve Nonprofit Explorer and the Nonprofit Explorer API, providing users with new ways to explore and analyze tax-exempt organizations.

↧

Bulk Downloads of Congressional Data Now Available

August 31, 2017, 11:09 pm

≫ Next: Get an Inside Look at the Department of Defense’s Struggle to Fix Pollution at More Than 39,000 Sites

≪ Previous: Nonprofit Explorer Update: Full Text of 1.9 Million Records

Using the ProPublica Congress API, developers can access details on each of the thousands of bills introduced in every two-year session. But they used to have to download those details one bill at a time, and be able to write API calls in software code. Now you can download information on all of the bills introduced in each session in a single file, thanks to the bulk bill data set we’re announcing today.

You can get this data for free starting right now from the ProPublica Data Store. A data dictionary and an example file are available here.

Twice a day, we generate a single zip file containing metadata for every bill introduced in the current congress, including who sponsors and cosponsors the bill, actions taken by committees, votes on the floor and a summary of what the bill would do. So every time you download the bulk bill data from the 115th Congress, you’ll have the complete, up-to-date data set.

You can also download archives of bill data for past congresses, going back to 1973 — when current House Speaker Paul Ryan was only 3 years old and one of the bills debated was the now-familiar ERISA, a law governing employer-sponsored retirement plans.

To produce the files we’re using the same codebase that powers our Congress API and the work of sites like GovTrack.us: the @unitedstates project. That open-source effort is a significant part of our congressional data and relies on volunteers. The bulk bill downloads replace the files previously available on that site and on the Sunlight Foundation’s site.

Bulk downloads are useful for developers and journalists who need the entire set of legislation but want to avoid gathering it one bill at a time. Bill files are available in JSON and XML formats.

What can you do with this data? Anything you want, but we hope it’ll be useful to researchers, journalists and any other citizen trying to better understand our country’s legislature. You might explore how Congress’ focus on various issues has shifted, the roles of committees in passing — or delaying — legislation or, on the lighter side, the rise of sometimes-implausible acronyms in bill names.

A note for users of our Congress API interested in using the bulk download data: The bulk files don’t match the API bill endpoint exactly but contain a subset of the fields available. A data dictionary of the fields and an example are available here.

Finally, a reminder for users of the Sunlight Congress API: We will shut down that service Sept. 30, 2017. After that date, the API will no longer respond to requests. We encourage users to migrate to the ProPublica Congress API.

↧

Get an Inside Look at the Department of Defense’s Struggle to Fix Pollution at More Than 39,000 Sites

May 7, 2018, 2:00 am

≫ Next: New in the Congress API: Lobbying Registrations and More

≪ Previous: Bulk Downloads of Congressional Data Now Available

by Abrahm Lustgarten

For much of the past two years I’ve been digging into a vast, $70 billion environmental cleanup program run by the U.S. Department of Defense that tracks tens of thousands of polluted sites across the United States. In some places, old missiles and munitions were left buried beneath school grounds. In others, former test sites for chemical weapons have been repurposed for day care centers and housing developments. The oldest, dating to World War I, have faded into history, making it difficult to keep track of the pollution that was left behind.

(You can purchase a cleaned, condensed version of Bombs in Your Backyard from the ProPublica Data Store. Or download the entire database for free.)

For nearly 45 years, the Pentagon kept its program — the Defense Environmental Restoration Program — out of the spotlight, and most of these sites have never been scrutinized by the public. However, the agency has meticulously tracked its own efforts, recording them in a detailed internal database. We were the first to see it, ever. Now we’re sharing it with you.

The dataset includes details on more than 39,000 unique sites across more than 5,000 present and former military locations in every U.S. state and territory. The sites are literally in almost everyone’s backyard. And so while a few of the most notorious cleanup spots may be well-known, the tools included here can help local news outlets, the public or anyone else dive deep into the details of hidden threats that have never before seen light.

The detail is extraordinary: Contaminants — and sometimes their concentrations in both soil and water — are listed. So is the amount of money spent over decades to deal with the problems, and the budget estimated into the future to finish it. You can find managers responsible for incremental decision-making, or plot the coordinates of specific sites used for dumping or chemical spills. There are even data fields filled with comments — the wisecracks of Pentagon staffers over the years characterizing the enormity or seriousness of their tasks. And much, much more.

We used some of the data from the government’s database to plot these sites on a map of the United States, and to drill into each site with details on the contamination to be found there, including adding additional location and cost data from other sources. You may want to use the data in other ways: perhaps to focus on single sites in much more detail, charting cleanup progress and failures, funding endeavors and political turning points. For that, you can download the entire database, just as we received it from the Department of Defense.

And of course, I’d love to hear about it if you use the data for a project of your own.

↧

New in the Congress API: Lobbying Registrations and More

June 6, 2018, 8:35 am

≫ Next: How ProPublica Illinois Uses GNU Make to Load 1.4GB of Data Every Day

≪ Previous: Get an Inside Look at the Department of Defense’s Struggle to Fix Pollution at More Than 39,000 Sites

by Derek Willis

Lobbying is a daily event in Washington. It’s a complex stream of activity involving lawmakers, interest groups and individuals who want to influence federal policy. Today we’re releasing a way for developers to tap into that stream to build software applications.

We’re adding new responses to the ProPublica Congress API that allow developers to programmatically access the lobbying data we started publishing last year, including the individuals and organizations who are registered to lobby the federal government, who they’re seeking to influence, and how much they’re being paid.

The data comes from the Clerk of the House of Representatives, which, along with the Secretary of the Senate, collects forms filed by lobbyists. Although both chambers have filings, the data from the House covers both Congress and the executive branch.

The House data is contained in XML files that we load daily into our database, and while we don’t offer every search variation that the House clerk’s site does, the API does give you three different ways to view filings, which we document here.

Get info about new and updated data from ProPublica.

First, developers can access the most recent filings, 20 at a time, in reverse chronological order. Second, developers can search lobbying filings by keyword or phrase, using the names of lobbyists, clients and issues. For example, you can search for “Facebook” and see lobbying filings where the social media giant is the client and where it is mentioned in the description of the lobbying work. An example of the latter is a 2018 filing from D&D Strategies that references a meeting with Senate committee staff before Facebook CEO Mark Zuckerberg appeared in front of Congress. Finally, developers can see an individual lobbying filing based on its unique ID in the data.

We’ve also made some other changes to the API, including adding a response that returns congressional press releases that mention a specific bill. When we add a new press release to our database, our software scans the text to find any references to bill numbers. Since some statements only refer to bill titles, we don’t catch every mention, but many bills, like a recent one on prison reform, have statements associated with them.

You can check out the full list of changes and fixes in the API’s changelog.

To start using the Congress API, sign up for an API key and we’ll get you on your way. If you have any questions or issues with the API, you can raise them on GitHub or by emailing apihelp@propublica.org.

↧

How ProPublica Illinois Uses GNU Make to Load 1.4GB of Data Every Day

July 10, 2018, 2:00 am

≫ Next: Download Chicago’s Parking Ticket Data Yourself

≪ Previous: New in the Congress API: Lobbying Registrations and More

by David Eads

I avoided using GNU Make in my data journalism work for a long time, partly because the documentation was so obtuse that I couldn’t see how Make, one of many extract-transform-load (ETL) processes, could help my day-to-day data reporting. But this year, to build The Money Game, I needed to load 1.4GB of Illinois political contribution and spending data every day, and the ETL process was taking hours, so I gave Make another chance.

Now the same process takes less than 30 minutes.

Here’s how it all works, but if you want to skip directly to the code, we’ve open-sourced it here.

Fundamentally, Make lets you say:

File X depends on a transformation applied to file Y
If file X doesn’t exist, apply that transformation to file Y and make file X

This “start with file Y to get file X” pattern is a daily reality of data journalism, and using Make to load political contribution and spending data was a great use case. The data is fairly large, accessed via a slow FTP server, has a quirky format, has just enough integrity issues to keep things interesting, and needs to be compatible with a legacy codebase. To tackle it, I needed to start from the beginning. Overview

The financial disclosure data we’re using is from the Illinois State Board of Elections, but the Illinois Sunshine project had released open source code (no longer available) to handle the ETL process and fundraising calculations. Using their code, the ETL process took about two hours to run on robust hardware and over five hours on our servers, where it would sometimes fail for reasons I never quite understood. I needed it to work better and work faster.

The process looks like this:

Download data files via FTP from Illinois State Board Of Elections.
Clean the data using Python to resolve integrity issues and create clean versions of the data files.
Load the clean data into PostgreSQL using its highly efficient but finicky “\copy” command.
Transform the data in the database to clean up column names and provide more immediately useful forms of the data using “raw” and “public” PostgreSQL schemas and materialized views (essentially persistently cached versions of standard SQL views).

The cleaning step must happen before any data is loaded into the database, so we can take advantage of PostgreSQL’s efficient import tools. If a single row has a string in a column where it’s expecting an integer, the whole operation fails.

GNU Make is well-suited to this task. Make’s model is built around describing the output files your ETL process should produce and the operations required to go from a set of original source files to a set of output files.

As with any ETL process, the goal is to preserve your original data, keep operations atomic and provide a simple and repeatable process that can be run over and over.

Let’s examine a few of the steps: Download and Pre-import Cleaning

Take a look at this snippet, which could be a standalone Makefile:

data/download/%.txt : aria2c -x5 -q -d data/download --ftp-user="$(ILCAMPAIGNCASH_FTP_USER)" --ftp-passwd="$(ILCAMPAIGNCASH_FTP_PASSWD)" ftp://ftp.elections.il.gov/CampDisclDataFiles/$*.txt

data/processed/%.csv : data/download/%.txt python processors/clean_isboe_tsv.py $< $* > $@

This snippet first downloads a file via FTP and then uses Python to process it. For example, if “Expenditures.txt” is one of my source data files, I can run make data/processed/Expenditures.csv to download and process the expenditure data.

There are two things to note here.

The first is that we use Aria2 to handle FTP duties. Earlier versions of the script used other FTP clients that were either slow as molasses or painful to use. After some trial and error, I found Aria2 did the job better than lftp (which is fast but fussy) or good old ftp (which is both slow and fussy). I also found some incantations that took download times from roughly an hour to less than 20 minutes.

Second, the cleaning step is crucial for this dataset. It uses a simple class-based Python validation scheme you can see here. The important thing to note is that while Python is pretty slow generally, Python 3 is fast enough for this. And as long as you are only processing row-by-row without any objects accumulating in memory or doing any extra disk writes, performance is fine, even on low-resource machines like the servers in ProPublica’s cluster, and there aren’t any unexpected quirks. Loading

Make is built around file inputs and outputs. But what happens if our data is both in files and database tables? Here are a few valuable tricks I learned for integrating database tables into Makefiles:

One SQL file per table / transform: Make loves both files and simple mappings, so I created individual files with the schema definitions for each table or any other atomic table-level operation. The table names match the SQL filenames, the SQL filenames match the source data filenames. You can see them here.

Use exit code magic to make tables look like files to Make: Hannah Cushman and Forrest Gregg from DataMade introduced me to this trick on Twitter. Make can be fooled into treating tables like files if you prefix table level commands with commands that emit appropriate exit codes. If a table exists, emit a successful code. If it doesn’t, emit an error.

Beyond that, loading consists solely of the highly efficient PostgreSQL \copy command. While the COPY command is even more efficient, it doesn’t play nicely with Amazon RDS. Even if ProPublica moved to a different database provider, I’d continue to use \copy for portability unless eking out a little more performance was mission-critical.

There’s one last curveball: The loading step imports data to a PostgreSQL schema called raw so that we can cleanly transform the data further. Postgres schemas provide a useful way of segmenting data within a single database — instead of a single namespace with tables like raw_contributions and clean_contributions, you can keep things simple and clear with an almost folder-like structure of raw.contributions and public.contributions. Post-import Transformations

The Illinois Sunshine code also renames columns and slightly reshapes the data for usability and performance reasons. Column aliasing is useful for end users and the intermediate tables are required for compatibility with the legacy code.

In this case, the loader imports into a schema called raw that is as close to the source data as humanly possible.

The data is then transformed by creating materialized views of the raw tables that rename columns and handle some light post-processing. This is enough for our purposes, but more elaborate transformations could be applied without sacrificing clarity or obscuring the source data. Here’s a snippet of one of these view definitions:

CREATE MATERIALIZED VIEW d2_reports AS SELECT id as id, committeeid as committee_id, fileddocid as filed_doc_id, begfundsavail as beginning_funds_avail, indivcontribi as individual_itemized_contrib, indivcontribni as individual_non_itemized_contrib, xferini as transfer_in_itemized, xferinni as transfer_in_non_itemized, # …. FROM raw.d2totals WITH DATA;

These transformations are very simple, but simply using more readable column names is a big improvement for end-users.

As with table schema definitions, there is a file for each table that describes the transformed view. We use materialized views, which, again, are essentially persistently cached versions of standard SQL views, because storage is cheap and they are faster than traditional SQL views. A Note About Security

You’ll notice we use environment variables that are expanded inline when the commands are run. That’s useful for debugging and helps with portability. But it’s not a good idea if you think log files or terminal output could be compromised or people who shouldn’t know these secrets have access to logs or shared systems. For more security, you could use a system like the PostgreSQL pgconf file and remove the environment variable references. Makefiles for the Win

My only prior experience with Make was in a computational math course 15 years ago, where it was a frustrating and poorly explained footnote. The combination of obtuse documentation, my bad experience in school and an already reliable framework kept me away. Plus, my shell scripts and Python Fabric/Invoke code were doing a fine job building reliable data processing pipelines based on the same principles for the smaller, quick turnaround projects I was doing.

But after trying Make for this project, I was more than impressed with the results. It’s concise and expressive. It enforces atomic operations, but rewards them with dead simple ways to handle partial builds, which is a big deal during development when you really don’t want to be repeating expensive operations to test individual components. Combined with PostgreSQL’s speedy import tools, schemas, and materialized views, I was able to load the data in a fraction of the time. And just as important, the performance of the new process is less sensitive to varying system resources.

If you’re itching to get started with Make, here are a few additional resources:

Making Data, The Datamade Way, by Hannah Cushman. My original inspiration.
“Why Use Make” by Mike Bostock.
“Practical Makefiles, by example” by John Tsiombikas is a nice resource if you want to dig deeper, but Make’s documentation is intimidating.

In the end, the best build/processing system is any system that never alters source data, clearly shows transformations, uses version control and can be easily run over and over. Grunt, Gulp, Rake, Make, Invoke … you have options. As long as you like what you use and use it religiously, your work will benefit.

↧

Download Chicago’s Parking Ticket Data Yourself

August 23, 2018, 2:00 am

≫ Next: Shedding Some Light on Dark Money Political Donors

≪ Previous: How ProPublica Illinois Uses GNU Make to Load 1.4GB of Data Every Day

by Katlyn Alo and Melissa Sanchez

ProPublica Illinois has been reporting all year on how ticketing in Chicago is pushing tens of thousands of drivers into debt and hitting black and low-income motorists the hardest. Last month, as part of a collaboration with WBEZ, we reported on how a city decision to raise the cost of citations for not having a required vehicle sticker has led to more debt — and not much more revenue.

We were able to tell these stories, in part, because we obtained the city of Chicago’s internal database for tracking parking and vehicle compliance tickets through a Freedom of Information request jointly filed by both news organizations. The records start in 2007, and they show you details on when and where police officers, parking enforcement aides, private contractors and others have issued millions of tickets for everything from overstaying parking meters to broken headlights. The database contains nearly 28.3 million tickets. Altogether, Chicago drivers still owe a collective $1 billion for these tickets, including late penalties and collections fees.

Now you can download the data yourself; we’ve even made it easier to import. We’ve anonymized the license plates to protect the privacy of drivers. As we get more records, we’ll update the data.

We’ve found a number of stories hidden in this data, including the one about city sticker tickets, but we’re confident there are more. If you see something interesting, email us. Or if you use the data for a project of your own — journalistic or otherwise — tell us. We’d love to know.

↧

Shedding Some Light on Dark Money Political Donors

September 12, 2018, 3:00 am

≫ Next: The Election DataBot: Now Even Easier

≪ Previous: Download Chicago’s Parking Ticket Data Yourself

by Derek Willis

On Wednesday we added details to our FEC Itemizer database on nearly $763 million in contributions to the political nonprofit organizations — also known as 501(c)(4) groups — that have spent the most money on federal elections during the past eight years. The data is courtesy of Issue One, a nonpartisan, nonprofit advocacy organization that is dedicated to political reform and government ethics.

These contributions often are called “dark money” because political nonprofits are not required to disclose their donors and can spend money supporting or opposing political candidates. By using government records and other publicly available sources, Issue One has compiled the most comprehensive accounting of such contributions to date.

To compile the data, Issue One identified the 15 political nonprofits that reported spending the most money in federal elections since the Supreme Court decision in Citizens United v. FEC in early 2010. It then found contributions using corporate filings, nonprofit reports and documents from the Internal Revenue Service, Department of Labor and Federal Election Commission. One of the top-spending political nonprofits, the National Association of Realtors, is almost entirely funded by its membership and has no records in this data.

Get info about new and updated data from ProPublica.

For each contribution, you can see the source document detailing the transaction in FEC Itemizer.

The recipients are a who’s who of national political groups: Americans for Prosperity, the National Rifle Association Institute for Legislative Action, the U.S. Chamber of Commerce and Planned Parenthood Action Fund Inc. account for more than half of the $763 million in contributions in the data. There’s also American Encore, formerly the Center to Protect Patient Rights, one of the main conduits for the conservative financial network created by Charles and David Koch.

The largest donor is the Freedom Partners Chamber of Commerce, a Koch-organized business association that has contributed at least $181 million to the leading political nonprofits. Other donors include the Susan Thompson Buffett Foundation, which has given at least $25 million to the Planned Parenthood Action Fund, and major labor unions like the American Federation of State, County and Municipal Employees, or AFSCME, which has given at least $2.8 million to Democratic political nonprofit organizations.

Also among the donors are major corporations like Dow Chemical (mostly giving to the U.S. Chamber of Commerce), gun manufacturers (to the NRA), 501(c)(3) charities and individuals.

You can read Issue One’s report on its work as well as its methodology for discovering the contribution records. Because many of the sources are documents that are filed annually, this data won’t be updated the same way that FEC Itemizer is for campaign finance filings, but it represents the most comprehensive collection of dark money contributions to date.

↧

The Election DataBot: Now Even Easier

September 18, 2018, 10:32 am

≫ Next: New Partnership Will Help Us Hold Facebook and Campaigns Accountable

≪ Previous: Shedding Some Light on Dark Money Political Donors

by Derek Willis and Gabrielle LaMarr LeMee

We launched the Election DataBot in 2016 with the idea that it would help reporters, researchers and concerned citizens more easily find and tell some of the thousand stories in every political campaign. Now we’re making it even easier.

Just as before, the DataBot is a continuously updating feed of campaign data, including campaign finance filings, changes in race ratings and deleted tweets. You can watch the data come in in real time or sign up to be notified by email when there’s new data about races you care about.

DataBot’s new homepage dashboard of campaign activity now includes easy-to-understand summaries so that users can quickly see where races are heating up. We’ve added a nationwide map that shows you where a variety of campaign activity is occurring every week.

For example, the map shows that both leading candidates in Iowa’s 1st District saw spikes in Google searches in the week ending on Sept. 16 (we track data from Monday to Sunday). The Cook Political Report, which rates House and Senate races, changed its rating of that race from “Tossup” to “Lean Democratic” on Sept. 6.

When super PACs spend a lot of money in a House or Senate race, you’ll see it on the map. When Google search traffic spikes for a candidate, that’ll show up, too. We’re also tracking statements by incumbent members of Congress and news stories indexed by Google News. So when you get an email alerting you to new activity (you did sign up for alerts, right?), you can see at a glance the level of activity in the race.

The new homepage also allows you to look back in time to see how campaign activity has changed during the past 15 weeks, and whether what you’re seeing this week is really different than it was before. We’ve also added a way to focus on the races rated the most competitive by the Cook Political Report.

In order to highlight the most important activity, we weighted activity by type. Independent expenditures — where party committees and outside interest groups are choosing to spend their money — count twice as much as other types of activity.

Instead of state-level presidential election forecasts, we now are tracking changes to FiveThirtyEight’s “classic” forecast for each House and Senate contest. We’ve also added candidate statements for more than 500 campaigns whose websites produce a feed of their content.

The homepage map is just the first step in a more useful experience for DataBot users. We’ll be adding other layers of summary data, including details on social media activity, to the homepage, and additional ways to see how races have changed based on the activity feeds.

We’ll also be working to make the individual firehose item descriptions more useful; for example, saying whether a campaign finance filing has the most money raised or spent for that candidate compared with other reports.

We’d love to hear from you about ways to make Election DataBot more useful as Nov. 6 approaches.

↧

New Partnership Will Help Us Hold Facebook and Campaigns Accountable

October 8, 2018, 12:00 pm

≫ Next: Chasing Leads and Herding Cats: Shaping a New Role in the Newsroom

≪ Previous: The Election DataBot: Now Even Easier

by Jeremy B. Merrill

We launched a new collaboration on Monday that will make it even easier to be part of our Facebook Political Ad Collector project.

In case you don’t know, the Political Ad Collector is a project to gather targeted political advertising on Facebook through a browser extension installed by thousands of users across the country. Those users, whose data is gathered completely anonymously, help us build a database of micro-targeted political ads that help us hold Facebook and campaigns accountable.

On Monday, Mozilla, maker of the Firefox web browser, is launching the Firefox Election Bundle, a special election-oriented version of the browser. It comes pre-installed with ProPublica’s Facebook Political Ad Collector and with an extension Mozilla created called Facebook Container.

The Facebook Container, according to Mozilla, helps users control what data Facebook collects about their browsing habits when they visit sites other than Facebook.

People who choose to download the Firefox Election Bundle will automatically begin participating in the Facebook Political Ad Collector project and will also benefit from the extra privacy controls that come with the Facebook Container project. The regular version of Firefox is, of course, still available.

Think of it as turning the tables. Instead of Facebook watching you, you can maintain control over what Facebook can see while helping keep an eye on Facebook’s ads.

You can download the Firefox Election Bundle here.

If you use Firefox and already have the Facebook Political Ad Collector installed, you can install Mozilla’s Facebook Container add-on here.

If you want to find out more about the Facebook Political Ad Collector project, you can read this story or browse the ads we’ve already collected.

↧

Chasing Leads and Herding Cats: Shaping a New Role in the Newsroom

October 12, 2018, 6:30 am

≫ Next: Want to Start a Collaborative Journalism Project? We’re Building Tools to Help.

≪ Previous: New Partnership Will Help Us Hold Facebook and Campaigns Accountable

by Rachel Glickhouse

In this ever-changing industry, new roles are emerging that redefine how we do journalism: audience engagement director, social newsgathering reporter, Snapchat video producer. At ProPublica, I’ve been part of developing a new role for our newsroom. My title is partner manager, and I lead a large-scale collaboration: Documenting Hate, an investigative project to track and report on hate crimes and bias incidents.

ProPublica regularly collects large amounts of information that we can’t process by ourselves, including documents gathered in our reporting, tips solicited by our engagement journalists, and data published in our news applications.

Get the latest news from ProPublica every afternoon.

Since the beginning, we’ve seen collaboration as a key way to make sure that all of this reporting material can be used to fulfill our mission: to make an impact in the real world. Collaboration has been a fundamental part of ProPublica’s journalism model. We make our stories available to republish for free through Creative Commons and usually co-publish or co-report stories with other news outlets. When it comes to large data sets, we often offer up our findings to journalists or the public to enable new reporting. It’s a way of spreading the wealth, so to speak. Collaborations are typically a core responsibility of each editor in the newsroom, but some of our projects have large-scale collaborations at their center, and they require dedicated and sustained attention.

My role emerged after Electionland 2016, one of the largest-ever journalism collaborations, which many ProPublica staff members pitched in to organize. While the project was a journalistic success, its editors learned a key lesson about the need for somebody to own the relationship with partner newsrooms. In short, we came to think that the collaboration itself was something that needed editing, including recruiting partners, making sure they saw the reporting tips they needed to see, and tracking what partners were publishing. It also reinforced the need for a more strategic tip-sharing approach after the success of large engagement projects, like Lost Mothers and Agent Orange, which garnered thousands of leads — and more stories than we had time to tell.

That’s how my role was born. Soon after the 2016 election, ProPublica launched Documenting Hate. Hiring a partner manager was the first priority. We also hired a partner manager to work on Electionland 2018, which will cover this year’s midterm elections.

Our newsroom isn’t alone in dedicating resources to this type of role. Other investigative organizations, such as Reveal from the Center for Investigative Reporting and the International Consortium of Investigative Journalists, staffed up to support their collaborations. Heather Bryant — who founded Project Facet, which helps newsrooms work together — told me there are at least 10 others who manage long-term collaborations at newsrooms across the country, from Alaska, to Texas, to Pennsylvania. What I Do

My job is a hybrid of roles: reporter, editor, researcher, social media producer, recruiter, trainer and project manager.

I recruited our coalition of newsrooms, and I vet and onboard partners. To date, we have more than 150 national and local newsrooms signed on to the project, plus nearly 20 college newspapers. I speak to a contact at each newsroom before they join, and then I provide them with the materials they need to work on the project. I’ve written training materials and conduct online training sessions so new partners can get started more quickly.

The core of this project is a shared database of tips about hate incidents that we source from the public. For large collaborations like Documenting Hate and Electionland, our developer Ken Schwencke builds these private central repositories, which are connected directly to our tip submission form. We use Screendoor, a form-building service, to host the tip form.

In large-scale collaborations, we invite media organizations to be part of the newsgathering process. For Documenting Hate, we ask partners to embed this tip submission form to help us gather story leads. That way, we can harness the power of different audiences around the country, from Los Angeles Times readers, to Minnesota Public Radio listeners, to Univision viewers. At ProPublica, we try to talk about the project as much as we can in the media and at conferences to spread the word to both potential tipsters and partners.

The tips we gather are available to participating journalists — helping them to do their job and produce stories they might otherwise not have found. ProPublica and our partners have reported more than 160 stories, including pieces about hate in schools, on public transportation and on the road, in the workplace, and at places of worship, and incidents involving the president’s name and policies, to name just a few. Plus, each authenticated tip acts as a stepping stone for other partners to build on their reporting.

At ProPublica, we’ve been gathering lots of public records from police on hate crimes to do our own reporting and sharing those records with partners, too. Any time we produce an investigation in-house, I share the information we have available so reporters can republish or localize the story.

As partner manager, I’m a human resource to share knowledge. I’ve built expertise in the hate beat and serve as a kind of research desk for our network, pointing reporters to sources and experts. I host a webinar or training once a month to help reporters understand the project or to build this beat, and I send out a weekly internal newsletter.

Another part of my job is being an air-traffic controller, sending out incoming tips to reporters who might be interested and making sure that multiple people aren’t working on the same tip at the same time. This is especially important in a project like ours; given the sensitivity of the subject, we don’t want to scare off tipsters by having multiple reporters reach out at once. I pitch story ideas based on patterns I’ve identified to journalists who might want to dig further. I’m constantly researching leads to share with our network and with specific journalists working on relevant stories.

And I’m also a signal booster: When partners publish reporting on hate, we share their work on our social channels to make sure these stories get as big an audience as possible. We keep track of all of the stories that were reported with sourcing from the project to make them available in one place. The Challenges

While the Documenting Hate project has produced some incredible work, this is not an easy job.

Many journalists are eager to work with ProPublica, but not always with each other; it can be a process to get buy-in from editors to collaborate with a network of newsrooms, especially at large ones where there are layers of hierarchy. Some reporters agree to join but don’t make it all the way through onboarding, which involves several steps that may require help from others in their newsrooms. Some explore the database and don’t see anything they want to follow up on right away, and then lose interest. And occasionally journalists are so overwhelmed with their day-to-day work that I rarely hear back from them after they’ve joined.

Turnover and layoffs, which are depressingly common in our industry, mean having to find and onboard new contacts in partner newsrooms, or relying on bounce-back emails to figure out who’s left. It also means that sometimes engaged reporters move into positions at new companies where they don’t cover hate, leaving a gap in their old newsrooms. A relentless news cycle doesn’t help, either. For example, after the 2017 violence in Charlottesville, Virginia, caused a renewed surge in interest in the hate beat, a series of deadly hurricanes hit, drawing a number of reporters onto the natural disaster beat for a time.

And because of the sensitivity of the incidents, tipsters sometimes refuse to talk after they’ve written in, which can be discouraging for reporters. Getting a story may mean following up on a dozen tips rather than just one or two. Luckily, since we’ve received thousands of tips and hundreds of records, active participants in our coalition have found plenty of material to work on. The Future of Partnerships

While collaborations aren’t always easy, I believe projects like Documenting Hate are likely to be an important part of the future of journalism. Pooling resources and dividing and conquering on reporting can help save time and money, which are in increasingly short supply.

Some partnerships are the fruit of necessity, linking small newsrooms in the same region or state, like Coast Alaska, or creating stronger ties between affiliates within a large network, like NPR. I think there’s huge potential for more local collaborations, especially with shrinking budgets and personnel. Other partnerships emerge out of opportunity, like the Panama Papers investigation, which was made possible by a massive document leak. If more newsrooms resisted the urge for exclusivity — a concept that matters far more to journalists than to the public — more partnerships could be built around data troves and leaks.

Another area of potential is to band together to request and share public records or to pool funding for more expensive requests; these costs can prevent smaller newsrooms from doing larger investigations. I also think there’s a ton of opportunity to collaborate on specific topics and beats to share knowledge, best practices and reporting.

With new partnerships comes the need for someone at the helm, navigating the ship. While many newsrooms’ finances are shrinking, any collaborative project can have a coordinator role baked into the budget. An ideal collaborations manager is a journalist who understands the day-to-day challenges of newsrooms, is fanatical about project management, is capable of sourcing and shaping stories, and can track the reach and impact of work that’s produced.

We all benefit when we work together — helping us reach wider audiences, do deeper reporting and better serve the public with our journalism.

↧

Want to Start a Collaborative Journalism Project? We’re Building Tools to Help.

February 28, 2019, 7:00 am

≫ Next: The Ticket Trap: Front to Back

≪ Previous: Chasing Leads and Herding Cats: Shaping a New Role in the Newsroom

by Rachel Glickhouse

Today we’re announcing new tools, documentation and training to help news organizations collaborate on data journalism projects.

Newsrooms, long known for being cutthroat competitors, have been increasingly open to the idea of working with one another, especially on complex investigative stories. But even as interest in collaboration grows, many journalists don’t know where to begin or how to run a sane, productive partnership. And there aren’t many good tools available to help them work together. That’s where our project comes in.

Get the latest news from ProPublica every afternoon.

We’ll be sharing some of the software we built, and the lessons we learned, while creating our Documenting Hate project, which tracks hate crimes and bias-motivated harassment in the U.S.

The idea to launch Documenting Hate came shortly after Election Day 2016, in response to a widely reported uptick in hate incidents. Because data collection on hate crimes and incidents is so inadequate, we decided to ask people across the country to tell us their stories about experiencing or witnessing them. Thousands of people responded. To cover as many of their stories as we could, we organized a collaborative effort with local and national newsrooms, which eventually included more than 160 of them.

We’ll be building out and open-sourcing the tools we created to do Documenting Hate, as well as our Electionland project, and writing a detailed how-to guide that will let any newsroom do crowd-powered data investigations on any topic.

Even newsrooms without dedicated developers will be able to launch a basic shared investigation, including gathering tips from the public through a web-based form and funneling those tips into a central database that journalists can use to find stories and sources. Newsrooms with developers will be able to extend the tools to enable collaboration around any data sets.

We’ll also provide virtual trainings about how to use the tools and how to plan and launch crowd-powered projects around shared data sets.

This work will be a partnership with the Google News Initiative, which is providing financial support.

Launched in January 2017, ProPublica’s Documenting Hate project is a collaborative investigation of hate crimes and bias incidents in the United States. The Documenting Hate coalition is made up of more than 160 newsrooms and several journalism schools that collect tips from the public and records from police to report on hate. Together we’ve produced close to 200 stories. That work will continue in 2019.

We’re already hard at work writing a how-to guide on collaborative, crowd-powered data projects. We’ll be talking about it at the 2019 NICAR conference in Newport Beach, California, in March. We are also hiring a contract developer to work on this; read the job description and apply here.

The first release of the complete tools and playbook will be available this summer, and online trainings will take place in the second half of the year.

There are a thousand different ways to collaborate around shared data sets. We want to hear from you about what would be useful in our tool, and we’re interested in hearing from newsrooms that might be interested in testing our tools. Sign up for updates here.

↧

The Ticket Trap: Front to Back

March 4, 2019, 2:00 am

≫ Next: New: You Can Now Search the Full Text of 3 Million Nonprofit Tax Records for Free

≪ Previous: Want to Start a Collaborative Journalism Project? We’re Building Tools to Help.

by David Eads

Millions of motorists in Chicago have gotten a parking ticket. So when we built The Ticket Trap — an interactive news application that lets people explore ticketing patterns across the city — we knew that we’d be building something that shines a spotlight on an issue that affects people from all walks of life.

But we had a more specific story we needed to tell.

At ProPublica Illinois, we’d been reporting on Chicago’s aggressive parking and vehicle compliance ticket system for months. Our stories revealed a system that disproportionately punishes black and low-income residents and generates millions of dollars every year for the city by pushing massive debt onto Chicago’s poorest residents — even sending thousands into bankruptcy.

So when we thought about building an interactive database that allows the public, for the first time, to see all 54 million tickets issued over the last two decades, we wanted to make sure users understood the findings of the overall project. That’s why we centered the user experience around the disparities in the system, such as which wards have the most ticket debt and which have been hit hardest because residents can’t pay.

The Ticket Trap is a way for users to see lots of different patterns in tickets and to see how their wards fit into the bigger picture. It also gives civically active folks tools for talking about the issue of fines imposed by the city and helps them hold their elected officials accountable for how the city imposes debt.

The project also gave us an opportunity to try a bunch of technical approaches that could help a small organization like ours develop sustainable news apps. Although we’re part of the larger ProPublica, I’m the only developer in the Illinois office, so I want to make careful choices that will help keep our “maintenance debt” — the amount of time future-me will need to spend keeping old projects up and running — low.

Managing and minimizing maintenance debt is particularly important to small organizations that hope to do ambitious digital work with limited resources. If you’re at a small organization, or are just looking to solve similar problems, read on: These tools might help you, too.

In addition to lowering maintenance debt, I also wanted the pages to load quickly for our readers and to cost us as little as possible to serve. So I decided to eliminate, as much as possible, having executable code running on a server just to load pages that rarely change. That decision required us to solve some problems.

The development stack was JAMstack, which is a static front-end client with microservices to handle the dynamic features.

The learning curve for these technologies is steep (don’t worry if you don’t know what it all means yet). And while there are lots of good resources to learn the components, it can still be challenging to put them all together.

So let’s start with how we designed the news app before descending into the nerdy lower decks of technical choices. Design Choices

The Ticket Trap focuses on wards, Chicago’s primary political divisions and the most relevant administrative geography. Aldermen don’t legislate much, but they have more power over ticketing, fines, punishments and debt collection policies than anyone except the mayor.

We designed the homepage as an animated, sortable list that highlights the wards, instead of a table or citywide map. Our hope was to encourage users to make more nuanced comparisons among wards and to integrate our analysis and reporting more easily into the experience.

The top of the interface provides a way to select different topics and then learn about what they mean and their implications before seeing how the wards compare. If you click on “What Happens if You Don’t Pay,” you’ll learn that unpaid tickets can trigger late penalties, but they can also lead to license suspensions and vehicle impoundments. Even though many people from vulnerable communities are affected by tickets in Chicago, they’re not always familiar with the jargon, which puts them at a disadvantage when trying to defend themselves. Empowering them by explaining some basic concepts and terms was an important goal for us.

Below the explanation of terms, we display some small cards that show you the location of each ward, the alderman who represents it, its demographic makeup and information about the selected topic. The cards are designed to be easy to “skim and dive” and to make visual comparisons. You can also sort the cards based on what you’d like to know.

We included some code in our pages to help us track how many people used different features. About 50 percent of visitors selected a new category at least once and 27 percent sorted once they were in a category. We’d like to increase those numbers, but it’s in line with engagement patterns we saw for our Stuck Kids interactive graphic and better than we did on the interactive map in The Bad Bet, so I consider it a good start.

For more ward-specific information, readers can also click through to a page dedicated to their ward. We show much of the same information as the cards but allow you to home in on exactly how your ward ranks in every category. We also added some more detail, such as a map showing where every ticket in your ward has been issued.

We decided against showing trends over time on ward pages because the overall trend in the number of tickets issued is too big and complex a subject to capture in simple forms like line charts. As interesting as that may have been, it would have been outside the journalistic goals of highlighting systemic injustices.

For example, here’s the trend over time for tickets in the 42nd Ward (downtown Chicago). It’s not very revealing. Is there an upward trend? Maybe a little. But the chart says little about the overall effect of tickets on people’s lives, which is what we were really after.

On the other hand, the distributions of seizures/suspensions and bankruptcy are very revealing and show clear groupings and large variance, so each detail page includes visualizations of these variables.

Looking forward, there’s more we can do with these by layering on more demographic information and adding visual emphasis.

One last point about the design of these pages: I’m not a “natural” designer and look to colleagues and folks around the industry for inspiration and help. I made a map of some of those influences to show how many people I learned from as I worked on the design elements:

These include ProPublica news applications developer Lena Groeger’s work on Miseducation, as well as NPR’s Book Concierge, first designed by Danny DeBelius and most recently by Alice Goldfarb. I worked on both and picked up some design ideas along the way. Helga Salinas, then an engagement reporting fellow at ProPublica Illinois, helped frame the design problems and provided feedback that was crucial to the entire concept of the site. Technical Architecture

The Ticket Trap is the first news app at ProPublica to take this approach to mixing “baked out” pages with dynamic features like search. It’s powered by a static site generator (GatsbyJS), a query layer (Hasura), a database (Postgres with PostGIS) and microservices (Serverless and Lambda).

Let’s break that down:

Front-end and site generator: GatsbyJS builds a site by querying for data and providing it to templates built in React that handle all display-layer logic, both the user interface and content.
Deployment and development tools: A simple Grunt-based command line interface for deploying and administrative tasks.
Data management: All data analysis and processing is done in Postgres. Using GNU Make, the database can be rebuilt at any time. The Makefile also builds map tiles and uploads them to Mapbox. Hasura provides a GraphQL wrapper around Postgres so that GatsbyJS can query it, and GraphQL is just a query language for APIs.
Search and dynamic services: Search is handled by a simple AWS Lambda function managed with Serverless that ferries simple queries to an RDS database.

It’s all a bit buzzword-heavy and trendy-sounding when you say it fast. The learning curve can be steep, and there’s been a persistent and sometimes persuasive argument that the complexity of modern Javascript toolchains and frameworks like React are overkill for small teams.

We should be skeptical of the tech du jour. But this mix of technologies is the real deal, with serious implications for how we do our work. I found that once I could put all the pieces together, there was significantly less complexity than when using MVC-style frameworks for news apps, in my view.

Front End and Site Generator

GatsbyJS provides data to templates (built as React components) that contain both UI logic and content.

The key difference here from frameworks like Rails is that instead of splitting up templates and the UI (the classic “change template.html then update app.js” pattern), GatsbyJS bundles them together using React components. In this model, you factor your code into small components that bundle data and interactivity together. For example, all the logic and interface for the address search is in a component called AddressSearch. This component can be dropped into the code anywhere we want to show an address search using an HTML-like syntax (<AddressSearch />) or even used in other projects.

We’ll skip over what I did here, which is best summed up by this Tweet:

lol pic.twitter.com/UCpQK131J6— Thomas Wilburn (@thomaswilburn) January 16, 2019

There are better ways to learn React than my subpar code.

GatsbyJS also gives us a uniform system for querying our data, no matter where it comes from. In the spirit of working backward, look at this simplified query snippet from the site’s homepage, which provide access to data about each ward’s demographics, ticketing summary data, responsive images with locator maps for each ward, site configuration and editable snippets of text from a Google spreadsheet.

export const query = graphql` query PageQuery { configYaml { slug title description } allImageSharp { edges { node { fluid(maxWidth: 400) { ...GatsbyImageSharpFluid } } } } allGoogleSheetSortbuttonsRow { edges { node { slug label description } } } iltickets { citywideyearly_aggregate { aggregate { sum { current_amount_due ticket_count total_payments } } } wards { ward wardDemographics { white_pct black_pct asian_pct latino_pct } wardMeta { alderman address city state zipcode ward_phone email } wardTopFiveViolations { violation_description ticket_count avg_per_ticket } wardTotals { current_amount_due current_amount_due_rank ticket_count ticket_count_rank dismissed_ticket_count dismissed_ticket_count_rank dismissed_ticket_count_pct dismissed_ticket_count_pct_rank … } } }

Seems like a lot, and maybe it is. But it’s also powerful, because it’s the precise shape of the JSON that will be available to our template, and it draws on a variety of data sources: A YAML config file kept under version control (configYAML), images from the filesystem processed for responsiveness (allImageSharp), edited copy from Google Sheets (allGoogleSheetSortbuttonsRow) and ticket data from PostgreSQL (iltickets).

And data access in your template becomes very easy. Look at this snippet:

iltickets { wards { ward wardDemographics { white_pct black_pct asian_pct latino_pct } } }

In our React component, accessing this data looks like:

{data.iltickets.wards.map( (ward, i) => ( <p>Ward {ward.ward} is {ward.wardDemographics.latino_pct}% Latino.</p> ) )}

Every other data source works exactly the same way. The simplicity and consistency help keep templates clean and clear to read.

Behind the scenes, Hasura, a GraphQL wrapper for Postgres, is stitching together relational database tables and serializing them as JSON to pull in the ticket data.

Data Management

Hasura

Hasura occupies a small role in this project, but without it, the project would be substantially more difficult. It’s the glue that lets us build a static site out of a large database, and it allows us to query our Postgres database with simple JSON-esque queries using GraphQL. Here’s how it works.

Let’s say I have a table called “wards” with a one-to-many relationship to a table called “ward_yearly_totals”. Assuming I’ve set up the correct foreign key relationships in Postgres, a query from Hasura would look something like:

wards { ward alderman wardYearlyTotals { year ticket_count } }

On the back end, Hasura knows how to generate the appropriate join and turn it into JSON.

This process was also critical in working out the data structure. I was struggling with this but I realized that I just needed to work backward. Because GraphQL queries are declarative, I simply wrote queries that described the way I wanted the data to be structured for the front end and worked backward to create the relational database structures to fulfill those queries.

Hasura can do all sorts of neat things, but even the most simple use case — serializing JSON out of a Postgres database — is quite compelling for daily data journalism work.

Data Loading

GNU Make powers the data loading and processing workflow. I’ve written about this before if you want to learn how to do this yourself.

There’s a Python script (with tests) that handles cleaning up unescaped quotes and a few other quirks of the source data. We also use the highly efficient Postgres COPY command to load the data.

The only other notable wrinkle is that our source data is split up by year. That gives us a nice way to parallelize the process and to load partial data during development to speed things up.

At the top of the Makefile, we have these years:

PARKINGYEARS = 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

To load four years worth of data, processing in parallel across four processor cores looks like this:

PARKINGYEARS=”2015 2016 2017 2018" make -j 4 parking

Make, powerful as it is for filesystem-based workflows and light database work, has been more than a bit fussy when working so extensively with a database. Dependencies are hard to track without hacks, which means not all steps can be run without remembering and running prior steps. Future iterations of this project would benefit from either more clever Makefile tricks or a different tool.

However, being able to recreate the database quickly and reliably was a central tenet of this project, and the Makefile did just that.

Analysis and Processing for Display

To analyze the data and deliver it to the front end, we wrote a ticket loader (open sourced here) to use SQL queries to generate a series of interlinked views of the data. These techniques, which I learned from Joe Germuska when we worked together at the Chicago Tribune, are a very powerful way of managing a giant data set like the 54 million rows of parking ticket data used in The Ticket Trap.

The fundamental trick to the database structure is to take the enormous database of tickets and crunch it down into smaller tables that aggregate combinations of variables, then run all analysis against those tables.

Let’s take a look at an example. The query below groups by year and ward, along with several other key variables such as violation code. By grouping this way, we can easily ask questions like, “How many parking meter tickets were issued in the 3rd Ward in 2005?” Here’s what the summary query looks like:

create materialized view wardsyearly as select w.ward, p.violation_code, p.ticket_queue, p.hearing_disposition, p.year, p.unit_description, p.notice_level, count(ticket_number) as ticket_count, sum(p.total_payments) as total_payments, sum(p.current_amount_due) as current_amount_due, sum(p.fine_level1_amount) as fine_level1_amount from wards2015 w join blocks b on b.ward = w.ward join geocodes g on b.address = g.geocoded_address join parking p on p.address = g.address where g.geocode_accuracy > 0.7 and g.geocoded_city = 'Chicago' and ( g.geocode_accuracy_type = 'range_interpolation' or g.geocode_accuracy_type = 'rooftop' or g.geocode_accuracy_type = 'intersection' or g.geocode_accuracy_type = 'point' or g.geocode_accuracy_type = 'ohare' ) group by w.ward, p.year, p.notice_level, p.unit_description, p.hearing_disposition, p.ticket_queue, p.violation_code;

The virtual table created by this view looks like this:

This is very easy to query and reason about, and significantly faster than querying the full parking data set.

Let’s say we want to know how many tickets were issued by the Chicago Police Department in the 1st Ward between 2013 and 2017:

select sum(ticket_count) as cpd_tickets from wardsyearly where ward = '1' and year >= 2013 and year <= 2017 and unit_description = 'CPD'

The answer is 64,124 tickets. This query took 119 milliseconds on my system when I ran it, while a query to obtain the equivalent data from the raw parking records takes minutes rather than fractions of a second.

The Database as the “Single Source of Truth”

I promised myself when I started this project that all calculations and analysis would be done with SQL and only SQL. That way, if there's a problem with the data in the front end, there's only one place to look, and if there's a number displayed in the front end, the only transformation it undergoes is formatting. There were moments when I wondered if this was crazy, but it has turned out to be perhaps my best choice in this project.

With common table expressions (CTE), part of most SQL environments, I was able to do powerful things with a clear, if verbose, syntax. For example, we rank and bucket every ward by every key metric in the data. Without CTEs, this would be a task best accomplished with some kind of script with gnarly for-loops or impenetrable map/reduce functions. With CTEs, we can use impenetrable SQL instead! But at least our workflow is declarative and ensures any display of the data can and should contain no additional data processing.

Here’s an example of a CTE that ranks wards on a couple of variables using the intermediate summary view from above. Our real queries are significantly more complex, but the fundamental concepts are the same:

with year_bounds as ( select 2013 as min_year, 2017 as max_year ), wards_toplevel as ( select ward, sum(ticket_count) as ticket_count, sum(total_payments) as total_payments, from wardsyearly, year_bounds where (year >= min_year and year <= max_year) group by ward ) select ward, ticket_count, dense_rank() over (order by ticket_count desc) as ticket_count_rank, total_payments, dense_rank() over (order by total_payments desc) as total_payments_rank from wards_toplevel;

Geocoding

Geocoding the data — turning handwritten or typed addresses into latitude and longitude coordinates — was a critical step in our process. The ticket data is fundamentally geographic and spatial. Where a ticket is issued is of utmost importance for analysis. Because the input addresses can be unreliable, the address data associated with tickets was exceptionally messy. Geocoding this data was a six-month, iterative process.

An important technique we use to clean up the data is very simple. We “normalize” the addresses to the block level by turning street numbers like “1432 N. Damen” into “1400 N. Damen.” This gives us fewer addresses to geocode, which made it easier to repeatedly geocode some or all of the addresses. The technique doesn’t improve the data quality itself, but it makes the data significantly easier to work with.

Ultimately, we used Geocodio and were quite happy with it. Google's geocoder is still the best we've used, but Geocodio is close and has a more flexible license that allowed us to store, display and distribute the data, including in our Data Store.

We found that the underlying data was hard to manually correct because many of the errors were because of addresses that were truly ambiguous. Instead, we simply accepted that many addresses were going to cause problems. We omitted addresses that Geocodio wasn't confident about or couldn't pinpoint with enough accuracy. We then sampled and tested the data to find the true error rate.

About 12 percent of addresses couldn’t be used. Of the remaining addresses, sampling showed them to be about 94 percent accurate. The best we could do was make the most conservative estimates and try to communicate and disclose this clearly in our methodology.

To improve accuracy, we worked with Matt Chapman, a local civic hacker, who had geocoded the addresses without normalization using another service called SmartyStreets. We shared data sets and cross-validated our results. SmartyStreets’ accuracy was very close to Geocodio's. I attempted to see if there was a way to use results from both services. Each service did well and struggled with different types of address problems, so I wanted to know if combining them would increase the overall accuracy. In the end, my preliminary experiments revealed this would be technically challenging with negligible improvement. Deployment and Development Tools

The rig uses some simple shell commands to handle deployment and building the database. For example:

make all make db grunt publish grunt unpublish grunt publish --target=production Dynamic Search With Microservices

Because we were building a site with static pages and no server runtime, we had to solve the problem of offering a truly dynamic search feature. We needed to provide a way for people to type in an address and find out which ward that address is in. Lots of people don’t know their own wards or aldermen. But even when they do, there’s a decent chance they wouldn’t know the ward for a ticket they received elsewhere in the city.

To allow searching without needing to spin up any new services, we used Mapbox's autocomplete geocoder, AWS Lambda, to provide a tiny API, our Amazon Aurora database and Serverless to manage the connection.

Mapbox provides suggested addresses, and when the user clicks on one, we dispatch a request to the back-end service with the latitude and longitude, which are then run through a simple point-in-polygon query to determine the ward.

It’s simple. We have a serverless.yml config file that looks like this:

service: il-tickets-query plugins: - serverless-python-requirements - serverless-dotenv-plugin custom: pythonRequirements: dockerizePip: non-linux zip: true provider: name: aws runtime: python3.6 stage: ${opt:stage,'dev'} environment: ILTICKETS_DB_URL: ${env:ILTICKETS_DB_URL} vpc: securityGroupIds: - sg-XXXXX subnetIds: - subnet-YYYYY package: exclude: - node_modules/**

functions: ward: handler: handler.ward events: - http: method: get cors: true path: ward request: parameters: querystrings: lat: true lng: true

Then we have a handler.py file to execute the query:

try: import unzip_requirements except ImportError: pass import json import logging import numbers import os import records log = logging.getLogger() log.setLevel(logging.DEBUG) DB_URL = os.getenv('ILTICKETS_DB_URL')

def ward(event, context): qs = event["queryStringParameters"] db = records.Database(DB_URL) rows = db.query(""" select ward from wards2015 where st_within(st_setsrid(ST_GeomFromText('POINT(:lng :lat)'), 3857), wkb_geometry) """, lat=float(qs['lat']), lng=float(qs['lng']))

wards = [row['ward'] for row in rows]

if len(wards): response = { "statusCode": 200, "body": json.dumps({"ward": wards[0]}), "headers": { "Access-Control-Allow-Origin": "projects.propublica.org", } } else: response = { "statusCode": 404, "body": "No ward found", }

return response

That’s all there is to it. There are plenty of ways it could be improved, such as making the cross-origin resource sharing policies configurable based on the deployment stage. We’ll also be adding API versioning soon to make it easier to maintain different site versions. Minimizing Costs, Maximizing Productivity

The cost savings of this approach can be significant.

Using Amazon Lambda cost pennies per month (or less), while running even the smallest servers on Amazon’s Elastic Compute Cloud service usually costs quite a bit more. The thousands of requests and tens of thousands of milliseconds of computing time used by the app in this example are, by themselves, well within Amazon’s free tier. Serving static assets from Amazon’s S3 service also costs only pennies per month.

Hosting costs are a small part of the puzzle, of course — developer time is far more costly, and although this system may take longer up front, I think the trade-off is worth it because of the decreased maintenance burden. The time a developer will not have to spend maintaining a Rails server is time that he or she can spend reporting or writing new code.

For The Ticket Trap app, I only need to worry about a single, highly trusted and reliable service (our database) rather than a virtual server that needs monitoring and could experience trouble.

But where this system really shines is in its increased resiliency. When using traditional frameworks like Rails or Django, functionality like search and delivering client code are tightly coupled. So if the dynamic functionality breaks, the whole site will likely go down with it. In this model, even if AWS Lambda were to experience problems (which would likely be part of a major, internet-wide event), the user experience would be degraded because search wouldn’t work, but we wouldn’t have a completely broken app. Decoupling the most popular and engaging site features from an important but less-used feature minimizes the risks in case of technical difficulties.

If you’re interested in trying this approach, but don’t know where to begin, identify what problem you’d like to spend less time on, especially after your project is launched. If running databases and dynamic services is hard or costly for you or your team, try playing with Serverless and AWS Lambda or a similar provider supported by Serverless. If loading and checking your data in multiple places always slows you down, try writing a fast SQL-based loader. If your front-end code is always chaotic by the end of a development cycle, look into implementing the reactive pattern provided by tools like React, Svelte, Angular, Vue or Ractive. I learned each part of this stack one at a time, always driven by need.

↧