Reporting Recipe: How to Investigate Health Professionals

January 13, 2015, 6:19 am

≫ Next: What I Learned From My Fellowship at ProPublica

≪ Previous: Reporting From the Youngest Land in the World

Today we're launching two guides to help researchers, journalists and citizens check the license and disciplinary records of medical professionals in every state. One is for doctors, the other for nurses.

State boards are responsible for investigating alleged wrongdoing, incompetence and mistakes made by doctors and nurses. When they find problems, boards can revoke a license—stopping a doctor or nurse from practicing—or impose lesser discipline. But all too often, the details of their findings remain inaccessible to the public.

There is no free and open national database that allows the public to scrutinize the qualifications of doctors and nurses. The Federation of State Medical Boards, a trade group for state medical boards, has a lookup tool for doctors, but it costs nearly $10 per search. "The fee helps defray a portion of the significant costs involved with gathering, verifying and maintaining licensure, disciplinary and medical specialty information," said Drew Carlson, the federation's director of communications, in an email. The National Council of State Boards of Nursing, a trade group for state nursing boards, has a national search tool to verify nurses' licenses, but not all states participate — and often the documents explaining any misconduct are unavailable.

Every state licensing board in the U.S. provides the public with basic license verification for medical professionals registered in their states, including whether or not practitioners have been disciplined. Most publish the legal documents online, giving the public a detailed description of any violations and the subsequent enforcement actions.

For example, Arizona's medical and nursing boards publicly list every enforcement action, along with documents that reveal the details of the wrongdoing.

However, 16 medical and nursing boards lag behind, withholding key details about misdeeds and mistakes, including the legal documents.

Dr. Sidney M. Wolfe, the founder and former director of the Health Research Group at the consumer watchdog organization Public Citizen, says that state boards should be more transparent with disciplinary information.

"The ability for people to get information on their doctors and nurses in this country depends on what state you live in," said Wolfe, whose organization published a yearly index ranking state medical boards by the number of disciplinary actions per physician. "That kind of information is too important to be left to the whims of the state."

The states that do not post the records online require patients to pay fees or file public records requests to find out the details about their doctors' and nurses' histories.

For many patients, that hurdle is too high. And in the U.S., where medical negligence is the third-leading cause of death, finding out this information could make a life-changing difference.

Medical and nursing boards reached by ProPublica cited two reasons for withholding disciplinary details from their websites: money and privacy.

Wyoming's State Board of Nursing posts all public records of nurses' wrongdoing, but the state medical board does not.

"Our data system is antiquated and we're going through a long-term transition," said Kevin Bohnenblust, executive director of the Wyoming Board of Medicine. The medical board lists a one-sentence summary of disciplinary actions, but provides no further detail. "Because we're so small, we are doing it in bite-size pieces. We've wanted to get it right rather than have a broken product."

Some medical and nursing boards rely on fees from public requests for disciplinary documents to cover their budgets.

"That is our way of obtaining monies," said Frances Carrillo, a special projects officer at the Mississippi State Board of Medical Licensure. "We don't take money from state coffers—we take money from fees and fines and licensure fees. This is our way of covering our budget to operate."

Other boards say that although medical and nursing disciplinary records are public records, they should not necessarily be publicized.

"We don't publicize that information for every nurse. It is public record but we just don't put it out there probably because of privacy issues," said Angela Rice, an administrative coordinator at the Louisiana Board of Nursing.

For our complete, state-by-state guide to public records of medical licensing board documents, and for more information on investigating doctors and nurses, check out our guides: How To Investigate Nurses and How To Investigate Doctors.

↧

What I Learned From My Fellowship at ProPublica

January 28, 2015, 8:49 am

≫ Next: Form 990 Documents Return to Nonprofit Explorer

≪ Previous: Reporting Recipe: How to Investigate Health Professionals

This story was co-published with Data Journalism China.

Like a lot of my classmates at Columbia, I had been following the work of ProPublica's news applications desk since I started learning data journalism. So when I received the call from ProPublica's news application editor, Scott Klein, telling me I was going to be their summer Google Journalism Fellow for 2014, I could not believe that I was going to work with his team.

Though my fellowship ended at the end of last summer, I stayed on in a different fellowship. Working here these last seven months has allowed me to better understand how to build news applications and the methodologies behind them. Here are a few things I learned in the last few months.

Data Collection and Data Analysis

ProPublica's news application team produces at least 12 large scale interactive applications a year. In most cases, data collection and data analysis are the most time-consuming parts of a project.

Web scraping is one of the major data sources for ProPublica's projects. Scraping and cleaning data can take a while, but it can be fun, when it works. Take Dollar for Docs for example. It takes a news developer here more than half a year to scrape and check all the data and to update the application. This nerd blog post explains in detail how ProPublica extracted data from a variety of not-so-friendly sources.

The FOIA (Freedom of Information Act) request is another major data source for ProPublica. ProPublica, like other investigative newsrooms, has a complex relationship with FOIA requests. It is one of the main ways to get data that the government is reluctant to disclose. But the process can be long and painful. Every federal agency has slightly different rules, and each state can have very different laws about what's required to be released under FOIA, and under what conditions. Not every state is like Louisiana, which requests government agencies to get back to FOIA requestors within three days. FOIA requests in general take months to process. If you hear happy cheers from a corner of the newsroom, it is very possible that a FOIA package just arrived. Receiving the parcel does not mean that you can roll up your sleeves and start your analysis. On good days, FOIA responses come in a structured format like Excel or csv. On bad days, they come as scanned PDFs or sometimes even on paper. Sometimes the government agencies make excuses: the database is too big to export; your request is too complicated and it is going to take three or more years to process; we need to hire someone to process your request, which will cost you $3000; etc. Under these circumstances, the reporter has to negotiate (or even fight) for the dataset.

Other than web scraping and FOIA requests, ProPublica also collects its own data. For China's Memory Hole, ProPublica built a database and collected images deleted from Weibo, China's version of Twitter, and effectively showed what topics got images deleted the most.

Reporters here believe that statistics should play a bigger role in storytelling. Everyone gets a copy of "Numbers in the Newsroom," a classic guide by Sarah Cohen. R is the primary data analysis tool at ProPublica but people here also use Ruby and Python for data cleaning and analysis. Pro-tips: Don't touch a dataset directly. Always try to write a script to wrangle it, so that you will not only avoid manual mistakes but also be able to replicate exactly what you did when the original dataset is updated. Even better, build a rails app to clean and analyze the dataset even when you are not building an interactive project.

News Application and Interactive Graphic Design

I read ProPublica's Nerd Guides before I started here. I highly recommend it. It helps you avoid common mistakes in news application and data visualization design. Stay away from pie charts unless you are showing relationships between one part and one whole; scatterplots are good for showing correlation between two variables and line charts are for continuous variables; avoid 3-D charts and donut charts at all cost, etc. It also includes a general guide for how to structure a news application, a coding manifesto, and a data bulletproofing guide.

News applications follow a certain structure here that helps the readers understand the story. Every news application is designed with a far view and a near view. Far view provides context for the data, telling the story on the highest level. The near view personalizes the story, enabling people to position themselves in it and to find out why they should care about the story. People can look up their city, their school district, their doctors, their health plan, etc. More detailed guides on news application design can also be found in the Nerd Guides.

But good design requires considerations that go beyond basic principles. Almost every news application or interactive graphic here at ProPublica goes through the "design - demo - redesign" cycle. Sometimes the whole team gathers in the conference room to brainstorm for a news application and come up with better ideas of data visualization or UI design. That is where I learned the most about the nuance of more effective and less effective data representations.

When I started at ProPublica, putting together a bunch of code and seeing it work was so exciting for me and I never wanted to show people my "dirty laundry". As I started working with other developers in the newsroom, I soon found out that I had to. They nicely pointed out my bad habits in coding and told me how to change them. Follow different naming conventions in different languages; indent your code to make it easier to read and to debug; clearly define the scope of variables instead of using global variables everywhere; try creating functions and classes when you find you are repeating yourself; try to adopt prototype-based programming style; avoid z-index but if you have to use it, don't be crazy and make it equal to 10000, etc. In a newsroom, deadline is a higher priority than coding style. But getting into good coding habits clearly helped. They are crucial in making sure that the code works efficiently and that it is easy to collaborate with another developer.

Venturing Into Unchartered Territory

Another bonus of working at ProPublica is to watch people experiment with new technologies. Satellite imagery processing and taking pictures with balloons don't happen every day in this newsroom. But when it does, it's really exciting. This blog post explains how ProPublica used satellite imagery and aerial imagery to tell the story of Louisiana's land loss and land gain.

Losing Ground is a huge news application with rich content. To make sure that users will be able to navigate through the app without missing the most important information, ProPublica conducted its very first formal user test on the news application. ProPublica recruited five users randomly from Twitter and had a 30-minute one-on-one test with each of them. Through a shared screen, the designer was able to see if users followed the path they had in mind. The users played with the app at will and told the designer which part looked confusing. The user test went great and the designers were able to redesign parts of the app according to findings in the user test.

For smaller projects on a tight deadline, ProPublica asks for help within the newsroom. Sometimes the news application developer asks reporters from the other side of the newsroom to sit down and play with the app. Sometimes the developer asks visitors in the newsroom to provide suggestions. The idea is to get someone who is totally unfamiliar with the subject to comment on whether the news app conveyed the information loud and clear. These semi-user tests are not perfect substitutes for the real ones. It is still on ProPublica's wish list to conduct real user tests for as many news apps as possible.

The last few months have been an exciting and rewarding adventure for me. I got my hands on collecting raw data, designing visualizations and building rails apps. If you are excited about the ideas above and want to explore them yourself, you should apply to ProPublica's Google Journalism Fellowship.

↧

Form 990 Documents Return to Nonprofit Explorer

February 4, 2015, 9:21 am

≫ Next: One Year, 2,000+ Downloads: Here’s How Our Data Store Is Doing

≪ Previous: What I Learned From My Fellowship at ProPublica

Today we released an update to our Nonprofit Explorer database and API. It includes updated organization profiles and — more significantly — has re-enabled links to Form 990 document PDFs. Those links had been disabled after they were removed by their source, Public.Resource.org, during an ongoing dispute with the IRS over privacy issues last year.

Late last month, Carl Malamud, founder of Public Resource, sent the entire existing trove of documents to ProPublica and to the Internet Archive in order to re-establish free, public access to the data. The document set — over 8 million filings, weighing in at over 8.4 terabytes — comprises all Form 990s digitized by the IRS between 2002 and November 2014.

Our update coincides with a judgment in Public Resource’s favor in a lawsuit against the IRS over access to “e-filed” tax returns. Under the order, the IRS is compelled to provide the structured electronic data of e-filed tax returns (as “Modernized e-File” XML), in addition to the Form 990 document images they had previously made available. Though the lawsuit only covers nine e-filed tax returns which the IRS originally declined to provide in XML format, Malamud said that the judgment “finally establishes that the public has a right to access e-file data and substantially supports a stronger interpretation of E-FOIA”. The IRS has not yet responded to a request for comment.

With help from Malamud, ProPublica plans to keep Nonprofit Explorer, as well as the Internet Archive, up-to-date with Form 990 document images, until public e-file data supersedes them. The process is involved, and we hope to have a reliable workflow for this process within the next few months.

↧

One Year, 2,000+ Downloads: Here’s How Our Data Store Is Doing

February 5, 2015, 9:31 am

≫ Next: Antebellum Data Journalism: Or, How Big Data Busted Abe Lincoln

≪ Previous: Form 990 Documents Return to Nonprofit Explorer

ProPublica’s Data Store opened for business nearly a year ago. Our two goals when we launched were to centralize and help people make better sense of the free datasets we make available, and to generate some income to help defray the costs of cleaning and analyzing our data. We’d like to share some news about how things have gone so far.

Since we launched on Feb. 26, 2014, we’ve seen more than 2,000 downloads of free data sets. The income the Data Store generates has become a significant part of our modest annual earned income.

We now have three times as many datasets as when we started. More than half are free to download.

Just in the past few months we’ve added some new datasets as part of launching our data-driven stories and interactives:

Raw data about political ad spending during the 2012 presidential election, generated by the community around our Free the Files project (a free download)
Medicare’s Part D prescribing data for 2012, including data for patients 65 and older only (a free download)
Cleaned Part D prescribing data, which we used in our Prescriber Checkup web application (for purchase)
Curated credit rating agency comments regarding tobacco bonds, from Cezary Podkul’s story on the subject (free download)
Cleaned and joined data that compares 2014 and 2015 Affordable Care Act plans (initially for purchase, now free)
Curated and cleaned data on bank bailouts after the 2008 financial crisis (for purchase)
A full accounting, available nowhere else, of school desegregation orders, both court-ordered and voluntary, gathered by Nikole Hannah-Jones for our school segregation stories (free download)
Data from U.S. school districts reporting incidents of restraining and secluding students from Heather Vogell and Annie Waldman’s stories (free download)

The data has proven popular, especially among researchers and our fellow journalists. We’re excited to expand our offerings even more this year.

Have any Data Store questions, comments or suggestions? E-mail us at data@propublica.org.

↧

Antebellum Data Journalism: Or, How Big Data Busted Abe Lincoln

April 9, 2015, 11:52 am

≫ Next: On Repeat: How to Use Loops to Explain Anything

≪ Previous: One Year, 2,000+ Downloads: Here’s How Our Data Store Is Doing

This story was prepared for the March 2014 conference, "Big Data Future," at Ohio State's Moritz College of Law, and will be published in I/S: A Journal of Law and Policy for the Information Society, 10:2 (2015). For more information, see http://bigdatafuture.org.

It’s easy to think of data journalism as a modern invention. With all the hype, a casual reader might assume that it was invented sometime during the 2012 presidential campaign. Better-informed observers can push the start date back a few decades, noting with self-satisfaction that Philip Meyer did his pioneering work during the Detroit riots in the late 1960s. Some go back even further, archly telling the tale of Election Night 1952, when a UNIVAC computer used its thousands of vacuum tubes to predict the presidential election within four electoral votes.

But all of these estimates are wrong – in fact, they’re off by centuries. The real history of data journalism pre-dates newspapers, and traces the history of news itself. The earliest regularly published periodicals of the 17th century, little more than letters home from correspondents hired by international merchants to report on the business details and the court gossip of faraway cities, were data-rich reports.

Early 18th century newspapers were also rich with data. If it were ever in doubt that the unavoidable facts of human existence are death and taxes, early newspapers published tables of property tax liens and of mortality and its causes. Commodity prices and the contents of arriving ships — cargo and visiting dignitaries — were a regular and prominent feature of newspapers throughout the 18th and 19th centuries.

Beyond business figures and population statistics, data was used in a wide variety of contexts. The very first issue of the Manchester Guardian on May 5, 1821 contains on the last of its four pages a large table showing that the real number of students in church schools far exceeded the estimates of the student population made by proponents of education reform.

Data was also used, as it is today, as both the input to and the output of investigative exposés. This is the story of one such investigative story, and of its author, New York Tribune editor Horace Greeley. It’s a remarkable tale, and one with important lessons for “big data” journalism today.

Though he’s no longer a household name, Horace Greeley was one of the most important public figures of the 19th century. His Tribune had a circulation larger than any paper in the city except for cross-town rival James Gordon Bennett’s New York Herald. More than 286,000 copies of the Tribune’s daily, weekly and semi-weekly editions were sold in the city and across the country by 1860, which by its own reckoning made it the largest-circulation newspaper in the U.S. Ralph Waldo Emerson observed, “Greeley does the thinking for the whole West at $2 per year for his paper.”

The Mileage of Congress

Read Greeley’s original story as it appeared in 1848 and see the list of mileage charged by representatives and senators. »

Greeley himself was a popular public speaker and a hugely influential national figure. He was a fascinating, frustrating, contradictory man. He was a leading abolitionist whose support for the Civil War was limited at best, yet his abolitionist writing in the Tribune made the paper the target of an angry mob during the Draft Riots in 1863. He was a vegetarian and a utopian socialist who published Karl Marx in the Tribune, but believed fervently in manifest destiny and America’s western expansion. He was a New York icon who thought the city was a terrible influence on working people and encouraged them to “Go West” to escape it. Though he was one of the founders of the Republican Party, his relationship with Abraham Lincoln was strained, and he ran for president in 1872 on what amounted to the Democratic ticket, losing big and dying broken-hearted before the Electoral College could meet to certify Grant’s election.

Long before his presidential campaign, and for decades, Greeley and his paper held sway with hundreds of thousands of everyday Americans. But if he was a celebrity with the people, he was far less successful convincing political elites to sponsor his entry into political office. His moralism and mercurial nature seem to have been a steady annoyance to powerful figures like New York’s William Seward and Whig (and later Republican) party boss Thurlow Weed.

Historian Richard Kluger noted of the relationship,

“[Greeley] was more useful to [Seward and Weed] than they ever proved to him. As the eloquent editor of a rising newspaper that reached, through its weekly edition, throughout the Empire State, Greeley was a lively fish on the hook, to be fed enough line to thrash about picturesquely until reeled in tightly during campaign season.”

It was perhaps out of a desire to shut Greeley up — and yet also a recognition of the care necessary when dealing with a man, as The Nation put it, “with a newspaper at his back” — that the Whigs nominated Greeley to fill a temporary vacancy in the House for the second session of the 30th Congress in 1848. The session would last only three months, and Greeley’s Congressional career would end when the term did. But what Greeley did with his time was remarkable.

By the middle of the 1800s, Congressmen’s compensation for travel to and from their districts had been an unsuccessful but simmering reform target for years. The law provided for a 40-cent per-mile mileage reimbursement, and computed the distance “by the usually travelled route.” after taking his seat, Greeley got a look at the schedule listing every congressman’s mileage and was shocked by the sums. To Greeley, the disbursements were a wasteful relic of an earlier time, when travel to and from the far-flung reaches of the United States would have been a costly, bruising affair. The 40-cent mileage had been calculated decades earlier to match a pre-1816 congressman’s pay rate of $8 a day, assuming he could travel a mere 20 miles per day. However, thanks to steamships and the increasing prevalence of trains, travelers could go far faster than that.

Greeley saw it as an outrageous waste of the taxpayer’s money, and deployed his newspaper to correct that wrong. “If the route usually travelled from California to Washington is around Cape Horn — or the Members from that embryo State shall choose to think it is — they will each be entitled to charge some $12,000 Mileage per session accordingly.”

Rather than simply opining against it, he conceived and published a data-journalism project that, in form if not in execution, would be very much at home in a newsroom today. He asked one of his reporters, Douglas Howard, a former postal clerk, to use a U.S. Post Office book of mail routes to calculate the shortest path from each congressman’s district to the Capitol, and compared those distances with each congressman’s mileage reimbursements. On Dec. 22, 1848, with Greeley now simultaneously its editor and a brand new congressman from New York, the Tribune published a story and a table in two columns of agate type. The table listed each congressman by name with the mileage he received, the mileage the postal route would have granted him and the difference in cost between them. “Let no man jump at the conclusion that this excess has been charged and received contrary to law,” wrote Greeley in the accompanying text. “The fact is otherwise. The members are all honorable men — if any irrelevant infidel should doubt it, we can silence him by referring to the prefix to their names in the newspapers.”

It wasn’t his colleagues Greeley inveighed against, but rather, he claimed, the system.“We assume that each has charged precisely what the law allows him and thereupon we press home the question — ‘Ought not THAT LAW to be amended?’”

Among the accused stood Abraham Lincoln, in his only term as congressman. Lincoln’s travel from faraway Springfield, Illinois, made him the recipient of some $677 in excess mileage — more than $18,700 today — among the House’s worst. Beside Lincoln, Greeley’s findings included a list of historical legends, including both of Lincoln’s vice presidents — Hannibal Hamlin, who took only an extra $64.80 to go between Washington and Maine, and Andrew Johnson, who got $122.40 extra to get to the Capitol and back from Tennessee. Daniel Webster received $72 extra for travel to and from the Senate from Massachusetts. John C. Calhoun and Jefferson Davis were recipients of an extra $313.60 and $736.80, respectively, for round-trip travel from South Carolina and Mississippi. The excesses tracked roughly according to distance from Washington. Isaac Morse, a Democrat from Louisiana whose journey comprised some 1,200 miles by postal route, received 2,600 miles in mileage from the House. A helpful if imprecise note, I assume written by Greeley, offered: “Only 409 miles less than to London.”

Congressional Mileage Map

This map shows the "excess" mileage paid to each member of the 30th Congress, according to Greeley's story. Although the law that stood in 1848 specified only that the mileage would be paid by "the usually traveled route," Greeley argued that the postal route from each member's district to the Capitol ought to be the standard by which mileage was paid.

Abraham Lincoln, in his only term in Congress, was paid $677 more than he should have been, according to Greeley. That's about $18,700 in today's dollars.

Higher excess mileage charged →

Notes: Mileage figures are based on original report in the Dec. 22, 1848 New York Tribune. We corrected arithmetic errors and rounded numbers to the nearest dollar. Sources: The New York Tribune, 30th Congressional District shapes by Jeffrey B. Lewis, Brandon DeVine, and Lincoln Pritcher with Kenneth C. Martis, U.C.L.A.

News Travels Relatively Fast

It took about five days for the story to travel from New York to the rest of the country. One particularly laudatory Greeley biographer reported that “the effect of [the mileage expose] upon the town was immediate and immense. It flew upon the wings of the country press, and became, in a few days, the talk of the nation.” On Dec. 27, the story broke loose in the House. The Congressional Globe recorded the eruption on the floor. Ohio Democratic Rep. William Sawyer ($281.60 in excess charges) raised a point of order, accusing Greeley of “a species of demagoguism of which he could never consent to be guilty while he occupied a seat on this floor, or while he made any pretensions to stand as an honorable man among his constituents.”

A heated exchange followed with nearly all speakers standing against Greeley, led by Sawyer and Thomas J. Turner, D.-Ill. ($998.40). Most of the charges, according to Turner, were “absolutely false,” and Greeley

“had either been actuated by the low, groveling, base, and malignant desire to represent the Congress of the nation in a false and unenviable light before the country and the world, or that he had been actuated by motives still more base — by the desire of acquiring an ephemeral notoriety, by blazoning forth to the world what the writer attempted to show was fraud. The whole article abounded in gross errors and willfully false statements, and was evidently prompted by motives as base, unprincipled, and corrupt as ever actuated an individual in wielding his pen for the public press.”

While the conversation was rich with florid dudgeon, some of the arguments against Greeley appeared more substantive. Turner pointed out that the Postmaster General had stopped using the postal route book Greeley used to compute mileage “in consequence of incorrectness.” Greeley countered that the article acknowledges this — though I found no passage indicating this in the Tribune. Others noted that the Mileage Committee independently determined mileage for each member based both on evidence provided by the member as well as on their own research, and that members themselves didn’t “charge” anything.

To Greeley, this was all beside the point. He defended his story on the floor, pointing out that he didn’t charge members with anything fraudulent or illegal nor did he “object to any gentleman’s taking that course if he saw fit; but was that the route upon which the mileage ought to be computed?”

Greeley’s own mileage is not listed in the table, but he separately told the House that he’d found that his own mileage was overestimated by some $4 – which would match the mileage paid to his predecessor – and that he’d corrected the matter with the House Sergeant-at-Arms. If opinions among his House colleagues ranged from annoyed to apoplectic, opinion among America’s newspapers seem to have been largely supportive of Greeley. “The election of Mr. Greeley to the House seems likely to produce good,” ran an editorial in the New York Evening Post the next week. “He has already rendered the people an important service by exposing the fraudulent manner of calculating and paying mileage.” The Eastern Carolina Republican damned Greeley with faint praise, saying that he’d “had hit upon a practical reform for once in his life.”

Greeley had “set down the excess to their honor,” added the Sandusky (Ohio) Clarion. “This was not altogether a judicious move, for Mr. Greeley, as a member of this House, especially considering how extravagantly nice some of these bloated crib-suckers are about honor.”

“I had expected that it would kick up some dust,” Greeley later wrote in his autobiography, “but my expectations were far outrun.” He called the affair the “mileage swindle,” and labeled the members “wounded pigeons” and their excuses a “shabby dodge.”

A few weeks into the scandal, he wrote in the Tribune

Members who have taken long Mileage generally had nothing to do with settling the distance; while the Committee say they applied to the members generally, and failing a response, did the best they could. That old rascal Nobody is again at his capers! He ought to be indicted.”

Lessons Learned

Though it’s 166 years old and largely forgotten, Greeley’s mileage story has resonance — and lessons — for data journalists today:

First, open records are important for journalists, and they’re absolutely essential for data journalists. Greeley was able to use his status as a sitting congressman to get access to the data for the story, “certifying that it was wanted as the basis of action in the House.” But a law granting access to government documents wasn’t put in place until the Freedom of Information Act was signed almost 100 years after Greeley’s death. Notably, Congress has exempted itself from FOIA. While it isn’t perfect, journalists and researchers today can count on getting data from the government much more easily than they could in Greeley’s day.

Second, data journalists must be cautious about the powerful stories raw data can tell on its own. Greeley might have known he was being provocative by publishing the names as he did, and his protestations that “there was no imputation in the article upon any member, that he had made illegal charges” seem a bit implausible. Indeed, the story Greeley wrote accompanying the long table insists that the target of his investigation was the outdated law and not any particular congressman. But that’s neither how it was taken in the House nor in the country.

Then, as now, raw data isn’t raw. It comes with biases and reflects the choices made about the methods used to create and analyze it. It can also tell its own story and mislead people into inferring things that the facts don’t support. As journalists, we must understand and make conscious, fair choices about what we’re doing when we put names next to numbers. And we must at all points give context — not just in an attached story, but located near the data itself. Greeley made an argument in the form of a statistical table and people across the country — even sophisticated newspaper opinion writers — concluded that the Congress was on the take. The numbers can speak for themselves, but it isn’t always clear what they’re saying.

Also, it’s just as important for data journalists to confirm their stories with actual humans as it is for traditional reporters to do so. Telephones hadn’t been invented yet when Greeley published the story, but it doesn’t seem as if Greeley tried contacting the Committee on Mileage to make sure his methodology was sound. Critics on the floor of the House revealed flaws in Greeley’s story that would have been devastating in today’s environment of instantaneous social-network media criticism. Greeley should also have reached out to Congressmen he singled out to give them a chance to respond pre-publication. “In case the design of the writer had been to act fairly in the matter,” asked Rep. Sawyer, “why he had not taken the trouble to ascertain the facts?”

The table printed in the Tribune is rife with misspelled names, arithmetic errors, a missing entry and what must have been typographic errors introduced when typesetting the complex columns of numbers. Greeley and his coauthor published a series of corrections and clarifications over the next few months. Howard later called the errors inevitable “in a computation involving over half a million of figures, and executed in a very brief space of time.” But with modern computing supporting us, data journalists today have a far higher bar for accuracy. Bulletproofing is a critical part of the editorial process of any data story and it must never be skipped.

All that said, Greeley’s work had its intended effect. The House continued to grouse about the story but passed a bill that session by a vote of 158 to 16 to change the computation of mileage to “the shortest continuous mail route” -- though, Greeley later wrote, with “a distinct understanding that the Senate would kill it.” In his autobiography Greeley reported that Congress later lowered the per-mile rate to twenty cents, and though the “usually travelled route” language remained, he conjectured that the spread of the railroads shortened that route to something comparatively reasonable.

It is perhaps a fitting coda to this story that, although transportation has gotten faster and easier than Greeley could have imagined, congressional mileage calculations remain, though in quite different form. Unlike in 1848, when members of Congress were personally paid the mileage payments, district travel funds are now part of each member’s overall expense budget. They’re calculated using a per-mile rate that increases with proximity to D.C. The highest rate, which would have applied to Greeley’s Manhattan district, is 96 cents, more than double the rate in Greeley’s day.

↧

On Repeat: How to Use Loops to Explain Anything

June 5, 2015, 7:55 am

≫ Next: Introducing FEC Itemizer: A Tool to Research Federal Election Spending

≪ Previous: Antebellum Data Journalism: Or, How Big Data Busted Abe Lincoln

This article was co-published with Source.

As visual journalists, one of our central responsibilities is to inform people. We do this in lots of ways. One technique I think we have only just begun to use for visual explanations is the visual loop. Specifically, the animated gif.

Animated gifs: so much potential.

We might overlook the GIF because we associate it with certain unserious things like memes, absurd mashups, and celebrity bloopers (or celebrity llamas).

Typical reaction gif.

A most masterful mashup.

This llama's J. Law moment.

But while these GIFs may be awesome and hilarious, looped images have a much richer past and lots of potential to help us explain complex concepts to readers.

GIFs in the Past

Loops actually have quite a long history — and it started long before the 1990's when the Graphics Interchange Format first included support for animation (by swapping multiple images in a sequence to form a rudimentary video). As delightful as they are, these were not the first visual loops:

Starting in the early 1800's we saw a variety of mechanical devices that used rows of images printed on strips or disks of paper to create the illusion of motion. These were the first animation devices.

The phenakistoscope was basically a spinning disk of images that you viewed through a narrow slit to trick your brain to see a sequence of images instead of a continuous blur.

The Phenakistoscope

The praxinoscope used mirrors to achieve the same result — reflected pictures on the inside of a spinning cylinder appeared as a moving picture.

The Praxinoscope

The zoetrope combined these concepts, with a spinning cylinder and narrow slits you'd look through to see the "moving" image.

The Zoetrope

There's also the poor man's version, the flip book, which was actually invented around the same time as these other contraptions.

Whether it was a spinning disk or a series of mirrors, the end result was actually pretty similar to the GIFs we know today — looping images that play cognitive tricks to allow us to see motion.

The technique of the phenakistoscope and other devices hasn't dissapeared entirely, even if it's been replaced by more sophisticated, digital forms. It showed up on the Boston T a few years ago. Ads for movie Coraline were printed as a series of posters inside the subway tunnel, which passengers saw as an animated movie as the train sped by.

A promotional loop for the movie Coraline.

Riders of the subway in Brooklyn see an animation on the wall as the train approaches and leaves the DeKalb Ave station.

But back to the 1800's. In 1872 photographer Edward Muybridge took the idea of looped images even further and created a system to actually project them. Some say this was the first movie projector.

If you look closely at that image being projected, it might remind you of something else that comes up often in the context of data visualization.

This is Horse in Motion, a piece by the same Edward Muybridge who invented the movie projector. He created this series of images to settle an argument about whether a horse gets all four feet off the ground at the same time (it does). But this piece is also a classic example of what we call "small multiples." Small multiples are a chart form made up of sequences of small images. They're useful because they put every bit of information in front of us at the same time. If we saw a single video of this horse in motion, it might be harder to see differences because we can't hold everything in our working memories. Small multiples let us offload our memory onto the page.

But what if we turn that series of photos into a single, looping GIF?

Horse is in Motion: GIF edition

Now here’s the horse is in motion, the GIF edition. Note that rather than repeat this image over and over in space, like small multiples do, we’ve just repeated it over and over again in time — forever. That GIFs repeat forever is important — in a sense GIFs introduce onto the web the notion of the infinite. Where does it start? When will it stop? When do you leave? Maybe you should stay!

And I think it is this repetition, this infinite looping, that makes GIFs such a powerful tool.

So let’s turn to why loops are useful to us, as journalists.

Why are loops useful? Some examples:

To demonstrate why GIFs might be useful to us, let's take a look at examples of loops that exist already. Many of these GIFs help explain a process.

Explain a Process

Here's a GIF that explains how a cheetah runs.

Or how a lock works, as well as how to pick one.

Or how a sewing machine works.

Quite a lot of these "how stuff works" GIFs fall into the food category—pretzels, pop tarts, ice cream sandwiches, pasta, dough.

Then we’ve got how chains are made, or paper clips.

More complex ones that show us the inner workings of how creatures breathe.

Or how a fan rotates.

Then some processes that you've probably never even thought of, like how to make a globe. (One could consider this the opposite problem of map projection.)

Or how Superman gets that perfect cape flutter.

Or what's really inside Big Bird.

Or how to draw a bus stop sign like a pro.

Explain a Concept

GIFs can also help us explain more abstract things.

Like the Pythagorean theorem or sorting algorithms.

How zip codes actually work (it's probably not what you think, just ask Jeff Larson)

Or common passwords.

It's as bad as you thought.

Or how positioning in CSS works.

If only this gif had existed years ago...

Or how to build better bar charts, pie charts and tables.

Tufte, in a gif.

Show Probability

GIFs can also help us show probability and chance in more intuitive ways.

Here's a looping graphic that accompanied a New York Times article about the jobs report. On the left hand side is hypothetical chart of what the unemployment rate could be, and on the right is what the jobs report would look like given that rate. So even if job growth were totally flat, the jobs report could look like it’s going up or down (or some combination).

Using loops to explain uncertainty in the jobs report.

Here's another example where pressing a button gives you a different possible outcome of who will win the Senate. The more you spin, the more you see how the probability shifts.

Can't stop spinning...

Why are they effective? Some theories.

Those are just a few examples of journalistic or explanatory GIFs in the wild. But why do they work? Why is repetition so great? What is it that actually happens when a image is looped? Let's take a look at some theories.

Exposure

The mere-exposure effect says we tend to like things we've been exposed to before.

And this tendency to like familar things begins at a pretty young age—just think of children's stories. Kids love to listen to the same story over and over and over! Things that are familiar are comfortable, they are predictable, and our tendency to like repetition is how we begin as children to learn to recognize patterns, to pick up new vocabulary, and to make predictions about the future.

It’s also why hit songs are a thing. The first time we hear a song, we might think that it’s just OK. But then we hear it at the gym, at the grocery store, and suddenly we're singing along to it in the car.

Imagination and Expectation

One reason why psychologists think music is so powerful is that we become imagined participants in a song. When you hear a few notes of a song you know, you're already imagining what's coming next—your mind is unconsciously singing along.

In a way this is why the little bouncing ball on old Disney sing-a-long music videos works so well—we can imagine and prepare to sing what comes next. And every time our imagination is right, every time our expectations about what comes next are confirmed, it's a surprise, and we get a big rush!

We are always one step ahead of the lyrics in a song.

I think there’s some parallel in this type of expectation to visual repetition. Even if we know what comes next, we are somehow surprised over and over again, whether it's Edward Norton waking up or James Stewart and Grace Kelly shifting their gazes.

These will never get old.

Some GIF enthusiasts at the Smithsonian library clearly have had some fun adding little elements of surprise to old images and illustrations. In this example they've managed to turn the common flying squirrel into...a legit flying squirrel!

Or making illustrated whales actually swim, or just having a little fun with skeletons, or making a part of the page come alive.

Thank you, Smithsonian.

There is also something expectational in this story, which is told as a text message conversation. You have to keep clicking to read the next text and see how the story unfolds. It even comes complete with the little dots that indicate someone is typing (while you're anxiously waiting to see her reply). It's very hard to stop clicking again and again to see the next text, the next text.

The mystery unfolds via text.

Shifting Attention

Another theory as to why repetition is so effective is that it allows us to shift our focus each time.

Each time we listen to a song we not necessarily listening to it the same way — we shift our attention from one aspect to another. First perhaps we listen to the melody, then to a guitar riff, then to some particularly interesting section of lyrics. In this way we’re never really focusing on the same part of a song each time. We hear different aspects of the sound on each new listen.

Our attention shifts between aspects of sound.

This happens in language too. "Semantic satiation" is the term for what happens when you repeat a word over and over again and suddenly the word stops having any meaning at all. Repetition effectively makes you stop focusing on what the word means, and instead focus on what the word sounds like. So repetition can really open up new elements of sound, not accessible on first hearing.

The implications of "semantic satiation."

A similar phenomenon happens in the visual realm. When I see a GIF repeat over and over, I can first focus on one part, then another, then another. I start to notice new things. Remember those spot the difference games, in which you have to detect all the visual discrepancies between two pictures? Notice that it’s a lot easier to see the differences when the two images loop back and forth than when they are just side by side. Looping makes us notice differences, because our attention can shift around.

This bodes well for GIFs that show changes over time because by repeating these images with lots of moving parts, we can notice new things each time.

So here’s a look how U.S. territories got added over the years, how Boston was filled in with landfill, how the New York subway system grew over time, spacejunk collection over the past decades, the ebb and flow of the seasons year after year, the baby boom, even how the alphabet has evolved from 900 B.C. to the present.

Here’s a GIF that shows the urban sprawl of Walmart. NPR brilliantly used GIFs for the mobile version of this graphic and small multiples for the desktop version. Both versions use repetition (one in space and the other in time), but are perfectly suited to the platform of the user.

NPR swaps gifs for small multiples in the mobile version.

Memorization

I can’t mention repetition and music and not talk about earworms. These are the bits of a song that work their way into our subconscious and then suddenly we can’t get them out of our heads. One theory about earworms says it's our brains trying to work out a melody or song lyrics. We repeat and repeat and repeat so that we can remember the last few words, and when we figure them out (or when we listen to the song again) the earworm disappears.

Disney's "It's a Small World" is a notorious earworm.

But consider this. When we repeat catchy tunes in our minds, we also repeat the lyrics. It’s much easier for us to remember song lyrics than it is to memorize other sorts of things such as speeches. Songs become almost like a "hook" for us to hang words on. In addition to the rhythm, you have the melody: the tune, the ups and downs, and the pitch that the words accompany. This provides a powerful set of cues that help you remember the words, much more effectively than random monotone stretches of speech.

And educators have taken full advantage of this fact to attach some pretty useful words to melodies; to encode information in lyrics. Think of the ABCs. Or the days of the week in Spanish, or the Fifty Nifty United States, or the presidents.

The alphabet song has the same tune as "Twinkle Twinkle Little Star."

We’ve already harnessed the power of musical loops for learning, so why not harness that power of repetition in visual ways to teach/inform? If a tune provides a hook on which to hang words, then the question becomes…could visuals provide a hook to hang information?

In fact, they already have! Since ancient Greece people have used a mental trick called a memory palace to associate information with images. A memory palace is basically an imagined building in your mind's eye in which you "place" various objects, the crazier the better, that you want to remember. Our brains have a much easier time remembering places and images than words or numbers, and so by associating the two you can memorize all sorts of things. After repeatedly walking through this “palace,” people can recall the most intricate details and trivia years later.

So we know images can help us remember things. I think we can also use these same visual “hooks” for another purpose: to provide instructions. If you need to help someone memorize a sequence of steps and then repeat them later, instructional GIFs are perfect!

So that could be how to dance or tie a knot.

How to tie a bow tie (this is sort of a combination small-multiples-GIF-loop), or how to make assorted baked goods, how to moonwalk (apparently everyone is doing it the wrong way, these GIFs show you how not to), how to golf, or even how to sign various internet slang.

Here's an instructional loop that the New York Public Library put together to help explain how to use its new crowdsourced building inspector tool. It's the first thing you see on the homepage, and it walks you through exactly what you're supposed to do.

NYPL walks you through how to use Building Inspector.

What if public health officials used instructional loops to teach people how to use a defibrillator or how to perform CPR? If we had a GIF when we were learning these things — one that showed us exactly how to do them? We might remember it better later on.

Wouldn't this be amazing as a GIF?

Because in addition to providing visual cues for memorizing, loops also show us exactly what order to do things in. This is obviously important when it comes to say, using defibrillator. Order is something we pick up on without even really knowing it, kind of like how after listening to a playlist over and over I know exactly what song comes next, even if I didn’t try to memorize the order intentionally.

Now, GIFs won’t replace medical training, but I think they could be pretty useful for this sort of instruction. And as far as I can tell, the current state of public health GIFs is pretty grim, because the only thing that turned up when I searched for “public health GIFs” was this little collection. So, progress is possible!

The sorry state of public health gifs.

Transformation

Finally, another reason loops are so powerful is that they can transform something mundane or average into something completely different. To illustrate this, let me tell you the story of Diana Deutsche.

Diana Deutsche is a psychologist at U.C. San Diego who studies how people perceive music and pitch and how that's affected by all sorts of things, like whether you're right- or left-handed, where you grew up, or even our expectations or beliefs about what we're about to hear.

Typically, Deutsche conducts these experiments by having people listen to tapes that she records and edits herself. So one night she was editing a bit of audio, and left the tape running on a loop while she went to the kitchen to make tea. After a little while she started wondering — what is that singing? She realized it was her own voice, on repeat, that she had confused with a song.

So, this seems really strange. We know the difference between speaking and singing, right? Well, let's see for ourselves what happens when we listen to what Diana did.

The music in Deutsche's recording

If you are like most people, at some point the “sometimes behaves so strangely” started sounding like a melody. And just in case you think you're the only one, see how the exact same thing happened to these children.

What is especially insane about this speech-to-song illusion is that you can never unhear it — you will ALWAYS, from now on, hear "sometimes behave so strangely" as a song.

Now, this powerful illusion tells us that repetition is so deeply rooted in music that we can turn words into music merely by repeating them in a loop!

And… if we can repeat a few words over and over to become music, what happens when we repeat other things over and over?

Applying Loops More Widely

A year or two ago the inventor of Giphy, Alex Chung, and video artist Paul Pfeiffer asked the same question. They began by thinking about a loop as a function. If you put sound through that loop function, you would get music. If you put an image or video through it, you would get a GIF. But then they took this premise to it's logical extreme. What if you put thought through it? What if we applied this loop to ourselves?

The loop function.

Their answer to the question "what does repetitive thought look like?" was hypnosis. If you think about it, this idea makes some sense in the contexts of loops and repetition. The classic icons for hypnosis are typically a mesmerizing clock swaying back and forth, or psychedelic spirals that zoom forever.

But hypnosis also contains the concept of a meditative state. Saying the same thing over and over, mantras, Ohm, are all the loop function applied to human thought. And what's more, Chung and Pfieffer thought, the process of hypnosis is basically the process of gaining access to your unconscious mind. With your brain in this hypnotic, suggestible state, it's possible to erase or override non-productive patterns and replace them with useful ones.

This idea also seems to make sense when we think repeated images in other contexts. Exposure therapy is a behavioral therapy technique that’s all about repeatedly exposing a patient to feared objects or contexts in order to overcome those fears. The idea of desensitization, that repeated exposure to a particular image or particular types of images could make you less sensitive to them, depends on this notion that repetition can be transformative — it can fundamentally change you.

So while Chung and Pfieffer were thinking about how looped images could help reprogram our minds, they came up with a crazy new startup idea which they called GIPHNOSIS: Using GIFs to reprogram yourself.

GIPHNOSIS

Specifically they thought by using GIFs as screen savers they could transform people’s moods in all kinds of ways. So whether it was desensitizing you to particularly ghastly horrors or boosting your mood by showing you adorably coordinated cats, GIFs, they thought, had immense power.

Mood-altering gifs were the main idea behind GIPHNOSIS.

Even if GIPHNOSIS wasn't exactly a successful startup, it’s fair to say that visual loops can be incredibly transformative. Whether they are used for good or for bad, they are powerful tools, and we’ve already seen some of the ways they can be used.

GIFs in the Future

But I am pretty confident that there are many more ways to use GIFs for journalism. And while I'm not sure what sorts of forms GIFs will take in the future, I urge you to think of ways to bring loops into the world of storytelling on the web in a purposeful, insightful, or just plain humorous way. Because who knows what sorts of impossible or magical or transformative experiences we can create — all with the power of loops.

Gif magic.

↧

Introducing FEC Itemizer: A Tool to Research Federal Election Spending

August 12, 2015, 3:02 pm

≫ Next: The Stories of Everyday Lives, Hidden in Reams of Data

≪ Previous: On Repeat: How to Use Loops to Explain Anything

The Federal Election Commission has long made filings that show federal candidates’ fundraising and spending available to the public. But it’s not always easy to find the latest information, such as who’s donated money and where it’s being spent. Today we’re releasing FEC Itemizer to make this work easier for journalists, researchers and citizens.

The interactive database provides a simple way to browse individual contributions and expenditures reported by federal political committees shortly after they are submitted.

Most committees file reports on a regular schedule – monthly or quarterly. With so many committees active in the election, that means new or amended filings are published every day. The data in the reports can be rich, showing who donates money to the committee and where the committees spend their money.

If you know the name of a committee that you’re looking for, NextGen Climate Action for example, you can search for it and then see filings for a particular two-year election cycle. Or you can browse filings by date. Looking at filings is easy, too. You can sort the itemized records by amount, date or name, and if you find something worth remembering, you can get a permalink to each record.

FEC Itemizer relies on The New York Times Campaign Finance API, which checks for new filings every 15 minutes. These filings aren’t considered official by the FEC, which publishes official data every Monday morning, but they are the first look at detailed financial information from federal political committees. Not all committees are included: Senate candidate and party committees still file their reports on paper. But if you’re looking for presidential committees, national and state parties, or super PACs, FEC Itemizer will have them.

If you’re looking for presidential activity, campaigns (like Hillary Clinton’s) and their supporting super PACs (like Right to Rise USA, which backs Jeb Bush) have filed reports covering the first six months of the year. Most committees involved in the presidential race will next file in October, but independent committees airing television ads or paying for canvassers need to file on a more frequent basis.

FEC Itemizer started life as a personal project that I worked on with Aaron Bycoffe, who now works at FiveThirtyEight. This is its first release for a wider audience.

If you have any questions or see anything amiss, let us know in the comments below or email me at derek.willis@propublica.org.

↧

The Stories of Everyday Lives, Hidden in Reams of Data

December 20, 2015, 1:35 pm

≫ Next: A More Secure and Anonymous ProPublica Using Tor Hidden Services

≪ Previous: Introducing FEC Itemizer: A Tool to Research Federal Election Spending

This article was written for the Knight Foundation website, and is also published there. The foundation is running a Knight News Challenge on Data, which will award $3 million to innovative ideas about making “data work for individuals and communities.” Winners will be announced in January 2016.

“A story, if it’s working, is always an answer to the question, ‘How should I live my life?’”

–Ira Glass

Humans are natural storytellers. Journalists simply do it for a living.

That goes for data journalists, too. Just like our newsroom colleagues who write traditional narrative journalism, we’re telling stories to readers even when it looks like we’re just presenting data.

Take for instance ProPublica’s recent launch of Debt by Degrees, a project we launched this month. Built by my colleagues Sisi Wei and Annie Waldman, it uses recently released U.S. Dept. of Education data on student debt. It’s built upon deep, ambitious data – more than 1,700 fields for each of about 7,800 colleges and universities in the U.S. Its creators at the White House were thoughtful stewards on a mission to help America’s young people make better choices when it comes to picking a college.

“Many existing college rankings reward schools for spending more money and rejecting more students,” said President Obama in his radio address announcing the new data, “at a time when America needs our colleges to focus on affordability and supporting all students who enroll.”

Here’s the gist of it: One of the reasons colleges are considered a public good, and why many are tax-exempt charities, is the economic benefit they confer to their students. College graduates have a far lower unemployment rate and earn a lot more money than people who never went to college. College degrees are, as the president said, “the surest ticket to the middle class.”

But while some schools are excellent at providing a first-class education to poor kids, many schools could be doing a better job at helping them avoid big student loan debt. That’s the story we tried to tell in our interactive database. Or rather, that’s the story we helped our readers tell for themselves, using the data on the schools they know best.

In Debt by Degrees, readers can look up virtually any school in the country and see things like the discount that it gives its poorest students and the amount of debt the poorest students take on to go there. The results range from the mundane to the shocking: elite institutions with deep pockets that admit too few poor kids; community colleges dealing with chaos and a student population struggling to remain in school at all; religious schools with a mandate to teach the poor that find it difficult to make ends meet, let alone provide an education to those who can’t afford it.

As reporters, it’s our nature to talk with the people we’re writing about, and it’s our professional responsibility to make sure we’ve heard from people who know things better than we do. So we talked to experts and college administrators, who told us that the school’s endowment mattered a lot when it came to how much aid they were able to give. Harvard, with its enormous endowment, makes sure its poorest students are not saddled with much debt at all, though it admits far fewer of them than other schools its size. While this isn’t the only factor that goes into how much a school can help, we decided it was important enough to include in our interactive database to help readers understand the “why” question.

The data we got from the Department of Education didn’t include endowment numbers, but luckily we found a list of the schools with the largest endowments and were able to match the data sets.

We made some other decisions – inverting one of the new data points that measures “repayment rate” of student debt into “nonrepayment rate” so that it would make more sense for people comparing it to “default rate,” the old, less accurate measure.

In designing the interactive database, we carefully structured each page so that the most important data points are at the top. As readers scroll down the page they’re taken step by step through a chronology – finances during school, after school and years later. You might recognize this as the format for many news stories; our pages start with an inverted pyramid “lede” followed by a narrative that runs chronologically. We inherit our techniques from the same traditions as do our narrative colleagues.

Of course, none of this work would have been possible if the government hadn’t taken the time and spent the resources to provide it. They faced pretty big obstacles to doing so; the data was collected as part of a project that was scaled back after intense pressure from colleges and universities.

We believe strongly that people have the capacity to understand complex, large-scale data, if they’re given the chance, and if they’re given a little help finding out why it matters to them.

That’s where the real opportunity lies – at the intersections between open data and storytelling. Hopefully, that’s where you come in. The Knight News Challenge on Data asks the same question, in a way, that we ask every day on my team: How might we make data work for individuals and communities? Our job as a society isn’t done when we’ve made the data available, though that’s an absolutely worthy and courageous – not to mention mandatory – endeavor. It’s also our responsibility to help people understand it, with all of its complexities and flaws. In short, to help people use data to help answer that question: How should I live my life?

↧

A More Secure and Anonymous ProPublica Using Tor Hidden Services

January 13, 2016, 11:45 am

≫ Next: Learn Data, Design and Code for Journalism. Apply for ProPublica’s Summer Data Institute. It’s Free!

≪ Previous: The Stories of Everyday Lives, Hidden in Reams of Data

Update, January 15: Our configuration has been updated; the walkthrough now notes that you can use Unix sockets for HiddenServicePort.

There’s a new way to browse our website more securely and anonymously. To do it, you’ll need a bit of software called the Tor Browser. Once you’ve got it installed, copy and paste this URL into the running Tor browser: http://www.propub3r6espa33w.onion/

This is called a “Tor hidden service.” Tor is a network of internet relays (and a web browser that uses the network) that protects your privacy by hiding your browsing habits from your internet service provider, and hiding your IP address from the websites you visit. A Tor hidden service is a special type of website that can only be visited with Tor, masking your digital trail as much as possible (Disclosure: Outside of my work at ProPublica, I’m also the developer of Onion Browser, an unofficial Tor Browser for iOS.)

We launched this in part because we do a lot of reporting, writing, and coding about issues like media censorship, digital privacy and surveillance, and breaches of private medical information. Readers use our interactive databases to see data that reveals a lot about themselves, such as whether their doctor receives payments from drug companies. Our readers should never need to worry that somebody else is watching what they’re doing on our site. So we made our site available as a Tor hidden service to give readers a way to browse our site while leaving behind less of a digital trail.

We actually launched it quietly as an experiment last year, shortly after publishing Inside the Firewall, an interactive news application about online media censorship in China. While we’re not aware of any countries currently blocking access to ProPublica, I was curious to see what we could do to improve access to readers if that ever happens.

While using our Tor hidden service greatly increases your privacy, it’s important to note that it is, for the most part, the same website people see on the regular Internet. Like all websites, ours contains embedded multimedia and code from external services like Google Analytics, Facebook “Like” buttons, etc. – important tools that help us engage with our audience and quantify how well we’re doing. (Our privacy policy outlines some of the things we use.) While we are still thinking through how to handle these things in our hidden service, the Tor Browser does obscure the identifying metadata that these external services can see, like your IP address, your location, and details about the computer you are using. And if you want to maximize your anonymity by blocking those external services, it’s easy to do yourself in the Tor Browser by increasing the “security level” to “high.”

About Tor & Hidden Services

A Tor hidden service (sometimes called an “onion site” or an “onion service”) has a special domain name that ends in .onion – like propub3r6espa33w.onion – that you can only connect to using Tor. These special websites and services use strong encryption (even if the URL doesn’t start with https), mask metadata like the IP address of the user, and even mask the address of the site they’re visiting.

Collectively, sites like these are often referred to as being part of the “dark web” though the term is contentious in the Tor developer community, thanks to its association with sites like Silk Road, an illicit online drug market that was seized by the FBI in 2013. But regardless of how it’s misused, the dark web has legitimate and even critical utility in keeping the Internet safe and private:

ProPublica and several other journalism and human rights organizations use SecureDrop to allow sources and whistleblowers to safely transmit sensitive files.
The email hosting service Riseup uses onion services to allow users to access their email ultra-securely.
The chat program Ricochet uses onion services under the hood to allow users to securely chat with each other without relying on any central servers.
Facebook launched an onion site in 2014 to improve access to Facebook over Tor, with an eye toward privacy-conscious users and those in countries where Facebook is blocked.

How is a Hidden Service Different?

You are probably already used to using a secure browser when browsing many sites, especially when banking or shopping. Your web browser lets you know when a site uses “HTTPS” by displaying a lock in the address bar. How is a Tor hidden service different, and why is it more secure than HTTPS?

Browsing Normally

When you’re on a site that uses HTTPS encryption, the connection between your web browser and the site is secure, but important metadata is still visible to servers that can observe your connection, like your ISP or the wifi router you use at a coffee shop. Those can know, for instance, what sites you visit, and can see any unencrypted images and scripts that get loaded in your browser.

Browsing Normal Sites Using Tor

Tor provides some anonymity by relaying traffic through different servers as you browse. This makes it seem to a web server that you are coming from somewhere else. Tor picks three random relays and routes your traffic through each. No relay gets the “whole picture” (that you are visiting propublica.org), because Tor encrypts your connection three times before sending it out: the first layer can only be decoded by the first relay, the second layer can only be decoded by the second, and so on.

Think of this way of layering encryption through relays as like the layers of an onion. Hence the name “Tor,” which was originally an acronym for “the onion router.” Though the longer name has fallen out of use, the onion metaphor is still pretty common when discussing Tor and software that uses it.

When you’re browsing using a Tor browser, your ISP only knows you are using Tor, not what sites you’re visiting or what you’re doing, even when you’re connecting to a non-HTTPS site. The first relay knows your actual IP address and ISP, and knows the address of the second relay. The second relay knows about the first relay and third relay but can’t unencrypt your data to see what you’re doing. The third relay (or “exit relay”) knows about the second relay and the site you are going to, and it can see any unencrypted data that you’re browsing. It’s possible for the sites you visit to know that you’re using Tor because the list of exit nodes is openly known, but they have no way of knowing your real IP address.

Although the exit relay that sends your Tor connection to the normal internet does not know your IP address (since your connection was forwarded to it by another relay), it has access to your metadata, like which sites you are visiting, and unencrypted data because it needs to know how to pass your request on to a desired website.

Browsing Onion Sites Using Tor

An onion site uses the encrypted and anonymous Tor connection from your computer all the way to the websites you visit. Just as before, Tor picks three random relays, but in this case, a copy of Tor we’re running also picks three random relays and the relays meet in the middle.

Your ISP knows you are using Tor. As before, the first relay knows that you are a Tor user, knows your IP address and ISP, and knows the address of the second relay. The chain of relays, which know only the connections before and after them, continue as before except now there are six of them.

As in normal Tor use, none of the relays between a user and the website see the “whole picture”. But the onion site connection never has to leave those confines to connect to the normal Internet, which exposes metadata. To a relay, both a user and our website look like normal Tor clients, and no relay knows any more than that.

More technical detail, such as how the two chains know how to meet, can be found here and in the Tor design paper.

How to Run Your Own Hidden Service

If you run a website and want to run a Tor hidden service, here’s how. I’ll assume from here on out that you’ve got a fair bit of technical knowledge and understand how to run a web server, use the command line and edit configuration files. The Windows command-line “expert” version of Tor is a bit finicky and I don’t have a lot of experience using it, so for now, these instructions will be Mac OS X and Linux-specific. (Are you a Windows Tor expert and interested in helping me write Windows-specific sections of these docs? Please get in touch!)

The following instructions will help you set up a demonstration onion site on your own computer. For a production site, there are a few other things you’ll want to consider that I’ll discuss toward the end.

Step 1: Install Tor

First, you’ll want to install a command-line version of Tor.

The easiest way to do this is to use your package manager to install Tor — Homebrew on Mac, apt-get or yum or whichever manager you use on Linux. The invocation is usually something like brew install tor or apt-get install tor.

If you’re on Mac OS X, you will be prompted to optionally run several commands to have launchd start tor at login, after installing. You can skip this for now.

Step 2: Configure a Tor Hidden Service

Once installed, edit Tor’s configuration file. For OS X Homebrew, you’ll want to create the file at /usr/local/etc/tor/torrc. For Linux, you’ll generally find that the configuration already exists at /etc/tor/torrc. Open that in your code editor of choice.

Add two lines like this:

HiddenServiceDir /tmp/test-onion-config
HiddenServicePort 80 127.0.0.1:3000

HiddenServiceDir: The directory containing test-onion-config needs to exist and needs to be owned by the user running Tor. On Mac OS X, Homebrew will install & launch Tor as your current user, so using /tmp/ is fine (since this is just a test demonstration). If you wish to configure an onion site in OS X that won’t disappear when rebooting, you can use something in your home directory, like /Users/<your_username>/Code/my-onionsite-config. On a Linux machine, you can use something like /var/run/tor/test-onion-config; in Ubuntu, /var/run/tor is already owned by the debian-tor user that runs the Tor daemon.
HiddenServicePort: This routes the inbound port at the xxxxxxxxxxxxxxxx.onion to an IP address and port of your choice. (It should be an IP address and it’s recommended that this route to the local machine. It can also be a Unix socket on the local filesystem.) In the HiddenServicePort 80 127.0.0.1:3000 example, a user accessing your http://xxxxxxxxxxxxxxxx.onion/ (implied port 80) would serve the web app running at port 3000 on your computer.

Unless your underlying website uses some authentication, a Tor hidden service configured like this will be readable by anybody who knows the address. So if you don’t want to test this with a web app you already have on your computer, you can create a simple test by doing:
```
$ mkdir /tmp/test-onion-content
$ cd /tmp/test-onion-content
$ python -m SimpleHTTPServer 3000
```
This will serve the contents of /tmp/test-onion-content at 127.0.0.1:3000, and also at the onion site address being configured.

(You can check out the Tor manual for more information about torrc config lines.)

Step 3: Access the Tor hidden service

If you aren’t running your test app on port 3000 (or whichever you chose) yet, do that now. Then start (or restart) Tor:

On Mac OS X, just run tor in a terminal window. If you previously installed Tor with Homebrew and followed the steps to copy plist files to have launchd start tor at login, you can run the following to restart it:

$ launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.tor.plist
$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.tor.plist

Depending on your flavor of Linux, you’ll need to do one of the following (or something similar):

$ sudo service tor restart
# or
$ sudo systemctl restart tor.service

If all went well, Tor should be running. (If running on the terminal, you’ll see Bootstrapped 100%: Done at some point. Otherwise, you can usually see the status by looking at the Tor log file — /var/log/tor/log, depending on your flavor of Linux.)

Now you’ll find that the HiddenServiceDirectory (i.e., /var/run/tor/test-onion-config or /tmp/test-onion-config) has been created.

Inside that directory, you’ll see two files: hostname and private_key.

If you open the hostname file, you will see that it contains an .onion address. If you open Tor Browser and try to visit it, you should now see your website.

More Tor Hidden Services

The torrc file can contain more than one hidden service, and hidden services can also operate on several ports. For example, Facebook’s facebookcorewwwi.onion listens on both HTTP (port 80) and HTTPS (port 443). In cases like this, a torrc file will look something like this:

HiddenServiceDir /var/run/tor/main-onion-site
HiddenServicePort 80 127.0.0.1:80
HiddenServicePort 3000 127.0.0.1:3000

HiddenServiceDir /var/run/tor/other-onion-site
HiddenServicePort 80 127.0.0.1:9000

In this case, the “main” site will serve two ports: http://xxxxxxxxxxxxxxxx.onion/ and http://xxxxxxxxxxxxxxxx.onion:3000/ (routing to what is running on ports 80 and 3000 locally). The “other” site will be available at http://yyyyyyyyyyyyyyyy.onion/ (routing to what is being served at port 9000 locally).

A little-known secret is that you can also use subdomains with onion sites: the web server that listens to connections from the HiddenServicePort just needs to respond to the hostname. This works because recent versions of Tor will handle a connection to www.xxxxxxxxxxxxxxxx.onion as a connection to xxxxxxxxxxxxxxxx.onion, and your browser will state the subdomain it wants as part of the request inside that connection. You can see an example onion site subdomain configuration here.

Custom Hidden Service Names

You may have noticed that we didn’t configure the onion name that served our example site. Given a HiddenServiceDir without a private_key file inside, Tor will randomly generate a private_key and hostname. The 16 characters of the hostname before .onion are actually derived from this key, which allows Tor to confirm that it is connected to the right hidden service.

There are a few tools that allow you to generate a private_key in advance to get a predictable name: Shallot and Scallion are two popular options. Given an existing torrc with an already-configured HiddenServiceDir, you can delete the existing hostname file, drop in the new private_key file, and restart Tor to use your new onion domain.

It’s debatable whether this is a good idea, since it may train users to look for the prefix and ignore the rest of the domain name. For example, an evildoer can generate a lot of propubxxxxxxxxxx.onion domains — how do you know you’re at the right one? Facebook works around this issue by having an SSL certificate for their hidden service to provide a strong signal to users that they’re at the correct onion site. We’re working on adding this to our onion site, too.

When in doubt, a user should try to confirm an onion site’s domain by corroborating it at a variety of sources. (You can find a GPG-signed file confirming our onion addresses, here.)

Running in Production

There are a few extra things to think about when running a Tor hidden service in production:

While you can run your hidden service on your laptop or workstation, you’ll likely want to use an always-on machine to act as a server. Otherwise, the hidden service goes offline when your computer does.
Your HiddenServiceDir should be relatively well-protected. If someone else can see your private_key, they can impersonate your hidden service. On a Linux machine, this is done by making sure that only the user running tor (debian-tor on Ubuntu) can access this directory.
The target of your HiddenServicePort should preferably be on the same machine as the web server. While you can map this to any IP address, terminating this connection locally reduces the chances of leaking metadata.
You may want to consider where your hidden service is installed. To avoid leaking traffic metadata as much as possible, you can choose to put the hidden service on the same machine as your website so that no Tor traffic has to leave the machine to access the website. Our hidden service is currently hosted on a machine located at the ProPublica offices (and not at our website hosting provider); this is mostly to help us debug issues, but also has the benefit of keeping full control of the machine hosting the hidden service and related log files. In terms of traffic metadata, this mixes encrypted HTTPS traffic from the hidden service with encrypted HTTPS traffic from our own use of the website. I think that’s an acceptable tradeoff (versus leaving hidden service logs available to our web host), but we may re-examine this in the future.
If you want to mirror an existing website, it‘s worth taking stock of the assets and resources that get loaded on your pages. We’ve made an effort to provide onion services for several subdomains that we use to serve our own assets. But news organizations also publish items that use external media — audio clips, videos, social media posts — to strengthen a story, and we use analytics to measure and understand our audiences. Having these external resources has ramifications (since some of the traffic no longer uses a hidden service and relies on using Tor to access the resource’s normal site) and it’s worth considering this issue on your own site. (As mentioned near the top of this post, there are features in Tor Browser that allow a user to block many of these resources.)
Current versions of Tor don’t provide any way of load-balancing large amounts of traffic. So even if you host your content with a production web server such as Apache or nginx, the hidden service endpoint is a single point of failure that can’t currently be scaled up. But Tor developers are working on a way of fixing this in an upcoming version.
SSL certificates are more difficult to acquire for Tor hidden services than normal domains. They can be issued, but must be extended validation (“EV” or “green bar”) certificates and undergo more thorough verification than proving ownership of a normal domain name. (We plan to go through this process, and we’ll update this post as we do so.)

Our Hidden Service Mirror

Putting together all of the above, you can get something like our current hidden service.

We use local Unix sockets (instead of an ip:port) for the local connection between Tor and nginx. Other than that, there’s nothing too special about our torrc, which you can find here.

The hidden service running ProPublica’s site at propub3r6espa33w.onion speaks to an instance of nginx that handles routing our subdomains and some processing — such as rewriting “www.propublica.org” links to instead use the onion domain — before proxying on to our normal web server. (In Ubuntu, you can install a version of nginx that contains the extra rewrite modules by installing the nginx-extras package.) You can see this configuration here.

You might notice that our hidden service does experimentally listen to HTTPS connections, but we’re currently using a self-signed key for that, which can cause a combination of browser errors and assets not loading if you try to visit our onion site that way. We’re working on getting a valid SSL certificate for our hidden service, and that should hopefully be fixed sometime soon.

If you have any concerns or feedback about this tutorial or the configuration I’ve shared, please get in touch. (My PGP key is 0x6E0E9923 and you can get it here, here, on Keybase and on most keyservers.)

↧

Learn Data, Design and Code for Journalism. Apply for ProPublica’s Summer Data Institute. It’s Free!

January 20, 2016, 2:50 pm

≫ Next: Meet the New ProPublica Campaign Finance API, Same as the Old API

≪ Previous: A More Secure and Anonymous ProPublica Using Tor Hidden Services

ProPublica is proud to announce its first-ever Summer Data Institute, a free 10-day intensive workshop on how to use data, design and programming for journalism. The workshop will be from June 1st to June 15th in ProPublica's New York offices. The deadline to apply is March 31st. Apply here.

Geared towards journalists and journalism students, this workshop will cover everything from finding and analyzing data, to using colors and typography for better storytelling, to scraping a website using code. By the end of the Institute, students will have created an interactive data project from beginning to end, with help and guidance from some of the best designer/developer/data journalists in the world.

ProPublica's News Apps team has worked on everything from colleges that saddle poor students with debt, doctors who take money from drug companies, how much limbs are worth in different states, and even investigative space journalism. The workshop will cover step-by-step how ProPublica brainstorms, reports, designs and builds these types of interactive graphics and data-driven news applications.

One of the reasons we're so excited about this workshop is because it is another step ProPublica is taking to increase the diversity of its own newsroom and beyond. That means training and empowering journalists from a broad array of social, ethnic, and economic backgrounds. We are particularly dedicated to helping people from communities that have long been underrepresented not only in journalism but particularly in investigative and data journalism, including African Americans, Latinos, other people of color, women, LGBTQ people, and people with disabilities.

The workshop is completely free to attend and ProPublica will provide lodging and cover roundtrip travel costs to New York City, as well as local travel costs to and from our offices. Additionally, to make the Summer Data Institute accessible to people for whom it would still be economically out of reach, ProPublica is offering a limited number of stipends, up to $1,000. Requests for stipends are part of the application.

The Summer Data Institute is made possible by a grant from the John S. and James L. Knight Foundation.

So don't wait, apply now! Or email this description to someone who you think should apply.

If you have any questions, email data.institute@propublica.org.

Apply!

↧

Meet the New ProPublica Campaign Finance API, Same as the Old API

February 4, 2016, 4:20 pm

≫ Next: Upgrading FEC Itemizer for the 2016 Campaign

≪ Previous: Learn Data, Design and Code for Journalism. Apply for ProPublica’s Summer Data Institute. It’s Free!

Beginning today, ProPublica is launching a Campaign Finance API to help researchers, journalists and software developers cover election fundraising and expenditures.

An API, or Application Programming Interface, is a language that two programs can use to communicate and trade data. Programmers can use it to access data from a website or Internet service more easily.

We’re assuming responsibility of an API that was previously published by The New York Times. If you used the previous Campaign Finance API published by the New York Times, your code will continue to work for a short time, but you should migrate immediately. Keep reading for details on how.

New users can sign up for a free API key by emailing apihelp@propublica.org. The API provides information on committees and candidates that file records with the Federal Election Commission, with an emphasis on those committees that file electronically (almost every federal committee except most Senate campaigns).

The API is updated with new electronic filings every 15 minutes and with summary data published by the FEC once a day.

The ProPublica Campaign Finance API powers our FEC Itemizer database, which non-programmers can use to find and browse electronic campaign finance filings as soon as they’re filed. Taking over the API will help us add new features to this project.

The API includes information about electronic filings, which are submitted to the FEC on nearly every day of the year. The API provides details about specific types of filings, filings for a specific date and a summary of financial information in each filing.

The API does not include itemized contribution records except in some specific circumstances. If you’re looking to search for contributors, the FEC and the Center for Responsive Politics make bulk data available, and the FEC’s new beta site also has an individual search.

The FEC has provided this data in bulk for decades, and has recently launched a beta API of its own that has includes candidates, committees, filings and some itemized transactions. There is some overlap between the FEC’s API and ProPublica’s: both offer users the ability to search for candidates or committees and summary financial information, while both offer information that the other does not.

One big difference is timeliness: the FEC API is updated nightly, while ours will be updated throughout each day. For many users of campaign finance data, that distinction may not be a big deal, but on filing days, when thousands of filings are submitted to the FEC, timeliness can matter a lot. Another is the source data: the FEC considers electronic filings to be “unofficial” in the sense that data from them is then brought into agency databases before being published as bulk data. The FEC API publishes data only from those official tables, while the ProPublica API has data from both the official tables and the raw electronic filings. As the FEC API develops, we may move to incorporate aspects of it into the ProPublica API, or to remove clearly duplicative offerings.

If you are a current user of The New York Times Campaign Finance API, we want to make this transition as easy as possible. The Times’ API will be shut down in a few days, so we encourage you to sign up for a new API key to use the ProPublica API. Requests will look similar, but not identical, to the previous ones. For example, to search for committees with the word “tomorrow” in the name, The Times API call would have been:

http://api.nytimes.com/svc/elections/us/v3/finances/2016/committees/search.json?query=TOMORROW&api-key=NYT_CAMPAIGN_FINANCE_API_KEY

The URL structure for that request using the ProPublica Campaign Finance API is:

https://api.propublica.org/campaign-finance/v1/2016/committees/search.json?query=TOMORROW

There are a couple of important changes to requests: First, the version is “v1” rather than the previous “v3” since this is the first iteration of the API under ProPublica. Second, the API key isn’t passed as a query string; instead it is sent as a header with the request. From a command line interface using curl, it would look like this:

curl “https://api.propublica.org/campaign-finance/v1/2016/committees/search.json?query=TOMORROW” -H “X-API-Key: PROPUBLICA_API_KEY”

The API will continue to return JSON and XML formats for requests, and callbacks will be supported for JSON responses. Documentation of all available requests is here.

We’re grateful to Chase Davis of The Times for his help in making it possible for ProPublica to acquire and host the API. We’re planning on adding some new data points and enhancing others, particularly where we can add value through custom calculations or categorization. If you have questions or comments about the transition, or the API going forward, please don’t hesitate to let us know. We’d also love to hear your ideas and requests, either in the comments below or at apihelp@propublica.org.

↧

Upgrading FEC Itemizer for the 2016 Campaign

February 11, 2016, 5:06 pm

≫ Next: How We Made Hell and High Water

≪ Previous: Meet the New ProPublica Campaign Finance API, Same as the Old API

We’ve upgraded our FEC Itemizer interactive database to make it faster, particularly when displaying individual contribution and expenditure records.

It’s now integrated directly into the database behind the Campaign Finance API we launched last month. Before that change, FEC Itemizer parsed each electronic filing on the fly (using Fech, a Ruby library built for that task). For filings with a small number of transactions, things didn’t speed up much, but filings with hundreds or thousands of records will now load noticeably more quickly.

For instance, the FEC Itemizer compares the summaries of a filing that amends an earlier one much more efficiently than it used to. Previously, it loaded two electronic filings into memory and then compared their summary information. Now, because we store each filing’s summary data in a database, we can quickly retrieve just that information to make the comparison, even if the filings contain a lot of activity.

We’re also being smarter about not serving up fresh pages when the data itself hasn’t changed since the last time it was requested.

There are still some filings that are so large that database queries are still slow. For instance, ActBlue, which helps Democratic candidates raise money online by serving as a conduit for donations, recently posted a filing with 5.3 million contribution records. For ginormous filings like that, we’ve turned off displaying individual records from filings for now, and will be adding bulk per-filing downloads by Feb. 19, one day before the next filing deadline.

We’ve begun adding new features to FEC Itemizer, too. Now when you visit a committee’s page, you’ll get three new fields: summary totals for that committee for the election cycle, plus the last date covered by those figures. As the presidential primaries continue, we’ll be adding pages to track independent spending and other aspects of the 2016 campaign.

Since the launch of the Campaign Finance API, we’ve issued keys to more than 75 users (you can get yours by emailing apihelp@propublica.org) and we have plans to add new responses and data, too. And we’d love to hear your suggestions and requests, either by email at apihelp@propublica.org or in the comments below.

↧

How We Made Hell and High Water

March 13, 2016, 7:46 pm

≫ Next: Following the Money is Now Easier with FEC Itemizer

≪ Previous: Upgrading FEC Itemizer for the 2016 Campaign

Our interactive story, "Hell and High Water," includes a map with seven animated simulations depicting a large hurricane hitting the Houston-Galveston region. Four of the scenarios depict a hurricane hitting with existing storm protection infrastructure in place; three envision storm protection projects proposed by a number of universities in Texas.

Five of the simulations were developed by teams at the University of Texas at Austin, and the Severe Storm Prediction, Education and Evacuation from Disasters (SSPEED) Center at Rice University.

A re-creation of Hurricane Ike, a real storm that hit Texas in September 2008.
Storm P7, a simulation that envisions Hurricane Ike making landfall near San Luis Pass, about 30 miles southwest of where it actually landed. According to Dr. Phil Bedient at Rice, the chances of the Storm P7 scenario occurring in any given year is about one in 100.
Storm P7+15 (a.k.a. "Mighty Ike"), which follows the same path as the P7 storm, but with 15 percent higher wind speeds. Bedient told ProPublica and The Texas Tribune that the chances of the Storm P7+15 scenario occurring in any given year is one in 350.
Mid-Bay Scenario, which simulates the P7+15 storm but includes a gate across the middle of Galveston Bay that would prevent storm surge from reaching Clear Lake and the Houston Ship Channel. The Mid-Bay scenario was proposed by the SSPEED Center at Rice University.

Coastal Spine, which simulates the P7+15 storm but includes a proposed 17-foot wall along Galveston Island and the Bolivar Peninsula, and a gate across Bolivar Roads, which would prevent storm surge from entering Galveston Bay. The Coastal Spine is a version of a proposal by Dr. William Merrell at Texas A&M-Galveston.

Two simulations are based on synthetic storms developed by the Federal Emergency Management Agency as part of the RiskMAP flood mapping study and modeled by researchers at Jackson State University. Synthetic storms, as opposed to the P7 storms that were based on Hurricane Ike, never appeared in nature, but are derived from averaging hundreds of actual storms that have hit the Texas coast.

Storm 36 is a synthetic storm that, like some other scenarios, makes landfall near San Luis Pass. According to JSU researchers, the chances that Storm 36 will occur in any given year is 1 in 500.
Coastal Spine (extended) simulates Storm 36 as if it were protected by a 17-foot-tall structure similar to the Coastal Spine, except it extends as far as Sabine Pass to the Northeast and as far southwest as Freeport. The "extended" Coastal Spine is the current version of the Coastal Spine/Ike Dike proposal by Dr. William Merrell at Texas A&M Galveston.

Researchers provided all of the storm data to ProPublica and the Texas Tribune in formats used by the ADCIRC system. Run on powerful supercomputers at UT-Austin and the U.S. Army Engineer Research and Development Center's Coastal and Hydraulics Laboratory (ERDC), ADCIRC takes a variety of storm inputs and is able to derive storm surge, wind strength and wind vectors with high precision, even over local areas. ProPublica and the Texas Tribune worked primarily with three ADCIRC file formats: a fort.14 grid file, a fort.63 water elevation time series file, and a fort.74 wind vector time series file.

The fort.14 grid is a highly accurate 3-D height map of the Texas coast. The grid provided to ProPublica and the Texas Tribune by Dr. Jennifer Proft at UT-Austin was created in 2008 and contains about 3.6 million points. The fort.63 and fort.74 files provided by UT are hourly snapshots during simulated storms starting on Sept. 11, 2008 at 1 p.m. UTC and continuing for 72 hours.

The grid files provided to us by Jackson State University that depict Storm 36 and Storm 36 protected by the "coastal spine" structure are each roughly 6.6 million points (though, within our areas of interest, the JSU and UT grids have about the same number of points). The Storm 36 time series files start on the fictitious date of Aug. 2, 2041 at 10:30 a.m. and continue for 96 hours in half-hour time steps. The Storm 36 (with Coastal Spine) time series files begin on the fictitious date of Aug. 1, 2041 at 1:30 a.m. and continue for 94 hours in half-hour time steps.

The ADCIRC grids are what are called "unstructured grids," meaning the resolution varies at different locations in the mesh. The length of the lines connecting grid nodes varied from about 50 meters to about 1,000 meters within our areas of interest. Because of this variability, the resolution of any given triangle ranges from about 1,250 square meters to about 45,000 square meters.

Variation in grid resolution in roughly our areas of interest (Jackson State University)

Additionally, the water surface elevations in the fort.63 files have a margin of error of about 1-2 feet, according to Bruce Ebersole at Jackson State University.

To display all of these storms on the same timeline, we synchronized the landfall of all the storms around Hurricane Ike's landfall on Sept. 13, 2008 at 7 a.m. UTC. To account for varying lengths of the time series files between all of the storms we wished to show our timeline, we synchronized our storm simulations to begin at 16 hours before landfall, and continue until 24 hours after landfall. Because the storms provided to us by Jackson State University are saved in half-hour time steps and the UT storms are saved in hour time steps, we removed every other snapshot from the Jackson State storms before processing them for our interactive graphic.

Our application focused on an initial 160km-by-120km area of interest surrounding the Houston/Galveston area and smaller areas covering the Houston Ship Channel (32km by 15km), Clear Lake (23km by 20km) and Galveston (9km by 6km).

Processing ADCIRC into PNGs

In order to process the ADCIRC files for display on the web, we had to compress the massive size of the initial datasets provided to us. To do that, we developed a processing pipeline that broke the grids and time series down first into our areas of interest, and then encoded averaged data into the red, green, blue and alpha channel pixels of PNG images for fast delivery to the browser.

First, we used an ADCIRC Fortran utility program called resultscope to "crop" the fort.14, fort.63 and fort.74 files to all of our areas of interest. This cut our "background" area, for example, down to a grid of about 650,000 points. We took these resulting files and encoded them into a series of data-encoded images. For each area (the "background," the Houston Ship Channel, Clear Lake and Galveston), we created a 512x512 pixel PNG image and "packed" the height data into the RGBA values of the image using OpenGL and Ruby. In these "image databases," we encoded green value to the height of the grid at a pixel location, the blue to the height after the decimal, and the red byte as a flag indicating whether the point was above or below sea level (NAVD88).

In order to format this data efficiently for transport over the Internet and manipulation in a web browser, we also "packed" each hour of the time series into PNG images, with the red and alpha values storing the wind x and y vectors respectively, the green value containing the water height at the location and the blue value containing the height of the water after the decimal. To further optimize the images for delivery over the web, we composited the time series images into sets of 1024x1024 images each containing four 512x512 images. Each storm, therefore would be a single 512x512 grid image, and ten 1024x1024 time series images, comprising 40 hours of the storm. In the application, we have seven sets of those images (for each storm), for every area of interest. The entire app is comprised of 28 sets or 308 images total. (If we had not collected the time series into quadrant images, there would have been 1,148 images altogether.)

Because every storm was compressed into the same 512x512 resolution, the accuracy of each area of interest varies. For the background, the accuracy is averaged over 72,072 square meters, for the Houston Ship Channel View, 1,932 square meters, for Clear Lake, 1,803 square meters, and for Galveston, 231 square meters.

Since the grid resolution varies at different points around the mesh, even though every pixel of our "image databases" is averaged over the same area, the underlying data may be lower resolution:

Resolutions of elevation in the real world, the ADCIRC grid, and our PNG "image databases" (Illustration: Sarah Way for ProPublica)

Our interactive graphic uses WebGL to decode these images in the browser and display the animations to the reader. In the graphic, the base image shown to the reader is a pan-sharpened true-color Landsat image of the Texas Coast. Using WebGL, we then mixed blue (rgba(18, 71, 94, 1)) into the satellite image in areas where the depth of the grid was below sea level or the surge exceeded the underlying elevation over the course of the time series.

Because it wouldn't be a hurricane without wind, we created a particle system to show the wind direction and speed. We worried that showing large amounts of particles on the vizualisation would slow the animation to a crawl, but we found a WebGL extension called instanced arrays that makes copies of just one instance of our particle and efficiently draws it thousands of times. The particles are randomly placed small rectangle meshes, and we rotate each rectangle based on the wind data in each storm's underlying "image database." For every frame we also update the rectangle's position based on the wind velocity at that point, which we store in a WebGL texture. Each wind particle only lives for about a second before it fades out and resets to it's original position. The wind speed in the visualization is 20 times as fast as in the underlying computer model because even hurricane sized wind speeds when viewed from what is essentially space would be imperceptible.

We presented other metadata about the storms, such as the storm tracks, and proposed and existing storm protection measures on the map as vector data. Texas A&M Galveston provided the current "extended" Coastal Spine scenario to ProPublica and the Texas Tribune as a low resolution raster image, which we converted into a vector file. Although it is therefore imperfect it is our best estimate of what that conception of the Coastal Spine would look like.

Rice University, via UT-Austin, provided the other proposed and existing barriers to us as a shapefile. UT-Austin provided the storm tracks for Ike, P7, and P7+15 as a shapefile and we converted them into GeoJSON. Jackson State University provided the storm track for storm 36 as a " TROP file." We synchronized it to the "Ike" timeline and converted it to a GeoJSON to present on the map.

The storage tanks presented in the Houston Ship Channel view come from a database created by Dr. Hanadi Rifai, and others, at University of Houston. To create that dataset, University of Houston researchers scrutinized 2008 aerial images to classify the tanks. The storage tank data is therefore current as of 2008.

Several other features in the interactive graphic use the same PNG "image database" we created to present the storms, including the address-lookup function and the surge numbers at Galveston Strand, Johnson Space Center and ExxonMobil Chemical. However, this process is done server-side, because we discovered that querying pixel data in the browser using the getImageData method in canvas returns incorrect results for images with alpha transparency.

The timeline graph at the top of the page is also generated by querying this same PNG "image databases" for each storm and area of interest at a given point. The point depicted in the timeline graph changes based on the area of interest: The overview as well as the Clear Lake area of interest depicts a point near the Kemah Boardwalk; the Galveston AOI is a point on the Galveston Strand; and the Houston Ship Channel view is a point on the ExxonMobil Chemical facility near 5000 Bayway Drive, Baytown, Texas.

We used Google's address geocoder. The accuracy of these numbers is the same as the resolution of the areas of interest noted above.

We would like to extend our thanks to the teams at Rice, UT-Austin, Texas A&M Galveston and Jackson State University for helping us understand and present this data, especially Jennifer Proft at UT and Bruce Ebersole at Jackson State University, who have spent many hours patiently guiding us through the intricacies of storm data.

↧

Following the Money is Now Easier with FEC Itemizer

March 14, 2016, 6:05 pm

≫ Next: Infographics in the Time of Cholera

≪ Previous: How We Made Hell and High Water

PACs, super PACs and nonprofit committees have spent at least $283 million influencing the presidential race so far. Today we’re announcing changes to our FEC Itemizer database that will help you stay informed about who is spending that money and where they are spending it.

FEC Itemizer now shows more information about outside spending, not only in the presidential race but in House and Senate contests, too. It also provides more detail about committees’ activities, such as a page with summary totals for independent expenditures, along with subtotals for spending in each state.

You can also get a view of where a committee is making independent expenditures, with a color-coded map showing the levels of spending. In addition, each contest within a state (the presidential race has both state-level elections and a national one) has its own link that lists independent spending and the top-spending committees.

For the presidential race, we’ve also added pages that summarize and list independent expenditures that support or oppose a specific candidate (for example, those supporting Hillary Clinton or opposing Ted Cruz).

Along with these new features, we’ve also added:

A page with super-PAC filings browsable by date.
Summary totals for committee pages.
The ability to browse a committee’s filings by a specific period; you can now quickly see whether what a committee raised in a reporting period was more or less than it has raised during the same period in previous years.
The ability to see itemized records from electioneering communications filings that don’t specifically advocate the election or defeat of a candidate.
A better summary of changes between an original filing and a later amendment, including a combined difference across all financial categories.
Bulk downloads of itemized records for two committees: ActBlue and Bernie 2016. ActBlue, a Democratic conduit committee, routinely lists tens of thousands of donations and expenditures in each report, while Bernie 2016 also lists thousands of contribution records. Instead of displaying those records in the browser, FEC Itemizer now provides a zipped CSV file of them for each filing.
Navigational “breadcrumbs” from every page so you know where you are.

With months to go before November’s election, we’re not done with FEC Itemizer, and we’d love to hear your suggestions for it and for the ProPublica Campaign Finance API.

↧

Infographics in the Time of Cholera

March 16, 2016, 8:32 pm

≫ Next: 8 Tips on Getting a Newsroom Data Team Started

≪ Previous: Following the Money is Now Easier with FEC Itemizer

This story originally appeared in “Malofiej 24,” published by the Spanish Chapter of the Society for News Design (SNDE).

“It is a singular truth that the mere shadowy image of a building is likely to have a longer term of existence than the piled brick and mortar of a building. Should posterity know where the proud structure stood, it will be indebted for its knowledge to the woodcut.”

—attributed to Nathaniel Hawthorne, 1836, quoted in “Low Life” by Luc Sante.

It’s a simple line chart, the kind you can make using Excel in about a minute, but for its time it might as well have been from another planet.

On Saturday, Sept. 29, 1849, The New York Tribune published on its front page a line chart tracking the deaths in New York City from the cholera epidemic that summer. It used techniques that would become common decades later, but were, for the time being, at the bleeding edge of visual data journalism. And, until now, it was forgotten.

The chart is a snapshot of the state of the art of data visualization in news at that moment, and is full of clues that help reveal parts of the hidden history of visual journalism.

Mid–19th century New York was a crowded and filthy place, with most of its half million residents packed like sardines into lower Manhattan. Sanitation was inadequate and street cleaning funds were controlled by corrupt city officials.

Medical understanding was also primitive, and disease outbreaks were widespread. Cholera hit big cities worldwide with terrifying regularity. In New York, an 1832 outbreak killed 3,515 of the city’s then 250,000 residents. Doctors only had a vague idea of what caused the disease. It would take five more years for John Snow to make his famous map of the London cholera outbreak around the Broad Street pump, and another 30 years before Robert Koch linked cholera with the Vibrio cholerae bacterium and medicine began to prevent and halt epidemics.

If New Yorkers were still in the dark about what caused the disease, they were all too aware of how fast it killed. The time from first symptoms to death could be 24 hours or less.

The 1849 epidemic arrived by ship from Europe after infected passengers escaped a dockside quarantine.

Naturally, a new cholera outbreak was big news, and the Tribune was one of the city’s biggest newspapers. Every day, it competed fiercely with other papers for readers, including its archrivals the Sun and the Herald. Gangs of children — “newsboys” — hawked the papers, called collectively the “penny press,” on the city’s streets.

My understanding of how this graphic came to be in the paper remains skeletal. I haven’t found any correspondence that mentions it among the papers of the Tribune’s editors. It seems likely that publishing an exotic illustration like this chart would have been a gambit to sell more papers to a populace with an interest in understanding the spread of the disease.

But there is evidence that the reasons to publish it were more personal. The disease hit the Tribune’s famous editor, Horace Greeley, at home. His son Arthur, called “Pickie,” contracted cholera and died in the middle of that summer at age five, “being ill only from one [a.m.] to five o’clock [p.m.].”

Although graphical displays of data were not unheard of in scholarly and engineering books, they were exceedingly rare in U.S. newspapers of that time. Maps became common during the U.S. Civil War in the 1860s, but in the 1840s, illustrations of any kind (let alone data visualization) were rare.

I’ve spent the past few years studying the history of data visualization in news. I have seen hundreds of examples of data displays in antique newspapers. Few have struck me as much as the cholera chart has. It is rare almost to the point of anachronism. There are a few reasons why.

First, the technology to reproduce such things was still primitive. Newspapers were typeset by hand, letter by letter. Illustrations and line art had to be carved by hand into small wooden blocks. Bigger illustrations were made by bolting several blocks together (if you look at the cholera chart closely, you can see some of the seams between the blocks). The process was laborious and hard to pull off on the daily deadline of a newspaper. It took great skill and there were only a handful of craftsmen who could do the work. Few if any newspapers had on-staff engravers, so it’s likely the Tribune would have had to bring in somebody with rare skills who could command a high fee.

The work also had to be done quickly. The chart includes data from a week before it was published.

A bigger obstacle than the production difficulties was the Tribune’s readers themselves. It is unlikely that everyday New Yorkers in the 1840s would have been familiar with statistical graphics, and they likely wouldn’t have had any idea how to read or interpret them.

The editors must have worried about this too. Attached to the graphic is a 300-word annotation explaining how to read a line chart, including basics like Cartesian coordinates (“Each half-inch along the bottom line represents a week”), axis labels (“The dates are placed under each”) and that the slope of the line represented the change between data points (“The zig-zag lines, which join the ends of these lines, show, by their upward or downward slopes, whether the deaths during those weeks have increased or decreased, rapidly or slowly.”)

At either end of the annotation are interesting clues. At the beginning, the editors credit the creator of the graphic, or at least the person who suggested it to them. “We are indebted to Professor Gillespie of Union College”. Gillespie was a professor of civil engineering and mathematics at the college in Schenectady, New York. Gillespie’s papers are not collected anywhere, and I cannot find any mention of him in the various Tribune archives, so it’s unclear what role he played. Bylines were not yet conventional in newspapers, but it seems likely to me that he was the creator of the drawing on which the engraving was made. As a scholar, Gillespie would have had access to William Playfair’s books, and been familiar with statistical graphics.

The annotation ends with one of the chart’s most interesting details, a hint at a scientific understanding that in 1849 was still just out of reach:

If the average temperature, moisture, electrical state, etc. during these weeks were represented in the same manner, and added to this diagram, their comparison would show at a glance whether there has been any connection between them.

The Tribune graphic is a snapshot in miniature of a standard yet to come, revealing a growing sense of awareness of the potential for well-designed information displays to help people understand and solve problems. And it has important lessons to teach us more than 166 years later.

The first, perhaps, is that the history of data journalism is far older than many people realize. Newspapers were publishing data on commodities prices, cargo shipments and births and deaths since their very beginning. Data visualization in newspapers wasn’t common until technical advances later in the 19th century made it easier, but even as far back as the 1840s newspapers were experimenting with it.

As visual journalists, we ought to feel a kinship with the unknown engraver of this cholera chart. We may think we inherit our craft from people like Snow and Playfair, but they were essentially scholars, with motivations, resources and limitations far different than ours as journalists. In reality, we stand on the shoulders of the people who toiled over maps and charts like our cholera chart, with imperfect data, on deadline, for newspapers.

But the most important lesson is that there is no such thing as an innately intuitive graphic. None of us are born literate in a reading visualizations. But a mid–19th century newspaper reader, like any of us, was born with the potential to understand one. The Tribune of 1849 couldn’t assume its readers would understand a line chart, but they clearly thought it was worth trying.

The implications of this are significant. First, we must always keep our readers in mind. They have the potential to understand our graphics, but we must never assume our graphics are intuitive, without the need for explanations or directions. Visual and narrative clues about how to read them are vital and mandatory.

On the other hand, this also means we are not trapped into using simple forms. There is no pure set of visual types that conform to human nature and are thus are intrinsically better than the others. We are free to experiment. If we keep our readers in mind, we can pursue new forms that delight them and help shed light on complex subjects in ways that have never been tried before. For the right story, our readers will put in the time necessary to understand even strange new graphical forms — like our line chart must have seemed in 1849 — if we only take the time to help them do so.

↧

8 Tips on Getting a Newsroom Data Team Started

April 16, 2016, 10:12 pm

≫ Next: A New Way to Keep an Eye on Who Represents You in Congress

≪ Previous: Infographics in the Time of Cholera

This story was co-published with Nieman Lab.

Thursday morning, a post by journalist Miguel Paz on a data-journalism email list asked for tips for starting a data-journalism team. I had many thoughts on the subject, and nothing to read on the subway, so I ended up writing this long response on my smartphone:

It’s okay to start small. A player-coach plus one developer can build incredible things — TPM’s election night coverage in 2012 was innovative and nationally competitive and built by two people. The LA Times news apps team was two people for a long time.
Help the organization build on successes. Work hard and create great projects. Go viral. Get covered. Win awards. Smart leaders will see your team as having an impact that grows more sharply than linear when you add people. Let them know that there’s more where that came from if they invest further. My team is about 20 percent of the staff but its projects are about half the traffic. We bring in earned revenue and grants. We bring home prestigious journalism awards. Make sure your sales pitch to your bosses for growing your team includes all those metrics.
Recruit generalists. There are a bunch of skills needed in building news apps, but at the most abstract level they fall into three buckets: Code, Design, and Journalism. Recruit people who have at least two of those skills and be willing to teach them the third. The easiest by far to teach is Code. You want journalists whose creativity expresses itself as interactive graphics and databases.

It is getting far easier to recruit. Our last fellowship round, we had hundreds of applicants, about a dozen of which would have been great. We had 700 applications to our Summer Data Institute. The applicants are out there!
Treat your news-app developers as authors. Each person should be the creative owner of a project. Do not split the work up into functional expertises. Here’s why:

In order to make a great interactive news project, its creator needs to have had their hands deep inside the data from the beginning, where they will start understanding the possible stories it tells. They need to build the server code to support their ultimate vision without slow negotiation or the friction of brain-to-brain communication, and they need to design the presentation because they’re the person who understands the material and the visual story possibilities best.

Think about this; it’s no different than other news desks. If you wanted to, you could split up your story work into cross-functional teams based on the tools they use: The Telephone-Assisted Reporter spends all day talking to all the sources for every story, a Designated Reader focuses on all the reading for the newsroom, and a Microsoft Word Specialist jumps around writing narratives. In fact, for breaking news you probably split the work up roughly this way. To make this be the rule for every project would be preposterous, yet it’s how most news app teams operate. The work is split between back-end developers, front-end developers, designers, etc.

Instead, expect and demand each person on the team to be responsible for all the work. That is not to say everybody will be equally good at everything, of course. Skill sharing is crucial. The stronger coders will help those who are just learning, seasoned designers will help out the ones who need help, the ones who are great FOIA writers can show the newbies how to do it, etc.
Demand journalism. Treat the team like a news desk and expect them to be journalists in everything they do. Members of the team should come to edit meetings, call sources and get bylines on their work. They should be like co-reporters in projects that are collaborations with a more traditional reporter — never “the data monkey.” When collaborating with another reporter, the data team member should go out on interviews, travel, develop sources, etc. And the traditional journalist should learn the basics of data so they too can work directly with the data.

Make sure your team is counted as newsroom when your company lists its numbers publicly. Your newsroom chiefs should not see hiring in your department as being in tension with the core work of the place.
Edit them. The team’s leader should be an editor, not IT. That leader’s boss should be an editor, all the way up to the executive editors. Just like any other desk, the editor helps each person on the team pick projects, get resources, stay focused, write good copy, and craft a user experience that makes sense and tells the story that readers understand and that is supported by the facts.
Don’t cross the streams. A news apps team should not do ANY platform work. Keep them focussed on building journalistic projects. They shouldn’t build your site templates, code your social media buttons, or create your marketing emails. Juggling platform work and journalism work is like juggling a bowling ball and a golf ball. They are completely different kinds of software development: Platforms are a container for other people’s work; the development cycle is slow and never done; your customers are largely your fellow colleagues. News apps are themselves the output of journalism; their dev cycle happens on deadline; apps aren’t directly connected to each other, and are built only to serve readers. Context-shifting between these two modes is very hard. And worst of all, because platform work has no fixed end date, it tends to consume other work. The ground is littered with teams who thought they could build a site section and then come right back — and never did.
Pick great bosses. My bosses (Paul Steiger, Steve Engelberg, Dick Tofel and Robin Fields) believed in me when I started putting the ideas together for my department, even when I didn’t believe in myself or know precisely where I was going. You want to feel like somebody with institutional authority is placing a bet on you, not doing you a favor. There will be plenty of compromising on your way to building the team the right way, but if you’re working at cross-purposes with your bosses, you are likely to run out of energy before they do.

↧

A New Way to Keep an Eye on Who Represents You in Congress

May 2, 2016, 12:45 am

≫ Next: Presenting Hell and High Water VR

≪ Previous: 8 Tips on Getting a Newsroom Data Team Started

Today ProPublica is launching a new interactive database that will help you keep track of the officials who represent you in Congress.

The project is the continuation of two projects I worked on at The New York Times — the first is the Inside Congress database, which we are taking over at ProPublica starting today.

But we also have big plans for it. While the original interactive database at The Times focused on bills and votes, our new project adds pages for each elected official, where you can find their latest votes, legislation they support and statistics about their voting. As we move forward we want to add much more data to help you understand how your elected officials represent you, the incentives that drive them and the issues they care about.

In that way, it is also a continuation of another project I worked on at the Times. In late 2008, The New York Times launched an app called Represent that connected city residents with the officials who represented them at the local, state and federal levels. It was an experiment in trying to make it easier to keep track of what elected officials were doing.

Because ProPublica is rekindling that effort, we’re calling the new project Represent.

The new Represent will help you track members, votes and bills in the House of Representatives and Senate. We’re also launching a Congress API, or Application Programming Interface, so developers can get data about what Congress is doing, too.

Represent will show details of votes and bills and provide a way for you to follow the activities of your elected representatives and understand how they fit into the broader world of American politics. For example, we’ll show you how often a member of the House or Senate votes against a majority of her party colleagues, or the kinds of bills each lawmaker sponsors and cosponsors. We have pages detailing every vote, every bill and every member, with details about each. On the homepage we’ll display significant votes in the House and Senate.

As with our Campaign Finance API, we are also taking over the congressional API that The New York Times started in 2009, with the goal of expanding it and making it even more useful for newsrooms and other users. If you used the previous Congress API published by the New York Times, your code will continue to work for a short time, but you should migrate soon. New users can sign up for a free API key by emailing apihelp@propublica.org.

Our focus is on the current Congress, the 114th, which lasts until the end of 2016, but we have data going back to 1995 and earlier for votes and members. We are taking advantage of new legislative bulk data produced by the Library of Congress and the Government Printing Office to make the process of updating the data more consistent and less reliant on scraping congressional sites, too.

This isn’t the only congressional data site out there, and our goal is to send visitors to other sites that offer valuable features. That’s why we’re linking to individual lawmaker and bill pages on GovTrack and C-SPAN, for example. Like GovTrack, our news app will provide some calculated metrics that visitors can use to help learn more about their representatives. We also have vote cartograms that show not only how each lawmaker voted but the relative clout of delegations.

That’s where you come in - we’d like to know what kinds of congressional information would make it easier to hold Congress accountable for its actions (or inactions)? Would more comparisons between lawmakers’ votes and legislative proposals be helpful? We’re currently showing recent bills by subject, but are there other ways of organizing information about bills that would be useful? What do you want to know about the activities of Congress?

Please let us know — either in the comments below or by sending me an email at derek.willis@propublica.org.

↧

Presenting Hell and High Water VR

November 22, 2016, 11:02 am

≫ Next: Sunlight Labs Takeover Update

≪ Previous: A New Way to Keep an Eye on Who Represents You in Congress

In March, ProPublica and the Texas Tribune published Hell and High Water, an interactive story that raised an alarm about Houston’s vulnerability to coastal storms. Today, a team at the University of Southern California is launching a virtual-reality experience based on those stories. It’s called Hell and High Water VR.

The Houston Ship Channel is one of the country's biggest petrochemical refining centers. It’s also home to storage tanks that contain billions of gallons of oil and toxic chemicals. The ProPublica/Tribune investigation drew on cutting-edge research and supercomputer-generated storm models to simulate a storm scientists say has about a 1 in 350 chance of hitting the channel in any given year.

Inspired by this story and the research, JOVRNALISM, a hackathon-style class at the USC Annenberg School for Communication and Journalism set out to create an immersive VR experience based on the project.

Graduate and undergraduate USC students, led by Professor Robert Hernandez, traveled to Houston during Spring Break 2016 to do original reporting based on the Hell and High Water investigation. They developed new immersive storytelling techniques to illustrate portions of the investigation they felt were ideally suited for virtual reality.

After months of work, the students have produced Hell and High Water VR. There are many ways for you to experience it:

If you’ve got an iPhone, we recommend downloading the JOVRNALISM app (an Android version is coming soon). If you have a VR headset like Google Cardboard you can use it, but it’s not required.
You can also see it on your phone by going to YouTube. (This link should launch the YouTube app on your phone.)
You can see the project on your computer using any recent browser through the video playlist below. (Your browser must have WebGL enabled to get the 360 experience.)

↧

Sunlight Labs Takeover Update

December 31, 2016, 5:38 am

≫ Next: ProPublica’s on IFTTT

≪ Previous: Presenting Hell and High Water VR

Last month we took over five projects that were created by The Sunlight Foundation. Here’s an update on where those projects stand now and our future plans for them.

Sunlight Labs shuttered on Nov. 11. A momentous date (in fact, a federal holiday) and as you may recall a busy week for everybody in news. Sunlight staffers Kat Duffy and Bill Hunt worked tirelessly to find a good home for Labs projects. They had a hard deadline to get everything placed — on Nov. 11, Sunlight Labs’ servers were to be turned off and the staff laid off.

We’ve spent the past few weeks understanding the code and data that Kat and Bill sent us and making a plan for migrating everything to ProPublica URLs, designs and systems. Our plan is to get them all up and running first and then make any fixes, changes or improvements, including integration with existing ProPublica projects.

Three of the five projects are currently online, while the other two are not. Most of the projects are connected to Congress in some way, which means time is of the essence: The next House and Senate will convene on Jan. 3, 2017, and we want to have these projects as ready as possible by that date.

Here are specific updates on each, including some volunteer opportunities for coders looking to chip in:

Sunlight Congress API: The API is currently running at its original address and supports all of the calls it did before the move. What’s missing right now is the ability to sign up for an API key, so we’ve turned off the requirement to have one for the moment. If you are an existing API user, don’t change anything. If you are a new user, you can use the API right now without a key, but we will require keys once we have a solution in place. We will be moving the API to a propublica.org address, but none of the other parts of the URL will change, and we do not anticipate removing any existing responses.

Much of the legislative data that powers the API comes from the United States project. We’ll be contributing to the efforts there and we invite users of the API to get involved, too, by contributing code for new or existing sources of information, or by updating information about members of Congress.

ProPublica has its own Congress API (which started life as The New York Times Congress API), but Sunlight’s is more robust and has many more users, so our plan is to merge the ProPublica Congress API into the Sunlight API to that we can offer a single service. The process will begin this month and continue into early 2017. Once we’ve completed that transition we’ll have a single Congress API. If you are interested in helping us test things out during the transition, email us at apihelp@propublica.org.

Politwoops: This popular service that tracks deleted tweets by politicians and elected officials is running at a new URL, and we expect to move it again soon (don’t worry, we’ll redirect links). The only other changes involve the design of the site and the addition and removal of politicians. We’ll also look to integrate Politwoops data into Represent, our congressional news application.

House Expenditure Reports: We relaunched this project as part of Represent at the end of November and will continue to maintain it going forward. It’s also available as a free download in the ProPublica Data Store. We’ve updated the data to include office expenses from the third quarter of 2016, and we’ll be adding new features to Represent based on the data. We hope to have a public search interface in addition to providing data downloads.

House Staff Directory: We are working on integrating data on House staffers into Represent, and expect that it will be completed in mid-December. As with the expenditure reports, we’ll continue to update the data quarterly.

Capitol Words: This project that analyzes the words spoken by lawmakers has both an API and a public web application. It is not running yet, and our goal is to have it back up in time for the beginning of the new Congress in January at the same URL, and then to integrate parts of it into Represent. We will continue to support the API, though the URL will eventually change. We’re particularly interested in hearing from Capitol Words API users on how they use the service and what changes they might make. Email us at apihelp@propublica.org.

Although these applications will look different than they did a month ago, we remain committed to preserving their utility and to using them to make ProPublica’s own news applications better.

↧

ProPublica’s on IFTTT

January 24, 2017, 12:32 pm

≫ Next: Learn Data, Design and Code for Journalism. Apply for ProPublica’s 2017 Data Institute!

≪ Previous: Sunlight Labs Takeover Update

One of the Sunlight Foundation projects that ProPublica adopted late last year is a service that you can use to be notified when some key things happen in Washington — for instance, when President Trump signs a bill into law, or when a bill is introduced that covers something you’re interested in.

These notifications are part of a service called IFTTT (which stands for IF This, Then That), which lets non-coders create small “applets” that connect web services and devices. So, for example, you can record your fitness tracker’s daily data in a Google spreadsheet or make a connected light bulb blink on when you get an email from a certain person.

The applets first built by Sunlight and powered by the databases that ProPublica took over in November are now part of ProPublica’s IFTTT channel. You can create applets based on five different real-world events:

President Trump signs a bill into law.
A new bill is introduced that matches a search term you provide.
The House or Senate schedules a bill for consideration.
A new lawmaker representing you enters Congress.
A member of Congress has a birthday.

Now that ProPublica has its own IFTTT channel, we’d like to hear from you about what other applets we should provide. For instance, we’ve got campaign finance data that isn’t yet hooked up to any IFTTT applets. What interesting things could we build with that? Let us know in the comments or on Twitter at @propubnerds.

↧