Quantcast
Channel: Articles and Investigations - ProPublica
Viewing all 132 articles
Browse latest View live

Meet our New OpenNews Fellow

$
0
0

Knight-Mozilla OpenNews announced five new fellows for 2014 today at the Mozilla Festival in London.

We’re excited to announce that Brian Jacobs will be ProPublica’s OpenNews fellow in 2014.

Brian is an interactive developer and designer who specializes in civic-minded visualizations and interactive projects, especially ones involving maps and geographic data. Most recently, he worked in Singapore on MIT’s SENSEable City Lab, and before that he was at Azavea, a GIS development shop and ProPublica collaborator based in Philadelphia.

At ProPublica, Brian will work alongside the other developers on the News Apps desk, building interactive projects that help people find their own stories in large data sets. Our plan is to get him to work immediately pursuing great accountability stories, and we’re especially eager to employ Brian’s expertise in telling stories with maps and location.

Our current OpenNews Fellow, Mike Tigas, continues to have a big impact at ProPublica and in the news-coding world beyond. In addition to his terrific work on our Nonprofit Explorer, Mike is a core developer on Tabula, which has made the impossible possible by turning text tables inside PDFs into structured data useful for data journalism (if that sounds like no big deal, trust me, it’s an amazing advancement). Mike’s got more projects in the works, so stay tuned.

OpenNews continues to play a critical role in supporting a community of practice among people who do journalism by writing code. We’re thrilled and grateful to be part of the OpenNews Fellowship again this year.


Data-Driven Journalism’s Secrets

$
0
0

Lea este artículo en español.

I’ve spent the last few weeks in the U.S. on a Douglas Tweedale Memorial Fellowship with the International Center for Journalists, talking to some American newsrooms about how they approach data-driven journalism. Here’s a bit about what I’ve learned.

The best way to start doing data-driven journalism is simply to start. When you’re just getting started, you really have nothing to lose. With every mistake, you gain experience and knowledge for your personal growth and to improve the quality of the journalism you are practicing.

It’s easy to say, simply, “I am not good at math,” and decide that data-driven journalism isn’t for you. Well, I wasn’t good at math either until I decided to tear down that barrier, and found that learning is better and easier by applying theory to real-life projects.

In 2008 before I even knew there was such a thing called “data-driven journalism,” I proposed a small, simple analysis of a data set of international tourists visiting in Costa Rica. On the way, I started to learn about Excel formulas, to calculate variations, to analyze totals by year, seasons, quarters, etc.

Gradually, my skill at correlating data began to increase, as did the complexity of my projects. These days I’m working in the Data Unit of La Nación in Costa Rica.

Here are three things you can start today to increase your data-driven journalism skills:

First, talk to developers and engineers about how to approach a data analysis project – what new software to use, formulas and methodologies, given the goals of your project.

Second, stick with it! Don’t give up too easily when learning to use a new technique or software like Open Refine, Tableau, Tabula or others to clean, analyze or visualize data.

Finally, make sure to share what you’re learning with others. Very often the questions people will ask you show challenges and motivate you to search for the right answers that you hadn’t thought of, increasing your knowledge and encourage you to try different approaches.

What I want to say is: If you want to do data-driven journalism, go ahead and start. Good ways to start learning include online courses, books and tutorials.

If you live in Latin America, you can take advantage of projects like Chicas Poderosas (“Powerful Girls”), which promotes the development of data-driven journalism skills through workshops that connect journalists, developers, designers, animators and storytellers and get them to work together on storytelling projects.

I also recommend global initiatives like Hacks & Hackers, which hosts meetups in many countries in and outside Latin America.

You must also commit to never stop learning. Even after you have developed advanced skills and a deep understanding of the techniques, tools and methodologies of analysis and visualization, there will always be a bigger challenge ahead – bigger datasets, new software to test, new techniques to try and different approaches to generate participation from people for whom your story is important.

Why Data-Driven Journalism Matters

You may ask, “why does data-driven journalism matter to me?” A good summary comes from Sisi Wei, a data journalist at ProPublica, who I talked to while I was on a two-week assignment in that newsroom.

“It allows me to scrutinize and examine information better than ever before. For example, when I was in college, I thought that you interview experts, believe in what they say, then quote them in your articles. But why not take the data results of their research and analyze it yourself? I can check what they are saying. If I analyze the data before I do an interview, I can ask questions about the anomalies that I found or questions on procedures that I understand better than before.”

I know just what she means – it’s very useful and satisfying to be able to scrutinize and examine the information in a better and new way.

Every good story starts with an idea, a question or an observation, said Sarah Cohen, who has been editor of computer-assisted reporting at The New York Times for about a year.

“And then, we look for data or documents that help us to extend the impact of these observations,” she added. We spoke in the Times’s employee cafeteria earlier this week.

That’s how one particular Pulitzer-Prize-winning series of stories, that was reported by Sarah and her colleagues at The Washington Post was born.

The series exposed the District of Columbia's culpability in the death of 229 children placed in protective care between 1993 and 2000.

“I feel that those stories have more impact than traditional investigations, because they show that this is not an isolated incident, that this was not just a one-time problem,” concluded Cohen.

Another of the great benefits of data-driven journalism is how it improves the quality of journalism and the engagement with audiences through visualization and interactive databases.

In Argentina, La Nación in Buenos Aires understands this benefit very well. “Putting data in contact with citizens through news apps or interactive visualizations is the way to activate citizen participation in the mobile and on-demand era,” said Angelica “Momi” Peralta, La Nación’s Multimedia and Interactive Development Manager.

Argentina doesn’t have a Freedom of Information Act. Their open-data government portals have launched only recently, so the data journalists at La Nación build their own datasets from scratch by scraping PDFs, among other techniques. They also share the data they develop in open formats so those interested in the data can get access to it.

“We believe that sharing data is a must. Media must be in front this time, wake up, make useful data sets ‘famous’ and available for others to reuse. Bring data to life, because darkness and corruption is killing people in places like ours, and data that you open comes to light, so once it reaches the hands of citizens (through mobile or visualizations), it will be more difficult for those who want to make it disappears,” Peralta told me via email.

The Idea and the Process

In my experience, succeeding at data-driven journalism is a matter of patience, lots of team work and refusing to give up. Here’s one story from my newsroom about how a project came together.

Last March, during a meeting with my team, we came up with the idea to investigate the recycling practices of households in the 81 regions of Costa Rica, and to see how local governments are supporting these efforts. Immediately, I began to dig into the issue by reading up on the relevant laws and regulations, as well as academic studies on the subject. I also created two databases.

The first set was built by extracting data from the 2011 Costa Rican Census. For the very first time, that Census asked whether or not households were separating plastic, paper, aluminum and glass from ordinary trash.

I assembled the second dataset by calling and requesting information from all 81 local governments.

I knew from the census that 40 percent of Costa Rican households were separating recyclable waste, but beyond that, what I wanted to know was: In what regions are those recycling practices the most extensive? What do people do with the stuff they are separating? Do the local governments handle the recycled waste properly, or is the effort in vain? How many tons are being collected every month, and how much of it is being recycled?

The Census had answers to the first two questions, but not the last two. Those questions could only be answered by asking the local governments. The second database was gathered from that reporting. I took to the streets to confirm my findings, talking with experts, government officials, associations and companies in the communities.

This is an important step, advises the Times’s Cohen. “Data-driven journalists need to spend some time on the street seeing how the data works in the three-dimensional world. The same happens with reporters that work on street. They need to spend some time with the data to see how it is represented. Any record is actually part of something that is happening. And without any of those perspectives, I think you can lose a lot.”

In the end, with the help of some of my colleagues, we mashed the two databases together so we could figure out the exact number and location of households splitting recyclables from the ordinary trash, and where the local government wasn’t actually picking them up in a separated way and instead just mixing recycling bags with garbage.

We now also had a data set of the towns with the highest rates of recycling, and which recycling efforts were really being supported by their local governments.

In parallel to the data analysis, we developed an interactive database that allowed each reader to interact and find their own local information, and so to “tell their own story” using the data.

The Secret Is: Don’t Keep Secrets

Finally, if you are going to start in data-driven journalism, you should know a big secret: Do not keep secrets from the team you’re working with. Share all your findings, drafts and data.

Any idea that is not shared with others is destined to die because the oxygen supply that a data-driven project needs to live depends on how much you share it and how much you nurture it with others’ feedback.

Don’t keep information to yourself. Share all your data and findings at the very beginning. Keep your notes in a place others can access, like a wiki or shared network drive. From the very start, engage developers, designers and multimedia experts with all aspects of your story. They’ll enrich your perspectives and boost the quality of your questions.

You may also find new sources (both data and human) that perhaps you wouldn’t have otherwise thought of, as well as new tools and methods to extract and analyze data. It will help you do your job better.

Talk as much as you can about your idea, even when it is only a project idea. Keep talking when it turns into a project and when it’s a draft and when you’re really far into the project.

And discard the notion that journalists and engineers and graphic designers should work separately. In a newsroom everybody can be doing the work of journalism.

On that point, I like a phrase that I heard during my stay here at ProPublica. It was said by Scott Klein, Senior Editor of News Applications. “Forget about saying to developers: OK, here's the data, work with that. My part is done.” Your project can only be successful if it is really done as a team.

I have confirmed this myself at La Nación (Costa Rica) during the development of a database that reveals the identity of more than 100,000 offshore entities in tax havens. It was a global project led by the International Consortium of Investigative Journalists, in which La Nación (Costa Rica) joined with other newsrooms in different parts of the world and worked together to generate an investigation with global impact.

It would not have been possible if the newspaper hadn’t assembled a multidisciplinary team that was willing to communicate and work together in whatever way we needed to.

My next challenge will be trying to learn the basics of software development. I know there are opposing views on whether or not a journalist should learn to code, but during the past two weeks I have been sitting in the ProPublica newsroom, I have seen why it is important for journalists to have at least basic programming skills.

If you are working with large volumes of data, even basic coding will help you, if only to communicate better with your engineers and to get the most out of your data and to analyze it in the best way possible. And understanding how code works will help you to know the most suitable visualization or interactive technique to deploy.

Again, I know it seems daunting, but the most important advice I have for journalists is to at least try data-driven journalism, and as you master it, keep learning.

Hassel Fallas is a data journalist at La Nación in Costa Rica. She’s visiting ProPublica as a 2013 Douglas Tweedale Fellow from the International Center for Journalists and as ProPublica’s October 2013 P5 Resident.

How We Analyzed FEMA’s Risk Maps

$
0
0

Today we published a story and interactive news application revealing why the flood risk maps in effect across New York and New Jersey predicted Sandy’s flooding so inaccurately. Instead of the latest technology available, which would have painted a far more accurate picture of the risks for homeowners and flood planners, FEMA’s maps relied on a patchwork of technologies, some dating to the 1980s.

For our project, we did an analysis of a few geographical data sets. In order to rank how well each county’s maps predicted Sandy's flooding, we compared the area of Sandy's storm surge as measured by the FEMA Modeling Task Force (specifically, the February 14, 2013 update) to the area of storm hazard zones in the effective maps in coastal New York and New Jersey counties.

To make the calculation, we only considered land in zones starting in A or V. These are areas which FEMA requires mandatory flood insurance. We wrote software to compare the areas of the two maps using the C++ OGR API. The software calculated the amount of overlap between the geographical areas —how well the maps predicted where Sandy would flood.

There are some important caveats to note when comparing predictive flood risk maps and actual flood inundation maps. Floods like the ones that came ashore with Superstorm Sandy do not hit everywhere equally. Although FEMA estimates that flood zones starting with an A or a V carry a one percent risk of flooding in any given year, there were only certain places in which Sandy was actually a “100 year storm.” In some places, Sandy was a rarer flood and in other places, a more common flood. This explains some, but not all, of the variation in accuracy.

Also, a FEMA official told ProPublica that the accuracy of the inundation maps may vary from location to location, but overall it represents a “very accurate overall depiction of the extent of flooding from Sandy.”

Nevertheless, we believe, as do experts we spoke with, that there is such a striking difference in the accuracy of the flood risk maps -- look, for instance, at the gap in accuracy between New York’s adjoining Queens and Nassau counties -- that these differences were notable. Areas with newer maps using newer technology predicted the flood extents far more accurately overall.

To find buildings that the maps updated in 2007 left out, ProPublica analyzed data from several New York City agencies, Preliminary Work Maps FEMA released in June and FEMA’s Sandy structure damage assessments. We counted 9,503 buildings damaged during Sandy that are included in FEMA's 2015 preliminary risk maps, still in the works, but not in the 2007 risk maps, which were in effect when the storm hit. Of those buildings, 398 were built or altered in or after 2007 -- when more accurate flood maps, had they been ready, could have alerted owners and builders to the dangers of building in those areas. This is the subset we used to find affected homeowners to interview for our story.

We chose buildings that fell into the following categories according to the US Army Corps of Engineers and FEMA: “affected,” “major,” “minor” and “destroyed” (see their methodology). To find the subset of these buildings that were built or altered in or after 2007, we restricted our query to buildings the city's assessment roll specified had “year built” or “year altered” values of 2007 or higher. We wrote software using the C++ OGR API to find these buildings.

We’ve made the scripts and the data they rely on available to download. We welcome further analysis and research. If you see any problems or ways we could have done our analysis better, please let us know by emailing data@propublica.org.

How We Calculated Injury Rates for Temp and Non-Temp Workers

$
0
0

Related Story: Temporary Work, Lasting Harm

Summary of Findings

An analysis of data from worker’s compensation claims in California, Florida, Massachusetts, Minnesota and Oregon over a five-year period found that the incidence of temporary worker workplace injuries was between 36 percent and 72 percent higher than that for non-temporary workers.

When workers were grouped by occupation, this gap widened significantly for workers in certain blue-collar, more-dangerous occupations and narrowed for workers in less dangerous occupations.

Temporary workers also are disproportionately clustered in high-risk occupations, our research found. Temporary workers were 68 percent more likely than non-temporary workers to be working in the 20 percent of occupations with the highest injury rate as measured by the U.S. Bureau of Labor Statistics.

Introduction

The safety of temporary workers on the job has become an issue of growing concern in the public health community. Such workers, recruited by temp agencies for jobs in factories, warehouses, offices and other worksites on a daily basis, make up a larger share of the American labor market than ever before, according to the most recent government jobs report in November 2013. There are now 2.78 million workers in the temporary help services industry. The American Staffing Association, the industry’s trade group, says that some 13 million people, nearly 1 in 10 workers, found a job through a staffing agency in 2012.

The director of the federal Occupational Safety and Health Administration has said that he is alarmed by the number of temp workers being killed on the first day on the job. Earlier this year, OSHA launched an initiative to raise awareness about the dangers temp workers face, as well as the responsibilities of temp agencies and the companies that use temp workers. But attempts to improve policies protecting temp workers have been limited by a lack of basic data, such as whether temp workers get injured more than regular workers, what types of injuries they suffer and whether certain occupations are of special concern.

The main resource for workplace safety data is the Bureau of Labor Statistics’ (BLS) Survey of Occupational Injuries and Illnesses. Worksites are required to keep a log of injuries. Every year, BLS economists collect data from those logs from 200,000 establishments to estimate workplace injuries and illnesses nationwide.

It’s impossible to compare the injury rates of temp worker to those of regular workers using this survey for two reasons. First, companies compiling the records are not required to indicate whether a temp or regular employee was harmed by an accident.  Second, BLS surveys have found that many worksite employers are not aware they are supposed to include injuries suffered by temps in their logs, meaning that a large number of temp worker injuries likely go uncounted.

Recently, public health researchers have begun to discuss whether state workers’ compensation data could be used to monitor injuries among temp and other contingent workers left out of the BLS survey. A 2010 study of Washington state workers’ compensation claims found that temp agency workers had higher rates of injuries. The rates were twice as high in the construction and manufacturing sectors. Washington State’s workers’ comp system is somewhat unique in that (1) the state fund is the only player in the insurance market, (2) it uses a unique coding system that identifies temp workers in specific classifications, such as “temporary staffing services - warehousing operations,” and (3) the state collects data on the number of hours people work, making it possible to calculate injury rates. Most states collect payroll data but not number of employees or hours worked, meaning that to calculate injury rates, one would need to use an outside source to obtain the employment data necessary to calculate injury rates.

In 2001, University of Minnesota researchers did such a study. The analysis compared workers’ comp costs and claims frequency among regular full-time workers, part-time workers and temporary and leased workers. To calculate claims frequency rates, researchers used outside data from the Census Bureau’s Current Population Survey. The researchers found that both cost and claims frequency were many times higher for temp and leased workers.

Methodology

How we got the data

ProPublica set out to compare the rate of workers’ compensation claims of temp workers and regular workers in as many states as possible. Reporters contacted workers’ comp system administrators and ratings bureaus in 25 states.

Using workers’ comp records to track temp worker injuries nationwide was not possible. In many states, such as New Jersey, claims are considered confidential. In many others, such as New York, there is no way to distinguish temp workers from regular workers because the state does not collect industry information or it is rarely reported on claims. Other states were problematic because of the way claims are reported to the state. In Texas, for example, employers aren’t required to carry workers’ comp insurance. Those employers are still required to report injuries to the state, but auditors have found that many fail to do so. In Illinois, about half of the claims are filed on paper and never entered into a computer system.

Ultimately, ProPublica obtained claims databases from California (2008-12), Florida (2008-12), Massachusetts (2008-12) and Oregon (2007-12) and aggregate claims data for Minnesota (2007-11). The data comes from the first report of injury (FROI) and subsequent report of injury (SROI) forms that employers and claims administrators are required to file with the state administrative office. It is important to note that data is not comparable between states because every state has different rules for what is considered a reportable claim. In California, the standard is any injury resulting in more than a full day of time off or requiring medical treatment beyond first aid. In Florida, the threshold is seven days off. In Massachusetts, lost-time claims are defined as more than five days of lost wages. In Minnesota and Oregon, the standard is more than three days away from work. There are also differences in the labor markets, especially the market for temporary workers, in each state.

Identifying Temp Workers

Temporary workers were identified using the employers’ North American Industrial Classification System (NAICS) codes. Under this system, which is used by most federal agencies, “temporary help services” is a separate industry, identified with the code 561320. California also allows claims to be filed with the older Standard Industrial Classification (SIC) codes. For the temp help industry, ProPublica was able to convert these codes to the NAICS system.

Unlike other states, Massachusetts’ database did not contain industry codes but did contain employer name. Because Massachusetts requires all temp agencies to register with the state, ProPublica was able to match several years of the agency registry to the claims database to identify temp agencies.  In addition, ProPublica searched the workers’ comp database for keywords, such as“staffing,” “personnel,” and “labor,” and then researched the companies to identify temp agencies that were missing from the registry.

Information for explaining the analysis

ProPublica analyzed the workers’ comp data in three ways.

1. ProPublica calculated total claims for temp agency workers and non-temp workers and calculated a claims rate using employment data from the Quarterly Census of Employment and Wages (QCEW). This BLS census counts all employees in the United States by industry and geography. Every quarter, every business is required to report for unemployment insurance taxes purposes to their state how many employees they had on the payroll. QCEW is considered to be the most reliable source of employment data for this analysis because, similar to workers’ comp data, it is required to be reported by companies to the state for insurance rating.

Combining the QCEW counts with counts of worker injuries from the workers’ compensation data obtained in California, Florida, Massachusetts, Minnesota, and Oregon, we were able to construct incidence rate ratios, also called risk ratios, by dividing the rate of injuries for temporary workers by that of non-temporary workers. We assessed the statistical significance of the risk ratios by calculating 95% confidence intervals.

In Florida and Oregon, the workers’ compensation data was rich enough that we were also able to identify the occupations of injured workers, and thus also construct incidence rate ratios for temporary versus non-temporary workers in various occupations.

Figure 1: Workplace Injury Incidence Rates and Risk Ratios by State and Occupation

 95% CI 
  Temp InjuredTemp Non-injuredNon-temp InjuredNon-temp Non-InjuredIRRMinMaxNot Significant
CaliforniaTotal51,227203,3832,007,33712,551,3061.461.451.47 
FloridaTotal6,233105,267267,4866,919,9281.501.471.54 
 Construction7727,0083,832239,6086.305.856.79 
 Production31222,7182,536252,9041.361.211.53 
 Transportation/Logistics65727,3836,568389,2221.411.301.53 
 Office15037,5002,9661,283,7041.731.472.03 
MassachusettsTotal3,12844,644150,8832,993,8801.361.321.41 
MinnesotaTotal3,18843,210102,3932,470,8011.721.671.79 
OregonTotal3,54526,275115,7871,505,5271.661.611.72 
 Construction691,5011,37854,2121.771.402.25 
 Production1768,6842,00193,0490.940.811.10*
 Transportation/Logistics1844,0662,862111,2881.731.492.00 
 Office256,725831249,4891.120.751.66*

The workers’ compensation data also classifies injuries by type, for example ‘struck by or against object’ and ‘amputation.’ It was also possible to calculate risk ratios for just workers with the same type of injury. (See Appendix A.)

2. It is important to consider occupation when analyzing temp worker injuries because the composition of the temp industry is very different from the labor market as a whole. For example, temp workers are over-represented in manufacturing, warehouse and office occupations and under-represented in sales and restaurant jobs.

To assess the greater occupational risk faced by temporary workers, we classified each of the 618 BLS Broad Occupational Categories into quintiles – five ranked groups -- by their injury incidence rate from the BLS Survey of Occupational Injuries and Illnesses.  We then grouped all temporary and non-temporary workers into each of these occupational danger categories and calculated the relative incidence of the two types of worker in each danger category. This allowed us to essentially compare the number of temps in each  category with the number of temps we would expect to see in each category, if temp workers were distributed across occupational risk levels as non-temp workers. We found that temps and non-temps were relatively equally distributed in the two least-dangerous occupational categories, but non-temp workers were much more concentrated in the middle, while temp workers were disproportionately represented in the two categories representing the most dangerous occupations.

Figure 2: Rate of temporary and non-temporary workers in occupations ranked by injury rate

 95% CI
Occupational Danger CategoryInjuries per 10,000 workersTemporaryNon-TemporaryIncidence Rate RatioLowerUpper
1 x <18.44406,95022,388,8000.73230470.7302050.7344104
2 18.44 <= x <42.40505,17018,432,8501.1041461.1013471.106953
3 42.40 <= x <87.50202,68027,665,8300.29515410.2939140.2963993
4 87.50 <= x <157.38658,19021,055,0501.2594371.2567241.262157
5 157.38 <= x 1,107,56026,510,6401.6831731.6806521.685697

While several states had occupation data, only one state, Oregon, had detailed Standard Occupational Classification (SOC) codes that could be matched to BLS employment data. In Florida, ProPublica coded the text occupation fields, first, using the NIOSH Industry & Occupation Computerized Coding System (NIOCCS), an automated coding program which was released by NIOSH (the National Institute for Occupational Safety and Health) in December 2012. ProPublica then coded the remainder of the fields manually, using the Census 2010 Occupation Index, the BLS SOC index, and the National Council on Compensation Insurance (NCCI) Scopes Manual. Any occupation that was unfamiliar was coded using job duties most commonly listed for them in online job postings.

Employment data for the QCEW program, which was used in the overall analysis, does not include data on occupation. So ProPublica used 2012 research estimates published in May 2013 by the BLS Occupational Employment Statistics (OES) program. This is an annual survey of nonfarm establishments. The estimates include data from six semi-annual survey panels over a three-year period, covering 1.2 million establishments.

Because of the small size of the survey sample, the OES data does not go down to the specificity of the 5-digit industry level for temporary help services (56132); so ProPublica had to use the broader 4-digit level for the employment services industry group (5613). That group also includes two other industries: employment placement agencies, i.e. recruiting firms, and professional employer organizations (PEOs), which are human resources outsourcing firms which assume an employer’s responsibilities for tax and insurance purposes and then lease the employees back to the company that supervises them.

3. ProPublica also wanted to consider whether differences between temporary and non-temporary workers might be causing the injury gap to be overstated.  For example, are younger workers more likely to be temporary and also more likely to be injured?

To assess this possibility, ProPublica conducted a logistic regression analysis of 117,274 2010 and 2011 workers’ compensation claims from Florida, and demographic information about workers in Florida from the American Community Survey. We obtained microdata from IPUMS, which enabled us to construct cell counts for each of the combinations of variables in our model. To obtain counts of uninjured workers in each cell, we subtracted the corresponding counts of injured workers. (For a table of the data used in this model, see Appendix B.)

The estimates of the number of workers with various combinations of characteristics in the American Community Survey differ somewhat from the Quarterly Census of Employment and Wages data used above. So we first ran a regression to determine the increased odds of temporary worker injury without controlling for worker characteristics. This regression found that temporary workers had 3.8-fold higher odds of being injured.

Then, including age, sex and occupation information, we found that the odds increased to over 4-fold higher, suggesting that comparing more similar groups of workers actually increases the gap in odds between temporary workers and non-temporary workers. Thus, concerns that worker characteristics would negate the increased odds of injury for temporary-workers appear unfounded. To the contrary, controlling for worker characteristics actually increased the ‘temp effect.’

Figure 3: Logistic Regression Model

Variables in the Equation
 BS.E.WalddfSig.Exp(B)
Step 1*Temp1.398.0214629.87310.0004.047
 Age 16-24-.508.0131515.35710.000.602
 Age 25-34-.162.010282.0891.000.850
 Age 45-54.162.008366.0601.0001.176
 Over 55.091.009101.2771.0001.095
 Male-.070.007101.5301.000.933
 Dangerous job1.344.00739348.97510.0003.834
 Constant-4.839.008385307.69910.000.008

* Variable(s) entered on step 1: Temp, Age 16-24, Age 25-34, Age 45-54, Over 55, Male, Dangerous job.

Figure 4: Logistic Regression Model Coefficients - Bootstrap Confidence Intervals

Bootstrap for Variables in the Equation
BBootstrap*
BiasStd. ErrorSig. (2-tailed)95% Confidence Interval
LowerUpper
1.398-.010.024.0011.3341.432
-.508.000.013.001-.534-.482
-.162-.001.010.001-.183-.145
.162.000.008.001.146.179
.091.000.009.001.073.109
-.070.000.007.001-.083-.057
1.344-.001.006.0011.3311.356
-4.839.001.008.001-4.854-4.822

* Unless otherwise noted, bootstrap results are based on 1000 bootstrap samples

Because worker characteristics did not significantly affect the regression model, we simplified our analysis by calculating a stratified risk ratio for workers in blue-collar jobs. When we calculated that ratio for temps and non-temps, in Florida, we found that temps were six times more likely to be injured.

Strengths and Limitations

While other studies have examined workers’ compensation claims in a single state to assess increased workplace injury risk for temporary workers, this study has found a consistently large and significant result across a diverse array of states, including two of the largest.

The main limitations are the result of the use of workers’ compensation data for this analysis. In many states the data is not publicly available at all. Where data is available, there is considerable variation in collection and reporting methods between states.

Workers' compensation claims, particularly the first reports of injury (FROIs), are an imperfect record of injuries. Some workers file false claims. Some employers suppress legitimate claims. As with any data set where records are filed by multiple people, some claims administrators provide more accurate and complete information than others. To limit these imperfections, we tried to use only accepted claims wherever possible. Such is the limitation of public administrative claims data collected by state governments. Future researchers could seek more complete and detailed claims data from private insurance companies.

In each of the states where we were able to obtain data, there were significant difficulties in using it for this type of analysis. In particular, it is important to have a reasonably accurate way of identifying temporary workers, but this is made difficult by confidentiality rules that prohibit release of employer names, and a lack of standardization of occupational coding and text descriptions.

While we were able to control for age, sex and occupation in Florida, it would also be interesting to control for other variables like race and job tenure, which could impact the results. Unfortunately, job tenure and race are rarely included in the states’ workers’ compensation records.

Temporary workers also appear to face barriers to filing workers comp claims.  In general, temporary workers are less educated, far less likely to be represented by a union and far more likely to have limited English proficiency. In addition, temp workers may be disproportionately drawn from men and women who lack immigration status. While we can’t estimate this effect precisely, it could be contributing to a significant undercount of temp worker injuries in this data.

Given the promise of this and other analyses, we hope that they will serve as impetus for regulators or others to start collecting standardized and comprehensive data on this important issue affecting an increasing number of workers, many of whom labor under limited protection.

Appendix A: Temporary Worker versus Non Temporary Worker Risk Ratios by Type of Injury

      95% CI
AmputationsTempTotal TempsNon-TempTotal Non TempsRisk Ratiolowerupper
Florida48.00111,500.00983.007,187,414.003.152.364.21
California108.00254,610.001,999.0014,558,643.003.092.553.75
Oregon40.0029,820.00700.001,621,314.003.112.264.27
Massachusetts23.0047,772.00519.003,144,763.002.921.924.43
Caught In       
Florida365.00111,500.009,628.007,187,414.002.442.202.71
California2,454.00254,610.0057,895.0014,558,643.002.422.332.52
Oregon275.0029,820.004,116.001,621,314.003.633.224.10
Struck by       
Florida950.00111,500.0030,952.007,187,414.001.981.862.11
California7,424.00254,610.00259,614.0014,558,643.001.641.601.67
Oregon690.0029,820.0015,598.001,621,314.002.412.232.59
Heat Related       
Florida8.00111,500.00183.007,187,414.002.821.395.72
California66.00254,610.001,796.0014,558,643.002.101.642.69

Appendix B: Count of Florida Workers by Characteristics

 Age   
  16 to 2425 to 3435 to 4445 to 44Over 55Dangerous JobSexNon Adjusted Count
InjuredNot Temp000010F8,163
000010M3,813
000011F2,504
000011M9,358
100000F2,199
100000M1,816
100001F435
100001M3,064
000100F8,818
000100M4,181
000101F3,956
000101M14,163
001000F5,748
001000M3,675
001001F2,800
001001M12,867
010000F4,116
010000M3,268
010001F1,575
010001M9,404
Temp000010F62
000010M31
000011F44
000011M221
100000F44
100000M28
100001F29
100001M156
000100F82
000100M63
000101F77
000101M525
001000F70
001000M47
001001F80
001001M483
010000F54
010000M62
010001F50
010001M398
Not InjuredNot Temp000010F713,566
000010M553,805
000011F89,292
000011M325,671
100000F464,904
100000M310,034
100001F35,282
100001M196,531
000100F830,571
000100M567,128
000101F116,001
000101M428,358
001000F770,164
001000M560,539
001001F103,524
001001M413,341
010000F709,647
010000M497,074
010001F70,092
010001M364,232
Temp000010F6,571
000010M2,278
000011F411
000011M970
100000F2,344
100000M638
100001F179
100001M1,687
000100F7,642
000100M2,599
000101F649
000101M2,395
001000F6,635
001000M3,919
001001F755
001001M2,026
010000F5,885
010000M2,818
010001F395
010001M1,607

Appendix C: Occupations With Injury Rate Z-Score and Dangerous Job Flag

OccupationZ-scoreDangerous Job
23Legal occupations-1.0442540
15Computer and mathematical science occupations-1.01440280
13Business and financial operations occupations-0.96703470
17Architecture and engineering occupations-0.96428120
11Management occupations-0.92546740
19Life, physical, and social science occupations-0.87432280
27Arts, design, entertainment, sports, and media occupations-0.68093080
43Office and administrative support occupations-0.60620710
25Education, training, and library occupations-0.59767570
39Personal care and service occupations-0.53734490
29Healthcare practitioner and technical occupations-0.43477780
21Community and social service occupation-0.43031980
41Sales and related occupations-0.41853650
35Food preparation and serving related occupations0.17365970
31Healthcare support occupations0.29638810
37Building and grounds cleaning and maintenance occupations0.61677521
51Production occupations0.84090431
47Construction and extraction occupations0.87499581
33Protective service occupations1.50292331
49Installation, maintenance, and repair occupations1.55601211
45Farming, fishing, and forestry occupations1.5777861
53 Transportation and material moving occupations2.05611111

Calculating ‘The Price of an Internship’

$
0
0

Today we launched an interactive news application called “The Price of an Internship.” We hope it will become a comprehensive source for data on internship programs at American universities, built by individual volunteers as well as journalists working in student newsrooms.

The initial launch includes data on internships at journalism programs at 20 universities, provided to ProPublica by students at these schools as part of a #ProjectIntern crowdsourcing effort.

For the initial launch, we asked volunteers to collect data on internship courses in journalism departments at their school, by calling both the university and department internship coordinator. The data they collected included whether an internship was required for graduation, how the school helped students find internships, whether it allowed paid internships, as well as the minimum number of hours students must work to receive credit.

While we performed extensive, manual verification on the data set that accompanied the app’s launch, the app will rely for most of its life on crowdsourced data. That means that at times the data may need updating or correction, and we may be missing some data. We welcome your help. If you see something that needs to be changed or added, e-mail Blair Hickman.

Calculating Costs

Comparing the policy requirements was fairly straightforward. However, we also needed a way to calculate the tuition cost of an internship consistently across different schools. To do this, we looked at the minimum and maximum number of credits a student could earn for a single term of an internship course, as well as the tuition cost-per-credit at that university.

Our volunteers helped us find the minimum and maximum number of credits for an internship course. This varies across universities. For example, at the University of Southern California, journalism students earn one credit for completing JOUR090X, according to media relations spokesperson Gretchen Parker. At Western Washington University, journalism students can earn a maximum of six credits for completing JOUR 430, according to Peggy Watt, chair of the Journalism Department.

Information on credits earned gave us half of our equation.

For the second half of our equation, we needed to calculate the tuition cost per credit at each school.

For tuition cost information, we turned to the Common Data Set, a standard form that most universities use to report data about their institution. In particular, Part G (“Annual Expenses”) of the CDS includes tuition and credit-hour data, which allows us to compare one school’s tuition costs to another’s.

Our calculations make the assumption that the dollar value of an internship credit is equal to the same credit in normal classwork. In situations where CDS data is not available, we use values reported on a school’s website or values given to us by an internship coordinator or other university official.

Our calculations differ a bit based on how the schools compute tuition pricing. The schools about which we’ve collected data so far have one of two kinds: per-credit-hour and per-term (or similar).

Schools with Per-Credit-Hour Tuition Pricing

Schools that use per-credit-hour tuition pricing generally report a per-credit-hour dollar amount in Section G6 of the CDS. If this is the case, we calculate the tuition cost of an internship by multiplying this value by the range of credits or credit-hours a student can earn for one term of an internship course.

Public institutions provide separate G6 values for in-state and out-of-state students. For private institutions, the price is the same. The calculation takes one of the three forms:

in_state = (G6 Public, In-state out-of-district) * (Credit value for listed internship course)

out_of_state = (G6 Public, Out-of-state) * (Credit value for listed internship course)

private = (G6 Private Institutions) * (Credit value for listed internship course)

Some institutions and courses have policies allowing internships to be taken for no tuition cost, including zero-credit and transcript notations. In these cases, we display the low-end of the tuition cost range as $0.

Schools With Per-term Tuition or Similar Pricing

While some schools charge per-credit-hour, other institutions may have alternative methods of charging tuition. For example, Western Washington University, charges a flat cost if a student is taking between 10 and 18 credits.

Schools report “typical” tuition costs for a full-time undergraduate student for the full school year in section G1 of the Common Data Set. For schools in this category, we normalize these values to create an “estimated” cost per term based on the CDS definition of a "full school year" — 2 terms if using a "semester" schedule, 3 terms if using a “trimester” or “quarter” schedule, 1 full-year term if using a "4-1-4" schedule or similar plan.

We take this derived cost per term and then divide it by the maximum number of credits a full-time student may take per term (reported in Section G2 of the CDS). This gives us a synthetic estimate of the cost-per-credit that a full-time student at the institution would expect to pay if they stretched a per-term pricing structure to allow a maximum number of courses. This may underestimate, but shouldn’t overestimate, the theoretical price of the internship.

For public institutions, the CDS gives separate G1 values for in-state and out-of-state students. For private institutions, the price is the same. The calculation takes one of the three forms:

in_state = (G1 Public, In-state out-of-district; tuition only; Undergraduate) / (# of terms per year) / (G2 maximum credits a full-time student may take) * (Credit value for listed internship course)

out_of_state = (G1 Public, out-of-state; tuition only; Undergraduate) / (# of terms per year) / (G2 maximum credits a full-time student may take) * (Credit value for listed internship course)

private = (G1 Private; tuition only; Undergraduate) / (# of terms per year) / (G2 maximum credits a full-time student may take) * (Credit value for listed internship course)

When a student can earn a range of credit hours for an internship course, we show the raw tuition costs as a range, and calculate the minimum and maximum credits the student could receive for the internship.

Some institutions and courses have policies allowing internships to be taken at no cost in certain circumstances. In these cases, we display the low-end of the tuition range as $0.

Ratings and Reviews

The app also lets students rate and review their internships. As with such features all around the Internet, these represent only the experiences of those students who have left feedback, and may not accurately represent all students who have taken an internship through the department’s internship programs.

How We Made the 3-D New York City Flood Map

$
0
0

Earlier this year we published a story and an interactive graphic about the evolving Federal Emergency Management Agency flood maps in New York City in the year after Hurricane Sandy.

FEMA had advisory maps in the works when Sandy hit. The agency rushed them out in the days afterward as a first sketch for those looking to rebuild.

Our story found that while the maps continued to be revised over the course of a year, homeowners had little guidance on how much their home’s value — as well as its required elevation — were changing as they struggled to rebuild after the storm. To complicate matters, Congress had recently passed legislation which threatened to dramatically raise flood insurance premiums for those remapped into high-risk flood zones.

In the midst of all of this, New York City Mayor Michael Bloomberg announced an ambitious $20 billion plan to protect the city from storms, a plan with at least a $4.5 billion funding gap and no clear timeline.

With these advisory maps as a guide, and knowing there would be another revision in the coming months, we wanted to create a visualization that would show readers the impact Sandy had, how much impact a potential flood could have, and how the measures laid out in Bloomberg’s plan, if implemented, might protect the city.

We were inspired by graphics like this Times-Picayune 3-D map of the New Orleans levee system which shows how bowl-like that city is, as well as the U.S. Army Corps of Engineers' scale model of the San Francisco Bay. Mapping in three dimensions helps readers see ground elevation and building height in a much more intuitive way than a traditional flat map, and one which matches their mental model of the city.

We set out to find the right technology to render our map in a browser. Software like Maya would allow us to make an animated motion graphic of the map, which would have been beautiful, but we wanted to let readers explore it and find locations that are important to them. So even though it only works in the newest browsers, we decided to use WebGL, which is a 3-D graphics library and a kind of experimental bridge between JavaScript and OpenGL. It lets web developers talk right to the user's graphics card.

Aside from creating what we believe is one of the first maps of its kind on the web, we also persuaded New York City to release, for the first time, its 26 gigabyte 2010 digital elevation model, which is now available on the NYC DataMine.

The Data

To make our 3-D maps we needed accurate geographic data. We needed the GIS files for two different iterations of the flood-risk zones — the 2007 zones (mostly based on 1983 data) and the new 2013 advisory ones. We also needed the footprints for every building in New York City (there are more than a million), including their heights as well as the amount of damage each building sustained during Sandy. We also needed elevation data, showing how high the land needed is all over the city, as well a base layer of things like streets and parks.

Some of that data was easy to get. FEMA will ship you free shapefiles for every flood insurance map in the country through its Map Service Center, and they post the new New York City flood zone files on a regional site.

Getting the building data wasn't so easy.

At the time we were making the map, the city’s building footprints shapefile did not include building heights, so we needed to join it to two other databases — the Property Assessment Roll and the Property Address Directory file in order to get number of stories (which we used as a proxy for height). Since then, the city has updated the buildings footprint file with a field containing the actual height for each building.

The last step was associating FEMA's estimate of damage level to each building. Since FEMA's data is stored as a collection of points, we needed to do a spatial query to find which of the building footprints intersected with FEMA's points.

With all of the data in hand, we were able to make a new shapefile combining building footprints, height (which we approximated by setting a story to ten feet high), and level of damage.

To make a 3-D map, you need information on the height of the ground. But finding a good dataset of ground elevations in a crowded city like New York is difficult. In 2010, the City University of New York mapped the topography of New York City using a technology called lidar in order to find rooftops that would be good locations for solar installations. Thankfully, we were able to persuade New York City’s Department of Information Technology and Telecommunications to give us the dataset. The department also posted the data on the NYC DataMine for anyone to download.

Inventing a Format

Once we assembled the data, we needed to convert it into a form WebGL could understand. Shapefiles store polygons as a simple array of points, but WebGL doesn't handle polygons, only triangles, so we needed to transform our polygons into a collection of triangles. In order to do so, we implemented an “ear clipping" algorithm to turn polygons like this:

into a collection of triangles that looks like this:

We then packed these triangles into a buffer of binary data in a format we called a ".jeff" file. Jeff files are very simple, binary files. They consist of three length-prefixed buffers: A 32-bit float representing the vertices in a particular shape, a 32-bit integer array of triangle offsets, and metadata. The record layout looks like this:

length of vertices vertex x1 vertex y1 vertex x2 vertex y2 ...
length of triangles index of the start of the triangle index of second point the triangle index of third point the triangle ...
length of metadata JSON encoded metadata

It turns out to be simple and fast for browsers that support WebGL to read this binary format in Javascript, because they already need to implement fast binary data buffers and typed arrays. In order to get and read binary data from the server you create an XMLHttpRequest which has a responseType of arraybuffer. The data sent back will be in a ArrayBuffer object which is simply an array of bytes. After the request completes we parse the .jeff files with the readJeff function of this class:

var Buffer = propublica.utils.Buffer = function(buf){
  this.offset = 0;
  this.buf = buf;};

Buffer.prototype.read = function(amt, type){
  var ret = new type(this.buf.slice(this.offset, this.offset + amt));
  this.offset += amt;
  return ret;};

Buffer.prototype.readI32 = function(){
  return this.read(Int32Array.BYTES_PER_ELEMENT, Int32Array)[0];};

Buffer.prototype.readI32array = function(amt){
  return this.read(amt*Int32Array.BYTES_PER_ELEMENT, Int32Array);};

Buffer.prototype.readF32array = function(amt){
  return this.read(amt*Float32Array.BYTES_PER_ELEMENT, Float32Array);};

Buffer.prototype.readStr = function(amt) {
   return JSON.parse(String.fromCharCode.apply(null, this.read(amt*Uint8Array.BYTES_PER_ELEMENT, Uint8Array)));};

Buffer.prototype.more = function() {
  return this.offset < this.buf.byteLength;};

Buffer.prototype.readJeff = function(cb){
  while(this.more()) {
    var points     = this.readF32array(this.readI32());
    var triangles  = this.readI32array(this.readI32());
    var meta       = this.readStr(this.readI32());
    cb(points, triangles, meta);}};

You'll notice that we are only sending x and y coordinates over the wire. Along with the triangles and point arrays we send to the client, we also send a bit of metadata that defines the height for the shape (either flood zones or building footprints) as a whole. Once the data arrives on the client a web worker extrudes the buildings into the 3-D shapes to display in the browser. The final view shows a lot of data. For example, Coney Island alone has almost 200,000 triangles.

A Neighborhood Bakery

In order to slice up the map into neighborhoods, we wrote a script to iterate through all of our files and clip them to the same bounds, so they would stack up on the map like a layer cake. New zones, old zones, the city's boundary, all five categories of damaged buildings, coastline and street map data needed to be clipped, reprojected and turned into .jeff files at once. With 11 layers and seven neighborhoods, we baked out 77 files every time we tweaked the map. Because Postgres's PostGIS extension is the best way to dice shapes into squares, our script created temporary tables for each layer of each neighborhood, and ran them through a query like so:

SELECT *,
  ST_Intersection(ST_Transform(the_geom, $NJ_PROJ),
  ST_SetSRID(
    ST_MakeBox2D(
      ST_Point($ENVELOPE[0], $ENVELOPE[1]),
      ST_POINT($ENVELOPE[2], $ENVELOPE[3])
    ), $NJ_PROJ)
  )
AS
  clipped_geom
FROM
  $TMP_TABLE
WHERE
  ST_Intersects(
    ST_Transform(the_geom, $NJ_PROJ), ST_SetSRID(
      ST_MakeBox2D(ST_POINT($ENVELOPE[0], $ENVELOPE[1]), ST_POINT($ENVELOPE[2], $ENVELOPE[3])
    ), $NJ_PROJ)
  )
AND
  ST_Geometrytype(
    ST_Intersection(
      ST_Transform(the_geom, $NJ_PROJ), ST_SetSRID(
        ST_MakeBox2D(ST_Point($ENVELOPE[0], $ENVELOPE[1]), ST_POINT($ENVELOPE[2], $ENVELOPE[3])
      ), $NJ_PROJ)
    )
  ) != 'ST_GeometryCollection';

In the above, $NJ_PROJ is EPSG:32011, a New Jersey state plane projection that was best for coastal New York City, and $ENVELOPE[0]..$ENVELOPE[3] is the bounds of each neighborhood.

The query grabs all the geographical data inside a box of coordinates. We took the result of that query, and used the pgsql2shp tool to create a shapefile of each one —77 in all . We then took each of those and pipe ran them through our script that baked out .jeff files. When that was done, we had 353 files, including all of the additional files that come along with the .shp format. In order to speed up the process, we used a Ruby gem called Parallel to run the tasks over our iMacs’ eight cores. In order to make sure each temporary table didn't stomp on the feet of its parallel task, our baker script created unique random table names for each shape for each neighborhood and dropped the table after it finished baking.

For the building shapes, we needed to use our elevation raster to record the ground elevation at each building's centroid. Fortunately gdal has a command line tool that makes that trivial. For any raster that gdal can read, you can query the values encoded within the image with:

gdallocationinfo raster.tif [x] [y] -geoloc

We issued that command while the baker was running and stored the result in the metadata we sent to the browser.

Making the Map

In order to display all this data on the web we relied on a fairly new web standard called WebGL. WebGL is an extension to the <canvas> tag that is supported by certain browsers and allows JavaScript to access OpenGL-like APIs to process 3-D information directly on a computer's graphics card. The API is very complex so we used lightgl.js, which provides very nice api that is closer to the WebGL API than something like three.js.

To organize things a bit, we created individual objects we called Scenes for each of the neighborhoods we wanted to show on the map. Each scene had what we called Shapes for Buildings, Flood, Flood Zones, Terrain, and Earth.

To build Buildings and Flood Zones we used the binary data format described above to build a 3-D representation that we uploaded to the graphics card.

But for the Terrain and Flood scenes, the elevation of the earth and the storm surge extent, we sent a specially encoded image to the browser that contained height information encoded as a network format float in each pixel's red, green, blue and alpha values. We wrote a little tool to encode images this way. Here’s an example:

In WebGL, you don't actually manipulate 3-D models as a whole, instead you upload to the graphics card small programs called “shaders” that operate in parallel on the 3-D data you've previously sent to the graphics card. We implemented both kinds of shaders —vertex and fragment. When a browser couldn't compile one of our shaders (for instance, not all browsers and video cards support reading textures in vertex shaders), we redirected to a fallback version of the map we called “lo-fi.”

After the geometric data was processed in the graphics pipeline, we did a bit of post processing to antialias harsh lines, and we added a bit of shading to make the buildings stand out from one another using a technique called Screen Space Ambient Occlusion. You can play with the settings to our shaders by visiting the maps in debug mode.

In order to make the little flags for landmarks like Nathan's Hot Dogs and the Cyclone, we built up an object of New York City BIN numbers and their descriptions for interesting buildings. For these BINs, we added bounds to the metadata section of the buildings .jeff file. Once we had those bounds we could use lightgl's project method to attach an HTML element (the flag) to the DOM near where the building was shown in canvasland. Whenever the user moved the map, we would reproject the flags so they would move along with the underlying map.

Maintaining State

The user interface for the maps is pretty complicated; we have seven different areas each with three different views. Originally we had set up an ad hoc way of tracking state through a lot of if statements, but when this became unwieldy we wrote a small state machine implementation we called the KeyMaster.

State Machines are one of the best tools in a programmer's toolbox. As their name implies, they maintain state and they work by defining transitions between those states. So for example, KeyMaster could define these states:

KeyMaster.add('warn').from('green').to('yellow');
KeyMaster.add('panic').from('yellow').to('red');
KeyMaster.add('calm').from('red').to('yellow');
KeyMaster.add('clear').from('yellow').to('green');
KeyMaster.initial('green');

and transition between states like this:

KeyMaster.warn();>> 'yellow'
KeyMaster.panic();>> 'red'
KeyMaster.calm();>> 'yellow'
KeyMaster.clear();>> 'green'

KeyMaster also has events that are called “before” and “after transitions,” which we used to clean up and remove the current state.

The "Lo-Fi" Version

For users that aren't using WebGL-capable browsers, we also made a "lo-fi" version of the map, which is simply a series of images for each neighborhood. To generate the images, we wrote a little tool to automatically take and save snapshots of whatever the current map view was. This was easy thanks to the Canvas API's toDataURL method. In the end, this version looked exactly like the main view, except you couldn't change the zoom or angle. We left the snapshotter in the secret debug version of the app -- you can save your own snapshots of the current map view by hitting the "take snapshot" button in the bottom left corner.

Our 3-D map of New York City is a taste of what we’ll be able to build once all browsers support it, and we think it helped us tell a story in a way that a 2-D map wouldn't have done as well. If you end up building a 3-D news graphic, let us know!

How to Send Us Files More Securely

$
0
0

Today we’re launching a new system to help sources send us messages and files more securely than they can via things like email and FTP. You can read more about the system at our secure information page at securedrop.propublica.org.

The system uses SecureDrop, an open-source tool developed for The New Yorker by Kevin Poulsen and the late Aaron Swartz. The software is now maintained by the Freedom of the Press Foundation, who worked with us to set it up at ProPublica. To help protect our sources identities, it is only accessible using the Tor system. We do not record your IP address or information about your browser, computer or operating system.

The design of the SecureDrop system builds upon well-tested security technologies like PGP encryption and Tor routing and techniques such as using air gapping. It represents the best we know of to share information electronically. However, no system is perfectly secure and sources wishing to send us material using the system should be aware of the specific risks they face, including the security of their own equipment and networks.

Technical Details

The information page at securedrop.propublica.org contains instructions on how to send us material through the system. The information page itself is on a server that is not connected to the SecureDrop server. But as an important part of the system, it needs to follow good security practices, so we’ve configured it according to security recommendations by the Freedom of the Press Foundation. We enforce HTTPS connections to that domain by using the “Strict-Transport-Security” HTTP header. We prevent external content and browser frames from accessing that page, to ensure that the information you see there isn’t tampered with. It does not store any access logs or create any persistent cookies. You can verify that server’s settings yourself by visiting SecurityHeaders.com and the Qualys SSL Labs tester.

We’ve documented the nginx configuration file we use on our information page and are publishing it today. Other sites may use our example config for similar small websites requiring HTTPS and high browser security.

For verification, the SHA1 fingerprint of the "securedrop.propublica.org" server SSL certificate is 33:03:99:09:7E:D3:83:E4:AC:48:54:E4:89:19:2D:47:68:61:7A:B5 and the Tor "hidden service" address of the ProPublica SecureDrop server is http://qzpl6f4fyx4pxzdu.onion/ . Publishing this address in several places makes it more difficult for an attacker to secretly change the link to their own "hidden service" address without someone noticing the change.

Technical users interested in finding out more about SecureDrop can visit their website and browse their GitHub repository.

How to Make a News App in Two Days, as Told by Six People Who Tried It for the First Time

$
0
0

This post was co-published with Source.

As part of the orientation week for the 2014 class of Knight-Mozilla OpenNews Fellows, fellow nerd-cuber Mike Tigas and I led a hackathon at Mozilla’s headquarters in San Francisco the goal of which was to build a news application in two days. This is the story of that hackathon and the app we created, told mostly from the perspective of the Fellows who participated.

The six Fellows, who will be embedded for a year in newsrooms around the world, come from a variety of backgrounds. Most had some familiarity with web development. Few had worked in news before. They had also never previously worked together. After the two-day event, the eight of us had finished the rough draft of a demonstration news app based on a data set on tire safety ratings from the National Highway Traffic Safety Administration. Using the app, readers can look up their car’s tires to see how their temperature, traction and treadwear ratings compare to other brands.

While the Fellows had never built a news app, they quickly picked up on the process. By the end of the first day of working as a group, we had almost finished cleaning the data set and importing it into a SQLite database and were ready to delve into user interface work. By the end of the second day, the app was fully searchable and was looking great.

So, how do you make a news app, and how did we make a simple one in two days?

On the surface, you build a news apps a lot like you build other kinds of web apps. They’re usually built around a relational database, using a framework like Django or Rails. They allow users to drive some kind of interaction (and sometimes input stuff that stays resident in the app) and they’re usually accessed on a variety of browsers and platforms.

One key difference: While most web apps are created to be containers for data, news apps are both the container and the data. The developer who makes the app is usually deeply involved in analyzing and preparing the data, and every app is closely tied to a particular data set.

At ProPublica we consciously design news applications to let readers tell themselves stories. Our tools include words and pictures and also interaction design and information architecture. News apps help users (really readers) find their own stories inside big national trends. They can help spur real-world impact by creating a more personal connection with the material than even the most perfectly chosen anecdote in a narrative story can.

We went through each of these four steps to build Tire Tracker.

  1. Understanding and acquiring data
  2. Cleaning and bulletproofing data
  3. Importing data into a database
  4. Designing and building the public-facing app

Understanding and Acquiring Data

As I said, most news apps enter the world with some or all of the data already in place — typically data that journalists have cleaned and analyzed. In fact, most of the time spent making a news app is usually dedicated to hand-crafting the data set so that it’s cleaned and highly accurate. This is not unlike the process of reporting a story — writing a story typically takes a lot less time than reporting it does.

There are many ways for journalists to acquire and prepare data sets for a news app. One of the easiest ways to obtain data is to find a dataset in a public repository such as data.gov, or get it through an API such as the New York Times Campaign Finance API. In some cases, the data arrives clean, complete and ready to be used, and you’ll be making a news app in no time.

However, acquiring data is not usually that easy. If the data you want is not in a public repository, you can request it from a government agency or company, sometimes through a Freedom of Information request. If the data is available as web pages but not as download able, structured data, you can try scraping it. If it’s a PDF you can try using a tool like Tabula to transform it into a format good for pulling into your database.

Most data sets, in our experience, come with quality problems. Really big, complex data sets often come with really big, complex problems. Because our hackathon was only two days and we didn’t want to spend those two days cleaning data, we picked a data set we knew to be relatively small and clean: NHTSA’s Tire Quality Dataset.

We found a version of the data from 2009 on the data.gov portal, but wanted to work with more recent data, so we called NHTSA, which provided us with an updated version as a CSV file. NHTSA also has tire complaint data, which we downloaded from their website.

Fellow Gabriela Rodriguez worked on researching and obtaining data for the app:

One of the things I worked on was researching possible data sets related to tires and their ratings. It would have been interesting to have data on tire prices and relate them to these ratings, but we couldn't find anything online that was free to use. We also looked into cars — their prices and the tires that came installed on them. Nothing. It’s only logical that this data would be published somewhere, but we couldn’t find it. Collecting some of this data ourselves would have taken more than a day, so we worked with what we had: Tires, ratings, complaints and recalls.

Because this was a demonstration app built in a two day hackathon and not a full-fledged news app, we relied on publicly available information about the data set and a few brief conversations with NHTSA and a tire expert to understand the data, and didn’t do as much reporting as we might normally have done. NHTSA’s data set was small, clean and easy to work with and understand. It got complex when we tried to join the data on other NHTSA datasets like complaints and recalls. The agency did not have unique IDs for each tire, so we wrote algorithms in an attempt to join them by brand name and tire model name.

Cleaning and Bulletproofing

Before you import your data, you need to clean and bulletproof it.

Even if at first glance your data looks good, you never know what problems lurk within until you examine it more closely. There are lots of reasons data can be dirty — maybe whoever assembled it ran up against the limits of some software program, which simply chopped off rows, or maybe they were “helped” by autocorrect, or maybe the software they used to send you the data eliminated leading zeros from ZIP Codes it thought were supposed to be integers. Maybe some values were left blank when they should be “meaningful nulls.”

Effectively cleaning data requires a solid understanding of what each column means. Sometimes this means reading the material the agency publishes about the data. Often this means calling somebody at the agency and asking a barrage of very nerdy questions.

Fellows Harlo Homes and Ben Chartoff worked on cleaning data.

Chartoff worked on making the grading data more usable, especially the complicated “size” column:

I spent most of my time in the bowels of the data, cleaning and parsing the "tire size" field. It turns out a tire's size can be broken down into component parts — diameter, load capacity, cross sectional width, etc. This means that we could break a single size down into eight separate columns, each representing a different value. That's great — it leads to more specificity in the data — but the tire size field in our source data needed a lot of cleaning. While many of the entries in the size field were clean and complete (something like "P225/50R16 91S" is a clean single tire size), many were incomplete or irregular. A size field might just list "16", for example, or "P225 50-60". After spending a while with the data, and on a few tire websites, I was able to parse out what these entries meant. The "16" refers to a 16 inch diameter, with the rest of the fields unknown. It's included in the tire size at the end of the first string, e.g. P225/50R16. P225 50-60, on the other hand refers to three different sets of tires: P225/50?, P225/55?, and P225/60? where the ?'s represent unknown fields. I ended up writing a series of regular expressions to parse sizes in different formats, breaking each entry down into anywhere from one to eight component parts which were each stored separately in the final database.

Holmes worked on joining tire rating data to complaint data, which was tricky because there was no common key between them. She implemented algorithms to find similarities between brand names and tire model names to guess which complaints went with which tires:

I wrote a simple script that fixed these inconsistencies by evaluating the fuzzy (string) distance between the ideal labels in our first data set and the messier labels in our incident report sets. In my initial implementation, I was able to associate the messy labels with their neater counterparts with almost 90 percent accuracy. (I didn't do any real, official benchmarking. It was a hackathon — who has the time?!)

This initial success proved that using fuzzy distances to standardize entity labels was the best way to go. However, certain specific qualities about our data set complicated the algorithm a bit. For example, some manufacturers have multiple lines of a particular product (like “Firestone GTX” and “Firestone GTA”) and so our algorithm had to be adjusted slightly to further scrutinize any entry that appeared to be part of a line of products made by the same manufacturer. To tackle this, I wrote another algorithm that parsed out different versions of a product where appropriate. Once this second layer of scrutiny was applied to our algorithm, the accuracy jumped significantly, and we eliminated all false positive matches.

If this had been a full-bore news app, we would have taken a few weeks to spot check and optimize Harlo’s spectacular matching work, but seeing as how this was a two-day hackathon, we decided not to publish the complaint data. The chance of misleading readers through even one false positive wasn’t worth the risk.

For a complete guide on bulletproofing data, see Jennifer LaFleur’s excellent guide.

Importing Data into a Database

After you’ve obtained, understood and cleaned the data, you’ll create a database schema based on your dataset in your framework of choice, and perhaps write an importer script actually get the data into it. We usually use Rake to write tasks that will import data from CSV, JSON or XML into our database. This makes it easy to recreate the database in case we need to delete it and start again. Our Rake tasks typically lean on the built-in database routines in Rails. If you don't use Rails, your framework will have a vernacular way to import data.

Designing and Building Your App

We design our apps with a central metaphor of a “far” view and a “near” view. The “far” view is usually the front page of an app. It shows maximums and minimums, clusters and outliers, correlations and geographic trends. Its job is to give a reader context and then guide him or her into the inside of the app either via search or by browsing links.

Fellow Aurelia Moser worked on the front page “far” view for the tire app, a grid of how the top-selling tire brands are rated:

As part of the data visualization team, I was meant to tackle some graphical representation of tire grades according to the top brands in the industry. The objective here was to illustrate at a glance what the official tire grade distribution was for the top tire manufacturers. Our initial approach was going to be to create a chart view of the data, but because there were more than 253 brands and tire lines it seemed like the chart might be overwhelming and illegible. Taking 'top-ten' brand data from Tire Review, I built a little matrix in D3 to illustrate grade information by Tireline or Make (y-axis) and Brand (x-axis).

The “near view” is the page on an app that most closely relates to the reader. It could represent the reader’s school, his doctor, a hospital, etc. It lets readers see how the data relates to them personally.

Fellow Brian Jacobs worked on the design and user interface for the near-view pages, which, in this case, let readers drill down to a specific brand and its tires:

I tried a top-level comparative view, showing tire brands at a glance, sorted by averaged tire ratings, and re-sortable by the other quality ratings. Each brand would also show sales volume information if possible. You would then be able to dig deeper into a particular brand, where a tire "report card" view would display, showing a more granular visualization of the quality rating distribution. This would display above a full list of tire models with their respective ratings. So, users would be able to explore brands from top down, and also go directly to their model of choice with a search tool.

It took some time to realize, but it turned out that some weaknesses in the dataset prevented any responsible inclusion of any summary or aggregate data. Omitting specific top-level statistics is unfortunate, as it greatly reduces the ability for the general-audience user from gleaning any quick information from the app, without having a specific tire in mind and essentially eliminates our ability to highlight any patterns.

My colleague Mike Tigas built the search feature for the app:

I focused on implementing a search feature on top of our dataset. I used Tire, a Ruby client for ElasticSearch, because I've used it on previous projects and the library provides simple integration into ActiveRecord models. (ElasticSearch was chosen over a normal SQL-based full-text search since we wanted to provide a one-box fuzzy search over several of the text fields in our dataset.) Amusingly, a lot of my time was spent on code issues related to using a software library named "Tire" in the same app as a Ruby model named "Tire". (We later renamed our model to "Tyre", internally.)

And Fellow Marcos Vanetta did a little of everything:

I worked on cleaning the data and importing it into the database ( for that I used mainly Ruby and LibreOffice). I also worked on the Rails app with Al and Mike. I helped Harlo with the normalization of some parameters and translating some Python scripts into Ruby. I also participated in mixing the visualization (Aurelia's baby) with the Rails app and worked on some minor JavaScript tricks with Gaba.

Once you’ve loaded your data, designed and built an interface and spot-checked your app, you’ll want to deploy it. We use Amazon EC2 to host our apps. A year or so ago we published a guide that goes into all the nerdy details on how we host our work — as well as a second guide that explores other ways to host an app.

Our tire quality application was, of course, more of an exercise in learning how to write stories with software than a full-blown investigation. We didn’t delve deeply into tire brands or do complex statistics we might do if we were more seriously analyzing the data. Nevertheless, the six 2014 OpenNews fellows started getting familiar with our approach to projects like these, and we can’t wait to see what else they come up with over the course of their newsroom year.

Read more lessons from the ProPublica/OpenNews "popup news apps team" at Source.


How to Start Learning to Program

$
0
0

A journalist I follow on Twitter recently asked me the best way to start to learn how to program. It’s a question I get asked a lot and, although it has been said before, here’s my advice for learning how to program:

The most important thing, by far, is to find a project you are dedicated to completing. Pick something that will disappoint you or your employer if you don't finish it. There are two reasons for this. First, you'll learn best if you attach new knowledge to old. Seeing code though the lens of a problem you know how to solve is an invaluable way to understand it and remember it. Second, it will give you the momentum to scale the steeper parts of the learning curve. You need to be more afraid of missing your deadline than you are of programming!

Next, pick a language. Go to a bookstore and flip through the first chapter of both a Python book and a Ruby book or browse the web for introductions to the languages. There are other languages but these two are excellent. Pick whichever language delights you more. They're quite different. Python I find more to the liking of perfectionists while Ruby is attractive to left-brained folks. Of course there are plenty of artists who use Python and neat-freaks who use Ruby. Don't anguish over the decision! If you can't decide, simply flip a coin. You really can't go wrong with either. And they're like Romance languages: If you're good at one you'll be able to learn the other pretty quickly.

Update: JavaScript is pretty great, too.

Having picked a language, you should learn however is best for you. If you like videos, pay for good video/screencast training. If you learn best with books, buy a book. If you like being in a class, do that. My colleague Lena Groeger wrote an excellent guide to the resources on the web for code learners. Bookmark it.

And then: Do your project and don't stop until you're done. And then do another. Just write a lot of code. It’s worthwhile to keep going, even when projects don't push you to learn parts of the language you don't already know. Repetition will help you retain knowledge and you’ll do everything a little better each time. Even very accomplished developers do this. Programming knowledge is perishable and won't stick unless you keep using it.

Get a Github account and learn how to use git (start with Al Shaw’s guide) so you can rewind to earlier versions of your code when you go down blind alleys. Try to read lots of other people's code if you can't figure out how to do something, or just to see other approaches to the same problems.

You'll learn best when you break things and have to fix them. Python and Ruby both give great error messages. Google them and you'll see how somebody else fixed the same bug. Don't just copy and paste the solution, but try to understand what was broken and how the fix works. If you Google your error and nobody else seems have gotten it, you’ve probably made a basic error like a misspelling or a missing comma.

Soon you'll find yourself knowing intuitively what different errors mean how to fix them. When you go to add a feature to your software and realize you know exactly how to do it, you’ve become a coder.

A Conceptual Model for Interactive Databases in News

$
0
0

On Sunday, March 2, Knight-Mozilla OpenNews, the Newseum and Pop Up Archive hosted a one-day conference focused on solving a fairly new problem: How to preserve the new breed of complex interactive projects that are becoming more prevalent in news. While print newspapers are relatively well preserved, we as an industry do a poor job of preserving interactive databases and online data visualizations, and they are in danger of being lost to history.

Inside newsrooms, these interactive databases are sometimes called “news applications” — but don’t be confused. They’re interactive databases published on the web, not something you buy on your smartphone. Think Dollars for Docs, not Flipboard.

We were among a few dozen attendees who attended the meeting. Preserving interactive databases isn’t as easy as storing a digital copy. They’re far more complex that a printed newspaper, with technical requirements and external dependencies that make preservation anything but straightforward.

The conference split into small groups to start with some basic questions. One group tackled best practices. Another talked about how external dependencies like the Google Maps API could be handled, while another asked who might be willing to pay for archiving efforts, and how to make it inexpensive so cash-strapped newsrooms can do it. Our group, consisting of Elaine Ayo, Mohammed Haddad, Tyler Fisher, Jacob Harris, Scott Klein, Roger Macdonald, Mike Tigas and Marcos Vanetta was tasked with answering the very basic question, “What is a news app and what is one made of?” Our goal was to define the components of a news app to better facilitate the conversation around what is worth preserving, what needs to be virtualized, and what it might take to archive one.

The conceptual model we took as an inspiration was the Open Systems Interconnection Model, usually called “The OSI Model,” one of the frameworks that makes the low-level networking bits of the Internet work without a lot of coordination. We attempted to come up with a way to describe news apps using OSI-like “layers,” with infrastructure at the bottom and audience at the top. Like the OSI Model, we conceived each layer as talking exclusively to the layers above and below it. But the metaphor broke down. We found that too many parts of news applications worth preserving — the code we write, the processes we define — talk to lots of layers at once.

So we ditched the layers idea and started to think about interdependent, non-hierarchical categories. We defined six of them, each with artifacts, attributes and preservation requirements.

Our draft model includes six categories:

The Code Category includes the software that runs in production as well as the software used to acquire, parse and analyze the data and any libraries the newsroom wrote for its own use.

The Data Category might better be called the Input Category. It includes data (raw and cleaned), metadata and data structure artifacts like the data dictionaries, reporting material and more.

The Story Category, also called the Output Category includes the narrative stories that went along with the app, APIs published with the app, multimedia, UX, visual design, information architecture, annotations and documentation.

The Infrastructure Category, which is something to be simulated more than preserved, includes the Internet itself (bandwidth), web browsers, web servers, operating systems, programming languages and frameworks, external display APIs, vendor libraries and dependencies and database management systems

The Process Category includes code documentation, code history (git), data transformation diaries, data diaries and documentation, documents describing the cultural context, story edits, data sources like FOIA letters and general writing about process (e.g., a nerd blog).

The Response Category includes user comments, site metrics, awards won, user behavior metrics, logs, media coverage, tweets and other social media mentions, as well as real-world impact.

Every category has actors, or people who perform tasks on this category. When archiving, ask who those actors were and the decisions they made. Save versions of each artifact to show how something transformed over time. And of course, provide documentation for all of these things.

Perhaps this all seems laborious or trivial, but knowing exactly what goes into and comes out of a news application is fundamental to understanding how to preserve one.

The model is not so much a way to think about how to build news apps — though it certainly does strongly imply something about how they’re built — as it is a way to understand them as human-made objects, and how to break them down in order to preserve them. Separating “code” from “infrastructure” and “data” is not all that helpful when building, but preserving each category requires separate and intentional efforts, different skills and technologies.

Throughout the day, we continued to return to Adrian Holovaty’s Chicago Crime, a groundbreaking news application that is now lost to the world. What do we want to know in 2014 about that app? What would we want to know in 2034? It’s not just the code that Adrian wrote or the map itself, though his reverse engineering of the Google Maps Flash API was one of its great innovations when it first came out. We want to know about his process. We want to know the infrastructure on which he built the app (indeed, making his use of Google Maps even more impressive). We want to know about how it was designed, how the user interactions worked. We want to know the impact it had and who responded to it.

With a defined model of news applications, it becomes clear that archiving a news app is about more than just making sure the app still exists on the web. Things like oral histories and screencasts will likely be required to tell future news developers and historians how this kind of journalism came to be and why we made the decisions we made.

This is just a first stab at the model. Our draft is the result of a few hours’ effort and we’ve posted it to GitHub. We hold no monopoly on the idea. Feel free to fork it and send us pull requests and to open issues to give us better ideas.

Eventually, we hope this model can serve as a document for understanding how to preserve a news application, and to start a conversation about how to tackle each part. Preserving our work for future generations is crucial. Just as we can look at a New York Herald issue from the middle of the Civil War, even though the Herald itself and everybody associated with it has long since died, we hope that future news nerds can look at the work we do long after we’re gone. The challenge is great, but monumentally important.

What Heartbleed Means for Newsrooms

$
0
0

This post was co-published with Source.

Heartbleed is a security vulnerability that affects recent versions of OpenSSL, a popular software library that provides encryption. It is used in a wide variety of software in common use in news organizations, including the widely used open-source web servers Apache and Nginx (Microsoft's IIS web server is not affected). Think of SSL/TLS as the "S" in HTTPS.

Heartbleed is named after the part of SSL/TLS that it attacks -- a "heartbeat," which is the signal that client software sends to an Internet server to keep a network connection alive. The attack causes the web server to inadvertently send back a lot of extra data along with its heartbeat signal. This data consists of recent information from a server's RAM, which could include a server's private SSL encryption keys, user cookies and even user passwords.

The leak of SSL private keys is particularly alarming, since it might let an adversary decrypt all of a site’s encrypted communication, including any encrypted packets the adversary’s been storing. (Decryption of stored packets can be prevented by using forward secrecy, which is used in some newer versions of the SSL/TLS protocol but is not yet widely in use.)

Does This Affect Me?

If your websites have SSL enabled (when users log in, for example), or if you use VPN software to secure your network, or if you run your own mail servers, your newsroom might be affected by Heartbleed.

Heartbleed can affect anything that uses OpenSSL version 1.0.1 or greater. This includes most open-source webservers (Apache, nginx, lighttpd), and can include email servers, instant message services (ejabberd, etc), and VPN servers (openvpn). Privacy software like Tor and SecureDrop are also vulnerable and have since released updates. Many popular server operating systems are affected and have released patches that fix the bug, including Linux distributions like Ubuntu, Debian, Fedora, Red Hat Enterprise and Arch Linux.

It’s important to note that OpenSSL is a library that is dynamically linked to other software. The bug will affect any software that uses it. You can see what version of OpenSSL your operating system has by running this command:

$ openssl version
OpenSSL 1.0.1g 7 Apr 2014

If you get a version between 1.0.1 and 1.0.1f, you may be vulnerable. Some Linux distributions include a hotfix for this bug while keeping the same version number, so you should double-check the operating system's website for more information.

Software using other SSL/TLS libraries, like GnuTLS, PolarSSL and Mozilla NSS are not affected. This also includes some proprietary software, such as Microsoft’s Exchange and IIS, which use their own implementation of SSL/TLS. But be aware that some network hardware solutions — such as routers, firewalls or VPN appliances — may contain OpenSSL. When in doubt, you should check with your vendor or run your own tests.

One of the security researchers that discovered the blog has compiled more information on the bug — and affected products — at heartbleed.com. The Carnegie Mellon University Computer Emergency Response Team (CERT) has also compiled a partial list of affected vendors.

What Should I Do?

You can test your web server on this website or with this command-line tester. You'll need to check every domain you have an SSL certificate for -- be especially mindful if your organization has a separate "login" domain name, separate from the main site. And you can test non-web services like e-mail by including the SSL port number.

If you use one of the Linux operating systems mentioned, check the announcement for affected versions and update instructions:

Remember that you may need to restart the services running so that the update is properly enabled -- i.e., "sudo service nginx restart" or the similar command your Linux distribution uses to restart a daemon.

If your infrastructure relies on Amazon AWS, Heroku or other cloud services, you should also check with that service provider for other information that may affect you.

Updating software is not the only thing you have to do. Because the bug might have leaked your server’s SSL private key, you should also revoke and re-generate your SSL certificates. (The instructions vary from SSL provider to SSL provider, but you'll need to go through most of the steps to generate and install the new key as if you are starting from scratch.) This is because an attacker with your private key can easily decrypt all encrypted traffic to and from your server. Revoking the old certificate tells the internet that the old key has been compromised and that any websites using it might be fake.

You should alert your users to change their passwords once you have fixed Heartbleed on your servers.

Of course, you should also tell your newsroom that they should be careful about logging into affected servers and should change their passwords just about everywhere, once the websites have been patched This is especially important on publicly available services they use for work, including your CMS's admin screens, comment moderation services, social media, etc. A list of popular websites affected as of 2 p.m. Eastern Time on Tuesday was posted by Donnie Berkholz, a software developer and analyst.

Everyone in your newsroom should also double-check that their web browsers check for certificate revocation. This ensures that a connection to an HTTPS server isn't using a certificate that may have been compromised in the attack -- a sign that you may be talking to a fake website.

  • Chrome: Preferences -> Show advanced settings… -> HTTPS/SSL“Check for server certificate revocation”.

  • Firefox: Preferences -> Advanced -> Certificates -> Validation -> “Use the Online Certificate Status Protocol (OCSP) to confirm the current validity of certificates”.

  • Safari on Mac OS X: Open Keychain Access (in Applications -> Utilities) -> Preferences -> Certificates -> Set both “Online Certificate Status Protocol (OCSP)” and “Certificate Revocation List (CRL)” to “Require if certificate indicates”. Set “Priority” to “Require both”.

  • Internet Explorer: Tools -> Internet Options -> Advanced. Check “Check for publisher’s certificate revocation” and “Check for server certificate revocation”.

Further Reading

Introducing Landline and Stateline: Two Tools For Quick Vector Maps in your Browser

$
0
0

This post was co-published with Source.

Today we're releasing code to make it easier for newsrooms to produce maps quickly. Landline is an open source JavaScript library for turning GeoJSON data into browser-based SVG maps. It comes with Stateline, which builds on Landline to create U.S. state and county choropleth maps with very little code out-of-the-box.

We finished the project and wrote documentation as part of last week’s Knight-Mozilla OpenNews Code Convening program (thanks to Github's Jessica Lord for helping out on short notice). We've been using variations of Landline at ProPublica for some time, for example on the front page of our nursing homes app and this voting rights act explainer.

Both should work in nearly every commonly supported browser. The full documentation is over at GitHub, but in a nutshell:

Landline

Creating a map with Landline is as easy as passing it some GeoJSON.

var map = new LandLine(YOUR_GEOJSON).all();

Landline does the work of translating GeoJSON to SVG, so the level of complexity of your map is really up to you. The resulting SVG string can be sent right to the browser (for modern browsers) or proxied through Raphael.js for universal support.

With Landline, you bring your own projection (based on how the GeoJSON was processed), but you can also pass in your own projection function if you'd like to reproject the map on the fly.

Landline has no dependencies other than Underscore.js.

Stateline

Creating a state map with Stateline is as easy as requiring the packaged state or county JSON file, specifying a container element and whether you’d like a state or county map:

var stateMap = new Landline.Stateline(container, "states");

You can also pass your own data to the map by joining on two-digit or five-digit FIPS codes associated with states and counties. This is ideal for presenting census data (such as the income map on our demo page), which comes with FIPS-based geoids.

The state and county maps packaged for Stateline are in the Albers projection for the continental U.S., and respective state plane projections for Alaska and Hawaii.

Stateline requires Underscore, jQuery and Raphaël.

Although we're releasing Landline and Stateline to the public today, it's still very much a work in progress. One thing we’re considering is moving the library to TopoJSON, a lighter weight format for sending geographic data over the wire (as it stands, the prepackaged county file clocks in at around 800 kilobytes, which is a bit big for most web applications). We’d also like to build up our library of Stateline packages for maps of counties within individual states, and other countries. Want to help out? Send us a pull request!

The Sisterhood of the Traveling Plants

$
0
0

National Geographic's Future of Food Hackathon took place last weekend in Washington, D.C., as a part of their “Future of Food” project. ProPublica’s Eric Sagara, Mike Tigas and Sisi Wei were part of a team with WNYC’s Noah Veltman and Tim Wong from The Washington Post. We also received expert help and advice from National Geographic's Dennis Dimick and Maggie Zackowitz.

Hackathon participants were given access to data from the U.N.’s Food and Agricultural Organization and asked to create software to explore (or even solve) problems related to the world food supply.

Our team’s final project, FareTrade, was awarded Best in Show. FareTrade is a mobile web app that allows users to see how far, on average, items on their grocery list traveled to reach them. We express this in terms of “food miles.” For all of the 103 types of food in the app, we also show the percentage of each crop produced domestically and the percentage imported from the top foreign producers.

You can see all the projects at http://futurefood.hackdash.org/.

How We Calculated Food Miles

Countries in the FAO dataset report how much they produce of a given crop each year, as well as how much of that crop they import and export from every other country in the database. In calculating food miles, we determined the average expected mileage that a particular crop traveled to reach the U.S.

As an example, let’s imagine that in 2011:

  • Ecuador exported 40 metric tons of bananas to the U.S.
  • Honduras exported 60 metric tons of bananas to the U.S.
  • The U.S. didn’t grow any bananas.

From this data we can calculate the fraction of bananas in the U.S. that came from each source and weigh the distances accordingly.

foodMiles = (0.4 * distance between US and Ecuador) + (0.6 * distance between US and Honduras)

Caveats

Hackathons are very short sprints and as such they limit the scope of the projects they produce. FareTrade is a bit unfinished and more a proof of concept than a fully formed work of data journalism. There are more than a few caveats you should know about when you use it, including:

  • The data is aggregated by country, so we're only computing distance from centroid to centroid. This can mean very inaccurate distances for a big country like the U.S. We assume a domestically grown artichoke traveled 0 miles when it might have traveled 2,500 miles from California to New York, and we assume that a piece of ginger exported from China traveled from the middle of China to Kansas, when it might have only traveled from coast to coast. To make this more meaningful we’d have to know more about the proportion of food produced in specific areas of an exporting country, and consumed in specific areas of the importing country.

  • The data is aggregated by year, so it doesn’t account for the seasonality, which is obviously crucial. In the U.S., berries may be domestically sourced in the summer and then shipped from the Southern Hemisphere when they’re out of season. In cases like this, the time of year that you choose to eat the food makes a huge difference in the expected food mileage, but we average that out into one number for the year. We can determine what foods are in season in particular months and make recommendations accordingly, but incorporating that into our mileage calculations would require data on consumption or imports by month instead of by year, data which is not currently available (to our knowledge).

  • We limited our app to a subset of all the things the U.N. tracks, because the different data sources classify commodities differently. Some of them are consistent, but others are not, so you might find two similar-but-not-identical categories like "bacon and ham" and "pork products" in two different datasets. These mismatches tend to occur more with livestock products than with crops.

  • The math doesn’t always add up. How many avocados did the US import from Mexico last year? It depends whom you ask. The FAO's trade matrix data gives us these two different records:

    • The U.S. imported 318,938 metric tons of avocados from Mexico in 2011.
    • Mexico exported 269,600 metric tons of avocados to the U.S. in 2011.

    There are lots of cases like this, where the numbers are fairly close but not the same; this is presumably because the U.N. gets data reported by national agencies, and the two sides of the exchange report it differently. For our purposes, when conflicting reports existed, we took the mean of the quantities

  • We are assuming that the percentage of a crop that gets eaten is the same for each food item in our app, when it may vary considerably between categories. Some crops are more likely than others to end up as animal feed, biofuel, or something else entirely.

Alternative Names

We ended up calling our project “FareTrade” but we came up with other good options that are too good not to share. Here are some of the potential project names that didn’t make the cut:

  • The Sisterhood of the Traveling Plants
  • TransPlants
  • Seasoned Travelers
  • Forklift
  • Final Destination
  • Banantastic Voyage
  • Around the World in Eaty Days
  • Secretary of Plate
  • Eatinerary
  • Foodyssey
  • Fruit Commute

ProPublica News Applications Desk Receives Data Journalism Award

$
0
0

ProPublica’s work has been recognized with one of the eight prizes in the Global Editors Network’s Data Journalism Awards, announced today at the GEN Summit in Barcelona.

In addition to winning the “Juror’s Choice” category, 13 ProPublica projects were finalists in the awards, more than any other newsroom.

To learn more about ProPublica’s data journalism, see our new video highlighting some of our most popular projects.

Congratulations to all of today’s winners.

The Road to Health is Paved With Good Data

$
0
0

I think I'm a decent arbiter of people's appreciation of data. I worked at IRE's data library as a grad student and I've attended four consecutive NICAR conferences. At ProPublica, I work with complex data sets every day. I help run our data store, so I can see how excited data-savvy reporters can get when working with great data sets. So you'll forgive me if I viewed attending Health Datapalooza with a small bit of skepticism. Surely, I thought, a bunch of healthcare nerds could never match the enthusiasm and bordering-on-obsessiveness of news nerds when it comes to data.

My assumption was quite off-the-mark. Health Datapalooza, despite (or maybe because of) its ridiculous name, was incredibly awesome and useful. I found many open-data compatriots among the 2,500 attendees – and very few of them were journalists. Instead, they were patient advocates, doctors and nurses, health policy wonks, insurance adjusters and app designers. The conference sessions taught me about upcoming data releases and new statistical analysis techniques; even better, I left with a crazy number of story ideas.

In a session on statistical correlation, Dr. Sujata Bhatia, a biomedical engineer at Harvard, brought up a fascinating issue: The Affordable Care Act will add millions of new people to insurance rolls and thereby change the make-up of the patient population that forms the baseline of many healthcare studies. Will those studies' conclusions need to be revisited? Dr. Bhatia and others are still attempting to find the answer.

I also found out that Fitbits and Jawbones are already outdated – health-data capturing will soon involve tiny microchips that stick to your skin like a band-aid. I watched some super-smart app developers at work during the "Code-a-Palooza" live judging. I even learned that Florence Nightingale was one of the first people to design a health data visualization.

Palooza speakers (they really missed out on a opportunity by not calling them 'headliners') included surgeon/journalist Atul Gawande, U.S. Chief Technology Officer Todd Park, U.K. Secretary of Health Jeremy Hunt and former Secretary of Health and Human Services Kathleen Sebelius in the last few days of her job. Hunt's keynote was especially fascinating – he discussed the U.K.'s efforts to reduce patient harm in hospitals, a story ProPublica's been following for some time.

Like any conference, the real value was in the networking. I was introduced to Amy Gleason, who came up with an app to help manage care of her chronically ill daughter. I finally met the ebullient Fred Trotter (I'm convinced he's the only rival to my colleague Charlie Ornstein when it comes to a passion for health data). I also got to meet some folks from Centers for Medicare and Medicaid Services and ResDAC in person – I'm usually bugging them for help over the phone or via e-mail.

And we even won a "Health Data Liberators" award.

For more takeaways from Health Datapalooza, I recommend this blog post from MedCity News.


Why We Removed the Form 990 PDFs From Nonprofit Explorer

$
0
0

Late last month, we removed links to download Form 990 document PDFs from our Nonprofit Explorer interactive database. These files had been hosted by Public.Resource.Org, a nonprofit organization dedicated to making public documents freely available, including building codes and nonprofit filings until the organization took them offline pending resolution of a dispute with the IRS.

In removing the documents, Carl Malamud, the founder of Public.Resource.Org, cited disagreements with the Internal Revenue Service regarding the price that the IRS charges for digital copies to the documents ($2,910 per year, according to Malamud), the IRS' refusal to publish raw data for electronically filed ("e-filed") tax returns, as well as legal and privacy issues in the IRS's failure to redact Social Security Numbers erroneously included in Form 990 filings.

According to The Chronicle of Philanthropy, Public.Resource.Org is also moving forward with a lawsuit against the IRS, charging that the IRS is required, under the Freedom of Information Act, to publish machine-readable copies of nonprofit filings. The suit states that e-filed tax returns are already stored in such a format at the IRS, but the IRS contends such files are excluded from disclosure requirements. A federal judge denied an IRS request to dismiss the case last month.

Announcing Raster Support for Simple Tiles

$
0
0

At ProPublica, we love maps. For the last few years we've been creating maps for projects like our investigations into gerrymandering and FEMA's emergency maps. We've even dipped our toes into WebGL with our Hurricane Sandy flood visualization.

Many of these mapping projects rely on a software framework we wrote called Simple Tiles. To support the work we're doing for a project launching this year that relies on satellite imagery, we've added raster-data support to the newest version of Simple Tiles.

"Raster data" refers to large images that are tied to coordinates on the earth. If you've ever used the satellite layer in Google Maps, or gazed at Mapbox Satellite, you've seen raster data.

We've been playing with this version of Simple Tiles internally for a couple of months now. We're pretty happy with the results:

Rendering raster layers with Simpler Tiles — the Ruby bindings to Simple Tiles — only takes a small amount of code:

# Set up the tile url to capture x, y, z coordinates for slippy tile generation
get '/tiles/:x/:y/:z.png' do
  # Let the browser know we are sending a png
  content_type 'image/png'
  # Create a Map object
  map = SimplerTiles::Map.new do |m|
    m.slippy params[:x].to_i, params[:y].to_i, params[:z].to_i
    m.raster_layer "path/to/raster.tif"
  end
  map.to_png
end

You can read through the new code over on Github. If you want to play around with the newest version read the installation instructions in the documentation. Please let us know in the issue tracker if you have problems, and help us improve the library by forking it.

A Big Article About Wee Things

$
0
0

This article was co-published with Source.


The scene is total chaos: a woman and all her purse's contents in middair as she trips over a child's toy, a man hastily trying to gather his spilled laundry, a screaming child weaving through the crowd. Somewhere, in the midst of it all, is the person you've been looking for: wearing a red and white striped shirt, black rimmed glasses and a lopsided cap. There he is! There's Waldo.

Many of us have fond memories of Waldo. But while he looms large in our imagination, our childhood searches for Waldo typically stayed pretty small – Waldo is a tiny person in the middle of lots of other tiny things.

And that's what this post is about: wee things. Specifically, the wee things that we see as part of graphics, maps, visualizations (wee things in space) as well as the wee things we experience as part of interactions, navigation, and usability (wee things in time). This means everything from sequences of small graphics that help us make comparisons, to tiny locator maps that help orient us within a larger graphic, to navigation icons that give hints about how we should make our way around a page.

Waldo, and the eternal search for him, can actually tell us quite a lot about design. In many ways, Waldo is a great example of what NOT to do when using wee things in your own work. So with Waldo as our anti-hero, let's take a look at how people read and interpret small visual forms, why tiny details can be hugely useful, and what principles we can apply to make all these little images and moments work for us as designers.

Wee Things In Space

Probably the most immediate definition of wee things are things that are physically small: little things on a page. We see these all the time in news graphics, and we're probably familiar with some of their forms: small multiples, sparklines, icons, etc. I'll go into more details about all of these.

These visual forms work because they serve as extensions of our mind – they are cognitive tools that complement our own mental abilities. They do this by recording information for us to make use of later, lending a hand to our (pretty terrible) working memories, helping us search and discover and recognize. We'll take a look at one task in particular they are great at: letting us make comparisons.

Make Comparisons

Tiny sequences of graphics, also known as small multiples, are great ways to help our brains compare. They are so successful because we don’t have to rely on working memory – every bit of information is in front of us at the same time. This means that we can easily see changes, patterns or differences.

Here are a bunch of examples of small multiples in the wild – maps and planets, first lady hair styles and telegraph signals, food trucks, fashion color trends and dressing appropriately for different climates, the distribution of deaths in the 1870’s and last but not least, Bill Murray’s hats.

The reason we can make comparisons so easily is because these small multiples take advantage of the built-in capabilities of our visual system. Specifically, something called preattentive processing.

Preattentive Processing

Technically, preattentive processing refers to "cognitive operations that can be performed prior to focusing attention on any particular region of an image.” But basically, it means the stuff you notice right away. Our brains aren't like scanners, they throw away most of what the eyes see. But they are good at perceiving simple visual features like color or shape or size, and in fact they do it amazingly fast without any conscious effort. That means we can tell immediately when a part of an image is different from the rest, without really having to focus on it. Our minds are really good at spotting one or two differences when everything else is the same.

Like in this example, you can easily spot Waldo.

It's easy to spot Waldo surrounded by his arch nemesis, Odlaw.

Or in this example, right? This is not hard.

But obviously, this is not where Waldo typically hangs out! Where he goes, I can't find him at a glance, and I need to spend a lot of time and actual effort. Our experiences with Waldo are usually something more like this:

A few years ago a group of MIT researchers actually studied what happened when people saw a scene like this. They used eye tracking devices to look at how people go about finding Waldo. This is what happens on a typical search:

Those jagged lines are the path of the eye as it hunts for Waldo (ending in success at the red dot). As you can probably tell, the process is far from straightforward.

So, Waldo basically thwarts our preattentive processing. Instead, he forces us to use our attentive processing: the conscious, slow, sequential process that lets us focus on one thing at a time. This "spotlight" of attention might be good at letting us concentrate, but it makes us pretty bad at spotting Waldo. We have such a hard time seeing him (despite the fact that he's right there in plain sight) because he doesn’t stand out in any clear way, color or size or orientation.

And obviously there are reasons for why Waldo doesn't stand out – he is purposely hard to find because his surroundings were constructed to hide him. But in general, we don’t want our graphics or data visualizations to be this much work to understand. If someone had to search that hard to read the information in a chart or graphic, it would be a failure. So if you're in the business of displaying data, avoid the Waldo strategy.

Instead, take advantage of those preattentive features! In this graphic from ProPublica, the use of color draws your eyes to specific lines of red text.The grey lines are government claims about the succeses and results of the CIA's drone program. The red lines are the official statements that it doesn't exist (or rather, the government can neither confirm nor deny). The graphic uses the preattentive feature of color to call out these contradictions.

This graphic from the New York Times uses size to highlight how much longer movie credits have gotten over the years.

Whether it's size, color, contrast, etc, it helps to have the point you want to get across encoded in one of these preattentive features..

Now the last two graphics I showed were also members of a special category of wee things I like to refer to as TinyText. This is exactly what it sounds like (you make the font size really small). Tiny words can help call attention to differences over time, like this piece comparing editions of the Origin of Species.

Or they can help us explore the terms used in Hong Kong policy addresses for the last few years, or see the rise and fall of Fortune 500 company rankings.

Show a Process

Small things can also help us show us a step-by-step process. Many examples of this sort of thing involve tiny people, like the following graphics that show figure skating spins, triple axels and toe loops, aerial skiing, and snowboarding tricks.

But process graphics are not just useful for Olympics! Here you see the process of all the planes taking off from LA airport in one day. Or the phases of the moon.

Or what happens to wee chickens before they end up at at a supermarket near you.

These process graphics seem to show up a lot in dance moves – for those of you who've been waiting to learn Thriller, now there's a chart for that. Or for that matter, any of these other moves.

Just as wee things are used in dance, they are also used in art. You might remember the wonderful Ed Emberly, who taught us to draw silly monsters and animals using some combinations of tiny circles, squares and lines.

We’ve also got process graphics like this one, of the many steps to creating a Japanese woodblock print, or even how to make an origami elephant.

Orientation

Wee things can also be used to orient someone, to give them a bird’s eye view.

One of my favorite features of the text editor Sublime Text is this mini map, which shows you a zoomed out smaller version of the page you're working on. There's also mini locator map in this Washignton Post graphic, which stays fixed on the left side of a regular sized map and scrolls down as you move along 14th Street.

Same idea in this example from WNYC, which lets you see exactly where you are on Fielder Ave as you move horizontally through the photographs of houses damaged by Hurricane Sandy.

Or this little map that moves along an enormously long satellite image of the Tigris and Euphrates Rivers (special points to the wee red square which changes shape as you resize your browser window).

Mini locator maps can give you more context for a story, like in this example from Grantland, where the route of an long dog sled race in Alaska is filled in as you scroll – or this vertical version of the same thing, following a drive from St Petersburg to Moscow.

Convey Meaning

Another benefit of wee things is that they pack a punch. A tiny graphic can say a hell of a lot without taking up too much room.

In some cases, that happens by swapping tiny things in for words. The book The Information is full of little inline pictures, like little arrows that convey intonation in speech, or a sequence of little dots to show an early idea for Morse code. Galileo stuck tiny pictures of Saturn into his writing, using them as just another piece of a sentence.

Another old example is a version of Euclid’s Elements by a mathematician Oliver Byrne, who illustrated the whole thing with geometric shapes right in line with the text (I present to you: the wee hypotenuse).

Then, we’ve also the mini graphics that many of you are probably familiar with: sparklines. These are tiny charts that show variation (usually over time). Now you can even put them in your tweets.

Another familiar example: the icon. We use icons and symbols all the time to convey a lot of pretty crucial information like what’s down the road, which door to go in, how to play and pause and rewind. On the web these are especially prevalent. We see social and navigation icons everywhere on the web, and they tell us a lot about how to find your way around a website or where to search or even how to like something. The Noun Project is a big repository full of icons you can use for free, kinda.

Graphical fonts are also icons, so here's an example of an font called Stateface, which makes it easy to embed a wee state into a graphic or table.

But keep in mind, that while some icons carry lots of meaning, they don't all carry the same amount. Icons work because they are recognizable, and if people are not quite sure what they mean, they lose their power. New icons may take time to become commonplace.

But some icons are more equal than others. A study of search icons found that a lone magnifying glass is often not enough to convey "search", especially when it doesn’t look exactly like the magnifying glass we are used to seeing. Instead, it helps to give people a few more clues, like placing the icon on the top right hand side of the page, or placing it within something that looks like a text field.

On a similar note, not everyone knows that the hamburger icon stands for "menu," – it is definitely not as universal as a peace sign. Another analysis found that making the icon look more like a button or including the word "menu" made people more likely to understand. So make sure that if your icon is new or unclear, you add extra information.

And of course, this doesn't mean you can't make up your own icons, as long as you provide context. We may never have seen a man riding a fish, but in this case it (obviously!) represents a fish hatchery employee.

Differentiate

One more benefit of wee thing is that they can help us differentiate individual elements. We often see graphics with lots of small circles, and these imply lots of small individual items. We can easily guess that they are multiple distinct elements, and we might try to hover, or click. The little dots make it clear that these are distinct.

Wee Things In Time

Now I’m going to broaden the scope of wee things a little bit. So far we've been talking a lot about physically small wee things: small things in space. But there is also a different, more abstract idea of wee things, and that's small things in time. These are the tiny moments, the interactions that that we spend only a second or two on, but can make a huge difference to our experiences and our understanding. Another word for these wee things in time (I didn't even have to make it up) is microinteractions.

Microinteractions



What are microinteractions? You can think of these as small "contained moments." Changing a setting, logging in, favoriting or liking something, giving a rating, importing or exporting, searching, sorting, syncing, formating, saving, the list goes on and on.

There are a couple of great resources on tiny details and microinteractions that I'll be stealing examples from, including the book Microinteractions by Dan Saffer, the blog Little Big Details, and Jack Moffett'sDesign A Day tumblr.

These interactions might seem trivial at first. Who cares about a hover state on a button, or a confirmation box, or a minor color change? But when it comes to designing details, Charles Eames said it best: "The details are not the details. They MAKE the design." So let's look at what wee things can do to help make our interactions better.

Give Hints

Small, strategically placed hints can help direct someone's attention to what they are supposed to do, to what will happen once they do it, and whether or not they actually did it successfully. But keep in mind that some hints are better than others. You can't just say "guy in red striped shirt" and have someone immediately find Waldo. It’s important to give real, helpful hints.

Affordances: What can I do here?

Affordances in the physical world

Hints can take a number of forms, and one is by taking advantage of affordances. In the physical realm "affordances" refer to the attributes of an object that make it do what it does. A wheel affords rolling, a light switch affords flipping, etc.

In the online world, we depend on perceived affordances, like buttons that look like buttons and links that look like links. Much of this is convention – there is nothing about blue underlined text that necessarily means a link, but these conventions have become established over time, and we use them to give people a clue as to what they can do.

Affordances in the online world: buttons look like buttons, links look like links

Here's an example of a pretty obvious hint. I see a blinking purple circle and think – ah! I can click here. However, I still have no idea as to what will happen when I click that arrow. I can guess, but its probably better for everyone if we’re not clicking around the internet trying to guess at what will happen next.

Feedforward: What will happen when I do it?

And that's where feedforward comes in. It's kind of like feedback, but it happens before the action. It gives me a clue not only of what I can do, but what will happen once I do it – so I know what will happen BEFORE I perform the action. This means that people can perform that action with confidence.

Here you can see that a black strip to the side of the menu comes out when I hover. It's a small hint, but it's just enough for me to guess that when I click on that button, the black strip will turn into a sidebar. And that's exactly what happens. Again, feedforward is not just "that I can click" but actually "what will my clicking accomplish."

Feedback: Did I do it?

Finally, feedback is another kind of hint. Feedback tells you the result of your action, whether it was a success or a failure. This is important! Did I sort this thing correctly? Did I select the right state or search for the right terms or go to the next screen in a sequence? Every action should be acknowledged. The best part about feedback is that it can later turn into another hint about what will happen next.

So let's look at some hints in the wild.

Here is an examples of a nice way of providing feedback – a series of dots along the bottom of the page. As you scroll, they highlight one by one. The left to right sequence also hints at the direction that I need to swipe to move through the interactive. "Fewer Helmets, More Deaths" at the New York Times is another example of this hint style.

Speaking of swiping, here are a couple of hints that I need to swipe to access whatever comes next, from a NY Times article, the Duolingo app, and my iPhone. As you can probably tell, this is early days for mobile swiping intructions – designers haven't quite made up their mind about the direction of the arrow, the direction of the swipe, or even whether "slide" is actually the better word.

Swipe (or slide) to begin on an NY Times article, Duolingo app, and iPhone

On a similar note, a lot of newfangled article templates these days want us to scroll down. So often we see something like this: a big fat arrow pointing down with the word "View," or a pair of arrows with the words "Scroll to Read."

Somehow these instructions always remind me of the Push/Pull signs on a door. They might help a little bit, but usually if it needs a sign, the door is not designed very well.

So can we design interaction so that labels and instructions are not necessary? In the case of scrolling on the web or on a phone, I think it's very possible, and in fact has already been done successfully.

Here's an example of an article page that has no arrow or button or "Scroll" or "View" instruction. Yet I immediately know that there is content underneath, because I see a small part of it. Even if I change the window size, this website always shows me a little sliver of content, which tells me there is more to read – without any arrow.

The wee sliver of an app icon gives you a hint to swipe left.

This can work for mobile too. When I go to share an image on my iPhone, I get a bunch of app icons that show me my options. I can see four of them in full, but I also know that there are more to the right. How? I can tell instantly because I see a portion of the next app icon peeking out from the right side. The need for a "See More" label has been eliminated by design (and a wee sliver).

Don Norman in The Design of Everyday Things says that "When a device as simple as a door has to come with an instruction manual — even a one-word manual — then it is a failure." Now, we obviously shouldn't strive to eliminate all instructions and labels, because often these small words are quite necessary. But we should at least think about designing alternatives. And usually, less manual is a good thing.

Bring Data Forward

Wee things can also help us present data up front instead of forcing a user to go seek it out. In some cases, bringing the data forward actually eliminates the need to interact at all. For example, Chrome shows an icon of a speaker if there is music playing in one of your tabs, so you don't have to frantically click through and try to find out where that music is coming from. On an iPhone, the calendar app shows you the date without you having to open it up to find out. The clock app shows you the time.

Chrome's speaker icon (L) and iPhone apps that reveal data on the outside (R)

Note that the weather app does NOT does do this, for no good reason. For all the weather app cares, it is ALWAYS partly cloudy. But it doesn't have to be like this! A small tweak to the weather app will show you exactly what you typically go into the weather app to find out – the current weather.

Chrome shows you all the instances of a term you've searched for on the page, in the little tick marks on the right side, bringing that information forward to you.

Chrome highlights instances of a search term along the right-hand side.

Now I know it's 5:40pm office time.

This website, rather than just telling me the office hours, includes what time it is at the office, right now. This is a relatively simple achievement (computers are pretty good at telling time) but it saves me the trouble of calculating time zone differences. In many cases, it's possible to take advantage of basic information, like the time, to help give people the information they really need.

In a similar vein, Pixelmator uses the eyedropper tool to show me the color I've selected, and Amazon shows me the number of items in my cart. In both of these cases information that could have been hidden behind yet one more interaction or step is brought to the top.

Pixelmator's eyedropper tool and Amazon's shopping cart.

The humble coffee grinder.

Prevent Errors

Microinteractions can also help us prevent errors. As an intro to this, let me tell you about my favorite machine in the world: the coffee grinder. With a coffee grinder, it is literally impossible to hurt yourself because it only works when it's closed (unlike, say, a blender).

And preventing screw ups is really good! Store shelves are filled with products in which this is not the case. One very painful example is the instant soup cup, which can easily tip over and lead to scalding burns (not to mention hungry emergency room patients).

In one of my favorite collection of wee things (and this is an actual picture in an actual scientific paper) we see the exact angle at which an instant soup container will topple over. The taller and narrower, the smaller the angle (and the more likely the accident). So why are so many soup containers tall and narrow? Soup cup desginers could prevent this error by making them very flat and wide, but for whatever reason, they have chosen not to.

In the online realm, preventing errors means designing microinteractions that make it impossible to mess up. For example, in this smart microinteraction, I am stopped from accidentally being born in the future.

Gmail knows that I used the word "attached" so prevents me from sending my email without the attachment (or at least gives me a heads up beforehand). But the second dialog is an example of what not to do. In the options below, which one will cancel my payment?

Gmail catching potential errors (L), anonymous website introducing confusion (R)

Not so easy, right? Whenever possible prevent mistakes. Be clear.

Surprise

While many wee interactions are invisible – our experiences so seamless that we don't even notice them getting in the way – some of the best interactions we do notice. Microinteractions can add little moments of surprise to an otherwise mundane task.

This can take the form of clever little transitions, like this example from Victory Journal where the menu icon transforms into a X to become the close button.

Or it could just be darn helpful, acknowledging that memory is limited and forms are boring. Ever start filling out an online form and immediately forget what you were supposed to type? In this example from Campfire, the instruction label is still visible, it just moves out of my way.

Surprising microinteractions can also take the form of little notes to your users that are funny or have some personality. If you're uploading something to Dropbox that will take a long time, it tells you to go grab a Snickers. And if your text gets too long, Google Voice just stops counting the characters and says "Really?".

Dropbox being friendly (L), Google Voice poking fun (R)

The Vimeo cancel/dismiss button is labeled "I Hate Change." All these things take into account that you're designing for real people who have a sense of humor.

And then there are just tiny interactions that are fun, like shaking your phone to get a snow globe effect on your photographs.

Or pressing this spinner over and over again to get a new prediction about who’ll win the Senate.

These are not always explicitly about conveying information, but they're tiny moments that we can make an emotional connection to the user. Delight is something we may overlook sometimes or not bother spending time on, but it can really make a difference. If this tour of wee things has taught us anything, it that it's worth trying to make those connections, worth spending a little more time and effort on the details – even if they are very small.

How to Ask Programming Questions

$
0
0

Already picked a programming language and started learning how to code? Inevitably, you’ll need to ask for some coding help. Even the best, most experienced developers do. It doesn’t matter if you’re chatting with an instructor, emailing a listserv like NICAR, posting on Stack Overflow, or even just tweeting out a question, there are ways to make sure your question gets answered. This guide is aimed at journalists, but can apply to anyone.

Here are some guidelines:

1. Do some research first

Before you can find a solution, you need to know how to describe the problem you’re having. Read the error message out loud so you understand what’s going on. Even if you don’t understand all the words, make sure you read the whole error at least once.

Many times, someone else has had the same question as you, and asked that question on the Internet. If you have an error message, Google the exact message. Or just describe the problem you’re having and search. There’s a great post from KnightLab on how to get good answers from Google about code.

If you’re having problems with a specific project hosted on Github, checking out their issues page (here’s an example) can also be helpful. Developers frequently use Github issues as a way to log known problems, or speak to the people using their project.

If you need help with a specific programming language, there are great journalism-specific google groups that can help you, such as PyJournos and RubyJournos.

Try any of the solutions you find to see if they work for you.

If nobody’s posted an error message like yours, chances are good that you’ve made some type  of basic error (we all do it), like forgetting a semicolon, or not installing a program you need. Step through your code one last time, maybe reading everything out loud, just to make sure you’ve got everything right.

If solutions aren’t working, keep track of what you’ve tried, as well as what happened when you tried them. When you post your question, you’ll want to explain what you’ve already done.

2. Be specific

When asking your question, be specific. This means your post should include:

  • The tools you’re using, including version numbers and your operating system version.
  • What you’re trying to accomplish – what did you expect to happen if everything had worked?
  • The code you’re using (or just the relevant parts)
  • Any error messages you got
  • What you’ve tried already, and what happened

Let’s look at an example email or post, looking for programming help:

I’m trying to scrape the Boone County inmate roster using Python, and the requests and BeautifulSoup packages. So far, I’m only trying to get requests and BeautifulSoup working.

Here’s my code:

import requests
import csv
from BeautifulSoup import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
print soup

The error I’m getting says:

Traceback (most recent call last):  File "test.py", line 3, in from BeautifulSoup import BeautifulSoupImportError: cannot import name BeautifulSoup

I first checked the spelling and it is correctly imported. So, I then tried Googling for “ImportError: cannot import name BeautifulSoup”. All the help out there says that I need to install it, but I thought I did. I tried again to make sure and the following happened:

pip install BeautifulSoup
Requirement already satisfied (use --upgrade to upgrade): BeautifulSoup in /Users/jacquelinekazil/Projects/envs/scraping-class/lib/python2.7/site-packagesCleaning up..

I can’t figure out why this is happening. Please help!

Note: This is an actual example. Bonus points if you can identify what the issue is.

3. Repeat

When people start responding with solutions, try them. If the proposed solution doesn’t work, do not respond simply with “That didn’t work.” Instead:

  • Post what your updated code looks like
  • Post the new error message
  • Post anything else you were inspired to try since you posted your question

If a proposed solution does work, make sure you respond so others with the same problem will know that the solution worked. Remember to thank the person for helping you!

4. Document and share

Be sure to pay it forward. When you’ve worked your issue through, document and share everything -- your problem, the things that you tried, and the solution that worked. One day, somebody with the same problem you had will be helped enormously by this.

5. Other guides

If you want to read more on this topic:

Reporting From the Youngest Land in the World

$
0
0

This story was co-published with Source.

On October 13, we stepped off a boat in the middle of the Mississippi River Delta onto brand new land. The ground, about six months old, was a bit squishy but it held firm under our boots. It was put there by engineers working on a quixotic project to save Southeast Louisiana, which is sinking into the Gulf of Mexico at a rate of about a football-field every hour.

We had covered this area before, as part of our interactive story “Losing Ground.” We came here on a brisk but warm morning to bear witness and, especially, to take photos.

For our previous project, we were able to use freely available imagery from NASA’s Landsat 8 satellite to document the sinking of the delta. Those photos have a resolution of up to 15 meters to a pixel — ideal for big geographic areas like southeast Louisiana’s iconic “boot.”

But for our follow-up piece about the humongous patchwork of coastal restoration efforts including Louisiana’s Coastal Master Plan, we wanted to zoom into much smaller areas, highlighting a few of the projects that state, local and federal agencies have put in place to reverse some of the damage. Many of those projects would appear as tiny specks in Landsat’s sweeping view of the coastline. We needed a better aerial vantage point to show the progress of projects that in many cases were relatively small — some only a couple of thousand feet across.

Thankfully, the U.S. Geological Survey, through its EarthExplorer website, offers a vast cache of high-resolution digital images. These “orthoimages” are images taken from airplanes that have later been georeferenced — warped to fit onto a map — and stitched together into much larger images by the agency. In addition to true-color photographs, the orthoimages include an infrared band, which is required to work with the images, as it lets you easily and accurately isolate land and water — a key feature in delta regions where water can look like land because it is so heavy with sediment.

The authors try their hand at amateur orthophotography on board Earl Armstrong’s airboat near the West Bay diversion project at the mouth of the Mississippi River. (Edmund D. Fountain for ProPublica/The Lens)

But the USGS does not fly orthophotography missions often. The most recent pass in the areas our stories covered was in 2012. With engineers building land as fast as they can, using backhoes and slurry pipelines, a lot can change in a year. What we needed were high-resolution aerial photographs of these areas right now. In remote sensing terms, what we needed was the holy grail: high temporal, spatial and spectral resolution. Like the old adage about fast, cheap and good, you don’t often get all three.

So we turned to a group called Public Lab, a nonprofit organization whose mission is to empower citizens to collect data about their environment with inexpensive or handmade tools. Public Lab has designed kits to collect aerial imagery with inexpensive point-and-shoot digital cameras hoisted on balloons and kites using rigs made with 3-D printers off of freely available plans. Scott Eustis, a coastal wetland specialist with the Gulf Restoration Network and volunteer organizer for Public Lab’s Gulf Coast chapter, offered to take us to two of the restoration sites we were focusing on for “Louisiana’s Moon Shot” in order to collect imagery with their DIY rigs.

The first project we mapped was a marsh creation project near Lake Hermitage, a village in Plaquemines Parish about 30 miles south of New Orleans that has dwindled to a seasonal fishing camp. In 2006, $37.9 million was approved to build 447 acres of land over 20 years here, in the form of two big sections of marsh near the eponymous lake under a 1990 law called the Coastal Wetlands Planning, Protection and Restoration Act. It’s a feat of engineering that sounds implausible. Huge pipelines from the nearby Mississippi would pump sediment into areas of water enclosed by man-made earthen berms. Then, like pizza dough, huge backhoes would push and smooth out the sediment until it filled the entire enclosed areas. Vegetation would be planted to anchor the land in place, preventing further erosion.

The creation of a third, 104-acre, $13 million section of marsh was approved under the Natural Resource Damage Assessment, a pool of Deepwater Horizon oil spill injury funds. That phase (barring some extra planting of vegetation) was completed in January.

As far as we knew, there were no aerial images of the completed project. We later found some, but more on that later.

The Kite Rig: A kite carries a 3D-printed chassis holding two cameras synched together, one hacked to accept infrared light. (Al Shaw/ProPublica)

So on a blustery morning in early October, with winds reaching 20-30 miles per hour, we set out from the fishing village on a 19-foot catamaran captained by Bob Marshall, a reporter with The Lens, and docked against what may have been the youngest land in the world.

In addition to sharing two Pulitzer Prizes and being one of the reporters on our “Losing Ground” project, Bob is also a lifelong resident of Southeast Louisiana. His 35 years reporting on its wetlands issues is informed by his lifetime of hunting and fishing experiences in the area’s marshes and swamps.

We did a few passes of the Hermitage marsh project with different camera rigs, with varying degrees of success. Public Lab uses both kites and balloons to collect images, but balloons – which can go much higher than kites – are nearly impossible to control in high winds, pitching backward and forward so much that they take useless photographs and are at an increased risk of crashing.

Luckily, GRN’s Eustis brought along a kite, which can handle strong winds. But that’s when we started running into trouble.

For our photos to be most useful for the project, we needed high spectral resolution, which for our purposes meant both visible and infrared light. Public Lab, when designing their kits, figured out that most cheap point-and-shoot cameras can be hacked to capture infrared light by removing physical filters blocking it. To create a multispectral image, Public Lab designed a rig that holds two cameras synced together: one capturing visible light, another hacked to capture infrared light. The images can later be combined in Photoshop to create a multispectral image.

Our need for infrared images ultimately sank the Lake Hermitage shoot. The kite, weighed down by the 25-ounce rig, was only to climb to an altitude of about 100 feet, producing images that were too close to the ground to be georeferenced, and not high enough to capture the full area we needed for our interactive project.

The Infragram Rig: A Lightweight Mobius ActionCam, hacked to only accept infrared light, plastic bottle top, kite tail is attached to the kite. (Al Shaw/ProPublica)

We did take a second kite pass of Hermitage using lighter rig — an ingenious device Public Lab calls the “Infragram” made up of a plastic soda-bottle top, a long kite tail and a tiny Mobius Action Cam, which was modified to only capture infrared light. The Infragram, weighing about an ounce, soared above the landscape. However, this flight, which we launched from Bob’s boat as he navigated the narrow channel between the two islands only captured a narrow band around the edges of the project.

Ultimately, to show the complete Hermitage project, we turned to Digital Globe, an outfit that sells very high resolution images. It turned out they had a recent enough photograph of the entire Lake Hermitage project, including infrared, taken on October 20.

The next day, we set out to capture images of the West Bay Diversion, a 2003 project at the very end of the Mississippi River near what is called the Bird’s Foot Delta. The diversion is a strategic cut in the river levees designed to shunt sediment out into the bay where it is supposed to fall out of the stream and start building land. But for six years after it was built, the diversion didn’t work. Sediment moved too fast to build up where planners needed it to. New land started emerging after the Army Corps of Engineers solved the problem by building a series of small artificial islands in the bay starting in 2009.

Georeferenced mosaics of our two passes through the Lake Hermitage marsh creation project, overlaid on existing USGS NAIP imagery. Because the images covered so little of our area of interest, we didn’t use them in the final project. (ProPublica/The Lens/Public Lab/USGS)

The winds were more favorable that day, so we were optimistic we would be able to launch the balloon rig — a six-foot-wide red weather balloon attached to 3,000 feet of string.

The Balloon Rig: Gulf Restoration Network’s Scott Eustis launches the six-foot weather balloon over West Bay. The balloon will carry the dual-camera chassis about 2,000 feet above the delta. (Edmund D. Fountain for ProPublica/The Lens)

We arrived at the Cypress Cove marina, where Louisiana Highway 23 comes to an end after stretching down the length of the west bank of the Mississippi River. There we met Earl Armstrong, a cattle rancher from nearby Venice, La., who offered to take us in his airboat through the bayous to the diversion.

Armstrong is an advocate for the diversion. We profiled him in our “Losing Ground” project, and he was instrumental in the Army Corps’ decision to build the islands to slow down the sediment.

We climbed aboard, and donned earmuffs as his airboat roared to life. We weaved through the bayous around Venice, crossing Grand Pass, one of the wide streams that branches off from the end of the Mississippi to the gulf, until we approached the diversion. We could see enormous tanker ships gliding up the river beyond the levee.

We were extremely lucky to have Armstrong and his airboat as our guides; any other vessel would have run aground against the mud lumps and young Delta land studded with green.

We launched the balloon twice, unspooling about 2,000 feet of string as Armstrong piloted the airboat around a half-square mile triangle where sediment was accumulating near the diversion. It sure seemed to us like the state’s plan was working — the amount of new land and vegetation was striking.

The two cameras attached to the rig were synched to automatically take photos every five seconds. By the time we disembarked at Cypress Cove, we’d taken hundreds of images.

The hardest part of the project came when we got back to the office. Not only would we have to georeference a tiny triangle of some of the most unstable land in the world, we would have to match up the visible spectrum and infrared images in order to create our false color land/water classification maps.

We started by looking at both Google Earth images of the area of interest, and 2013 images from the National Agriculture Imagery Program, or NAIP (another high resolution true-color only dataset available from the USGS). Using these, we were able to match up the biggest, most persistent pieces of land. However, much had changed in the year since the NAIP image was taken. Using coordinates recorded on photos taken with iPhones while on the boat, we were able to track where the boat was when we launched the balloons. We were also able to use images of Bob’s boat, stashed nearby, as a ground control point. When we matched up a few images, near the boat marker, the others fell in like jigsaw pieces.

We used Photoshop’s warp-and-distort and scale transformations to stretch the images according to underlying features. We then added the infrared images and warped them to match the visible light images. With accurate alignment, we could then use the infrared data to create a land/water mask to accentuate the differences. Since we were building the entire image over the 2013 NAIP image, we were able to use the Geographic Imager plugin for Photoshop to convert our new mosaic into a GeoTIFF.

The resulting false-color mosaic we made by georeferencing images from both the true-color and hacked infrared cameras taken on the balloon rig. If you look closely, you can see Bob’s boat docked against the land mass in the top right of the image. (ProPublica/The Lens/Public Lab)

Getting high temporal, spatial and spectral resolution isn’t easy to do if you don’t have deep pockets. A single 10-square-mile multispectral georeferenced image from Digital Globe taken recently was expensive. We purchased two of these, including one for Bay Denesse, another area we wanted to feature in “Louisiana’s Moon Shot.” Public Lab’s DIY approach is admirable, and a promising work in progress — as are we as aerial photographers.

If we were to return, we might have alloted more time to return to Lake Hermitage on a less windy day, and we might have used slightly more expensive cameras, perhaps with embedded GPS units and gyroscopes to minimize processing time. That being said, the ability to control the entire mapping process from image acquisition all the way through to a final web map was priceless. So was our ability to witness for ourselves the delta being remade, and to meet the people with the biggest stake in the restoration’s success.

Many thanks to Scott Eustis and Jordan Macha from the Gulf Restoration Network, Shannon Dosemagen, Stevie Lewis and Stewart Long from Public Lab for arranging the trip, providing the tools, training and cartography consulting, Knight-Mozilla OpenNews for helping pay some of The Lens’ and Public Lab’s costs, Earl Armstrong and Bob Marshall for indulging us as armchair orthophotographers, if only for a week.

Viewing all 132 articles
Browse latest View live