Data USA

Data USA is a collaboration between Deloitte, Macro Connections at the MIT Media Lab, and Datawheel which is (according to their About page), “the most comprehensive website and visualization engine of public US Government data.” The data is pulled from sources such as the American Community Survey, the Bureau of Economic Analysis, and the Bureau of Labor Statistics, and the visualizations are powered by D3plus, an open source visualization engine.

Are secondary data users research parasites?

Even though NIH and NSF both have data sharing requirements, there is clearly some resistance to it. The best example is an editorial from the New England Journal of Medicine. Secondary data users are characterized as “research parasites.”

A rebuttal comes from a Science editorial with the title #IAmAResearchParasite.

Data Sharing
Dan L. Longo and Jeffrey Drazen | N Engl J Med
January 21, 2016

Marcia McNutt | Science
March 4, 2016

Demographic and Economic Profiles of the Super Tuesday States

In advance of Super Tuesday, the U.S. Census Bureau released demographic and economic profiles of the 12 states holding primaries and caucuses:

Hall of Justice

The Sunlight Foundation has created a project called Hall of Justice which gathers publicly available criminal justice datasets and research.

While not comprehensive, Hall of Justice contains nearly 10,000 datasets and research documents from all 50 states, the District of Columbia, U.S. territories and the federal government. The data was collected between September 2014 and October 2015. We have tagged datasets so that users can search across the inventory for broad topics, ranging from death in custody to domestic violence to prison population. The inventory incorporates government as well as academic data.

A New FiveThirtyEight Podcast

In addition to their weekly podcast on data, What’s the Point?, as well as their sports podcast, Hot Takedown, FiveThirtyEight has launched an election podcast called, appropriately enough, FiveThirtyEight Elections.

Kaggle Datasets

Kaggle, a platform for predictive modelling and analytics competitions, introduced a section for users to download and analyze public data.

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Kaggle Datasets has four core components:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility to the previous work that’s been created on the data
  • Conversation: forums and comments for discussing the nuances of the data

Current datasets include U.S. Baby Names, 2013 American Community Survey, May 2015 Reddit Comments, U.S. Department of Education: College Scorecard, and Ocean Ship Logbooks (1750-1850).

U.S. Census Bureau Open Source

The U.S. Census Bureau is committing to an open source policy. Their mission, “is to serve as the leading source of quality data about the nation’s people and economy. We honor privacy, protect confidentiality, share our expertise globally, and conduct our work openly. Where possible, the US Census Bureau will actively participate in open source projects aimed at increasing value to the public through our data dissemination efforts.”

Read a current list of the open source projects here.

Playing with Mortality Visualizations

Nathan Yau of Flowing Data has been doing some interesting (and beautiful) visualizations of when and how people die. First was Years You Have Left to Live, Probably. Next was Causes of Death. And today he posted How You Will Die.

Classifying Countries by Income

The World Bank has released a new working paper by Neil Fantom and Umar Serajuddin reviewing the World Bank’s classification of countries by income.


The World Bank has used an income classification to group countries for analytical purposes for many years. Since the present income classification was first introduced 25 years ago there has been significant change in the global economic landscape. As real incomes have risen, the number of countries in the low income group has fallen to 31, while the number of high income countries has risen to 80. As countries have transitioned to middle income status, more people are living below the World Bank’s international extreme poverty line in middle income countries than in low income countries. These changes in the world economy, along with a rapid increase in the user base of World Bank data, suggest that a review of the income classification is needed. A key consideration is the views of users, and this paper finds opinions to be mixed: some critics argue the thresholds are dated and set too low; others find merit in continuing to have a fixed benchmark to assess progress over time. On balance, there is still value in the current approach, based on gross national income per capita, to classifying countries into different groups. However, the paper proposes adjustments to the methodology that is used to keep the value of the thresholds for each income group constant over time. Several proposals for changing the current thresholds are also presented, which it is hoped will inform further discussion and any decision to adopt a new approach.

Read a summary of the findings.
Download the PDF.

How Americans Spend Their Day

Nathan Yau of Flowing Data created a beautiful visualization of how Americans spend an average day.

More specifically, I tabulated transition probabilities for one activity to the other, such as from work to traveling, for every minute of the day. That provided 1,440 transition matrices, which let me model a day as a time-varying Markov chain.