Kaggle, a platform for predictive modelling and analytics competitions, introduced a section for users to download and analyze public data.
At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.
Kaggle Datasets has four core components:
- Access: simple, consistent access to the data with clear licensing
- Analysis: a way to explore the data without downloading it
- Results: visibility to the previous work that’s been created on the data
Conversation: forums and comments for discussing the nuances of the data
Current datasets include U.S. Baby Names, 2013 American Community Survey, May 2015 Reddit Comments, U.S. Department of Education: College Scorecard, and Ocean Ship Logbooks (1750-1850).
The U.S. Census Bureau is committing to an open source policy. Their mission, “is to serve as the leading source of quality data about the nation’s people and economy. We honor privacy, protect confidentiality, share our expertise globally, and conduct our work openly. Where possible, the US Census Bureau will actively participate in open source projects aimed at increasing value to the public through our data dissemination efforts.”
Read a current list of the open source projects here.
H/T Flowing Data
Nathan Yau of Flowing Data has been doing some interesting (and beautiful) visualizations of when and how people die. First was Years You Have Left to Live, Probably. Next was Causes of Death. And today he posted How You Will Die.
The World Bank has released a new working paper by Neil Fantom and Umar Serajuddin reviewing the World Bank’s classification of countries by income.
The World Bank has used an income classification to group countries for analytical purposes for many years. Since the present income classification was first introduced 25 years ago there has been significant change in the global economic landscape. As real incomes have risen, the number of countries in the low income group has fallen to 31, while the number of high income countries has risen to 80. As countries have transitioned to middle income status, more people are living below the World Bank’s international extreme poverty line in middle income countries than in low income countries. These changes in the world economy, along with a rapid increase in the user base of World Bank data, suggest that a review of the income classification is needed. A key consideration is the views of users, and this paper finds opinions to be mixed: some critics argue the thresholds are dated and set too low; others find merit in continuing to have a fixed benchmark to assess progress over time. On balance, there is still value in the current approach, based on gross national income per capita, to classifying countries into different groups. However, the paper proposes adjustments to the methodology that is used to keep the value of the thresholds for each income group constant over time. Several proposals for changing the current thresholds are also presented, which it is hoped will inform further discussion and any decision to adopt a new approach.
Read a summary of the findings.
Download the PDF.
Nathan Yau of Flowing Data created a beautiful visualization of how Americans spend an average day.
More specifically, I tabulated transition probabilities for one activity to the other, such as from work to traveling, for every minute of the day. That provided 1,440 transition matrices, which let me model a day as a time-varying Markov chain.
The drop in birth rates from 2007 through 2013 has been well documented. However, it is also important to examine total rates of pregnancy and other pregnancy outcomes (abortion and fetal loss) to provide a comprehensive picture of current reproductive trends. This NCHS Health E-Stat uses data from 2010 to update a previous NCHS report on pregnancy rates. Data on pregnancy outcomes by age and race and Hispanic origin are presented.
2010 Pregnancy Rates Among U.S. Women
Sally C. Curtin, Joyce Abma [NCHS] and Kathryn Kost [Guttmacher Institute]
html | pdf
Monday’s Supreme Court case centered on data. The case, Evenwell v Abbot, argues that representation in Texas legislative districts ought to be based on voters rather than the total population. Currently, most states use total population for re-districting purposes and this comes from the decennial census. The decennial census does not have a citizenship question. But, the replacement for the Census long-form, the American Community Survey (ACS) does.
The former directors of the Census Bureau filed an amicus brief against the idea of using the eligible voter population (e.g., citizens 18+ years of age). A group of applied demographers also filed an amicus brief, noting that this was quite possible using the ACS. Note that Sonia Sotomayor does not think the ACS is adequate, but that is because she misunderstands the data:
As is typical with cases involving data and social science research, there are lots of supplementary links:
The Washington Post [10 or so opinions from the Opinion | In Theory section]
‘One Person One Vote’: A Primer
Washington Post | Opinion : In Theory
[10 or so opinions and comments]
Argument preview: How to measure “one person, one vote”
Lyle Dunston | ScotusBlog
December 1, 2015
The Threat to Representation for Children and Non-Citizens: An Analysis of the Potential Impact of Evenwel v. Abbott on Redistricting
Andrew Beveridge | Social Explorer
December 2, 2015
Supreme Court is skeptical of challenge to Texas district lines
Maria Recio | Sacramento Bee
December 8, 2015
This is the source of the Sotomayor quote
“. . . Dueling Affirmative Action Empiricism” [this is actually from Fisher vs Texas, but is included here as evidence of the Supreme Court using social science research.
The FiveThirtyEight blog has a podcast called What’s the Point: “A show about our data age. Each week, Jody Avirgan brings you stories and interviews on how data is changing our lives.”
The most recent episode is about polling and religion in America.
The Pew Research Center released a report detailing the unique challenges of surveying Latinos in the United States.
As the U.S. Hispanic population grows, reaching nearly 57 million in 2015 and making up 18% of the nation’s population, it is becoming increasingly important to represent Hispanics in surveys of the U.S. population and to understand their opinions and behavior. But surveying Hispanics is complicated for many reasons – language barriers, sampling issues and cultural differences – that are the subject of a growing field of inquiry. This report explores some the unique challenges currently facing survey researchers in reaching Hispanics and offers considerations on how to meet those challenges based on the research literature and our experiences in fielding the Pew Research Center’s National Survey of Latinos.
Download the full report (PDF)
The U.S. Census Bureau released a new interactive visualization which shows how race and ethnicity categories have changed since the first census.
From the Random Samplings blog post:
Over the years, the U.S. Census Bureau has collected information on race and ethnicity. The census form has always reflected changes in society, and shifts have occurred in the way the Census Bureau classifies race and ethnicity. Historically, the changes have been influenced by social, political and economic factors including emancipation, immigration and civil rights. Today, the Census Bureau collects race and ethnic data according to U.S. Office of Management and Budget guidelines, and these data are based on self-identification.
H/T: Data Detectives