Archive for the 'Data' Category

Page 4 of 28

Formatting Data in R

image of data

Nathan Yau of Flowing Data recently published a tutorial on loading data and basic formatting in R. The tutorial covers loading data from CSV files, subsetting data frames, editing data to make it easier to manage and merging multiple datasets.

Lynchings in America


[click here for link to NYT graphic]

The Equal Justice Initiative has documented 4,000 lynchings in the South between 1870 and 1950. This resource is potentially useful for examining out-migration of blacks, particularly men, from the South during this era. It could also be useful for explaining current race-based inequalities, including incarceration.

Lynching in America: Confronting the Legacy of Racial Terror
Summary Report | Equal Justice Initiative
February 2015

Supplement: Lynchings by County [pdf only]

Note that the graphic by the New York Times has a time-dimension in it. I am awaiting the full report from the Equal Justice Initiative to see what additional detail is available in it.

Press Coverage [scroll down]

Data Demise: ACS 3-year product

The Census Bureau has released its last 3-year ACS product with the 2011-2013 release. This is a cost-cutting move, although the Census Bureau might argue that it never meant for there to be a 3-year product in the first place.

The Census Bureau is not cutting back on data collection – it is eliminating the tabular release of the 3-year data (geographic areas of 20,000+). The 1-year data are for geographies of 65,000+ and the 5-year data have no population limits. These will continue to be released.

The microdata products have share the same release types: 1-year, 3-year, and 5-year. These all share the same geographic limit (PUMAs), but the 3-year and 5-year products are not just concatenations of the 1-year files. They have been re-weighted and income-denominated items are inflated to the last year (e.g., 2013). [See explanatory note from IPUMS].

The ACS 3-year Demographic Estimates are History
Brendan Buff | APDU Blog post
Feb 3, 2015

Census Bureau Statement on American Community Survey 3-Year Statistical Product
Stanford University Libraries | Ron Nakao’s Blog

Not any more: NY vs FL


The above cartoon is from the Florida Sun Sentinal back in early 2014 as New York just held on to its ranking as the third largest state. With the release of the most recent population estimates, Florida has now edged out New York.

Florida Passes New York to Become the Nation’s Third Most Populous State, Census Bureau Reports
December 23, 2014

We’ve updated our Apportionment Calculator. See which states are projected to lose/gain seats in 2020 based on the 2014 results.

And, no. North Dakota is not gaining a seat, even as it is the fastest growing state.

Tools: Data as Text

Most of the familiar statistical packages social scientists work with are not well-equipped for analysis of text. Python is one tool often used with text data.

Here is a series of Python tutorials posted on Neal Caren’s Github site. Notice the wide-prevalence of code sharing. That is a feature of much of the folks who work in this field.

You can follow his tutorials on Python or take a Coursera course by a UM professor in February. Another option is the Coursera Data Science specialization offered via Johns Hopkins. This set of courses skips Python but includes a snapshot of the variety of concentrations in this field.

Learning Python for Social Scientists [list curated by Neal Caren]
Programming for Everybody (Python) [University of Michigan via Coursera]
Data Science Specialization [Johns Hopkins via Coursera]

Here’s a rendering of that specialization from a student in the Data Toolbox course:

data science dependencies
Source: Uri Grodzinski

More fun with names

Who knew that the name Violet was such a good example of a bi-modal distribution?


This was drawn from a very fun post:

How to Tell Someone’s Age When All You Know is Her Name
Nate Silver and Allison McCann | FiveThirtyEight blog
May 29, 2014

We had a previous post on fun with the Social Security names database.

This age of names example is a great applied demography exercise – calculating the median age of names. For that you’ll need a link to the full names database and cohort life tables:

Beyond the Top 1000 Names
Cohort Life Tables for the Social Security Areas by Calendar Year

Here’s also a nice link to some Big Data exercises via Python. There is a lot of code sharing in this GitHub repository.

Counting Same-Sex Couples

Both the Pew Research Center and the FiveThirtyEight blog have done write up about the trouble the U.S. Census has counting same-sex couples.

Pew’s story (which came out in September) discusses the way gender reporting on the census confounds the data.

The story in FiveThirtyEight reports on how the Census Bureau is working to make it’s questions gather more accurate data.

The Antidote for “Anecdata”: A Little Science Can Separate Data Privacy Facts from Folklore

The Antidote for “Anecdata”: A Little Science Can Separate Data Privacy Facts from Folklore
Daniel Barth-Jones | Info/Law Blog [Harvard]
November 21, 2014
This is a great piece that shows again that most of the publicity about re-identification in data are overblown:

The 11 in 173 million risk demonstrated for this celebrity ride re-identification (or 1 in 15,743,614) is truly infinitesimal. To put this in perspective, this risk is over 1,000 times smaller than one’s lifetime risk of being hit by lighting. With proper de-identification applied and the cryptographic hash problem fixed in any future data releases, this spooky specter of celebrity cyber-stalking using TLC taxi data is likely to vanish as soon as one turns on the lights.

This blog post is in reaction to the release of NYC taxi medallion data, which were improperly anonymized. A previous blog post described the data.

Here is the piece that sensationalizes the possibility of re-identification, based on famous people who ride cabs.
Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset
Anthony Tockar | Neustar Blog
September 15, 2014

Big Data: NYC Taxi Cab Trips

This is a big data resource, and more. Check out the reaction to the bad anonymization here.

20GB of uncompressed data comprising more than 173 million individual trips. Each trip record includes the pickup and dropoff location and time, anonymized hack licence number and medallion number (i.e. the taxi’s unique id number, 3F38, in my photo above), and other metadata.

Before the link to the data, here’s an analysis based on similar data:
Why New Yorkers Can’t Find a Taxi When It Rains
Eric Jaffe | City Lab Blog
October 20, 2014
Provides a nice synopsis of some research using taxi cab rides. Read it for the links to the formal research papers.

New York City Taxi Cab Trips [in small chunks]

FOILing NYC’s Taxi Trip Data
Chris Whong | personal website of an Urbanist, Mapmaker, Data Junkie
March 18, 2014
a synopsis of how he got the data via a FOIA request & a link to the data on rides/fares as single files, instead of the chunked version above.

and the story about how the taxicab medallion IDs were improperly anonymized:

Poorly anonymized logs reveal NYC cab drivers’ detailed whereabouts
Dan Goodin | ars technica
June 23, 2014

On Taxis and Rainbows: Lessons from NYC’s improperly anonymized taxi logs
Vijay Pandurangan | Medium blog

Getting A More Accurate Count Of Arab Americans

The PSC Infoblog has reported on this earlier here, but this is still of interest.

The U.S. Census Is Trying To Get A More Accurate Count Of Arab Americans
Ben Casselman | Blog
November 24, 2014

Note that this article mentions that the Census Bureau did a special tabulation for Homeland Security to provide counts of Arab populations by geography (place and zip code).

Some Arabs have expressed reluctance to identify themselves on a government form, especially after the Census Bureau shared detailed data on the Arab-American population with the Department of Homeland Security in the early 2000s

These “detailed tabulations” referenced above, were public use tables from American FactFinder. Here’s the original FOIA request from the Electronic Privacy Information Center:

FOIA request: Department of Homeland Security Obtained Data on Arab Americans From Census Bureau [Source: EPIC]

Here is the example for Places drawn from DP-2. Here’s the example for Zip Codes drawn from (Tables PCT16 and PCT17).