Archive for the 'Methodology' Category

Page 2 of 12

Backcasting Native Hawaiian Population

The Pew Research Center Fact Tank examines findings by David Swanson which uses 1910 and 1920 Census data to estimate the population of Hawaii in 1778, the year Capt. James Cook arrived.

In this case, Swanson took a detailed look at the 1910 and 1920 U.S. Census’s Native Hawaiian counts, tracking the survival rate of each five-year age group from one census to the next. For example, he looked at how many children who were newborns to age 4 in 1910 were counted as 10- to 14-year-olds in 1920, then did the same for each successive age group. For each group, he created a “reverse cohort change ratio,” which he used to go back in time and estimate the size of each age group for each decade until he got to 1770.

The article also reports on the growth of the Native Hawaiian population since the 1980s.

App vs. Web for Surveys

The Pew Research Center has been experimenting with mobile apps for “signal-contingent experience sampling” to gather data about how Americans use their smartphones. They have just released a report examining the possibilities of this method:

This report utilizes a form of survey known as “signal-contingent experience sampling” to gather data about how Americans use their smartphones on a day-to-day basis. Respondents were asked to complete two surveys per day for one week (using either a mobile app they had installed on their phone or by completing a web survey) and describe how they had used their phone in the hour prior to taking the survey. This report examines whether this type of intensive data collection is possible with a probability-based panel and to understand the differences in participation and responses when using a smartphone app as opposed to a web browser for this type of study.

PAA President Ruggles wants you to write a letter

The is an excellent summary of the consequences of the demise of the 3-year ACS tabular products. Please follow through and contact the relevant government officials:

ACS 3-Year Summary Products: Please take action to save the ACS 3-year data products
Steve Ruggles | PAA President and Director of the Minnesota Population Center
March 4, 2015

Another take on gentrification

Gentrification in America Report
Mike Maciag | Governing
February 2015
This resource is city-specific and provides both counts and maps of gentrified census tracts for the 50 largest cities. To be eligible for gentrification a census tract’s median household income and median home value were both in the bottom 40th percentile of all tracts within a metro area at the beginning of the decade. The gentrified tracts recorded increases in the top third percentile for both measures when compared to all others in a metro area.


And more broadly, this resource has a special issue on gentrification:

The G-Word: A Special Series on Gentrification
The titles in this series are:
Do Cities Need Kids?
The Neighborhood Has Gentrified, But Where’s the Grocery Store?
Just Green Enough
Gentrification’s Not So Black and White After All
The Downsides of a Neighborhood ‘Turnaround
Some Cities Are Spurring the End of Sprawl
Keeping Cities from Becoming “Child-Free Zones”
From Vacant to Vibrant: Cincinnati’s Urban Transformation
Can Cities Change the Face of Biking?

The Hedometer Index

This is an index of happiness created from tweets. The index provides a daily score, which can be toggled to exclude weekends, Mondays, etc.

Hedometer Index

This is an excellent resource because the creators of this happiness index describe the calculation of the index, the words used in it, provide an API, have links to articles based on the index, etc. It is a valuable resource, even if you do not care about happiness as it provides a template for many other uses of data from Twitter.

Instructions [Documenation of index via video or written – click on links]
Words [Words used in index, ranks, etc.]
Blog [The Computational Story Lab. . . mostly related to happiness]
Press [press coverage]
Papers [refereed papers by research team]
Talks [maybe you need a clip for a lecture]
API [lots of examples]

Move over Index of Consumer Sentiment/Expectations?

I ran across this in the Wall Street Journal (slide 58 of 93):

Can happiness from tweets reduce drawdowns from selling VIX?

Selling VIX futures has been profitable historically. However, the strategy can be subject to drawdowns, when there is risk aversion . . . . Using the Hedometer index as an input, we have created a Happiness Sentiment Index (HSI), which can be sued to proxy market risk sentiment. . . .

HSI index

See next post for more on the Hedometer Index.

Data Demise: ACS 3-year product

The Census Bureau has released its last 3-year ACS product with the 2011-2013 release. This is a cost-cutting move, although the Census Bureau might argue that it never meant for there to be a 3-year product in the first place.

The Census Bureau is not cutting back on data collection – it is eliminating the tabular release of the 3-year data (geographic areas of 20,000+). The 1-year data are for geographies of 65,000+ and the 5-year data have no population limits. These will continue to be released.

The microdata products have share the same release types: 1-year, 3-year, and 5-year. These all share the same geographic limit (PUMAs), but the 3-year and 5-year products are not just concatenations of the 1-year files. They have been re-weighted and income-denominated items are inflated to the last year (e.g., 2013). [See explanatory note from IPUMS].

The ACS 3-year Demographic Estimates are History
Brendan Buff | APDU Blog post
Feb 3, 2015

Census Bureau Statement on American Community Survey 3-Year Statistical Product
Stanford University Libraries | Ron Nakao’s Blog

Tools: Data as Text

Most of the familiar statistical packages social scientists work with are not well-equipped for analysis of text. Python is one tool often used with text data.

Here is a series of Python tutorials posted on Neal Caren’s Github site. Notice the wide-prevalence of code sharing. That is a feature of much of the folks who work in this field.

You can follow his tutorials on Python or take a Coursera course by a UM professor in February. Another option is the Coursera Data Science specialization offered via Johns Hopkins. This set of courses skips Python but includes a snapshot of the variety of concentrations in this field.

Learning Python for Social Scientists [list curated by Neal Caren]
Programming for Everybody (Python) [University of Michigan via Coursera]
Data Science Specialization [Johns Hopkins via Coursera]

Here’s a rendering of that specialization from a student in the Data Toolbox course:

data science dependencies
Source: Uri Grodzinski

More fun with names

Who knew that the name Violet was such a good example of a bi-modal distribution?


This was drawn from a very fun post:

How to Tell Someone’s Age When All You Know is Her Name
Nate Silver and Allison McCann | FiveThirtyEight blog
May 29, 2014

We had a previous post on fun with the Social Security names database.

This age of names example is a great applied demography exercise – calculating the median age of names. For that you’ll need a link to the full names database and cohort life tables:

Beyond the Top 1000 Names
Cohort Life Tables for the Social Security Areas by Calendar Year

Here’s also a nice link to some Big Data exercises via Python. There is a lot of code sharing in this GitHub repository.

The Antidote for “Anecdata”: A Little Science Can Separate Data Privacy Facts from Folklore

The Antidote for “Anecdata”: A Little Science Can Separate Data Privacy Facts from Folklore
Daniel Barth-Jones | Info/Law Blog [Harvard]
November 21, 2014
This is a great piece that shows again that most of the publicity about re-identification in data are overblown:

The 11 in 173 million risk demonstrated for this celebrity ride re-identification (or 1 in 15,743,614) is truly infinitesimal. To put this in perspective, this risk is over 1,000 times smaller than one’s lifetime risk of being hit by lighting. With proper de-identification applied and the cryptographic hash problem fixed in any future data releases, this spooky specter of celebrity cyber-stalking using TLC taxi data is likely to vanish as soon as one turns on the lights.

This blog post is in reaction to the release of NYC taxi medallion data, which were improperly anonymized. A previous blog post described the data.

Here is the piece that sensationalizes the possibility of re-identification, based on famous people who ride cabs.
Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset
Anthony Tockar | Neustar Blog
September 15, 2014