Archive for the 'Methodology' Category

Page 4 of 9

Big data and the death of polls

The rumors of the death of polls might be greatly exaggerated. Recent coverage of a Twitter-based study ignores the weak effects in the original paper. ["For instance, being an incumbent predicts almost a 50,000 vote contribution to the Republican margin in their statistical model, whereas receiving 100 percent (all!) of tweet-mentions gets you only 155 votes"]. But, one of the authors of the papers even goes so far as to say “In the future, you will not need a polling organization to understand how your elected representative will fare at the ballot box. Instead, all you will need is an app on your phone.”

How Twitter can predict an election
Fabio Rojas | Opinions, Washington Post
August 11, 2013

Original Paper
More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior
J. DiGrazia, K. McKelvey, J. Bollen and F. Rojas | SSRN
February 21, 2013

How Twitter can Predict Elections: A Rebuttal
Rob Santos | Washington Post
August 16, 2013

Can Twitter Predict Elections? Not so Fast
Mark Blumenthal & Ariel Edwards-Levy | Huffington Post
August 16, 2013

Let’s Calm Down about Twitter Being Able to Predict Elections, Guys
Jason Linkins | HuffingtonPost
August 14, 2013

Popular Press
How Twitter can help predict an election – in one eye-catching study
Sean Sullivan | Washington Post
August 14, 2013
Want to figure out who is going to win a congressional race? Find out which candidate received the lion’s share of tweets in the lead-up to Election Day.


Some high-profile misses are also illustrative of the challenge of using tweets to reliably project elections. Anthony Weiner’s nearly 250,000 mentions on twitter (according to are unlikely to revive his downward spiral in the New York mayoral race – current front-runner Bill de Blasio has received barely 10,000 mentions in the same period. And while then-Rep. Ron Paul (R-Tex.) received wide recognition on Twitter during the 2012 Republican presidential primaries, he failed to win a single contest.

A New Study Say Twitter Can Predict US Elections
Robinson Meyer | The Atlantic
August 13, 2013

[Unrelated Visualization of Tweets before the 2010 Election]

The Perils of Administrative Censuses

Some who are against a mandatory census argue that the government already has this information and is wasting money re-collecting data. Of course, not all information on individuals is tied to their residence and the census needs to know the location of the population for reapportionment purposes. Others who are against the census are also against big government so probably are not in favor of administrative records as a data collection device.

The following is the record of the German administrative census as compared to a population count. Some of the sources are US-based research from the Census Bureau, which is looking to use administrative records to supplement its address-based census.

Germany Counts Heads and Finds 1.5 Million Fewer Residents Than It Expected
Press Release | Statistisches Bundesampt [German Federal Statistical Office]
May 21, 2013

Lessons from the German Census
D’Vera Cohn | Fact Tank: Pew Research Center
June 20, 2013

When the results of the 2011 German census were announced recently, they included an embarrassing error – at least in the demographics world. It showed the German population was 1.5 million people short of what the government had expected. The news dealt a blow to Germany’s reputation for efficient record-keeping, and it’s also relevant to how the next U.S. Census is conducted.

2010 Census Administrative Records Use for Coverage Problems Evaluation Report
Sheppard, Dave, | Census Bureau
March 18, 2013

2020 Census: Local Administrative Records and Their Use in the Challenge Program and Decennial

February 21, 2013
Highlights | Full Report

And Now for Something a Little Different. . .
Bob Groves | Director’s Blog: Census Bureau
June 27, 2012

Toward a Vision: Official Statistics and Big Data
C. Capps and T. Wright | AmStatNews
August 1, 2013
This piece even references Herman Hollerith:

The Census has a long history of innovation. Herman Hollerith invented the punch card for the 1890 Census; the first civilian computer was used for the 1950 Census. The first official sample survey was used by the Census Bureau to measure unemployment in 1937. Some of the basic technology for GIS was developed in the Dual Independent Map Encoding/Graphic Base Files efforts for the 1970 Census and TIGER for the 1990 Census.

Each of these innovations was done to reduce escalating cost and to preserve official statistical integrity. For these same reasons, the Census Bureau will continue to explore the possibility of using the explosion of Big Data to reduce cost, reduce reporting burden, and increase the effectiveness of national statistical estimation.

These benefits will accrue only if the Census Bureau can continue to preserve individual and corporate confidentiality, working to earn and preserve the public’s trust.

Cautionary Tale about Big Data Sampling

Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose
F. Martatter, J. Pfeffer, H Liu, K. Carley |
June 2013
[Abstract] [Paper]

These authors compare metrics based on the data one gets from Twitter’s free API vs the full universe (Firehose) and samples drawn from the Firehose. And, as an added bonus there is an excellent supply of references in this emerging field of big data/real-time data.

Imagining a Census Survey Without a Mandate

This is an update to a May 17th post on challenges to the American Community Survey’s mandatory response status via a House Bill [H.R. 931] introduced by Ted Poe, (R, TX):

Imagining a Census Survey Without a Mandate
Carl Bialik | Wall Street Journal (Blog post)
June 5, 2013
This piece mentions to former ISR researchers: Leslie Kish’s role in the move away from a decennial census to the ACS and Bob Groves’ on the currency of the ACS data. However, it mostly focuses on the statistical issues, which a voluntary ACS would introduce.

Census Gets Questions on Mandatory Queries
Carl Bialik | Wall Street Journal
March 30, 2012
Old article, but the issues are the same.

The Census’s 21st-Century Challenges
Carl Bialik | Wall Street Journal (Blog)
July 30, 2010
This piece talks about Canada’s foray into a voluntary census, which we’ve also covered. A good source for quotes about response bias.

Lessons from North of the Border

Why a Voluntary ACS Could Wipe Some States off of the Map
Terri Ann Lowenthal | The Census Project Blog
May 17, 2013

This is a great re-cap of the disaster Canada has on its hands with its voluntary National Household Survey. And, it is relevant for the US, because Congressional Republicans want to allow people to ‘just say no’ to all or part of the American Community Survey. She also reminds readers of the history of the marriage question in the US Census, including the possible deletion of the “times married” question.

The PSC-Info blog has several links to recent ACS/Census funding news:

ACS to drop “Number of Times Married” question

“it’s an Alice in Wonderland moment” or “GOP Census Bill would Eliminate America’s Economic Indicators”

The Census Reform Act of 2013

The ACS Faces More Battles

SENATE: The Census Bureau has already written the reports; read them.

Nerd Alert: Dictionary of Numbers

For those of you who try to incorporate quantitative reasoning in your teaching, here’s a nice resource:

Dictionary of numbers: putting numbers in human terms
This is a Google Chrome extension that tries to make sense of numbers you encounter on the web by giving you a description of that number in human terms. Because “8 million people” means nothing, but “population of New York City” means everything.

And, here’s a blog post about it from the nerd-friendly xkcd site – a webcomic of romance, sarcasm, math, and language:

Dictionary of Numbers
May 15, 2013
Opening paragraph:

I don’t like large numbers without context. Phrases like “they called for a $21 billion budget cut” or “the probe will travel 60 billion miles” or “a 150,000-ton ship ran aground” don’t mean very much to me on their own. Is that a large ship? Does 60 billion miles take you outside the Solar System? How much is $21 billion compared to the overall budget?

Measuring Marriage & Divorce among Same-Sex Couples

For Gays, Breaking Up Is Hard to Do – or Measure
Carl Bialik | Wall Street Journal [print column]
May 3, 2013
This article touches on the personal and on the aggregate. The personal stories are couples being unable to get a divorce because they live in states that do not recognize same-sex marriages. On the other hand, states have not modified divorce forms to collect data on same-sex couples.

Same-Sex Divorce Stats Lag
Carl Bialik | Wall Street Journal [blog]
May 3, 2013
This version provides links to sources of marriage and divorce statistics. European countries do collect data on these events, but so far do not have enough dissolutions to calculate robust rates. An NIH-funded study is following a cohort of couples who were married in Vermont.

Decennial Census Data on Same Sex Couples
Census Bureau
May 2013
The Census Bureau has a website with links to technical papers, data, etc. on same-sex couples from 1990+ as measured by this agency.

Census Bureau: Flaws in Same-Sex Couple Data
D’Vera Cohn | Pew: Social and Demographic Trends
September 27, 2011
The Census Bureau announced today that more than one-in-four same-sex couples counted in the 2010 Census was likely an opposite-sex couple, and identified a confusing questionnaire as a likely culprit. The bureau released a new set of “preferred” same-sex counts, including its first tally ever of same-sex spouses counted in the census.

How Accurate Are Counts of Same-Sex Couples?
D’Vera Cohn | Pew: Social and Demographic Trends
August 25, 2011
This is a nice brief on the obstacles to accuracy in measuring same-sex couples in census data. And, it illustrates the efforts that the Census Bureau makes in measuring concepts in an era of rapid social change.

Canada’s “NSF” Problem

House Republicans are trying to implement serious changes to the evaluation and funding of NSF science [here and here].

Canada is perhaps a bit further down this road. Here’s the latest on the decision to fund research that has industry applications rather than basic science.

When science goes silent
Jonathan Gatehouse | MacLean’s
May 3, 2013
This article touches on the shift in funding from basic science to applied science, but it is more in-line with an earlier post on the muzzling of environmental scientists.

National Research Council move shifts feds’ science role
Canadian Press | CBC News
May 7, 2013
‘Job-neutral’ restructuring to make agency streamlined, efficient and functional, president says

The Harper government is telling the National Research Council to focus more on practical, commercial science and less on fundamental science that may not have obvious business applications.

The government says the council traditionally was a supporter of business, but has wandered from that in recent years — and will now get back to working on practical applications for industries.

Some folks disagree with this shift:

In a statement, the executive director of the Canadian Association of University Teachers said the government is “killing the goose that laid the golden egg.”

“By transforming the NRC into a “business-driven, industry-relevant” organization, you are denying its ability to support basic research,” said Jim Turk.

“At the same time, you are cutting support to basic research in the universities.”

And is this part of the Tory ‘war on science’? [more coverage on this]

NDP science critic Kennedy Stewart called the shift in direction for the NRC “short-sighted” and said it could actually hurt economic growth in the long run, because it scales back the kind of fundamental research that can lead to scientific breakthroughs.

Research Council to focus on commercially viable projects, rather than science for science’s sake
Jessica Hume | Sun News
May 7, 2013
Two quotes say it all:

The government of Canada believes there is a place for curiosity-driven, fundamental scientific research, but the National Research Council is not that place.

“Scientific discovery is not valuable unless it has commercial value,” John McDougall, president of the NRC, said in announcing the shift in the NRC’s research focus away from discovery science solely to research the government deems “commercially viable”.

Nature: Replication, replication, replication

This issue of Nature is a compilation of replication articles across several issues of Nature. They highlight the importance of replication and open data for science. However, some of the examples might apply more to medicine or biology than population science. Lest, readers think that this issue doesn’t apply to demographers, here’s a tweet from Justin Wolfers, advertising a piece in Bloomberg Business on the importance of replication for the field of economics. His motivation is the recent dust-up due to an error in a famous paper by Reinhart and Rogoff [See PSC-Info], but the discussion is much broader than that example.


[Link to Stevenson/Wolfers Replication article]

No research paper can ever be considered to be the final word, and the replication and corroboration of research results is key to the scientific process. In studying complex entities, especially animals and human beings, the complexity of the system and of the techniques can all too easily lead to results that seem robust in the lab, and valid to editors and referees of journals, but which do not stand the test of further studies. Nature has published a series of articles about the worrying extent to which research results have been found wanting in this respect. The editors of Nature and the Nature life sciences research journals have also taken substantive steps to put our own houses in order, in improving the transparency and robustness of what we publish. Journals, research laboratories and institutions and funders all have an interest in tackling issues of irreproducibility. We hope that the articles contained in this collection will help.

Reducing our irreproducibility
(April 25 , 2013)

Further confirmation needed
A new mechanism for independently replicating research findings is one of several changes required to improve the quality of the biomedical literature.
Nature Biotechnology 30, 806
(September 10, 2012)

Error Prone
Biologists must realize the pitfalls of work on massive amounts of data.
Nature 487, 406
(July 26, 2012)

Must Try Harder
Too many sloppy mistakes are creeping into scientific papers. Lab heads must look more rigorously at the data — and at themselves.
Nature 483, 509 x
(March 29, 2012)


Independent labs to verify high-profile papers
Monya Baker
Nature News
(August 14, 2012)

Power Failure: Why small sample size undermines the reliability of neuroscience
Katherine S. Button, John P. A. Ioannidis et al.
Nature Reviews Neuroscience 14, 365-376
(April 15, 2013)

Replication studies: Bad copy
Ed Yong
Nature 485, 298-300
(May 17, 2012)

Reliability of ‘new drug target’ claims called into question
Asher Mullard
Nature Reviews Drug Discovery 10, 643-644
(September 2011)


If a job is worth doing, it is worth doing twice
Jonathan F. Russell
Nature 496, 7
(April 4, 2013)

Methods: Face up to false positives )
Daniel MacArthur
Nature 487, 427-429 \
(July 26, 2012)

Drug development: Raise standards for preclinical cancer research )
C. Glenn Begley & Lee M. Ellis
Nature 483, 531-533
(March 29, 2012

Believe it or not: how much can we rely on published data on potential drug targets? )
Florian Prinz, Thomas Schlange & Khusru Asadullah
Nature Reviews Drug Discovery 10, 712
(September 2011)

Tackling the widespread and critical impact of batch effects in high-throughput data
Jeffrey T. Leek, Robert B. Scharpf et al.
Nature Reviews Genetics 11, 733-739 )
(October 2010)


Research methods: know when your numbers are significant
David L. Vaux
Nature 492, 180-181
(December 13, 2012)

A call for transparent reporting to optimize the predictive value of preclinical research
Story C. Landis, Susan G. Amara et al.
Nature 490, 187-191
(October 11, 2012)

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility
Anton Nekrutenko & James Taylor
Nature Reviews Genetics 13, 667-672
(September 2012)

The case for open computer programs
Darrel C. Ince, Leslie Hatton & John Graham-Cumming
Nature 482, 485-488
(February 23, 2012)

Reuse of public genome-wide gene expression data
ohan Rung & Alvis Brazma
Nature Reviews Genetics 14, 89-99
(February 2013)

Research from The Data Privacy Lab

Respondent re-identification is a big worry for data projects who want to share their data. And, some recent cases illustrate that can/is occurring with genetic data. But, sometimes the case is over-stated. Here is an illustration with a case that hit the press with great fanfare.

First, the fun stuff. See, if you are unique. The following link has you type in your gender, exact age of birth and your 5-digit zip code. The latter two do not meet HIPAA guidelines:

Next are several links: The first is the coverage of re-identification in the press (Forbes, The Scientist, & xxxx) followed by the researcher’s version of the story (Sweeney). The next is a rebuttal, which reminds readers that administrative matches, e.g., voting registration are not as ubiquitous as some claim. There is also a link to an article by Barth-Jones where he discusses the famous case of the re-identification of Governor William Weld, which lead to much of the HIPAA rules.

Harvard Professor Re-Identifies Anonymous Volunteers In DNA Study
Adam Tanner | Forbes
April 24, 2013

Participants in Personal Genome Project Identified by Privacy Experts
MIT Technology Review
May 1, 2013

“Anonymous” Genomes Identified
Dan Cossins | The Scientist
May 3, 2013

Identifying Participants in the Personal Genome Project by Name
Latanya Sweeney, Akua Abu, Julia Winn | Data Privacy Lab

Reporting Fail: The Reidentification of Personal Genome Project Participants
Jane Yakowitz Bambauer | Info/Law [Harvard Law Blogs]
May 1, 2013

The ‘Re-Identification’ of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now
Daniel C. Barth-Jones | Social Science Research Network (SSRN)
June 4, 2012