Meta Search
search across all the following databases:
Data Catalog
Data and documentation
KnowledgeBase
Common questions and answers.
Resources
Entire collection of data resources.
Measuring Marriage & Divorce among Same-Sex Couples
I. Purpose The purpose of this program is to rectangularize hierarchical data. The program allows one to choose selected variables and to restrict the file to cases meeting user-specified criteria.
The following items are covered in this document:| Comments: | Blank lines and lines beginning with an asterisk (*) in column one are treated as comments. After the keyword parameters are read, only lines with upper or lower case level designators (H/h and P/p), for U.S. Public Use Microdata samples (PUMS) are accepted as input. All other lines are ignored and may be used for comments and file documentation. |
|---|---|
| File Name: | Full paths and filenames may be specified with an overall length up to 1024 characters per file. |
| Items: | Level designations such as Housing and Person may be specified either as selection criteria or as items to be extracted for output. Items are specified with the appropriate level designator (as defined for the data set---H or P in the case of the PUMS) in the first field followed by an item name, a field location specification, and a field length specification. The name is arbitrary, but use of codebook names is recommended for purposes of documentation. Item names may be up to 16 characters in length. No spacers (blanks, commas, tabs, etc.) are allowed in the item name. All entries are free format and need to be separated by a spacer. Blanks, commas, and tabs are acceptable spacers. Comments may be appended to a line by following the last entry on the line with a least one space and using an asterisk (*) as the first character of the comment. If an item is to be used as a criterion for case selection, two fields are appended: low value and high value, inclusive, of a range of values to be selected. These values must be specified exactly as they appear in the codebook with the number of characters exactly equal to the field size for the item. If only a low value is given, the high value is assumed to be the same. The order in which items for selection and extraction are specified is not critical, nor is the order of household and person items. The program orders them as within the hierarchies of the levels, housing selection, household extraction, person selection, and person extraction items, in the case of the PUMS. |
| AND/OR: | Selection criteria are ANDed over all specified items unless the same item name is given for consecutive selection items. An implicit OR is done using the selection criteria specified for a sequence of two or more entries with the same name. Note that the actual item locations do not need to be the same if an OR is desired over two or more items---only the name must be the same. Thus, ORs may be done across items within a record, if desired, if a common name is used for the items. ORs can also be done across records. The boolean capabilities are primitive, but allow most basic selections to be made without difficulty. |
Study # Description Census Enumerations 81 Census of Population and Housing (PUS), 1900: 1-760 sample 140 Census of Population and Housing (PUS), 1910: 1-250 sample 454 Census of Population and Housing (PUS), 1910: Oversample of Black-headed households 58 Census of Population and Housing (PUS), 1940: 1% sample 62 Census of Population and Housing (PUS), 1950: 1% sample 83 Census of Population and Housing (PUS), 1960: .1% sample 486 Census of Population and Housing (PUS), 1960: 1% sample 409 Census of Population and Housing (PUS), 1970: .1% sample - 5% questionnaire (state file) 410 Census of Population and Housing (PUS), 1970: .1% sample - 15% questionnaire (state file) 446 Census of Population and Housing (PUS), 1970: 1% sample - 5% questionnaire (state file) 447 Census of Population and Housing (PUS), 1970: 1% sample - 15% questionnaire (state file) 632 Census of Population and Housing (PUS), 1970: 1% sample - 5% questionnaire (county group file) 631 Census of Population and Housing (PUS), 1970: 1% sample - 15% questionnaire (county group file) 633 Census of Population and Housing (PUS), 1970: 1% sample - 15% questionnaire (neighborhood file) 800 Census of Population and Housing (PUS), 1970: 1% sample - 5% questionnaire (neighborhood file) 37 Census of Population and Housing (PUMS), 1980: 5% (A sample) 336 Census of Population and Housing (PUMS), 1980: 1% (B sample) 390 Census of Population and Housing (PUMS), 1980: .1% (A sample) 440 Census of Population and Housing (PUMS), 1990: 5% (A sample) 441 Census of Population and Housing (PUMS), 1990: 1% (B sample) 442 Census of Population and Housing (PUMS), 1990: .1% (A sample) 622 Census of Population and Housing (PUMS), 1990: 3% (O sample) [elderly] 670 Census of Population and Housing (PUMS), 1990: 8% (A+O sample) [elderly] 252 Census of Population and Housing (PUMS): Puerto Rico, 5% (A sample) 43 Census of Population and Housing (PUMS), 1990: Puerto Rico 5% (A sample) 45 Census of Population and Housing (PUMS), 1990: Puerto Rico 1% (B sample) 306 Current Population Survey (CPS), March 1968 307 Current Population Survey (CPS), March 1969 135 Current Population Survey (CPS), March 1970 353 Current Population Survey (CPS), March 1971 308 Current Population Survey (CPS), March 1972 309 Current Population Survey (CPS), March 1973 128 Current Population Survey (CPS), March 1974 129 Current Population Survey (CPS), March 1975 130 Current Population Survey (CPS), March 1976 131 Current Population Survey (CPS), March 1977 132 Current Population Survey (CPS), March 1978 133 Current Population Survey (CPS), March 1979 115 Current Population Survey (CPS), March 1980 234 Current Population Survey (CPS), March 1981 116 Current Population Survey (CPS), March 1982 117 Current Population Survey (CPS), March 1983 223 Current Population Survey (CPS), March 1984 103 Current Population Survey (CPS), March 1985 235 Current Population Survey (CPS), March 1986 159 Current Population Survey (CPS), March 1987 295 Current Population Survey (CPS), March 1988 330 Current Population Survey (CPS), March 1989 342 Current Population Survey (CPS), March 1990 401 Current Population Survey (CPS), March 1991 443 Current Population Survey (CPS), March 1992 483 Current Population Survey (CPS), March 1993 577 Current Population Survey (CPS), March 1994 700 Current Population Survey (CPS), March 1995 742 Current Population Survey (CPS), March 1996 802 Current Population Survey (CPS), March 1997 813 Current Population Survey (CPS), March 1998 863 Current Population Survey (CPS), March 1999 968 Current Population Survey (CPS), March 2000 Integrated Public Use Microdata Samples (IPUMS), United States: 1850-1990 685-ip18501 IPUMS 1850 Sample 685-ip18601 IPUMS 1860 Sample 685-ip18701 IPUMS 1870 Sample 685-ip18801 IPUMS 1880 Sample 685-ip19001 IPUMS 1900 Sample 685-ip19101 IPUMS 1910 Unweighted Sample 685-ip19103 IPUMS 1910 Hispanic oversample 685-ip19201 IPUMS 1920 Sample 685-ip19401 IPUMS 1940 Sample 685-ip19501 IPUMS 1950 Sample 685-ip19601 IPUMS 1960 Sample 685-ip19701 IPUMS 1970 State sample (5% questionnaire) 685-ip19702 IPUMS 1970 State sample (15% questionaire) 685-ip19703 IPUMS 1970 County group sample (5% questionnaire) 685-ip19704 IPUMS 1970 County group sample (15% questionnaire) 685-ip19705 IPUMS 1970 Neighbor sample (5% questionnaire) 685-ip19706 IPUMS 1970 Neighbor sample (15% questionnaire 685-ip19802 IPUMS 1980 Metro B Sample 685-ip19803 IPUMS 1980 Urban/rural C Sample 685-ip19902 IPUMS 1990 1% Sample 685-ip19903 IPUMS 1990 3% Elderly Sample 685-ip19904 IPUMS 1990 Flat Unweighted State SampleInformation on the location of the data, record length, and where and how to identify the levels of hierarchy is stored in a file that the program accesses. To use Extract with data sets other than those listed above, see the the Data Archive staff.
IV. Command File (data and output definitions)
The user can provide all the necessary information on the command line. However, for user-documentation puroses, the preferred method should be to include the required parameters with the program. The keyword parameters are:| Data: | This parameter indicates the data from which the extract
is drawn. One may use the Data Archive study number or an abbreviated
desciption. The former is more reliable as the user-supplied abbreviated
description must exactly match the abbreviated description that is in the
"data description file" located in /usr/local/lib/utilities/extract.datasets.lst
Data: 440 or Data: PUMS 1990 5% |
|---|---|
| For: | This parameter defines the subfiles one is using. This is
states for the recent census files and IPUMS 1980 and 1990 5% State samples.
The default is the entire sample if this parameter is excluded.
One can only restrict the extract to subfiles with the 1% and 5% 1980
PUMS files; the 1%, 5%, 3%, and 8% 1990 PUMS files;
and the 1980 and 1990 5% IPUMS files.
States are identified by their FIPS code (see any census documentation
for the relevant state FIPS codes) or by their postal abbreviation.
For: 26, 39, 10 Selects the states of Michigan, Ohio, and District of Columbia For: mi, oh, dc same results For: michigan ohio dictrictofcolumbia same results |
| Output: | This parameter designates the location of the
output file.
Output: /usr/shared/male90.dat
The following abbreviations allow one to get
state-specific files:
Output: /usr/shared/st*.dat Output: /usr/shared/st!.dat Output: /usr/shared/st#.dat The *,!, or # is replaced by the state designation: mi26, mi, or 26, respectively. |
| Log: or Codebook: |
Designates the location of the codebook and log files. The codebook
describes the column locations or of the variables while the log file
provides a record of the total number of records input and
Codebook: output as well as a breakdown of the number of records
according to level of hierarchy. In the previous version of extract the
codebook and log files were written separately. One may designate the
location of this file with either "log" or
"codebook."
Log: /usr/shared/male90.log Codebook: /usr/shared/male90.log |
| Test: | Controls the number of records read. If the data are in subfiles (state, year, province, etc.), the number of records read is per subfile. Test: 1000 |
| Delimiter: | Defines the spacing mechanism between the numbers in the output. The default spacing is "space." Other commonly used spacers are "tab" and "comma." If one wants to eliminate spacing between variables (to conserve space), the delimiter should be "none." Delimiter: Comma |
V. Command File (Variable and Case Selection)
The command file has several parts. The previous section described the data and output parameters. This section describes the variable and case selection parameters. The program requires four fields (type of record, name of variable, location of the variable, length of the variable) to define variables. The fields have to be separated by at least one spacer which can be a blank (most typical), comma, or a tab.Examples The following examples show set-up files and the command lines to invoke Extract. Two set-up files and command lines are provided for each example. The first version, (a), has the keyword parameters within the set-up file while the second version, (b), has the keyword parameters within the command line.
Example 1:
Select a sample of males between 20 and 64 who are
employed in selected manufacturing industries. Use the 1/1000 file from
the 1980 PUMS.
Example 1a
Command line: extract male80.a
Set-up file: male80.a
Data: 390 Output: /usr/shared/male80-a.dat Log: /usr/shared/male80-a.log H State 4 2 H FamIncom 112 5 P Sex 7 1 0 0 P Age 8 2 P Age 8 2 20 64 P Marital 11 1 P Race 12 2 P Grade 40 2 P FinGrade 42 2 P Labor 81 1 P Industry 87 3 P Industry 87 3 132 150 P Industry 87 3 180 192 P Industry 87 3 351 370
Example 1b
Command line: extract male80.b Data: 390
Output: /usr/shared/male80-b.dat Log: /usr/shared/male80-b.log
Set-up file: male80.b
H State 4 2 H FamIncom 112 5 P Sex 7 1 0 0 P Age 8 2 P Age 8 2 20 64 P Marital 11 1 P Race 12 2 P Grade 40 2 P FinGrade 42 2 P Labor 81 1 P Industry 87 3 P Industry 87 3 132 150 P Industry 87 3 180 192 P Industry 87 3 351 370
Example 2:
Create a test file that selects the first
1000 records from each state using the 1990 5% PUMS.
Example 2a
Command line: extract test.pgm.a
Set-up file: test.pgm.a
Data: 440 Output: test.dat Log: test.log Test: 1000 H State 11 2 P Sex 11 1 P Race 12 3 P Age 15 2 Example 2b
Command line: extract test.pgm.b Data: 440 Output: test.dat Log: test.log Test:1000
Set-up file: test.pgm.b
H State 11 2 P Sex 11 1 P Race 12 3 P Age 15 2
Example 3:
Using the 5% file from the 1990 PUMS, restrict
sample to California and New York. Delimiter is a comma. Separate
files will be written for each state.
Example 3a
Command line: extract state.pgm.a
Set-up file: state.pgm.a
Data: 440 For: ca,ny Output: st*.dat Log: all.cdb Delimiter: comma H State 11 2 P Sex 11 1 P Race 12 3 P Age 15 2
Example 3b
Command line: extract state.pgm.b Data: 440 For: ca,ny
Output: st*.dat Log: all.cdb Delimiter: comma
Set-up file: state.pgm.b
H State 11 2 P Sex 11 1 P Race 12 3 P Age 15 2
VI. Command File - "AND" and "OR"
Extract allows one to make "and" selections and "or" selections. Three examples will be provided that cover (a) "AND" selections across several variables (b) "OR" selections within one variable and "OR" selections across several variables. All examples will be based on the 5% 1990 Public Use sample.(a) AND selections across several variables
To make "AND" selections, you make the usual Extract entries: level of hierarchy, variable name, starting position and length for all the variables you need to include in the extract.H State 11 2 H PUMA 13 5 P Sex 11 1 P Race 12 3 P Age 15 2 P Pwgt1 18 4 P REarn 127 6Next, add the selection criteria. For illustrative purposes, the sample will be restricted to black males between 25 and 54 years of age. If you don't want/need the selection variable in the data set, add the low value and high values to the original variable entry. For instance, to the "Sex" entry, add "0" and "0" to columns 5 and 6. Likewise, for the "Race" entry, add "002" and "002" to columns 5 and 6. (In this example, you don't need the variables sex and race in the data set as they will be constants---all cases will be "0" on sex and all cases will be "002" on race.)
P Sex 11 1 0 0 P Race 12 3 002 002If you want the "selection" variable to be in the data, enter a new line under the original variable entry with additional selection criterion in columns 5 and 6.
P Age 15 2 P Age 15 2 25 54To select state, PUMA, age, earnings, and person weight for black males, 25 to 54:
H State 11 1 H PUMA 13 5 P Sex 11 1 0 0 P Race 12 3 002 002 P Age 15 2 P Age 15 2 25 54 P REarn 127 6
(b) "OR" selection within one variable
With this type of "OR" selection, you select cases if the value on a variable is equal to one of several non-contiguous values. For illustrative purposes, I will examine labor force status for a 1/500 sample of older white women. To draw a 1/500 sample, I need to select 4 of 100 subsample values. [5/100 * 4/100 = 1/500]. For statistical purposes, these four subsample values should not be contiguous (page 4-6 in the technical documentation). In this illustration, a case will be selected if the value on the household subsample variable is either 02 or 32 or 62 or 92. To select the variables of interest for a 1/500 sample of white women:H Subsampl 27 2 H Subsampl 27 2 02 02 H Subsampl 27 2 32 32 H Subsampl 27 2 62 62 H Subsampl 27 2 92 92 P Sex 11 1 1 1 P Race 12 3 001 001 P PWgt1 18 4 P RLabor 91 1 P Hours 93 2 P Yrwrk 114 1You are probably not going to be interested in retaining the subsample variable. Thus, you could delete the first "Subsampl" entry which has the effect of selecting on this variable, but not retaining the variable in the data:
H State 11 1 H Subsampl 27 2 02 02 H Subsampl 27 2 32 32 H Subsampl 27 2 62 62 H Subsampl 27 2 92 92 P Sex 11 1 1 1 P PWgt1 18 4 P RLabor 91 1 P Hours 93 2 P Yrwrk 114 1 P Race 15 2 001 001 (c) "OR" selection across variables In this type of selection you want to select a case if the case has a value or a set of values on one variable or if the case has a value or a set of values on another variable. For example, you might be interested in drawing a Mexican sample. A respondent will be considered Mexican if any one of the following are true:
H State 11 2 P Sex 11 1 P Age 15 2 P PWgt1 18 4 P Hisp 38 3 P POB 44 3 P Anc1 53 3 P Anc2 56 3 P Lang2 68 3 P REarn 127 6Next, you make a pseudo-variable that has the values you want to select on. The column locations and widths on the pseudo-variable will match the column locations and widths on the "real" variables of interst. As you want to select on Hisp, POB, Anc1, Anc2, and Lang2, you need to make a pseudo variable, Mex, with the following entries:
P Mex 38 3 001 001 P Mex 38 3 210 220 P Mex 44 3 315 315 P Mex 53 3 210 218 P Mex 56 3 210 218 P Mex 68 3 625 625Note that the column and length fields correspond to the values found for Hisp, POB, Anc1, Anc2, and Lang2. The final command file should look as follows:
H State 11 2 P Sex 11 1 P Age 15 2 P PWgt1 18 4 P Hisp 38 3 P POB 44 3 P Anc1 53 3 P Anc2 56 3 P Lang2 68 3 P Mex 38 3 001 001 P Mex 38 3 210 220 P Mex 44 3 315 315 P Mex 53 3 210 218 P Mex 56 3 210 218 P Mex 68 3 625 625 P REarn 127 6Extract was written to facilitate the use of hierarchical data by allowing for easy and rapid rectangularization of raw data files. While Extract is easy to use, one can get incorrect results. The purpose of this note is to give some suggestions on how to use Extract efficiently as well as descriptions of, and solutions to, some of the more common user and/or system errors. Please feel free to continue asking for help if you have problems with Extract.
A. Use Extract Efficiently
(1) Before doing an extract, run a test job and look at the results.
(2)Determine how large your file will be before running the actual extract.
This will allow you to determine if there is enough space, if perhaps you should sample rather than take all cases, etc. For example, if I have a test extract created from the 1990 PUMS 1/1000 file, called women90.dat, I can make a very close estimate of how large it will be using either the 5% or 1% files. To determine file size:
ls -l women90.dat
rw-r--r-- 1 lisan sys 6399764 Apr 13 09:30 women90.dat
If you are using the 5% data for the actual job, multiply the size of
women90.dat (6399764) by 50 to determine how large the actual data file
would be (319988240 or ~320mb); if you are using the 1% data for the actual
job, multiply the size of women90.dat by 10 to determine how large the
actual data file would be (~64mb).
(3)Don't overload the system
Extract must be queued if run on the large workstations (malthus, graunt, ariel, or meca). Please see the Computing Guide for instructions on queueing. One does not have to queue jobs when using the smaller workstations (east, west, south, lotka, pareto). However, these workstations should not be used if the input file has more than 1,000,000 records. Thus, the smaller workstations can handle CPS files, 1/1000 files, or selected states from the 1% or 5% files. Users must be logged on during Extract unless the job was submitted to a queue. This is another reason not to use the smaller (queueless) workstations for large jobs (complete 1990 5% PUMS) as this will require one to stay logged on for several hours.B. Common Mistakes Using Extract 1. Extract results in no cases The most common cause of this is that the column width of the selection values does not match the column width on the selection variable.
If the width of a variable is three, the width of the selection values must be three. Extract makes character comparisons, not arithmetic comparisons. Thus,
54 is not the same as 054
The following is an example from the 1970 PUMS. Many users have made
this error.
INCORRECT
P Age 9 3
P Age 9 3 25 54
CORRECT
P Age 9 3
P Age 9 3 025 054
A less obvious error can occur if some of the selection values are
in error, but others are correct. In this situation, you will get cases,
just not the correct number. An example of this problem can be shown
with the "subsampl" variable in the 1990 PUMS. If you are making a 1/10
sample, you might do the following:
INCORRECT H State 11 2 H Subsampl 27 2 2 2 H Subsampl 27 2 12 12 H Subsampl 27 2 22 22 H Subsampl 27 2 32 32 H Subsampl 27 2 42 42 H Subsampl 27 2 52 52 H Subsampl 27 2 62 62 H Subsampl 27 2 72 72 H Subsampl 27 2 82 82 H Subsampl 27 2 92 92 P Sex 11 1 P Race 12 3The above results in a 9/100 sample, not a 1/10 sample, because the first comparison will never be true: 2 is not the same as 02. Thus, you need to do the following:
CORRECT H State 11 2 H Subsampl 27 2 02 02 H Subsampl 27 2 12 12 H Subsampl 27 2 22 22 H Subsampl 27 2 32 32 H Subsampl 27 2 42 42 H Subsampl 27 2 52 52 H Subsampl 27 2 62 62 H Subsampl 27 2 72 72 H Subsampl 27 2 82 82 H Subsampl 27 2 92 92 P Sex 11 1 P Race 12 3
2. Not enough space to write files
/usr/pops/bin/xtract: test.dat: 0402-011 Cannot create the specified file. /usr/pops/bin/xtract: test.log: 0402-011 Cannot create the specified file.The users is trying to write to a location without enough room (probably the user home directory or IFS user space). Most Extract jobs should be run in system-wide /usr/shared space, or the workstation-specific shared space (graunt: /usr/gshared; ariel: /usr/ashared; meca: /usr/mecashared). Before running the job, make sure there is enough space available:
On local disk type: df .
On IFS space type: fs lq
Once the job is done and has been checked, compress the data and move the
data and documentation to your personal workspace.
3. Number of cases in data file does not match the number of cases in the log file The log file was added the codebook file so that users could verify the input and output files. (Early on we had problems losing our IFS connection so that not all records would be read in for a state. We corrected this problem by putting the data on a local drive, but retained the log file which should still be used as a check to see if results look reasonable.)
The most likely reason that the number of cases in the data file is less than the number of cases in the log file is that you are trying to create a file that is over one gigabyte (1G) in size. There is a system limit of 1G for a file. Extract will continue to write the data (but it will be written to memory rather than disk). When Extract is finished, these data will be lost. This problem can be solved by breaking the extract into two jobs. For instance, select males in one job and females in the second job. One might also consider sampling to reduce the size of the file. It is fairly difficult to manipulate extremely large files in common statistical packages4. Need to kill the extract If the job has been queued, one needs to kill the queue. To do so first get the job number using the command checkq:
checkq Job Num User Job Name Submitted Started Priority ------- ------- ------------- ------------- ------------ -------- There are no jobs for the first available machine. There are no jobs for the first available RS6000. There is 1 job for graunt. 104582 lisan job 7/16 13:20:51 7/16 13:20:51 Normal There is 1 job for malthus. 10761 sallard spss.txt 7/16 13:39:49 7/16 13:39:49 Normal There are no jobs for ariel. There are no jobs for meca.
To kill the job, use the killq command:
killq Job Num Host Host List Job Name Start Time ------- ------------ --------------- ------------- ---------- 104582 graunt graunt job 7/16 13:39:49 Enter the job to kill. To use an argument to kill- eg, kill -KILL, enter the argument followed by the job number- eg, -KILL 908675 Kill which job number? 104582If the job was run outside of the queue (on a small workstation), one needs to know the job number. When the user first invokes Extract a job number is given:
extract cps90.pr &
(1)[21897]
However, if one didn't write the number down, it can be recovered by:
busy | grep username | grep extract | grep -v grep For example: busy | grep lisan | grep extract | grep -v grep lisan 21897 0.0 0.0 268 408 pts/3 S 13:12:50 0:00 sh /usr/local/bin/extract cps90.pr To kill this job type: kill -9 21897 [1] Killed extract cps90.pr