home | schedules | software | help | who we are | about | workshops | links | data access | contact us | print version

<  November 2009 >
Su Mo Tu We Th Fr Sa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Reserve a classroom


Schedule for
11/23/2009


Main Lab
140 Prospect St.
Room 101
8:30am- 5pm No Classes

Rosenkranz Hall
115 Prospect St.
Room 01
8:30am- 5pm No Classes

Consultant's Desk
140 Prospect St.
Room 100
10am- 1:30pm Jennifer Green
1:30- 5pm Taylor Arnold


schedules

software

help

who we are

data access

about

workshops

links

Converting Data Formats

2. How do I convert my data into the correct format?


Today we will cover two basic techniques for loading data
        a) When and how to use StatTransfer
        b) When and how to use program specific definition/syntax files.

To begin with you must first know what kind of data you have: data with labels and defintion already included or data with labels and definitions in a separate file or code book. If this is not obvious to you (i.e. you have not hand typed the data into Excel yourself) then you should look at the datasource codebook or webpage for help. If this is still unclear, ask a StatLab consultant to help you.

Side Note: Recognizing Deliminated Data from Raw Data or Undeliminated Data -

Raw Data is a line or a number of lines of data strung together with no clear internal dividers. For example a observations might appear as follows:

afghan130199040000000199920000000
Deliminated Data uses internal markers such as periods, spaces, or tabs to clearly separate data. Three versions of the above observation follow:
afghan.130.1990.400..1999.200..
afghan 130 1990 400 00000 1999 200 00000
afghan 130    1990    400    .   1999    200   .
As you can see, interpreting the raw data without a code book can be difficult since numbers can easily merge together. In the raw data it is not clear if 1990 refers to a the year 1990 or perhaps 19 is a separate variable and 99 is the year. This is why raw data generally requires the use of definition files either provided or built from the code book.


StatTransfer
Specific help for StatTransfer can be found on the Statlab StatTranfer Help page.

Using Program Specific Definition/Syntax Files

Frequently, data resources such as ICPSR will offer data in a raw data format with accompanying definition or syntax files which can be used to format the data.

here is a raw file


While formats and syntax files may vary, there are some general rules of thumb:
1. Download both the syntax file and the data file on to the c: drive. (After you have finished manipulating the data you can then save the files on to your y: drive or alternative memory storage.)
2. The Statlab computers may not correctly recognize all zip naming conventions. For example, when downloading from ICPSR, it is necessary to rename zipped files with the extension .gz to .gzip.
3. It is generally easiest to use the program preferred by data archive first to create your dataset. You can always then change the data to the format of your preferred software later using StatTransfer.
4. Syntax files are not always updated as software versions change. Frequently, these files will need to be "tweaked" in order to run correctly. General rules are to remove any extraneous material from the files (such as *** notes ***), work through the files piecemeal (first import data, then run the label commands etc), and finally check every period and space. A single space can matter. Finally, it helps to be patient. It usually takes a couple times to get the syntax file to work properly.

As an example, we will practice downloading a ICPSR file and opening it in SPSS. First we will search for the file using StatCat. Searching for "civil war" we find the Correlates of War Project. StatCat provides basic information on the data, data format, and location [click on Statlab Server on the Holdings Available line]. We see that the data can be found through ICPSR. (Note the first time you enter you will have to provide your Yale email address.) The ICPSR page provides not only the data but other useful information. For our purposes, however, in particular, note the File Information page which provides specific file format information that may be useful in correcting possible syntax file problems (especially note the LRECL number and records per case).

  1. Move the files of interest from the server folder on to the computer hard drive. The C:\temp folder is a good place.
  2. Open SPSS (under the Start Menu). Under the File menu select Open>Syntax and browse for the syntax file in the c:\temp\9905 folder. In this case we select WWCW.SPS (SPSS syntax files can be identified through the SPSS. Data files end in .por or .sav).
  3. At this point, we will run the syntax file. Hopefully it works. If not...
  4. Clear all extraneous material identified by enclosure in asterix. You probably should read this later, however.
  5. Open the "Using ASCII Data Files with SPSS" help file and copy  the text in section 4.  Paste at top of current syntax file.
  6. We will need to replace several portions of this pasted text. Be careful not to disturb spaces or punctuation.
  7. Check that the last but one line ends in a period and the last line ends in"execute." (note the final period)
  8. Save
  9. To run, highlight all the text and click on the run arrow.
  10. If you followed me direction, you probably failed to load your data. I did the first time. In this case, you will see that there is an additional, unnecessary \ between the RECORDS and the next line. There only need be one. Remove the extra slash, (it doesn't matter which one), change the numbers of your file name, and run again. Also make sure that there are no unnecessary spaces.
  11. Congratulation, this only took you two attempts. Frustratingly, it can often take 5,6 even 7 times to get it right.