Converting Data Formats
2. How do I convert my data into the correct format?
Today we will cover two basic techniques for loading data
a) When and how to use StatTransfer
b) When and how to use program specific definition/syntax files.
To begin with you must first know what kind of data you have: data with labels and defintion already included or data with labels and definitions in a separate file or code book. If this is not obvious to you (i.e. you have not hand typed the data into Excel yourself) then you should look at the datasource codebook or webpage for help. If this is still unclear, ask a StatLab consultant to help you.
Side Note: Recognizing Deliminated Data from Raw Data or Undeliminated Data -
Raw Data is a line or a number of lines of data strung together with
no clear internal dividers. For example a observations might appear as
follows:
afghan130199040000000199920000000Deliminated Data uses internal markers such as periods, spaces, or tabs to clearly separate data. Three versions of the above observation follow:
afghan.130.1990.400..1999.200..As you can see, interpreting the raw data without a code book can be difficult since numbers can easily merge together. In the raw data it is not clear if 1990 refers to a the year 1990 or perhaps 19 is a separate variable and 99 is the year. This is why raw data generally requires the use of definition files either provided or built from the code book.
afghan 130 1990 400 00000 1999 200 00000
afghan 130 1990 400 . 1999 200 .
- Use StatTransfer when data is in a deliminated (tab, space,period or other) ascii or text form. Deliminated data uses specific internal notation to separate strings of data.
- Use Stat Transfer when data is in Excel. In this case, first make sure to remove any special place holders for missing observations. Commonly used holders are single or double periods, N/A, or -99 and other negative numbers.
- Use StatTransfer when data is in another statistical program format such as SAS, SPSS, or STATA program. StatTransfer can transfer all formats available at the StatLab and many more.
Using Program Specific Definition/Syntax Files
Frequently, data resources such as ICPSR will offer data in a raw data format with accompanying definition or syntax files which can be used to format the data.
While formats and syntax files may vary, there are some general rules
of thumb:
1. Download both the syntax file and the data file on to the c: drive.
(After you have finished manipulating the data you can then save the
files on to your y: drive or alternative memory storage.)
2. The Statlab computers may not correctly recognize all zip naming
conventions. For example, when downloading from ICPSR, it is necessary
to rename zipped files with the extension .gz to .gzip.
3. It is generally easiest to use the program preferred by data archive
first to create your dataset. You can always then change the data to
the format of your preferred software later using StatTransfer.
4. Syntax files are not always updated as software versions change.
Frequently, these files will need to be "tweaked" in order to run
correctly. General rules are to remove any extraneous material from the
files (such as *** notes ***), work through the files piecemeal (first
import data, then run the label commands etc), and finally check every
period and space. A single space can matter. Finally, it helps to be
patient. It usually takes a couple times to get the syntax file to work
properly.
As an example, we will practice downloading a ICPSR file and opening it
in SPSS. First we will search for the file using StatCat. Searching
for "civil war" we find the Correlates of War Project. StatCat provides
basic information on the data, data format, and location [click on
Statlab Server on the Holdings Available line]. We see that the data can
be found through ICPSR. (Note the first time you enter you will have to
provide your Yale email address.) The ICPSR page provides not only the
data but other useful information. For our purposes, however, in
particular, note the File Information page which provides specific file
format information that may be useful in correcting possible syntax
file problems (especially note the LRECL number and records per case).
- Move the files of interest from the server folder on to the computer hard drive. The C:\temp folder is a good place.
- Open SPSS (under the Start Menu). Under the File menu select Open>Syntax and browse for the syntax file in the c:\temp\9905 folder. In this case we select WWCW.SPS (SPSS syntax files can be identified through the SPSS. Data files end in .por or .sav).
- At this point, we will run the syntax file. Hopefully it works.
If not...
- Clear all extraneous material identified by enclosure in asterix. You probably should read this later, however.
- Open the "Using ASCII Data Files with SPSS" help file and copy the text in section 4. Paste at top of current syntax file.
- We will need to replace several portions of this pasted text. Be careful not to disturb spaces or punctuation.
- After the words FILE HANDLE, replace data1 with a new file name of your choice (Ex: cwar1). Each time you run the syntax file this name will need to be changed so it is useful to always include a number at the end.
- Similarly, after DATA LIST FILE= , replace data1 with the same file name as above (Ex: cwar1).
- Replace datafilepath with the file path and name of the raw data which you will be formatting. (Hint ICPSR raw data files almost always end in .dat) Ex: 'c:\temp\9905\WWCW.dat'
- Replace the nnn following LRECL= with the Logical Record Length of your data file. Sometime this will already be in the syntax file, other times you will need to find this information from the codebook or from the webpage where you downloaded the data. In ICPSR, this information is located in the "Description - File Information" page. (Ex: 75)
- Replace the RECORDS number. Again, like the LRECL this number may be available in the text below or can be found in the codebook etc.
- Now remove any text below which appears to replicate the text we just entered. (Example: remove the old "DATA LIST" line, noting the addition of FIXED before the RECORD).
- Final Text will look as follows:
- FILE HANDLE cwar1 NAME='C:\TEMP\9905\WWCW.DAT' LRECL=75 .
DATA LIST FILE=cwar1 FIXED RECORDS=3/ - Check that the last but one line ends in a period and the last line ends in"execute." (note the final period)
- Save
- To run, highlight all the text and click on the run arrow.
- If you followed me direction, you probably failed to load your
data. I did the first time. In this case, you will see that there is an
additional, unnecessary \ between the RECORDS and the next line. There
only need be one. Remove the extra slash, (it doesn't matter which one),
change the numbers of your file name, and run again. Also make sure
that there are no unnecessary spaces.
- Congratulation, this only took you two attempts. Frustratingly,
it can often take 5,6 even 7 times to get it right.
