- Home
- About Us
- Workshops & Seminars
- Software Help
- Software Access
- Spatial Analysis / GIS
- External Resources
3550 Rackham Building University of Michigan Ann Arbor, MI 48109-1070
more contact info
Introduction
The purpose of this paper is to give you some background and helpful hints for data entry for Statistical Packages that may allow you to avoid some common pitfalls, and to get you on your way. There are many different statistical packages that are available on the PC, Mac, Mainframe and UNIX workstation platforms. We cannot possibly give information that will be valid for every one of these programs, but there are some common elements that we believe will be helpful regardless of the statistical program or programs that you use. The points made in this handout should be valid for SPSS, SAS, and SYSTAT.
It is important to think about how you would like to analyze and use your data before you begin to enter it. It is wise to talk to someone who is familiar with the statistical program that you wish to use in advance, if possible, and see what advice they can give about special problems or capabilities that you should take into consideration.
You may be entering the data yourself, or you may hire someone else to do that task. If you do the work yourself, you will need to have available to you a computer program that is capable of data entry. This can be done with a number of different types of programs. This handout will discuss Excel as a data entry tool, but that is not meant to endorse a particular product, rather it is simply to help you understand a program that is widely available at the University of Michigan and that is supported on campus. You may choose any other program for data entry that you are comfortable with and that will get the job done. This handout will try to describe the requirements for statistical data entry so that you can make the choice of a data entry program that is appropriate for you. If you hire someone else to enter the data for you, you need to make the task as easy, painless and efficient for them as possible. Planning is important here, and spending some time at the planning stages will pay off later when it comes time to analyze your data.
What Data to Enter
When you begin your data entry task, the first question that you may face is what data should be entered. It is important to consider in advance how the data will be used before you make decisions on what to include and what not to include. You may have data from a single questionnaire, or you may have data from several questionnaires, lab tests, x-ray results, home diaries and other information, all for the same individual. The types and complexity of data that are used in statistical packages are almost endless. Here are some helpful points that you should consider...
What is a case and what is a record?
In this document, a case is defined as the unit of analysis for your study. For example, a case might be a participant in a survey, or an animal in an experiment, or a family, or a hospital, etc. A record is a line within a data file. Sometimes a case may have more than one record, or more than one type of record. For example, a participant may have an xray record, a questionnaire record, and a physical exam record. These records will all have to be linked in order to analyze them properly.
ID:
Each record that you enter needs to have some form of unique identification that applies to its case, and its case only. An id number serves this purpose. This number may be one that you assign (for example a number that you write on the corner of each questionnaire as it comes in), or it may be one that is available from other sources (for example social security number, or a hospital id code). You may have several pieces of information that together form a unique identifier (for example, census tract, household number, and individual id). If you do have several identifiers for each case, be sure that you enter each identifier for each record for that case. Id numbers allow you to check data later if problems or inconsistencies arise. The id number also allows you to match/merge different files that are available for the same individual or experimental unit.
When in doubt, put it in:
If you have a small amount of data that you wish to enter, it is not difficult to enter all of it. However, for more complex or larger studies, it sometimes becomes problematic which pieces of information to enter. If you have the resources (time, money, storage space) it is usually a good rule of thumb to include everything, or at least everything that you might possibly want from the data. It is relatively easy to add another piece of information to the data entry process as it is being carried out, but becomes much more difficult and costly to go back and enter more data for every case after the data have already been entered. If you find that you have entered more information than you need, it is easy to ignore that part of the data, or simply save it for analysis at a later time. Of course, entering "everything" can become too cumbersome and expensive, and that is when good planning and foresight will help you to include the important items.
Enter data in its simplest form
Statistical programs are designed to summarize, tabulate, graph and otherwise display and manage data. If you summarize the data before entering it to your statistical program, you are wasting your time in doing the calculations, and may introduce mistakes in the process. You may also find that you would like the unsummarized data in order to look at some aspect of it. It is much easier to combine data to derive summary measures than it is to pull it apart.
Enter multiple variables for multiple response data
Usually, a question or response has only one value, but in some questionnaires, respondents are asked to "Check all that apply" or give a list of items. In this case, we recommend that you enter separate information for each list item. So, for example, you may have variables that are Q1A, Q1B, Q1C, and so on. Each of these items can be entered as a "1" if the person chose that answer, and as a "0" if they did not. These items can then be tabulated to get counts of how many people responded with "A" to question 1, etc.
Data Layout
In general, data are laid out in a rectangular array. The rows represent cases, and the columns represent variables. A sample data layout is shown below:
ID GROUP NAME SEX SYSTOLIC DIASTOLIC 1 1 Tony M 128 86 2 1 Phyllis F 135 92 3 2 Rhonda F 102 61 4 2 Jason M 143 105 5 2 Maya F 115 68
The intersection of a row and a column is called a cell. The general rule of thumb is that each cell should contain only 1 piece of information. For example, blood pressure is entered as 2 values, Systolic and Diastolic. If you later wish to calculate the ratio of systolic to diastolic blood pressure, that can be easily done by the stat package. Do NOT enter both blood pressure values in one cell (say 128/86), because the stat package will not be able to read that value, and will not know how to interpret it. The stat package can then easily and accurately calculate any ratio or other function of values that you wish.
Data From Outside Sources
Data from outside sources, that may have already been entered into a flat (or text) file, may be in many formats. Data may be in fixed column format, where each variable is stored within particular columns (as shown in the previous example). This is easy to read into most statistical packages, such as SPSS and SAS, and is probably the best in terms of being non-ambiguous when reading the data into a stat package. Another format that is commonly used is list format; that is, each variable value is followed by at least one blank or other delimiter, such as a comma. List format data may appear "ragged" when you view it with an editor, since different cases may take up different line lengths. The major caution to use with list format data is that each variable must be represented by something. That is, if a measurement is missing, a missing value code must be present in the data. The missing value code might be a number such as 99 or 999, or might be a special character, such as a period. Data may also be stored in binary format, or in exponential format. These are only a few of the more common methods of storing raw data that might be read into statistical packages. Be sure to find out as much as possible about the format of the raw data that you are getting from the outside source before you attempt to read it into the statistical package that you will be using. Information such as how many lines of data there are for each case is critical.
Special Cases
Time Series Data:
There are 2 basic arrangements for time series data: vertical and horizontal. An example of vertically arranged data is shown below:
ID TIME VALUE
1 10 102
1 15 105
1 20 110
2 10 107
2 15 118
2 20 123
3 10 78
3 15 83
3 20 102
The same data arranged vertically are shown below:
ID T10 T15 T20
1 102 105 110
2 107 118 123
3 78 83 102
Both SAS and SPSS can accomodate switches between these types of data layouts, but if you can plan to enter the data in the way you will want to use them, you can save some hassle later on. It is much more difficult in SYSTAT to go between these types of data layout.
Both Aggregated and Individual Data:
You may have different types of information for the same subjects. For example, there may be data for census tracts and information for households, as well as information for individuals within households. We recommend entering these 3 types of data into 3 separate files, and then later merging them using match/merge commands in the statistical package that you use. Check on how difficult or easy the match/merge operation will be in your statistical package before you make a decision on this.
Types of Data
There are 3 main types of data that we will discuss here. They are: numeric, character (or alphabetic) and date. The general rule is not to mix variable types in the same cell.
Numeric:
Numeric data consist of data that are all numbers. The values can be positive integers (as from questionnaire responses), negative values (as from experimental results), and/or values that have decimals in them. Some examples of numeric data are:
128.03
-.0025
22000
If you have numeric data, stick to that type of data for every case, or the statistical package will not know what to do with the non-numeric data.
Don't mix data types:
Systolic
128
104
?
na
Use a different variable if you want to add comments:
Systolic Comment
128
104
999 Meter Broke
999 na
Character:
Character data can include information that contains letters alone, or a combination of letters and numbers. These data will not be able to be used in descriptive statistics, such as means and standard deviations, but will typically be used in tabulations, or cross-tabulations, where the number of cases in a category are enumerated. Some programs allow character data to be used as the grouping variable for Anova models. Examples of character data include first name, last name, street address, state and zip code (the zip code has numbers as values, but they are not intrinsically numeric, and would not be used with descriptive statistics, such as mean and standard deviation).
SAS allows character variables to be up to 200 characters in length, SPSS allows 256 characters, and Systat allows only 12 characters. Since most stat procedures will only print out the first 6 to 20 characters, it is wise to consider making your character variables short, and to include easily distinguishable values in the first 6 or 8 characters of the code.
Date:
Date values are handled differently in each statistical package. SPSS allows dates to be entered, with values separated by slash, hyphen or period (01/01/92, 01-01-92, 01.01.92). The year in SPSS can be either 2 or 4 values. If 2 values are used (eg: 9 2), then the century is assumed to be 1900, if you wish to enter dates for a different century, give the year with 4 numbers (eg: 1892). SAS has many different formats that are allowed for dates. The slash and hyphen notations (01/01/92, 01-01-92) are allowed, as in SPSS. For more details on SAS date formats, check the SAS Language Guide, Reference Manual. Systat does not have date handling facilities at this time.
When statistical packages process date values, the date that you enter is changed into a numeric value that represents the amount of time from a given reference date to the date you entered. This allows dates to be subtracted, as for example, when you have a questionnaire date and a birth date for a subject, and you subtract the birth date from the questionnaire date in order to get the subject's age at the date of interview. As a simple baseline for entering date values, we recommend that you use either the slash or the hyphen method for entering the dates for both SAS and SPSS, and enter month and day values as 2 digit values, as shown in the examples above. For other statistical packages, check the program to see how they handle dates.
Missing Data:
It is important that missing data be handled in a consistent way. SAS will interpret a period '.' as a valid missing data value for numeric data. If a period '.' is used as the missing value code for character data, it will later need to be changed to the SAS internal code for missing values in character data, which is a blank ' '. SPSS will generate an error code for a period '.' , so it is wise to use a given numeric value for SPSS, and then to instruct the program to note that the value is actually missing.
For either SAS or SPSS, if you are using numeric codes to indicate missing values, be sure that the codes you choose are sufficiently different from the valid data codes so that they will be noticeable when you look at descriptive statistics. For example, if a question has valid responses ranging from 1 to 5, missing value codes could be 8 and 9, where 8 indicates not applicable, and 9 indicates missing. Then when the values of the question are tabulated, the numbers of not applicable responses and missing responses, as well as the valid codes can be counted. If you were to get the mean of the question, its value would not be valid, since the missing codes would be included. You must then let the program that you are using know which values are missing for that particular variable.
Data Entry Programs
Data may be entered in any number of different computer programs. You may decide to use a database program, such as dbase or Foxbase, or you may use a spreadsheet, such as Lotus or Excel, or you may use a text processing program, such as Word or Word Perfect. This is not meant to be an exhaustive list of the programs that are available. However, there are some common elements that can be discussed.
Each of these programs has a special way it saves its files that makes them readable to that program, and not to others. This special type of file is saved in what we call a program format. So, Word Perfect saves files in Word Perfect format, Excel saves files in Excel format, etc. Some statistical programs on some platforms will read data that is in the format from other programs (for example, SAS PC and SPSS PC will both read dbase files directly). However, it is best to check to see if the statistical program that you are using is compatible with files created by the program you are using to enter data. Be sure to check with someone who is knowledgeable about versions of both the stat program and the program that you are using for data entry.
Each of these programs has a method for saving plain text (or unformatted or ascii) files. In general, this is the type of file that you want to save when you are planning to use the data with a statistical program. There will not be a compatibility problem between the ascii file and the statistical program, and an ascii file can, in general, be read by any statistical program.
There are also programs available commercially that make data transfers between different programs. One such program is Dbmscopy, another is Data Junction. If you decide to use such a program to go between your data entry program and your statistical analysis program, check with someone knowledgeable in the use of the data transfer program before you begin. There may be some issues about different versions of files, and compatibility that you should know about before you begin.
Text Processing Programs
If you enter your data using a text processing program, you are probably safest if you enter the data in columns, with the rightmost values aligned (right justified). This requires planning how much space will be needed for each variable in advance. If your data have many variables, you may not want to enter them with a word processing program, since you will have a hard time keeping track of where you are when you are entering the data, and the pages will not allow you to get as wide as you might need in order to fit all of your variables for a given case on a single line. Be sure that you hit return (hard carriage return) at the end of the line for each case! Do not allow the program to wrap your typing around to the next line. Don't use tabs, or set margins or insert page breaks when you are entering the data. These special characters will only confuse the statistical program you are using.
After entering your data, save them to a file. Be sure to save the data as a plain text (unformatted or ascii) file. Do not use the regular file saving mechanism, since the program will save the file in its own format, which will include many characters that can't be read by a statistical program.
Check to be sure that the saved file is in ascii format, and doesn't have any formatting characters from the text processing program by using the command
type
in DOS. It is more difficult to check whether the file has been saved properly if you use a Mac. One way you can try to do this, is to try to open the file in Teachtext, and see how it looks. If it has lots of weird looking symbols in it, there is a problem.
Database Programs
Database programs such as Dbase are designed for entering data and for linking records. We won't talk much about these programs, except to say that you may want to find out if your statistical program can import these files directly. Compatibility varies, depending on the version of the statistical program that you are using, and the version of the data base program. You are always safe if you save the data base file as a text or ascii file. Then the statistical program can be given instructions to read in the data from the text file that you have created.
You can check that your data has been saved correctly in DOS by using the 'type' command. The equivalent command in unix is 'cat'. On MTS, a listing of a raw data file can be obtained with the command 'list'.
Spreadsheet Programs
There are many spreadsheet programs that are available, and they are popular for data entry for statistical programs. Lotus and Excel are two examples of spreadsheet programs that can be used for this purpose, although there are many more. The advantages of using a spreadsheet program include a readily visible and easily changed file that allows you to print, edit and rearrange values. Some of the disadvantages are that spreadsheets allow much more flexibility in the values that can be entered in a given cell than do statistical programs. This makes it tempting to put more than one value in a cell, or to include notes or comments within numeric variable cells. This will cause problems in the statistical program, as discussed earlier.
Statistical programs are not spreadsheets, and will not accept many of the kinds of data that spreadsheets use. For example, you may have a spreadsheet in which you enter a series of questionnaire responses for each subject, and you may form totals, or leave blanks in your spreadsheet in order to make it more readable. These will give the statistical program problems. The rule of thumb is to enter only data that are appropriate for individual cases in your spreadsheet, and not to calculate summary values. Allow your statistical program to do that. Keep your spreadsheet simple, and only enter one type of data within a given spreadsheet. Do not add blank lines or put variable names at intervals in the spreadsheet. These will all confuse the statistical program, and will not facilitate the transfer between the two programs.
You usually will want to put the names of the variables in the first row of the spreadsheet. Variable names should follow a few simple rules. Keep the variable names to 8 characters or less, start the variable name with an alphabetic character (not a number), and make subsequent characters be either letters or numbers or underscores '_'. These rules work for SPSS, Systat and SAS. If you are planning to use a different statistical package, check the variable name rules before you begin.
You may want to enter certain information about each variable in its column heading. For example, you may want to add a row to your spreadsheet that contains the missing value codes for a given variable, or something like: 0=male, 1=female. Do not save these first rows with your data when you are saving it to transfer it to the statistical program. It will only confuse the statistical program.
Make your id variable the first column that you enter. This facilitates finding errors and verifying data at a later time. If you have more variables than will fit in one spreadsheet file, start another file, being sure that the id variable is included for each record in the new file, and then match/merge them using your statistical package.
Some spreadsheets allow you to see the first column (for example id number) and first row (for example, variable name), at all times as you move through the spreadsheet. This makes data entry much easier, since you will always know which variable you are entering, and which case you are working on. The technique for always viewing the top row in Excel is to go to the upper right hand corner, on the status bar, and, with your mouse, grab the small black line that is just above the arrow. When you grab the black line, your mouse will change its appearance, and will sprout two tiny arrows, one above a double line, and one below. You can then drag the top part of the first row (or rows) down so that they will be visible at all times in the spreadsheet. Similarly, for making the first column always visible, go to the lower left corner of your spreadsheet and grab the small black line with your mouse. This can be used to move the display so that the first column is always visible at any place in the spreadsheet.
When saving the data from your spreadsheet, you need to save it with a special format. For Excel, we recommend the csv (comma separated values) format. SAS, SPSS and Systat can all read this type of format and create a data set using it. Do NOT use text format with Excel, since that actually saves a file that is tab delimited (has tabs between individual values), and this will confuse your statistical program.
It is probably safest to save 2 versions of your spreadsheet file. First, save the file in regular format (as you usually do) and keep several copies of it in a safe place. Second, save the file in csv (or other ascii) format, and give it a different name to distinguish it from the normal format file. Then, if you make an error in saving the file with the ascii format, your original data will still be safe in the normal format file.
Checking your data
After your data have been entered, you should check them carefully. If you find any errors, correct them using the program that you used to enter the data, and then save the file again. Once you have gotten your data into the statistical program that you will be using for your analysis, you will want to get descriptive statistics on all of your variables and frequency tabulations of some variables so that you can check for any errors.
Transferring your data
Once you have saved your data, you will need to get it to the platform that you are using for your statistical analysis. There are many file transfer programs that are available, among them kermit, procomm, pctie, ftp and fetch. The method that you use will depend on the software that is available to you, and on other considerations, such as the size of the file that you wish to transfer, and so on. The people at 4-HELP are experienced in data transfer, and can help you to move your data from one platform to another.