 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
The Center For Statistical Consultation and Research
3550 Rackham Building
University of Michigan
Ann Arbor, MI 48109-1070
cscar@umich.edu
|
|
|
. |
 |
| |
Preparing Data: Data Entry
|
|
|
|
Introduction
The purpose of this paper is to give you some background
and helpful hints for data entry for Statistical Packages
that may allow you to avoid some common pitfalls, and to
get you on your way. There are many different statistical
packages that are available on the PC, Mac, Mainframe and
UNIX workstation platforms. We cannot possibly give
information that will be valid for every one of these
programs, but there are some common elements that we
believe will be helpful regardless of the statistical
program or programs that you use. The points made in this
handout should be valid for SPSS, SAS, and SYSTAT.
It is important to think about how you
would like to analyze and use your data before you begin
to enter it. It is wise to talk to someone who is
familiar with the statistical program that you wish to
use in advance, if possible, and see what advice they can
give about special problems or capabilities that you
should take into consideration.
You may be entering the data yourself, or
you may hire someone else to do that task. If you do the
work yourself, you will need to have available to you a
computer program that is capable of data entry. This can
be done with a number of different types of programs.
This handout will discuss Excel as a data entry tool, but
that is not meant to endorse a particular product, rather
it is simply to help you understand a program that is
widely available at the University of Michigan and that
is supported on campus. You may choose any other program
for data entry that you are comfortable with and that
will get the job done. This handout will try to describe
the requirements for statistical data entry so that you
can make the choice of a data entry program that is
appropriate for you. If you hire someone else to enter
the data for you, you need to make the task as easy,
painless and efficient for them as possible. Planning is
important here, and spending some time at the planning
stages will pay off later when it comes time to analyze
your data. |
| |
What Data to Enter
When you begin your data entry task, the
first question that you may face is what data should be
entered. It is important to consider in advance how the
data will be used before you make decisions on what to
include and what not to include. You may have data from a
single questionnaire, or you may have data from several
questionnaires, lab tests, x-ray results, home diaries
and other information, all for the same individual. The
types and complexity of data that are used in statistical
packages are almost endless. Here are some helpful points
that you should consider...
| | What is a case and what is a
record?
In this document, a case is defined as
the unit of analysis for your study. For example, a case
might be a participant in a survey, or an animal in an
experiment, or a family, or a hospital, etc. A record is
a line within a data file. Sometimes a case may have more
than one record, or more than one type of record. For
example, a participant may have an xray record, a
questionnaire record, and a physical exam record. These
records will all have to be linked in order to analyze
them properly.
| | ID:
Each record that you enter needs to have
some form of unique identification that applies to its
case, and its case only. An id number serves this
purpose. This number may be one that you assign (for
example a number that you write on the corner of each
questionnaire as it comes in), or it may be one that is
available from other sources (for example social security
number, or a hospital id code). You may have several
pieces of information that together form a unique
identifier (for example, census tract, household number,
and individual id). If you do have several identifiers
for each case, be sure that you enter each identifier for
each record for that case. Id numbers allow you to check
data later if problems or inconsistencies arise. The id
number also allows you to match/merge different files
that are available for the same individual or
experimental unit.
| | When in doubt, put it in:
If you have a small amount of data that
you wish to enter, it is not difficult to enter all of
it. However, for more complex or larger studies, it
sometimes becomes problematic which pieces of information
to enter. If you have the resources (time, money, storage
space) it is usually a good rule of thumb to include
everything, or at least everything that you might
possibly want from the data. It is relatively easy to add
another piece of information to the data entry process as
it is being carried out, but becomes much more difficult
and costly to go back and enter more data for every case
after the data have already been entered. If you find
that you have entered more information than you need, it
is easy to ignore that part of the data, or simply save
it for analysis at a later time. Of course, entering
"everything" can become too cumbersome and
expensive, and that is when good planning and foresight
will help you to include the important items.
|
| |
Enter data in its
simplest form:
Statistical programs are designed to
summarize, tabulate, graph and otherwise display and
manage data. If you summarize the data before entering it
to your statistical program, you are wasting your time in
doing the calculations, and may introduce mistakes in the
process. You may also find that you would like the
unsummarized data in order to look at some aspect of it.
It is much easier to combine data to derive summary
measures than it is to pull it apart.
|
| |
Enter multiple variables
for multiple response data:
Usually, a question or response has only
one value, but in some questionnaires, respondents are
asked to "Check all that apply" or give a list
of items. In this case, we recommend that you enter
separate information for each list item. So, for example,
you may have variables that are Q1A, Q1B, Q1C, and so on.
Each of these items can be entered as a "1" if
the person chose that answer, and as a "0" if
they did not. These items can then be tabulated to get
counts of how many people responded with "A" to
question 1, etc.
|
| |
Data Layout:
In general, data are laid out in a
rectangular array. The rows represent cases, and the
columns represent variables. A sample data layout is
shown below:
ID GROUP NAME SEX SYSTOLIC DIASTOLIC
1 1 Tony M 128 86
2 1 Phyllis F 135 92
3 2 Rhonda F 102 61
4 2 Jason M 143 105
5 2 Maya F 115 68
The intersection of a row
and a column is called a cell. The general rule of thumb
is that each cell should contain only 1 piece of
information. For example, blood pressure is entered as 2
values, Systolic and Diastolic. If you later wish to
calculate the ratio of systolic to diastolic blood
pressure, that can be easily done by the stat package. Do
NOT enter both blood pressure values in one cell (say
128/86), because the stat package will not be able to
read that value, and will not know how to interpret it.
The stat package can then easily and accurately calculate
any ratio or other function of values that you wish.
|
| |
Data From Outside
Sources:
Data from outside sources, that may have already been
entered into a flat (or text) file, may be in many
formats. Data may be in fixed column format, where each
variable is stored within particular columns (as shown in
the previous example). This is easy to read into most
statistical packages, such as SPSS and SAS, and is
probably the best in terms of being non-ambiguous when
reading the data into a stat package. Another format that
is commonly used is list format; that is, each variable
value is followed by at least one blank or other
delimiter, such as a comma. List format data may appear
"ragged" when you view it with an editor, since
different cases may take up different line lengths. The
major caution to use with list format data is that each
variable must be represented by something. That is, if a
measurement is missing, a missing value code must be
present in the data. The missing value code might be a
number such as 99 or 999, or might be a special
character, such as a period. Data may also be stored in
binary format, or in exponential format. These are only a
few of the more common methods of storing raw data that
might be read into statistical packages. Be sure to find
out as much as possible about the format of the raw data
that you are getting from the outside source before you
attempt to read it into the statistical package that you
will be using. Information such as how many lines of data
there are for each case is critical.
|
| |
Special Cases:
| | Time Series Data:
There are 2 basic arrangements for time
series data: vertical and horizontal. An example of
vertically arranged data is shown below:
ID TIME VALUE
1 10 102
1 15 105
1 20 110
2 10 107
2 15 118
2 20 123
3 10 78
3 15 83
3 20 102
The same data arranged
vertically are shown below:
ID T10 T15 T20
1 102 105 110
2 107 118 123
3 78 83 102
Both SAS and SPSS can
accomodate switches between these types of data layouts,
but if you can plan to enter the data in the way you will
want to use them, you can save some hassle later on. It
is much more difficult in SYSTAT to go between these
types of data layout.
| | Both Aggregated and Individual
Data:
You may have different types of
information for the same subjects. For example, there may
be data for census tracts and information for households,
as well as information for individuals within households.
We recommend entering these 3 types of data into 3
separate files, and then later merging them using
match/merge commands in the statistical package that you
use. Check on how difficult or easy the match/merge
operation will be in your statistical package before you
make a decision on this.
|
| |
Types of Data:
There are 3 main types of data that we
will discuss here. They are: numeric, character (or
alphabetic) and date. The general rule is not to mix
variable types in the same cell.
| | Numeric:
Numeric data consist of data that are all
numbers. The values can be positive integers (as from
questionnaire responses), negative values (as from
experimental results), and/or values that have decimals
in them. Some examples of numeric data are:
If you have numeric data, stick to that type of data for
every case, or the statistical package will not know what
to do with the non-numeric data.
| | Don't mix data types:
| Use a different variable if you
want to add comments:
Systolic Comment
128
104
999 Meter Broke
999 na
| | Character:
Character data can include information
that contains letters alone, or a combination of letters
and numbers. These data will not be able to be used in
descriptive statistics, such as means and standard
deviations, but will typically be used in tabulation s,
or crosstabulations, where the number of cases in a
category are enumerated. Some programs allow character
data to be used as the grouping variable for Anova
models. Examples of character data include first name,
last name, street address, state and zip code (the zip
code has numbers as values, but they are not
intrinsically numeric, and would not be used with
descriptive statistics, such as mean and standard
deviation).
SAS allows character variables to be up to 200 characters
in length, SPSS allows 256 characters, and Systat allows
only 12 characters. Since most stat procedures will only
print out the first 6 to 20 characters, it is wise to
consider making your charact er variables short, and to
include easily distinguishable values in the first 6 or 8
characters of the code.
| | Date:
Date values are handled differently in
each statistical package. SPSS allows dates to be
entered, with values separated by slash, hyphen or period
(01/01/92, 01-01-92, 01.01.92). The year in SPSS can be
either 2 or 4 values. If 2 values are used (eg: 9 2),
then the century is assumed to be 1900, if you wish to
enter dates for a different century, give the year with 4
numbers (eg: 1892). SAS has many different formats that
are allowed for dates. The slash and hyphen notations
(01/01/92, 01-01-92) are allowed, as in SPSS. For more
details on SAS date formats, check the SAS Language
Guide, Reference Manual. Systat does not have date
handling facilities at this time.
When statistical packages process date values, the date
that you enter is changed into a numeric value that
represents the amount of time from a given reference date
to the date you entered. This allows dates to be
subtracted, as for example, when you ha ve a
questionnaire date and a birth date for a subject, and
you subtract the birth date from the questionnaire date
in order to get the subject's age at the date of
interview. As a simple baseline for entering date values,
we recommend that you use eithe r the slash or the hyphen
method for entering the dates for both SAS and SPSS, and
enter month and day values as 2 digit values, as shown in
the examples above. For other statistical packages, check
the program to see how they handle dates.
| | Missing Data:
It is important that missing data be
handled in a consistent way. SAS will interpret a period
'.' as a valid missing data value for numeric data. If a
period '.' is used as the missing value code for
character data, it will later need to be changed to t he
SAS internal code for missing values in character data,
which is a blank ' '. SPSS will generate an error code
for a period '.' , so it is wise to use a given numeric
value for SPSS, and then to instruct the program to note
that the value is actually missing.
For either SAS or SPSS, if you are using numeric codes to
indicate missing values, be sure that the codes you
choose are sufficiently different from the valid data
codes so that they will be noticeable when you look at
descriptive statistics. For example, if a question has
valid responses ranging from 1 to 5, missing value codes
could be 8 and 9, where 8 indicates not applicable, and 9
indicates missing. Then when the values of the question
are tabulated, the numbers of not applicable responses
and missing responses, as well as the valid codes can be
counted. If you were to get the mean of the question, its
value would not be valid, since the missing codes would
be included. You must then let the program that you are
using know which values are missing for that particular
variable.
|
| |
| | Data Entry Programs:
Data may be entered in any number of different computer
programs. You may decide to use a database program, such
as dbase or Foxbase, or you may use a spreadsheet, such
as Lotus or Excel, or you may use a text processing
program, such as Word or Word Per fect. This is not meant
to be an exhaustive list of the programs that are
available. However, there are some common elements that
can be discussed.
Each of these programs has a special way it saves its
files that makes them readable to that program, and not
to others. This special type of file is saved in what we
call a program format. So, Word Perfect saves files in
Word Perfect format, Excel saves files in Excel format,
etc. Some statistical programs on some platforms will
read data that is in the format from other programs (for
example, SAS PC and SPSS PC will both read dbase files
directly). However, it is best to check to see if the
statistical program that you are using is compatible with
files created by the program you are using to enter data.
Be sure to check with someone who is knowledgeable about
versions of both the stat program and the program that
you are using for data entry.
Each of these programs has a method for saving plain text
(or unformatted or ascii) files. In general, this is the
type of file that you want to save when you are planning
to use the data with a statistical program. There will
not be a compatibility problem between the ascii file and
the statistical program, and an ascii file can, in
general, be read by any statistical program.
There are also programs available commercially that make
data transfers between different programs. One such
program is Dbmscopy, another is Data Junction. If you
decide to use such a program to go between your data
entry program and your statistical analysis program,
check with someone knowledgeable in the use of the data
transfer program before you begin. There may be some
issues about different versions of files, and
compatibility that you should know about before you
begin.
|
| |
| | Text Processing Programs:
If you enter your data using a text processing program,
you are probably safest if you enter the data in columns,
with the rightmost values aligned (right justified). This
requires planning how much space will be needed for each
variable in advance. If your data have many variables,
you may not want to enter them with a word processing
program, since you will have a hard time keeping track of
where you are when you are entering the data, and the
pages will not allow you to get as wide as you might need
in order to fit all of your variables for a given case on
a single line. Be sure that you hit return (hard carriage
return) at the end of the line for each case! Do not
allow the program to wrap your typing around to the next
line. Don't use tabs, or set margins or insert page
breaks when you are entering the data. These special
characters will only confuse the statistical program you
are using.
After entering your data, save them to a file. Be sure to
save the data as a plain text (unformatted or ascii)
file. Do not use the regular file saving mechanism, since
the program will save the file in its own format, which
will include many characters that can't be read by a
statistical program.
Check to be sure that the saved file is in ascii format,
and doesn't have any formatting characters from the text
processing program by using the command
in DOS. It is more difficult to check whether the file
has been saved properly if you use a Mac. One way you can
try to do this, is to try to open the file in Teachtext,
and see how it looks. If it has lots of weird looking
symbols in it, there is a problem.
|
| |
| | Data Base Programs:
Data base programs such as Dbase are designed for
entering data and for linking records. We won't talk much
about these programs, except to say that you may want to
find out if your statistical program can import these
files directly. Compatibility varies, depending on the
version of the statistical program that you are using,
and the version of the data base program. You are always
safe if you save the data base file as a text or ascii
file. Then the statistical program can be given
instructions to read in the data from the text file that
you have created.
You can check that your data has been saved correctly in
DOS by using the type command. The equivalent command in
unix is cat . On MTS, a listing of a raw data file can be
obtained with the command
|
| | Spreadsheet Programs:
There are many spreadsheet programs that are available,
and they are popular for data entry for statistical
programs. Lotus and Excel are two examples of spreadsheet
programs that can be used for this purpose, although
there are many more. The advantages of using a
spreadsheet program include a readily visible and easily
changed file that allows you to print, edit and rearrange
values. Some of the disadvantages are that spreadsheets
allow much more flexibility in the values that can be
entered in a given cell than do statistical programs.
This makes it tempting to put more than one value in a
cell, or to include notes or comments within numeric
variable cells. This will cause problems in the
statistical program, as discussed earlier.
Statistical programs are not spreadsheets, and will not
accept many of the kinds of data that spreadsheets use.
For example, you may have a spreadsheet in which you
enter a series of questionnaire reponses for each
subject, and you may form totals, or leave blanks in your
spreadsheet in order to make it more readable. These will
give the statistical program problems. The rule of thumb
is to enter only data that are appropriate for individual
cases in your spreadsheet, and not to calculate summary
values. Allow your statistical program to do that. Keep
your spreadsheet simple, and only enter one type of data
within a given spreadsheet. Do not add blank lines or put
variable names at intervals in the spreadsheet. These
will all confuse the statistical program, and will not
facilitate the transfer between the two programs.
You usually will want to put the names of the variables
in the first row of the spreadsheet. Variable names
should follow a few simple rules. Keep the variable names
to 8 characters or less, start the variable name with an
alphabetic character (not a number), and make subsequent
characters be either letters or numbers or underscores
'_'. These rules work for SPSS, Systat and SAS. If you
are planning to use a different statistical package,
check the variable name rules before you begin.
You may want to enter certain information about each
variable in its column heading. For example, you may want
to add a row to your spreadsheet that contains the
missing value codes for a given variable, or something
like: 0=male, 1=female. Do not save these first rows with
your data when you are saving it to transfer it to the
statistical program. It will only confuse the statistical
program.
Make your id variable the first column that you enter.
This facilitates finding errors and verifying data at a
later time. If you have more variables than will fit in
one spreadsheet file, start another file, being sure that
the id variable is included for each record in the new
file, and then match/merge them using your statistical
package.
Some spreadsheets allow you to see the first column (for
example id number) and first row (for example, variable
name), at all times as you move through the spreadsheet.
This makes data entry much easier, since you will always
know which variable you are entering, and which case you
are working on. The technique for always viewing the top
row in Excel is to go to the upper right hand corner, on
the status bar, and, with your mouse, grab the small
black line that is just above the arrow. When you grab
the black line, your mouse will change its appearance,
and will sprout two tiny arrows, one above a double line,
and one below. You can then drag the top part of the
first row (or rows) down so that they will be visible at
all times in the spreadsheet. Similarly, for making the
first column always visible, go to the lower left corner
of your spreadsheet and grab the small black line with
your mouse. This can be used to move the display so that
the first column is always visible at any place in the
spreadsheet.
When saving the data from your spreadsheet, you need to
save it with a special format. For Excel, we recommend
the csv (comma separated values) format. SAS, SPSS and
Systat can all read this type of format and create a data
set using it. Do NOT use text format with Excel, since
that actually saves a file that is tab delimited (has
tabs between individual values), and this will confuse
your statistical program.
It is probably safest to save 2 versions of your
spreadsheet file. First, save the file in regular format
(as you usually do) and keep several copies of it in a
safe place. Second, save the file in csv (or other ascii)
format, and give it a different name to distinguish it
from the normal format file. Then, if you make an error
in saving the file with the ascii format, your original
data will still be safe in the normal format file.
|
| |
| | Checking your data:
After your data have been entered, you should check them
carefully. If you find any errors, correct them using the
program that you used to enter the data, and then save
the file again. Once you have gotten your data into the
statistical program that you will be using for your
analysis, you will want to get descriptive statistics on
all of your variables and frequency tabulations of some
variables so that you can check for any errors.
|
| |
| | Transferring your data:
Once you have saved your data, you will need to get it to
the platform that you are using for your statistical
analysis. There are many file transfer programs that are
available, among them kermit, procomm, pctie, ftp and
fetch. The method that you use will depend on the
software that is available to you, and on other
considerations, such as the size of the file that you
wish to transfer, and so on. The people at 4-HELP are
experienced in data transfer, and can help you to move
your data from one platform to another.
|
|
| |
|
CSCAR Home | About Us | Location | Workshops & Seminars | Software Help |
External Resources | Spatial Analysis GIS | Contact Us | Search
|
| |
|