 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
The Center For Statistical Consultation and Research
3550 Rackham Building
University of Michigan
Ann Arbor, MI 48109-1070
cscar@umich.edu
|
|
|
. |
 |
| |
Importing Data: ASCII To SAS
|
|
Introduction
A SAS data set is created by using a DATA STEP. The
data step has 3 essential parts: the DATA STATEMENT, the
INFILE (or CARDS) STATEMENT, and the INPUT STATEMENT.
The DATA STATEMENT names the SAS data set to be
created. SAS data set names can have from 1 to 8
characters, must start with a letter or underscore (_),
and the initial character can be followed by any
combination of letters, numbers and underscores.
The INFILE STATEMENT indicates the raw data file to be
read. If the raw data are in the same directory from
which sas is initiated, the file name can be given
without any path, but if the raw data are in another
subdirectory, a path name must be given. The CARDS
STATEMENT is used when raw data are read from the command
file. Raw data are included in the command file, and at
the end, a single semicolon is included on a line by
itself to indicate the end of the raw data.
The INPUT STATEMENT tells SAS the names of the
variables to be read in, and what form the raw data will
be. SAS variable names follow the same rules as data set
names.
Data transformations and recodes or other programming
statements will come after the DATA STATEMENT, the INFILE
STATEMENT, and the INPUT STATEMENT, but before the CARDS
STATEMENT.
If your data are in a flat (ascii, line, text,
unformatted) file, ie: the data are readable by humans,
you may read the data and create a SAS data set with a
Data step. If the data are lined up in specific columns,
you may read the data with column style input
or with formatted input. If the data are
separated by a delimiter, such as blank spaces or commas,
you may use list style input.
Reading raw data with column style input
Reading raw data with column style input: For column
style input, the raw data must be lined up in specific
columns. Each variable name is given, followed by the
columns in which it is found. Character values are
indicated by a $ sign.
|
data marflt;
infile 'marflt.dat' firstobs=2 obs=424;
input flight $ 1-4 dest $ 18-20 passngrs 34-36;
run;
proc print data=marflt;
proc means data=marflt;
run;
|
|
The infile 'marflt.dat' will need a fully qualified
path name (eg: '~jansmart/rawdata/marflt.dat') if the
file is not in the current subdirectory. The SAS data set
MARFLT will be temporary, and will not be saved once the
SAS session is completed.
Note the use of the firstobs and obs options in the
infile statement. The firstobs option means that the
first line that will be read from the raw data file
MARFLT.DAT will be line 2. This may be especially helpful
if there is information on the first line of a
descriptive nature that you wish to skip when reading in
the file. The obs option is actually more like a
'lastobs' option. It indicates to SAS the last
observation that will be read in. In this case, there
will be a total of 423 observations read in, starting at
line 2, and continuing through line 424.
Reading raw data with formatted input
To read data with formatted input, first indicate the
column in which to begin reading with an @ sign (e.g.
@46), then give the variable name, and then indicate the
format of the variable with a format of the form w.d for
numeric variables, where w is the width of the entire
variable, and d indicates the number of places after the
decimal. Note that explicit decimals in the data will
override a decimal specification given in the input
format. The @ can be used to move around to different
columns in the raw data, so that portions of the line may
be skipped, if desired.
|
data marflt2;
infile 'marflt.dat' ;
input flight $3. @46 capacity 3. @52 freight 8.2;
proc print data=marflt2;
where capacity > 150;
title 'printout of data for flights';
title2 'with capacity of more than 150 passengers';
proc means data=marflt2;
title 'descriptive statistics for mydata.marflt2';
run;
|
|
Note the use of the where statement in the proc print
shown above, which controls the cases that are printed.
The where statement can be used to select cases for
processing by any procedure. Note: You may mix column and
formatted style input in the same input statement.
Reading in Raw Data with List Style Input
If your raw data are delimited by some character, such
as blanks, or commas, and they are not necessarily lined
up in columns, you may use list style input for reading
the data into a SAS data set. It is best to have some
place holder for missing values if you are using list
style input to prevent mistakes in reading the data.
Missing values are typically indicated by a period, or
numeric values may be used for missing values. If numeric
values such as 99 or 999 are used, be sure that the
values you choose to indicate missing responses are not
possible as valid values.
List style input is given by simply listing the
variable names, with character variables being followed
by a $. An example of reading in raw data with list style
input is shown below:
|
data medexam;
length lname $10;
input lname $ id sbp age;
cards;
Smith 1028 105 .
Williams 133725 86 49
Brun 182906 108 56
;
|
|
Note that in the above example, the raw data are not
contained in another file, but are included right in the
command file. This is indicated with a cards statement,
and the raw data then follow. No semicolons are included
in the raw data if this method of reading in values is
used, and the end of the data is signalled by a semicolon
(;) on a line by itself. The length statement is used to
indicate that LNAME has a maximum length of 10
characters. This is useful so that SAS will not truncate
character variables, since SAS automatically assigns a
length for a character variable that is equal to the
first value that it encounters.
List Input with the dlm Option
Any delimiters may be used to separate variable values
in a raw data file for list style input (e.g. commas or
slashes), but the default is blank spaces. The dlm option
tells SAS the type of delimiter that your raw data uses,
as shown below. By default, multiple occurrences of a
delimiter (e.g. multiple blanks), will be read as one
delimiter.
|
data class;
infile 'class.dat' dlm=',' ;
input lname $ sex $ age height sbp;
proc chart data=class;
vbar height sbp;
title 'vertical bar charts for selected variables';
run;
|
|
The dlm option in the example above indicates that the
delimiters are commas for this input file.
List Input with the dsd Option
If you had data that were entered with commas
delimiting the values, and you wished multiple commas to
indicate missing values, you could do so by using the dsd
option on the infile statement. For example, suppose your
raw data were stored in a file called medexam.dat, and
that it appeared as shown below:
|
1002,54,,,,,280
1003,37,f,67,129,118,142
1004,29,m,72,,186,,
1004,33,m,59,143,102,155
|
|
In this case, the commas with nothing between them
indicate missing values. If this file were read into SAS
with the following commands, the data that resulted would
be in error, as shown in the data step and printout
below:
|
data medexam2;
infile 'medexam.dat' dlm=',';
input id age sex $ height sbp dbp weight;
proc print data=medexam2;
title 'printout of data';
title2 'with multiple commas causing';
title3 'errors in reading data';
run;
printout of data
with multiple commas causing
errors in reading data
OBS ID AGE SEX HEIGHT SBP DBP WEIGHT
1 1001 54 280 1002 37 . 67
2 1003 29 m 72 186 1004 33
|
|
A look at the partial log for this problem indicates
that there were problems in reading the raw data:
|
1 The SAS System
NOTE: Copyright(c) 1989 by SAS Institute Inc., Cary, NC USA.
NOTE: SAS (r) Proprietary Software Release 6.09 TS027
Licensed to UNIVERSITY OF MICHIGAN, ICPSR, Site 0009403007.
NOTE: AUTOEXEC processing completed.
1 data medexam2;
2 infile 'medexam.dat' dlm=',';
3 input id age sex $ height sbp dbp weight;
4
5 options linesize=72;
6
7
NOTE: The infile 'medexam.dat' is:
File Name=/afs/umich.edu/user/k/w/kwelch/510/medexam.dat,
Owner Name=kwelch,Group Name=staff,
Access Permission=rw-r--r--,
File Size (bytes)=87
NOTE: Invalid data for DBP in line 2 9-9.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----
6--
2 1002,37,f,67,129,118,142 24
ID=1001 AGE=54 SEX=280 HEIGHT=1002 SBP=37 DBP=. WEIGHT=67
_ERROR_=1
_N_=1
NOTE: 5 records were read from the infile 'medexam.dat'.
The minimum record length was 0.
The maximum record length was 24.
NOTE: SAS went to a new line when INPUT statement reached past
the end
of a line.
NOTE: The data set WORK.MEDEXAM2 has 2 observations and 7
variables.
|
|
The following command file, with the dsd option, would
read the data correctly:
|
data medexam2;
infile 'medexam.dat' dlm=',' dsd ;
input id age sex $ height sbp dbp weight;
proc print data=medexam2;
title 'printout of data';
title2 'with multiple commas read as missing';
title3 'error is fixed';
run;
printout of data
with multiple commas read as missing
error is fixed
OBS ID AGE SEX HEIGHT SBP DBP WEIGHT
1 1001 54 . . . 280
2 1002 37 f 67 129 118 142
3 1003 29 m 72 . 186 .
4 1004 33 m 59 143 102 155
|
|
The Missover Option with List Input
Sometimes you will think there are a certain number of
cases in your data set, but notice that you have fewer
than you expected when you check your log. (Often, you
will have exactly half the number of cases that you
expect). This can be especially troublesome when there
are missing values scattered throughout your data. One
reason that this may happen is that in list style input,
SAS simply keeps reading until it finds enough values to
fill in all of the variables that you had specified in
your input statement, so it will skip to the next line
and continue reading, unless you tell it not to do that.
The missover option indicates that SAS should fill out a
given case by only using the information on a single
line. An example is shown below:
|
data medexam2;
infile 'medexam.dat' dlm=',' missover ;
input id age sex $ height sbp dbp weight;
proc print data=medexam2;
title 'printout of data';
title2 'with multiple commas read as missing';
run;
|
|
Note that this will still cause data to be read
incorrectly within a line, because the dsd option is not
used here, but the correct number of cases will be read.
Cautions
Before reading in raw data to SAS or any other
statistical program, it is always wise to check the data
by using an editor, or in Unix by using the more
[filename] command. Know how many cases you expect to
have in the data set, and then CHECK YOUR LOG to be sure
that the correct number of cases have been read.
It is always good practice to print a few cases of
your data so that you can check it after it has been read
in. If you wish to print only a few cases, you can use
the firstobs and obs data set options.
After reading in your data, and checking that the
number of cases is correct, it is always a good idea to
get descriptive statistics on numeric variables, and
perhaps frequencies of character variables, so that you
can check the minimum, maximum and mean values to see if
they make sense. If you notice a problem, don't continue.
Stop there and check it out!
The sas log that is created with batch jobs is
necessary to help track down problems. If you have
problems, SAVE YOUR LOG, and SAVE THE COMMAND FILE. These
will help to track down the problem. It is not helpful to
keep an approximate copy of the file, because something
may have changed, which changes the entire nature of the
problem. One way to make this convenient is to create a
subdirectory with the problem files (SAS command file,
log file and output file) in it. Then the person who is
helping you will be much better able to identify and
solve the problem.
|
|
| |
|
CSCAR Home | About Us | Location | Workshops & Seminars | Software Help |
External Resources | Spatial Analysis GIS | Contact Us | Search
|
| |
|