-- -- CSCAR  
-- --   center for statistical consultation and research  
School of Social Work
--
about us
about us
location
workshops
software help
external resources
spatial
jobs
contact
search
--
The Center For Statistical Consultation and Research
3550 Rackham Building
University of Michigan
Ann Arbor, MI 48109-1070
cscar@umich.edu
.
Software Help
 
Importing Data: ASCII To SAS
Introduction

A SAS data set is created by using a DATA STEP. The data step has 3 essential parts: the DATA STATEMENT, the INFILE (or CARDS) STATEMENT, and the INPUT STATEMENT.

The DATA STATEMENT names the SAS data set to be created. SAS data set names can have from 1 to 8 characters, must start with a letter or underscore (_), and the initial character can be followed by any combination of letters, numbers and underscores.

The INFILE STATEMENT indicates the raw data file to be read. If the raw data are in the same directory from which sas is initiated, the file name can be given without any path, but if the raw data are in another subdirectory, a path name must be given. The CARDS STATEMENT is used when raw data are read from the command file. Raw data are included in the command file, and at the end, a single semicolon is included on a line by itself to indicate the end of the raw data.

The INPUT STATEMENT tells SAS the names of the variables to be read in, and what form the raw data will be. SAS variable names follow the same rules as data set names.

Data transformations and recodes or other programming statements will come after the DATA STATEMENT, the INFILE STATEMENT, and the INPUT STATEMENT, but before the CARDS STATEMENT.

If your data are in a flat (ascii, line, text, unformatted) file, ie: the data are readable by humans, you may read the data and create a SAS data set with a Data step. If the data are lined up in specific columns, you may read the data with column style input or with formatted input. If the data are separated by a delimiter, such as blank spaces or commas, you may use list style input.


Reading raw data with column style input

Reading raw data with column style input: For column style input, the raw data must be lined up in specific columns. Each variable name is given, followed by the columns in which it is found. Character values are indicated by a $ sign.

data marflt;
	infile 'marflt.dat' firstobs=2  obs=424;
	input flight $ 1-4 dest $ 18-20 passngrs 34-36;
	run;

	proc print data=marflt;

	proc means data=marflt;
	run;
		

The infile 'marflt.dat' will need a fully qualified path name (eg: '~jansmart/rawdata/marflt.dat') if the file is not in the current subdirectory. The SAS data set MARFLT will be temporary, and will not be saved once the SAS session is completed.

Note the use of the firstobs and obs options in the infile statement. The firstobs option means that the first line that will be read from the raw data file MARFLT.DAT will be line 2. This may be especially helpful if there is information on the first line of a descriptive nature that you wish to skip when reading in the file. The obs option is actually more like a 'lastobs' option. It indicates to SAS the last observation that will be read in. In this case, there will be a total of 423 observations read in, starting at line 2, and continuing through line 424.


Reading raw data with formatted input

To read data with formatted input, first indicate the column in which to begin reading with an @ sign (e.g. @46), then give the variable name, and then indicate the format of the variable with a format of the form w.d for numeric variables, where w is the width of the entire variable, and d indicates the number of places after the decimal. Note that explicit decimals in the data will override a decimal specification given in the input format. The @ can be used to move around to different columns in the raw data, so that portions of the line may be skipped, if desired.

data marflt2;
	infile 'marflt.dat' ;
	input flight $3. @46 capacity 3. @52 freight 8.2;

	proc print data=marflt2;
	where capacity > 150;
title 'printout of data for flights';
title2 'with capacity of more than 150 passengers';

proc means data=marflt2;
title 'descriptive statistics for mydata.marflt2';
run;
		

Note the use of the where statement in the proc print shown above, which controls the cases that are printed. The where statement can be used to select cases for processing by any procedure. Note: You may mix column and formatted style input in the same input statement.


Reading in Raw Data with List Style Input

If your raw data are delimited by some character, such as blanks, or commas, and they are not necessarily lined up in columns, you may use list style input for reading the data into a SAS data set. It is best to have some place holder for missing values if you are using list style input to prevent mistakes in reading the data. Missing values are typically indicated by a period, or numeric values may be used for missing values. If numeric values such as 99 or 999 are used, be sure that the values you choose to indicate missing responses are not possible as valid values.

List style input is given by simply listing the variable names, with character variables being followed by a $. An example of reading in raw data with list style input is shown below:

 data medexam;
		 length lname $10;
		 input lname $ id sbp age;
		 cards;
 Smith  1028  105 .
 Williams 133725  86     49
 Brun    182906    108            56
		 ;

Note that in the above example, the raw data are not contained in another file, but are included right in the command file. This is indicated with a cards statement, and the raw data then follow. No semicolons are included in the raw data if this method of reading in values is used, and the end of the data is signalled by a semicolon (;) on a line by itself. The length statement is used to indicate that LNAME has a maximum length of 10 characters. This is useful so that SAS will not truncate character variables, since SAS automatically assigns a length for a character variable that is equal to the first value that it encounters.


List Input with the dlm Option

Any delimiters may be used to separate variable values in a raw data file for list style input (e.g. commas or slashes), but the default is blank spaces. The dlm option tells SAS the type of delimiter that your raw data uses, as shown below. By default, multiple occurrences of a delimiter (e.g. multiple blanks), will be read as one delimiter.

data class;
	 infile 'class.dat' dlm=',' ;
	 input lname $ sex $ age height sbp;

proc chart data=class;
	   vbar height sbp;
	   title 'vertical bar charts for selected variables';
run;

The dlm option in the example above indicates that the delimiters are commas for this input file.


List Input with the dsd Option

If you had data that were entered with commas delimiting the values, and you wished multiple commas to indicate missing values, you could do so by using the dsd option on the infile statement. For example, suppose your raw data were stored in a file called medexam.dat, and that it appeared as shown below:

1002,54,,,,,280
1003,37,f,67,129,118,142
1004,29,m,72,,186,,
1004,33,m,59,143,102,155

In this case, the commas with nothing between them indicate missing values. If this file were read into SAS with the following commands, the data that resulted would be in error, as shown in the data step and printout below:

data medexam2;
	infile 'medexam.dat' dlm=',';
	input id age sex $ height sbp dbp weight;
proc print data=medexam2;
	title 'printout of data';
	title2 'with multiple commas causing';
	title3 'errors in reading data';
run;

printout of data
with multiple commas causing
errors in reading data

OBS     ID     AGE    SEX    HEIGHT    SBP     DBP    WEIGHT

1     1001     54    280     1002      37       .      67
2     1003     29    m         72     186    1004      33


A look at the partial log for this problem indicates that there were problems in reading the raw data:

1 The SAS System
NOTE: Copyright(c) 1989 by SAS Institute Inc., Cary, NC USA.
NOTE: SAS (r) Proprietary Software Release 6.09  TS027
	  Licensed to UNIVERSITY OF MICHIGAN, ICPSR, Site 0009403007.

NOTE: AUTOEXEC processing completed.

1          data medexam2;
2          		infile 'medexam.dat' dlm=',';
3          		input id age sex $ height sbp dbp weight;
4
5          options linesize=72;
6
7

NOTE: The infile 'medexam.dat' is:
	  File Name=/afs/umich.edu/user/k/w/kwelch/510/medexam.dat,
	  Owner Name=kwelch,Group Name=staff,
	  Access Permission=rw-r--r--,
	  File Size (bytes)=87

NOTE: Invalid data for DBP in line 2 9-9.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----
6--
2         1002,37,f,67,129,118,142 24
ID=1001 AGE=54 SEX=280 HEIGHT=1002 SBP=37 DBP=. WEIGHT=67
_ERROR_=1
_N_=1
NOTE: 5 records were read from the infile 'medexam.dat'.
	  The minimum record length was 0.
	  The maximum record length was 24.
NOTE: SAS went to a new line when INPUT statement reached past
the end
	  of a line.
NOTE: The data set WORK.MEDEXAM2 has 2 observations and 7
variables.

The following command file, with the dsd option, would read the data correctly:

data medexam2;
infile 'medexam.dat' dlm=',' dsd ;
input id age sex $ height sbp dbp weight;
proc print data=medexam2;
title 'printout of data';
title2 'with multiple commas read as missing';
title3 'error is fixed';
run;

printout of data
with multiple commas read as missing
error is fixed

OBS     ID     AGE    SEX    HEIGHT    SBP    DBP    WEIGHT

1     1001     54              .        .      .      280
2     1002     37     f       67      129    118      142
3     1003     29     m       72        .    186        .
4     1004     33     m       59      143    102      155
	   



The Missover Option with List Input

Sometimes you will think there are a certain number of cases in your data set, but notice that you have fewer than you expected when you check your log. (Often, you will have exactly half the number of cases that you expect). This can be especially troublesome when there are missing values scattered throughout your data. One reason that this may happen is that in list style input, SAS simply keeps reading until it finds enough values to fill in all of the variables that you had specified in your input statement, so it will skip to the next line and continue reading, unless you tell it not to do that. The missover option indicates that SAS should fill out a given case by only using the information on a single line. An example is shown below:

data medexam2;
	infile 'medexam.dat' dlm=',' missover ;
	input id age sex $ height sbp dbp weight;
 proc print data=medexam2;
	title 'printout of data';
	title2 'with multiple commas read as missing';
	run;

Note that this will still cause data to be read incorrectly within a line, because the dsd option is not used here, but the correct number of cases will be read.


Cautions

Before reading in raw data to SAS or any other statistical program, it is always wise to check the data by using an editor, or in Unix by using the more [filename] command. Know how many cases you expect to have in the data set, and then CHECK YOUR LOG to be sure that the correct number of cases have been read.

It is always good practice to print a few cases of your data so that you can check it after it has been read in. If you wish to print only a few cases, you can use the firstobs and obs data set options.

After reading in your data, and checking that the number of cases is correct, it is always a good idea to get descriptive statistics on numeric variables, and perhaps frequencies of character variables, so that you can check the minimum, maximum and mean values to see if they make sense. If you notice a problem, don't continue. Stop there and check it out!

The sas log that is created with batch jobs is necessary to help track down problems. If you have problems, SAVE YOUR LOG, and SAVE THE COMMAND FILE. These will help to track down the problem. It is not helpful to keep an approximate copy of the file, because something may have changed, which changes the entire nature of the problem. One way to make this convenient is to create a subdirectory with the problem files (SAS command file, log file and output file) in it. Then the person who is helping you will be much better able to identify and solve the problem.

 
CSCAR Home | About Us | Location | Workshops & Seminars | Software Help | External Resources | Spatial Analysis GIS | Contact Us | Search
 
--
 
Copyright © 1998 - 2001 The Regents of the University of Michigan, Ann Arbor