------------------------------------------------------------------------- Dallas, TX 6-Sep-91 This is a list of over 100,000 English words transcribed orthographically. I obtained it from The Interociter bulletin board in Dallas (214/258-1832). The original read.me file said that the list came from Public Brand Software. The original list contained 146,440 words, but I discovered that there were thousands of duplicate words. I resorted the list and removed the duplicates using the Unix utility uniq. The total number of words is now 109,582. I have repackaged the list into four files (the original was five): File Bytes Words Range --------- ------ ----- ----- words1.lst 315376 29839 A-D words2.lst 242484 23101 E-K words3.lst 325716 30439 L-R words4.lst 270759 26203 S-Z ---------------- Total 1154335 109582 This word list includes inflected forms, such as plural nouns and the -s, -ed and -ing forms of verbs. Thus the number of lexical stems repre- sented in the list is considerably smaller than the total number of words. Evan Antworth Academic Computing Department Summer Institute of Linguistics 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Internet: evan@sil.org UUCP: ...!uunet!convex!txsil!evan phone: 214/709-2418 fax: 214/709-3387 ------------------------------------------------------------------------- Contents of LEXICON.ZIP: Length Method Size Ratio Date Time Name ------ ------ ----- ----- ---- ---- ---- 179250 Deflated 39792 78% 01-01-80 00:14 LEX 1152 Deflated 582 50% 05-28-86 15:50 KEYS.ORG 1181 Deflated 414 65% 01-26-91 22:58 KEYS 17153 Deflated 4006 77% 05-29-91 08:00 SAMPLE.DIC 7433 Deflated 2350 69% 05-18-93 16:42 FINDVERB.PAS 4195 Deflated 4024 5% 05-18-93 16:45 FINDVERB.EXE ------ ------ --- ------- 210364 bytes 51168 76% 6 files ---------------------------------------------------------------------- From FINDVERB.PAS: (*************************************************************************) (* - LEX courtesy of Dave Keiras, University of Michigan *) (* *) (* Format of LEX: *) (* ------------------------------- *) (* Column Data............................ *) (* 1 part of speech (see key below) *) (* 2 past participle flag *) (* 3 negative flag *) (* 4 to be flag *) (* 5 verb + ing flag *) (* 6 aux flag *) (* 7-26 word (in conventional lower case spelling) *) (* *) (* Herewith a sample... *) (* ZDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDD? *) (* 3 B00000a 3 V00000abet 3 J00000abnormal 3 *) (* 3 V00000abandon 3 V00000abets 3 J00000abnormally 3 *) (* 3 V10000abandoned 3 V10000abetted 3 p00000aboard 3 *) (* 3 V00010abandoning 3 V00010abetting 3 P00000about 3 *) (* 3 V00000abandons 3 N00000abettor 3 p00000above 3 *) (* 3 V00000abbreviate 3 N00000abettors 3 N00000abrasion 3 *) (* 3 V10000abbreviated 3 N00000abilities 3 N00000abrasions 3 *) (* 3 V00000abbreviates 3 N00000ability 3 N00000abrasive 3 *) (* 3 V00010abbreviating 3 J00000able 3 N00000abrasives 3 *) (* 3 N00000abbreviation 3 J00000abler 3 N00000absence 3 *) (* 3 N00000abbreviations 3 J00000ablest 3 N00000absences 3 *) (* @DDDDDDDDDDDDDDDDDDDDDDADDDDDDDDDDDDDDDDDDDDDDADDDDDDDDDDDDDDDDDDDDDY *) (* *) (* Parts of Speech Key *) (* --------------------- *) (* (V) verb (q) possessive pronoun/pronoun (e.g. "his") *) (* (J) adjective (j) adjective/verb (A) adverb *) (* (a) adverb/verb (B) article (C) conjunction *) (* (I) interjection (N) noun (d) noun/adjective *) (* (n) noun/verb (P) preposition (O) pronoun *) (* (o) adjective/pronoun (e.g. this, that) *) (* (Q) possessive pronoun (e.g. my, her, your, our, their, its) *) (* (X) don't care category *) (* --added X category for "not" so it will not be an adverb *) (* (p) preposition/adverb (e.g. aboard, above, around, as, before, below *) (* *) (*************************************************************************) program findverb; { The LEX dictionary was compiled by Dave Keiras, of the University of Michigan, who has graciously made it available in the public domain. The grammatical markings in LEX (and the parsing theory they imply) are due to him; this program, on the other hand, is mine. This is a simple filter program operating on the LEX file. This version finds verbs (i.e, lines marked with "V" in the first column), writes a file containing each line it finds, and doesn't touch the LEX file at all, so it's safe to use. It is so trivial to modify this to find anything you want that, rather than add several hundred lines of bullet-proofing and user interface, I offer this commented source code file for you to fool around with. It is written in TurboPascal, Version 5.5, but will probably compile on everything from version 3 through 7 with only minor modifications, since it really doesn't do anything particularly complicated. Enjoy. -John Lawler (jlawler@umich.edu) University of Michigan } -------------------------------------------------------------------- ------ ------ ---- ----- ---- ---- ------ ---- Length Method Size Ratio Date Time CRC-32 Name 1089117 Deflated 266003 76% 11-08-93 22:00 996a7f73 spanish.lex A simple list of about 90,000 Spanish words. Courtesy of Dave Eddington , of Middle Tennessee State University. The ASCII codes used are as follows: * = beginning of a word # = end of a word V\ = a vowel with an accute accent n~ = n with a tilde over it u$ = u with umlaut -------------------------------------------------------------------- The file "achumawi.zip" is an archive of several Shoebox databases of Achumawi linguistic data. Please note that the material in the shoebox databases compressed in the achumawi.zip file are copyright 1993, 1994 Bruce Nevin. The analysis is incomplete work in progress. Use for any purpose requires written permission from Bruce Nevin, 49 Sumner Street, Gloucester, MA 1930-1546, bn@lightstream.com. This file is in ZIP compression format, and thus must be downloaded in BINARY mode. Issue the BINARY command at the FTP prompt before using the GET command to copy the files to your system, as follows: ftp> binary ftp> get achumawi.zip If you need a program to uncompress the archive file on your DOS PC, download "zip.exe" next: ftp> get zip.exe If you're using a Mac, download the utility "unzip.cpt.hqx" in TEXT mode: ftp> text ftp> get unzip.cpt.hqx This will need to be de-BinHexed (Fetch can do this on the fly) and de-Compressed itself. The de-archiver for Mac files can be downloaded (also in TEXT mode) thus: ftp> get dearchiver To quit FTP, issue the QUIT command: ftp> quit --------------------------------------------------------------------