FORTH STRINGS TABLE OF CONTENTS: 1. OVERVIEW 2. SYNTAX 3. PARAMETERS/DATA STACK CONVENTIONS 4. INTERNAL REPRESENTATION/STRUCTURE OF STRINGS 5. DATA ABSTRACTION 6. MAJOR STRING FUNCTIONS 6.0. A NOTE ON THE SYMBOLOGY USED HEREIN 6.1. STRING LITERALS 6.2. STRING DEFINITION 6.3. STRING CONSTANTS 6.4. STRING VARIABLES 6.5. STRING REFERENCE 6.6. BASIC STRING MANIPULATION 6.7. STRING/CHARACTER FETCHES/STORES 6.8. STRING INSERTIONS 6.9. STRING DELETIONS 6.10. STRING REPLACEMENTS 6.11. STRING ROTATIONS 6.12. STRING COMPARISONS 6.13. STRING PATTERN MATCHING 6.14. STRING SET OPERATIONS 6.15. STRING TRANSLATION 6.16. STRING ENCODING/DECODING 7. AREAS NOT COVERED 7.1. INPUT/OUTPUT 7.2. STRING STACK 7.3. PARSING STRINGS 8. THE FORTH SOURCE FILE 9. THE VALIDATION FILE 10. FORTH-83 DEVIATIONS 11. INTERNALS 11.1. INTERNAL STRING STRUCTURE 11.2. VISIBILITY 11.3. ERROR CHECKING 11.4. MEMORY OPERATORS 11.5. THE ISSUE OF STANDARDS 11.6. NULL STRINGS 11.7. CONVERSION PROBLEMS 12. GLOSSARY 1. OVERVIEW This file discusses the concept of string manipulation primitives within the Forth language. The formulation attempts to be as general as possible in both problem approach and solution, that is, in general, no a priori assumptions are made about what is "best" nor about any imposed constraints, and in general, I strive for the most complete, minimal solution. 2. SYNTAX Following the general Forth philosophy of making word names short yet meaningful, the convention of using the dollar sign "$" to represent string operations is used throughout. All string words begin with an initial "$" with two exceptions. Additionally, where possible, other Forth naming conventions are "blended in". For example, the Forth word chosen to copy a string is merely "$$!" (pronounced "s-s-store"). Which means to copy/store (!) one string ($) into another string ($). In a like fashion, the word for string concatenation is "$$+" and the word for typing a string is "$.". This seems, to me, more meaningful (in the context of Forth) than, say, the words "STRCAT" or "STRPRINT". As a caveat, I would like to add that I am NOT an advocate of BASIC and the use of the "$" prefix I do not regard as giving the string operators presented here a "BASIC-like" flavor. In the Forth source code provided (and in the this documentation as well) a "pronunciation guide" is given as a suggestion for pronouncing the string manipulation primitives provided herein. It can be found (in double quotes) to the far right of the block on the line which begins the colon definition. Two successive dollar signs "$$" as a word prefix indicate that two string references are needed (e.g., the concatenation word $$+ previously given). In general there is no restriction on the two string references so long as they are validly defined strings. The two references may, in fact, be the same. For example, if the string variable "STV" contains the value "Hello there!", then the code: "STV STV $$+" would give STV the value "Hello there!Hello there!". (Of course the "insert string in string" word ($$INS) and the "replace string in string" word ($$REP) - to be introduced later - should not contain the same reference for both strings.) Whenever a string primitive begins with "$$" the first string is always understood to be the "source" string and the second string is always the "target" string. So, using the concatenation example again, the code: "STV1 STV2 $$+" would add/concatenate STV1 to the end of STV2. A prefix of "$C" indicates that a string and a character reference are needed/returned. For example, the string operator "$C!" (pronounced "s-c-store") takes a character, a string, and an index (which starts at 0) into the string and stores the character in the string at the designated index. Using the previous variable STV, then the code: "65 STV 1 $C!" would give STV the value "HAllo there!Hello there!". A single "$" indicates that only a single string reference is needed. For example, the string operator "$NULL" (pronounced "string-null") sets a given string to null. Hopefully the pattern and logic behind the syntax chosen for Forth string operators will become more apparent as additional examples are given. 3. PARAMETERS/DATA STACK CONVENTIONS The method selected to pass parameters to/from the string operators provided herein follows a regular, consistent form. Operands (e.g., strings and characters) always precede operators (e.g., indices, lengths, and counts). The general form of parameter passing is: {operands} {operators} where {operands} = {addr,length|char|string} {string} and {operators} = {index} {length} {number} {status} {flag} The enclosing brackets (i.e., "{" and "}") indicate the optional occurrence of a parameter and the parameters themselves are: addr = a machine address, char = an 8 bit ASCII character, string = an abstract string reference which may be further qualified if needed (e.g., "string1"), index = an index into a string (0..n), length = the length of the (sub)string in question, number = a number or character count, status = a set of values indicating possible outcomes, and flag = a TRUE/FALSE flag (a returned value, designated as "t | f"). Finally, the general Forth usage of "consuming" a data stack item is followed. 4. INTERNAL REPRESENTATION/STRUCTURE OF STRINGS As a start, I feel one should clearly separate the actual internal representation of Forth strings from their external appearance. That is, the string data type (just as any other data type) should be abstract. The Forth programmer should be given a rich enough set of string primitives so that he/she need not have to be concerned with the internal representation. The implementation of the string manipulation capability itself must, however, be closely concerned with the internal representation/structure. In general, two major classes of representation for strings have tended to evolve: counted strings and (null) terminated strings. Terminated strings (also called ASCIIZ strings) are simply strings which are terminated by a special symbol (generally the ASCII zero, or nul character). This is the representation chosen, for example, by the C language. Counted strings are strings which reserve some initial portion of the data structure itself to contain the following physical string length. There are, of course, advantages to both string types. The chief advantage of the terminated string type is that the string may be arbitrarily long. The chief advantage of the counted string is that it is fast (i.e., many manipulations on strings require knowing the string length and this can be determined immediately in the case of counted strings). A disadvantage with the terminated string is that the special terminating symbol cannot be used within the body of the string and a disadvantage with the counted string is that there is always some fixed (system imposed) limitation on the maximum string size. All in all, I feel the counted string representation is the superior choice for Forth. In Forth, speed is always a consideration and the problem of having a fixed string size limitation is resolved to a large extent by ensuring that the implementation treat strings as an abstract data type. As a consequence, the string implementation given here uses counted strings. I did not however select the current Forth usage (e.g., as reflected in the word COUNT) of representing a counted string with a preceding count byte. My feeling was that in our (current) era of 32-bit micros, a single byte limiting the string length to 255 characters is simply too restrictive. I have, rather, used an initial 16-bit word to represent the current string length. Note also that this internal representation can be easily changed without, in general, requiring application program modifications since the string is implemented as an abstract data type. Let me make the emphatic point that the string operators defined here are *totally* independent of implementation considerations. That is, there is nothing in the definition of the string primitives provided which, in any way, requires any assumptions about the underlying string structure. This point may seem minor, but it is an extremely important one. 5. DATA ABSTRACTION Having just discussed some of the considerations involved in the choice of the internal representation/structure of strings, I would now ask that the reader (from this point forward) ignore these considerations entirely! The reason is that such considerations should only be a factor for implementation (i.e., a vendor concern or a concern if you implement a string package yourself). The string type, from the programmer's perspective, should be an abstract data type. The power and advantage of data abstraction, in general, is well known, so I shall not go into further detail here. The key issue is, rather, how one reflects the abstraction at the programming level. A natural choice would seem that of implementing strings as an address/pointer type. That is, any reference to a string merely leaves an address upon the data stack. This address is meaningless (per se) to the programmer, but is the "hook" used by all of the underlying string primitives to perform string operations. Of course, this implies, that the set of string primitives must be sufficiently powerful not to require the direct programming access of the abstract data type. Not only does the use of a pointer type keep the underlying implementation hidden, but it (as well) allows string references to be treated as single data elements (since the FORTH-83 standard defines the address type to be the same as an unsigned number which is represented as a single stack element). Thus, if two string references are on TOS they may be swapped via "SWAP" rather than having to introduce a new word such as "$SWAP". I should add that I do not agree with any number of the particulars of the FORTH-83 definition in this regard, but (at least for this subcase) it serves the purpose well. Although the definition and implementation of string types as abstract data structures provides immeasurable benefit in the ability to write clean, clear code using strings - it provides another major benefit as well. Portability. That is, if strings (or any other data type) are abstract, then they are prohibited (at the programming/application level) from assuming anything about the underlying system. The underlying details must, of course, be handled by the Forth system - but, they more properly belong at this "hidden" level anyway. Since the code/application can assume nothing about the underlying implementation, then - by definition - the code is portable. The only "wrinkle" in the above logic is that, unfortunately, some sort of "interface" may be needed between strings and some other vendor supplied package/subsystem (e.g., I/O) which is non-portable. To this end, an "import" and "export" capability has been added which "translates" between (to/from) the internal/abstract string representation and an external/memory sequential one (which seems to be a common format across vendors for the data content of strings). Note that (ideally) this import/export capability should only be temporary. That is, once other extensions to the Forth language have been defined which achieve true portability/abstraction, then the need for importing and exporting will go away. 6. MAJOR STRING FUNCTIONS The major string functions proposed are discussed next. In general they fall into the categories of: - String literals - String constants - String variables - String reference - Basic string manipulation - String/character fetches/stores - String insertions - String deletions - String replacements - String rotations - String comparisons - String pattern matching - String set operators - String translation - String encoding/decoding Admittedly the taxonomy is arbitrary but it should, hopefully, represent a good presentation framework and it should, at least, be exhaustive. All reasonable string functions I can think of are either represented within the primitives given or capable of being built from these primitives. 6.0. A NOTE ON THE SYMBOLOGY USED HEREIN In discussing the major string functions to follow, excerpts from the code in "FSTRINGS.SCR" will be used. Section 3 (i.e., PARAMETERS/DATA STACK CONVENTIONS) explains the symbology used here. 6.1. STRING LITERALS As a way of introducing the string data type into a program, a string literal capability must be provided. Since a string may be any arbitrary sequence of characters, some way of delimiting the string must be found which does not interfere with the capability to define an arbitrary sequence of characters. The string literal (interpretative) is introduced via: $LIT "This is a string" It's action is to read the input stream until a non-blank character (i.e., a "delimiter") is encountered and to return the read string (until a subsequent delimiter is encountered) as an abstract string reference upon the stack. The delimiters are not included as part of the returned string. However, embedded delimiters may be included in the string by simply using two consecutive occurrences of the delimiter. The definition of $LIT (as it occurs within the code) is: \ : $LIT ( -- string ) \ "s-lit" \ Two string literal primitives are provided - "$LIT" for \ interpretation and "[$LIT]" for colon definitions - otherwise \ their actions are the same. \ The string in the input stream following $LIT (or [$LIT]) is \ delimited by the first following non-blank character. The \ string is defined by all characters immediately following the \ delimiting character until a subsequent delimiter is \ encountered in the input stream. The delimiter, itself, may \ also be included in the string by using two successive \ delimiters to represent a single delimiter. For example: \ $LIT "The double quote character is """ Within a colon definition "[$LIT]" at run time is equivalent to $LIT interpretatively. Thus the word "X", defined as: : X [$LIT] ' This is a string' ; leaves the string " This is a string" as a reference upon TOS when executed, and is equivalent to the interpretative sequence: $LIT | This is a string| The use of "$LIT" leaves the abstract string data type on TOS. This data type is used in the remainder of the string primitives provided herein. The above use of [$LIT] within a compiled word cannot be used when run time interactive input is needed. For this reason, two primitives are provided which read typed keyboard characters directly into a string until a character (i.e., ASCII decimal value 13) is encountered. The two primitives are: : $KEY ( string -- ) \ "s-key" \ Reads keyboard characters into "string" until . and : $$KEY ( string -- ) \ "s-s-key" \ Same as $KEY except input echoed back to console. The string data type may be compiled directly into the current dictionary via the use of the "$," word once it is placed upon the stack via the $LIT word. Alternatively, the ",$" word may be used to read a string from the Forth input stream and place it into the current dictionary. Thus the two segments of code: $LIT "This is a string" $, and ,$ *This is a string* are equivalent. The source code description of the above discussed words follows: : $, ( string -- ) \ "s-comma" \ Compiles a string into the dictionary. : ,$ ( -- ) \ "comma-s" \ Compiles the following word string into the dictionary. A point to keep in mind with $LIT is that only one string literal may be defined concurrently. That is, $LIT returns a "standard" string reference address, so if, for example, "$LIT 'hello' $LIT 'there'" is typed in, then the two string references left on TOS both "point to" the string "there". In general, this type of situation should not occur. 6.2. STRING DEFINITION Once a string literal capability is defined, then string data types may be defined. The following subsections discusses the two possible cases of string constants and string variables. 6.3. STRING CONSTANTS No provision is made for implementing string constants. The primary reason is that since the string is implemented as an abstract data type, then a string constant is no different in reference from a string variable. Thus the only meaning of a string constant is that assignments to string constants would be caught at compile (or run) time and the corresponding compilation (or run) aborted. This would have the negative effect of introducing additional complexity in the package with a possible run time speed penalty as well. Since the author felt this direction to be counter to his perceived "philosophy" for Forth, the capability was not implemented. In a like fashion, no error checking has been provided either. The user can, of course, modify this package as he/she desires to introduce a constant and/or error checking capability if so desired. In some Forth systems I've seen, it appears that the only difference between a string constant and a string variable is that the former allows initialization (i.e., preassigning a value) and the latter does not. In such systems a reference to a string constant or a string variable *both* leave identical stack data types. Thus in these systems there is nothing (save the programmer's own good judgment) to prevent the run time alteration of string constants. (Note that this is entirely different from the use of standard Forth scalar constants and variables where the former leave a value and the latter leave an address on the stack.) Such usage of the terms "constant" and "variable" seem, to the author at least, to be artificial, contrived, and misleading. That is, a string constant, in such systems, is not a constant but, rather, an initialized variable - and this is merely a minor syntactical distinction. The implementation given here allows allocating a string variable in either an initialized or a NULL state with equal facility and avoids the use of string constants altogether, since (as previously mentioned) this would invariably extract a performance penalty. For example, if a string constant were provided, then the appropriate definition might be something like: $LIT "This is a string constant" $CON STR-CON However this same capability is provided with an initialized variable via: CREATE STR-CON ,$ "This is a string constant" I fail to see much difference between the two nor any advantage to the use of a "constant". 6.4. STRING VARIABLES String variables are "CREATE"d just as are regular Forth variables. This is due to the implementation of the abstract string data type as a pointer/address. For example, to create the string variable SV1 with the initial value: " Hello world!" - one could code: $LIT "Hello world!" CREATE SV1 $, or CREATE SV1 ,$ "Hello world!" or CREATE SV1 12 $ALLOT $LIT "Hello world!" SV1 $$! As shown, if a string variable is needed, but its initial contents are not known, then the variable (plus the dictionary space required) can be defined via the CREATE and $ALLOT words. For example, to reserve a string variable of 97 characters, one could code: CREATE SV1 97 $ALLOT or 97 CREATE SV1 $ALLOT The use of $ALLOT not only reserves the necessary dictionary space but initializes the string to NULL (i.e., zero length) as well. This is important since a null string is different from an unintialized string (and the string package provided here incorporates null strings). All strings are initialized with this package. The source code definition of $ALLOT is: : $ALLOT ( number -- ) \ "s-allot" \ Reserves "number" characters in the dictionary for a string \ and sets the string to NULL. Although initialization is provided, error checking is not. This means that it is the responsibility of the Forth programmer to ensure that the dynamic (i.e., run time) size of strings never exceed their initially (implicitly or explicitly) allocated sizes. (Of course, error checking can be added if so desired, at a performance cost.) 6.5. STRING REFERENCE Once strings are defined as constants or immediately available as literals, then they may be referenced. "Referencing", in this sense, means an immediate operation with a string without altering its value. Four basic reference operators are provided: 1) determining the length of a string, 2) printing a string, 3) "importing" a string from another package (e.g., I/O), and 4) "exporting" a string to another package. String length is determined via: : $LEN ( string -- length ) \ "s-length" \ Returns the length of string. Printing a string is accomplished via: : $. ( string -- ) \ "s-dot" \ Prints a string. Their actions should be self-explanatory. Since almost all Forth packages of which I am aware treat the data portion of a string via contiguous memory locations and since it may be necessary to "send/export" a string to some other vendor provided Forth package (e.g., I/O) and/or to "receive/import" a string from some other vendor provided package - an import/export ability is provided. Note that exporting and importing strings are necessary (but hopefully temporary) evils. That is, once the "string data type" has been established within the framework of the Forth standard and once all packages recognize (and act appropriately upon) this data type - then the export/import artifact will no longer be needed. The definitions of import and export are: : $IMPORT ( addr length string -- ) \ "s-import" \ Imports a string from addr, length. and : $EXPORT ( string addr -- ) \ "s-export" \ Exports a string to addr. Note that is is the user's responsibility when using $EXPORT to ensure that enough contiguous memory is available/reserved at "addr" to hold the data portion of the string. 6.6. BASIC STRING MANIPULATION The use of the definition "basic string manipulation" (as was the previous case of string reference) is somewhat arbitrary. What I am trying to capture here are those essential actions necessary for *any* string package. Two basic string manipulation operators are provided; they are: : $NULL ( string -- ) \ "s-null" \ Forces a string to NULL. and : $$+ ( string1 string2 -- ) \ "s-s-plus" \ Adds (concatenates) string1 onto string2. Again, the actions should be self-explanatory. 6.7. STRING/CHARACTER FETCHES/STORES Following the Forth fetch/store convention, a number of operators are provided which perform fetches/stores across strings/characters. They are: : $C! ( char string index -- ) \ "s-c-store" \ Stores "char" in "string" at position "index". : $C@ ( string index -- char ) \ "s-c-fetch" \ Fetches "char" from position "index" in "string". : $$! ( string1 string2 -- ) \ "s-s-store" \ Stores string1 in string2. Original contents of string2 are \ lost. and : $$@ ( string1 string2 index length -- ) \ "s-s-fetch" \ "string2" is built/fetched using the "string1" substring \ starting at position "index" for "length" characters. The $C@ and $LEN primitives provide, essentially, the capability to "examine" any string in terms of size/content, and the $C! primitive provides the primary method of altering a string. The $$@ operator is, in effect, the "substring" operator used in other languages. When using $$@, the original contents of string2 are lost. 6.8. STRING INSERTIONS Often it is necessary to dynamically "add" information into a string; the string insertion operators provide this capability, and they are: : $CINS ( char string index -- ) \ "s-c-ins" \ Inserts "char" in "string" with "char" at position "index". \ Remaining characters, if any, are moved right. and : $$INS ( string1 string2 index -- ) \ "s-s-ins" \ Inserts "string1" into "string2" starting at position "index" \ of "string2". Remaining characters, if any, are moved right. When a character, or string, is inserted into another string, the original character(s) - if any - beginning at the point of insertion are right shifted past the inserted character(s). Note that: "string1 string2 DUP $LEN $$INS" is equivalent to: "string1 string2 $$+" The latter usage is to be preferred since it is simpler to understand and faster in execution. 6.9. STRING DELETIONS The converse of string insertion (just presented) is string deletion. One general purpose and several special purpose string deletion primitives are provided. The general purpose string deletion primitive is: : $DEL ( string index number -- ) \ "s-del" \ Deletes "number" characters from "string" starting at \ position "index". Although any of the special purpose string deletion primitives presented next can be built using $DEL, $C@ and $LEN, they are provided based on both the perceived frequency of use and in order to standardize naming conventions. They are: : $|TRIM ( string number -- ) \ "s-left-trim" \ Deletes "number" characters from the left/start of "string". : $TRIM| ( string number -- ) \ "s-trim-right" \ Deletes "number" characters from the right/end of "string". : $|SPACES ( string -- ) \ "s-left-spaces" \ Trims leading spaces from string. and : $SPACES| ( string -- ) \ "s-spaces-right" \ Trims trailing spaces from string. 6.10. STRING REPLACEMENTS Having now considered string insertions and deletions, string replacements are now addressed. A single, general purpose string replacement primitive is provided; it is: : $$REP ( string1 string2 index -- ) \ "s-s-rep" \ Replaces current substring in "string2" with "string1" \ starting at position "index" of "string2". 6.11. STRING ROTATIONS The ability to circularly rotate strings left/right is also provided via four primitives. The single shift string primitives are: : $ROT ( string -- ) \ "s-right-rote" \ Rotates a string right one character. The multiple shift string primitives are: : $<>ROT ( string number -- ) \ "s-many-right-rote" \ Rotates "string" right "number" characters. Note that neither $<>ROT are necessary since $<>ROT can be similarly defined. The rationale behind providing $<>ROT was simply that of execution time efficiency. 6.12. STRING COMPARISONS The determination of string ordinality (i.e., the relative lexicographic order of two strings) can be satisfied with a single, all purpose, string comparator (i.e., "$$COMPARE") which compares two strings and returns -1, 0, or +1 depending on whether the first string is less than, equal to, or greater than the second string, respectively. This implementation treats the null string as having the lowest possible ordinality (and there is valid mathematical reason for doing so). Further, the ordinality of a string is based upon the ASCII collating sequence. Although the "$$COMPARE" primitive is adequate to resolve all questions concerning string ordinality, from a coding/programming perspective, it is often more expedient to ask a specific question (such as: Is string1 less than string2?). For this reason, and also from the standpoint of standardizing string comparator terminology, a number of additional, special purpose string comparators are introduced (e.g., $$=, $$<, $$<=, etc.). As might be expected, these comparators are all simple words invoking the $$COMPARE comparator. The single general purpose string comparator is: : $$COMPARE ( string1 string2 -- status ) \ "s-s-compare" \ Returns -1, 0, or +1 depending on whether string1 is \ lexicographically less than, equal to, or greater than \ string2, respectively. And the six special purpose string comparators are: : $$= ( string1 string2 -- t | f ) \ "s-s-equal" \ Returns -1 if string1 = string2, else returns 0. : $$< ( string1 string2 -- t | f ) \ "s-s-less-than" \ Returns -1 if string1 < string2, else returns 0. : $$<= ( string1 string2 -- t | f ) \ "s-s-less-than-or-equal" \ Returns -1 if string1 <= string2, else returns 0. : $$> ( string1 string2 -- t | f ) \ "s-s-greater-than" \ Returns -1 if string1 > string2, else returns 0. : $$>= ( string1 string2 -- t | f) \ "s-s-greater-than-or-equal" \ Returns -1 if string1 >= string2, else returns 0. and : $$<> ( string1 string2 -- t | f ) \ "s-s-not-equal" \ Returns -1 if string1 <> string2, else returns 0. 6.13. STRING PATTERN MATCHING The ability to locate arbitrary characters and substrings within a string is provided. The two primitives for locating an arbitrary character within a string are: : $CFIND ( char string -- index | -1 ) \ "s-c-find" \ Searches for leftmost occurrence of "char" in "string". \ Returns "index" if found, else returns -1. and : $CFIND< ( char string -- index | -1 ) \ "s-c-find-back" \ Searches for rightmost occurrence of "char" in "string". \ Returns "index" if found, else returns -1. The primitive which searches for a substring (to include the searched string itself) within a string is: : $$FIND ( string1 string2 index length -- index | -1 ) \ "s-s-find" \ Searches for the first occurrence of the string1 substring \ starting at "index" for "length" characters in string2. \ Returns index if found, else returns -1. Note that "index" and "length" identify the substring within "string2" to be used for matching purposes. 6.14. STRING SET OPERATIONS Two set theoretic string primitives are provided. In both cases a character is considered a set element and a string is considered a set. The two set theoretic string primitives are: For single element set membership: : $CMEM ( char string -- t | f ) \ "s-c-mem" \ Returns -1 if "char" is in "string", else returns 0. Note that the $CMEM primitive is sufficient (along with other primitives introduced here such as $NULL, $C@, $$+, etc.) to build any necessary higher order set theoretic string operators. For multiple element set membership: : $$VER ( string1 string2 -- index | -1 ) \ "s-s-ver" \ Verifies that string2 contains only those characters in \ string1 by returning a -1. Otherwise the index of the first \ character in string2 not contained in string1 is returned. The $$VER (verify) operator is borrowed from PL/I and it is more powerful, and practical, that might at first be expected. As an example if we define: CREATE NUMBERS ,$ "0123456789" Then any string can be tested to see if it contains only numbers with the word: \ Returns -1 if "string" contains only the ASCII characters 0..9, \ else returns 0. : NUMBERS? ( string -- flag ) NUMBERS SWAP $$VER 0< ; 6.15. STRING TRANSLATION Two straightforward string translation primitives are provided based, primarily, on frequency of use. They translate lower-case characters in a string to upper-case, and vice-versa. They are: : $>UPPER ( string -- ) \ "s-to-upper" \ Converts any lower-case characters in "string" to upper-case. and : $>LOWER ( string -- ) \ "s-to-lower" \ Converts any upper-case characters in "string" to lower-case. 6.16. STRING ENCODING/DECODING String encoding (i.e., translating a number to its ASCII string representation) and string decoding (i.e., translating an ASCII string number representation to a number) are handled by three primitives which are modeled after the currently defined FORTH-83 word set which performs double number conversion (i.e., "<#", "#", "#>", etc.). String encoding, or decoding, is initiated via the use of the "$CONVERT" word. Its definition is: : $CONVERT ( string index -- ) \ "s-convert" \ Defines a string and index within the string to be used for \ subsequent string-to-number or number-to-string conversions. An initial call to $CONVERT, in effect, defines a specific string and a specific place/index within that string to be used for subsequent encoding or decoding. String encoding is accomplished via successive calls to the word "$N". Its definition is: \ : $>N ( -- number | -1 ) \ "s-to-n" \ Converts the character at the string/index position defined \ by the last call to $CONVERT to a number if possible. An \ error (i.e., -1) is returned if the character is non-numeric \ (i.e., does not lie between 0..BASE-1). The index position \ established in the call to $CONVERT is always incremented \ after a call to $>N. The fact that $>N always increments the index position is important since this allows $>N to be used to "search" a string for the first valid numeric character (and then proceed with the decode operations). The primitives given here are quite simple (more so than the "<#", "#>", and associated word set) and it is assumed that the user will tailor the "higher levels" of string encoding/decoding to his/her taste. 7. AREAS NOT COVERED Since this paper tries to concentrate exclusively on Forth strings, a number of related topics were purposely excluded. In all cases the primary reason for exclusion was that the topic area would necessarily introduce other major considerations which would lay well beyond the area of strings and thereby cause a considerable loss of focus. 7.1. INPUT/OUTPUT Input/output, per se, has not been addressed in this file. The reasoning is straightforward. Forth, itself, provides minimal input/output facilities. The Forth machine (upon which the Forth language itself is based) is a virtual machine with an on-line console (for interactive input/output) and a block file system for everything else. This obviously falls far short of an input/output system in the traditional sense. Correspondingly, the string functions given will work with the console and the block file (i.e., the Forth input stream). No attempt has been made to extend this concept. The reasoning is simply that input/output considerations for Forth are a separate issue, of which, string structures are a subcomponent (e.g., any binary read capability could easily input Forth strings - along with any other internal Forth data type for a given system). In truth, as long as string primitives are provided to read string literals from a standard Forth block file (as they are in this implementation) - then this, in itself, handles the issue of portability. So, in essence, there is no need to augment string primitives beyond this point (at least from the perspective of portability). Note: this is not to say that many Forth systems do not have excellent input/output facilities (well beyond the FORTH-83 standard) - it is rather to say that this is an independent subject and more properly treated as input/output instead of strings. 7.2. STRING STACK The idea and use of a string stack will also be considered outside the bounds of this discussion. This, in no way, indicates that I feel the idea of a string stack is bad (I think it is an excellent idea). It is rather because it introduces additional issues and the line has to be drawn somewhere. For example, if a string stack were introduced then string operations could proceed independently of data stack operations with no confusion between the two. This is excellent except that it completely changes the semantics/effect of the same code when string and data stack operations are intermixed. For example, if "SR1" is a string reference, then the code: "3 SR1 7 + $." Could be expected to produce entirely different results depending on whether or not a string stack were implemented. Of course this same logic applies if a floating stack (etc.) is introduced. The point, however, is that this is really more of a stack (and semantics) issue than one of a string manipulation issue. I have long argued for multiple stacks in Forth (and even *God forbid* for a type stack), but the arguments have fallen upon deaf ears. I'm getting very tired of arguing at this point. 7.3. PARSING STRINGS Most C language implementations (for example) provide a healthy serving of string parsing functions. This is nice except that often underlying assumptions must be made (e.g., what characters constitute a "whitespace") which may not apply in all/most situations - and, of course, if you don't need it - why have it? I can think of no parsing functions which cannot be built upon the primitives provided here if needed - so I've left them out. Additionally, since the string package provided herein is primarily targeted toward writing application code (rather than, for example, parsing Forth code) - no provisions have been made for parsing functions as they relate to the Forth language itself. 8. THE FORTH SOURCE FILE The file "FSTRINGS.SCR" contains all of the Forth string primitives discussed in this document. 9. THE VALIDATION FILE The file "FSTEST.SCR" contains a set of validation tests to ensure that the string primitives provided in FSTRINGS.SCR function properly. Although not exhaustive, the validation tests should give reasonable confidence that this package will work on your particular Forth system. (The validation package will also attempt to isolate malfunctions down to a specific primitive if possible.) 10. FORTH-83 DEVIATIONS As far as I am able to determine FSTRINGS.SCR is all kosher FORTH-83 Standard. I use words from the double number extension word set (i.e., 2DUP, 2DROP, 2SWAP, etc.). Also I use the BRANCH word from the system extension word set in [$LIT]. I would suspect any reasonable FORTH-83 implementation would provide these words (along with the "\" word which is also used). I have tested the system (although not extensively) under both MasterFORTH and Laxen & Perry's F83 - no problems. I would be glad to assist anyone who encounters difficulty in getting this system to run - but, remember, the whole idea is just to provide working FORTH-83 code to demonstrate the functions themselves so that they can then be rewritten consistent with the underlying Forth system and extended to your taste. 11. INTERNALS Some of the key points which may be of assistance in understanding the code/logic in file FSTRINGS.SCR follows: 11.1. INTERNAL STRING STRUCTURE Although nothing in the use of the primitives provided here makes any assumptions about the underlying internal string structure used, this information is required for the actual implementation itself. The internal representation selected for strings in this implementation was that of counted strings. The initial string *word* contains the count (which may be 0 or null). Thus the limit on the working size of a string word is 0..32,767 bytes due to the use of signed number arithmetic. The reserve scratch area "_BUFFER" has been allocated at 1024+2 bytes. The only words which reference _BUFFER are $LIT, $<>ROT. Other than this everything is done "core-to-core" via CMOVEs. _BUFFER's initial allocation (and use) should be thought out carefully when implementing this system since it represents a (potentially) large permanent allocation of space. For example, it may be more expedient to temporarily "grab" unused dictionary space for this purpose and avoid any permanent loss of space. How one does this is dependent upon the particular Forth system being used. Another point to keep in mind is that only one _BUFFER is provided. Thus one cannot have two string literals simultaneously defined. (Also, for example, a call to $<>ROT will translate the "number" to the modulus of the length of the string given, but that's hardly editing!) 11.4. MEMORY OPERATORS The string primitives provided herein do most of their real "work" through the underlying "memory" primitives (i.e, CMOVE, CMOVE>, COMPARE, CFIND, CFIND<, and -optionally- SAME?). For this reason alone it would be advisable to code these memory operators in assembler. (Still another reason is that these memory operators should also be useful independent of this string package.) The "interface" operators (i.e., _CF+, _LEN, and _$>AL) provide the connection/interface between the high-level string operators and the low-level memory operators. They also make it easier to modify this package for stack widths of other than 16 bits and/or allow using other internal representations than the one selected here. Since the use of the interface operators is pervasive as well, it is recommended that they be coded in assembler also (in addition they are very short words). So, in general, the basic modus operandi of these string operators is to call upon the interface operators; to perform the appropriate "stack juggling"; and to call upon the appropriate memory operators. In the file FSTRINGS.SCR, block 1 is the load block for a full FORTH-83 implementation. Block 2 is the load block for a MasterFORTH system (i.e., the words: COMPARE, CFIND, CFIND<, SAME?, _CF+, _LEN, and _$>AL) are written in assembler for a MasterFORTH system (80XX chip). Even if you don't have a MasterFORTH system (in fact does *anyone* besides me!?), I would strongly recommend you examine the assembler implementation of these words and rewrite them (in assembler) for your own particular Forth system (they really make a rather dramatic difference in performance). The words are documented and the underlying algorithmic approach should be valid regardless of the system you're running. As an aid the understanding the MasterFORTH implementation for the 80XX series chip (so that you may convert these words): - The MasterFORTH assembler is based upon the Laxen/Perry F83 public domain model with the following major differences: - The mode #) is simply ); - The mode S#) has disappeared; - Local labels are available for unconditional jumps; - REPZ and REPNZ are REPE and REPNE instead; - JZ, JNZ, JC, and JNC are added; - The string instructions use BYTE and WORD rather than AL and AX. - The MasterFORTH implementation uses the following machine resources: - The four segment register must be maintained at the same value (i.e., CS = DS = SS = ES); - The following three registers are used internally by MasterFORTH and must be restored prior to returning to Forth: - SI -- the Forth instruction pointer (IP) - SP -- the Forth parameter stack pointer - BP -- the Forth return stack pointer. The definitions of the memory operators are: : COMPARE ( a1 a2 n -- status ) \ Compares a1-a2, (a1+1)-(a2+1), ... (a1+n-1)-(a2+n-1) \ as required returning -1, 0, or +1 depending on lexicographic \ order. If n=0, then 0 (i.e., =) is returned. : SAME? ( a1 a2 n -- t | f ) \ Compares a1-a2, (a1+1)-(a2+1), ... (a1+n-1)-(a2+n-1) \ as required returning -1 if all "n" bytes match, else returns \ 0. Note that this operator is not strictly required since one \ may define SAME? as \ : SAME? COMPARE 0= ; \ It is included primarily for reasons of speed. \ Memory character/byte FIND GTH 08/22/87 : CFIND ( c a1 n -- a2 | 0 ) \ "c-find" \ Searches a1, a1+1, ..., a1+n-1 for c returning first \ address of match or 0 if no match. NOTE: n=0 returns 0. : CFIND< ( c a1 n -- a2 | 0 ) \ "c-find-back" \ Same as CFIND except search is high-to-low memory. \ NOTE: Memory search begins at address: a1+n-1. The words "CFIND" and "CFIND<" were selected to match the current FORTH-83 words "CMOVE" and "CMOVE>". Also note that no attempt was made to code a general-purpose assembler word which searches memory attempting to locate a substring. The primary reason for this is that this operation, although frequently used, is far more complex (when done efficiently) than is generally realized. The reader is referred to the excellent article "Searching for Strings with Boyer-Moore" by Richard Wiggins and Paul Walberg in the November, 1986 (Volume 3, Number 11) edition of Computer Language (pp. 28-42) just to see how harry things can get! 11.5. THE ISSUE OF STANDARDS The string operators provided herein will hopefully be of benefit to some of you in string processing applications. The true test, however, of the utility of any programming language extension is its proven utility over time. There is thus no good measure at present of what string operators, as provided in FSTRINGS, are needed, are defined well, or are missing. It is more reasonable to expect that FSTRINGS will serve as a catalyst (or perhaps seed) to encourage members of the Forth community to investigate the fundamental nature of and need for string operations in Forth. In addition to the issue of functionality addressed above, there are as well the matters of stack/data conventions and naming conventions. All of these issues need to be throughly investigated and an informal standard developed and used for an extended period before any sort of string standards proposal is warranted (at least in this author's opinion). This has been a rather long-winded way of stating that I am in no way proposing FSTRINGS as a standards extension. The development of FSTRINGS did, however, provide reasonably good assurance that some more fundamental, underlying operators (upon which any contiguous memory string package could easily be built) could be good candidates as standards extensions. These are just the "memory operators" earlier mentioned. Thus I would like to propose that the words: COMPARE, CFIND, CFIND<, and -optionally- SAME? be considered as standards extensions to augment the CMOVE and CMOVE> words. 11.6. NULL STRINGS This package allows the use of null strings. You may wish to ignore them altogether, but I would caution against this. The null string processing introduces very little overhead for the capability gained. Null string handling is greatly simplified by the fact that all of the underlying memory operators are defined to accept null (or zero) counts. For example, in the null case: CMOVE and CMOVE> do not move anything; COMPARE returns 0 (i.e., "equal"); CFIND and CFIND< return 0 (i.e., "not found"); and -optionally- SAME? returns -1 (i.e., "equal"). Thus the majority of the string primitives herein can be written without concern for the special case of null (or zero length) strings. For those primitives which must explicitly test for the null string case, this action is always performed initially. If the null string case applies, then the return stack is set appropriately and the primitive is immediately EXITed. In all cases, this amounts to a single line of (commented) code at the beginning of the word definition. 11.7. CONVERSION PROBLEMS As mentioned, this system has been tested with both MasterFORTH and F83. Jerry Shifrin was kind enough to allow me/us to test this package under LMI's Forth as well. The primary difficulty encountered was the underlying implementation of the "BRANCH" word. Both MasterFORTH and F83 use absolute addresses for BRANCH. LMI seems to use a relative offset instead. BRANCH is only used in the definition of [$LIT] as follows: : [$LIT] ( -- string ) \ "bracket-s-lit" \ Compiled string literal COMPILE BRANCH HERE 0 , \ forward branch around string HERE ,$ HERE ROT ! [COMPILE] LITERAL ; IMMEDIATE It may be necessary to modify [$LIT] prior to the "!" in order to adjust the "address/offset" which follows BRANCH in the compiled word as necessary. It may also be necessary to perform "aligns" as well. Alternatively, if your Forth system supports the ">MARK" and ">RESOLVE" words from the system extension word set, then it may be more expedient to simply recode [$LIT] as: : [$LIT] ( -- string ) \ "bracket-s-lit" \ Compiled string literal COMPILE BRANCH >MARK HERE ,$ SWAP >RESOLVE [COMPILE] LITERAL ; IMMEDIATE Finally, it may be necessary to modify the "_CF+" and "_$>AL" words to use some value other than "2+" if your Forth system's word size is other than 16 bits. 12. GLOSSARY A short glossary of all string primitives follows: : COMPARE ( a1 a2 n -- status ) : SAME? ( a1 a2 n -- t | f ) : CFIND ( c a1 n -- a2 | 0 ) \ "c-find" : CFIND< ( c a1 n -- a2 | 0 ) \ "c-find-back" : $LIT ( -- string ) \ "s-lit" : $, ( string -- ) \ "s-comma" : ,$ ( -- ) \ "comma-s" : [$LIT] ( -- string ) \ "bracket-s-lit" : $KEY ( string -- ) \ "s-key" : $$KEY ( string -- ) \ "s-s-key" : $ALLOT ( number -- ) \ "s-allot" : $IMPORT ( addr length string -- ) \ "s-import" : $EXPORT ( string addr -- ) \ "s-export" : $LEN ( string -- length ) \ "s-length" : $. ( string -- ) \ "s-dot" : $NULL ( string -- ) \ "s-null" : $$+ ( string1 string2 -- ) \ "s-s-plus" : $C! ( char string index -- ) \ "s-c-store" : $C@ ( string index -- char ) \ "s-c-fetch" : $$! ( string1 string2 -- ) \ "s-s-store" : $$@ ( string1 string2 index length --) \ "s-s-fetch" : $CINS ( char string index -- ) \ "s-c-ins" : $$INS ( string1 string2 index -- ) \ "s-s-ins" : $|TRIM ( string number -- ) \ "s-left-trim" : $TRIM| ( string number -- ) \ "s-trim-right" : $|SPACES ( string -- ) \ "s-left-spaces" : $SPACES| ( string -- ) \ "s-spaces-right" : $DEL ( string index number -- ) \ "s-del" : $$REP ( string1 string2 index -- ) \ "s-s-rep" : $ROT ( string -- ) \ "s-right-rote" : $<>ROT ( string number -- ) \ "s-many-right-rote" : $$COMPARE ( string1 string2 -- status ) \ "s-s-compare" : $$= ( string1 string2 -- t | f ) \ "s-s-equal" : $$< ( string1 string2 -- t | f ) \ "s-s-less-than" : $$<= ( string1 string2 -- t | f ) \ "s-s-less-than-or-equal" : $$> ( string1 string2 -- t | f ) \ "s-s-greater-than" : $$>= ( string1 string2 -- t | f) \ "s-s-greater-than-or-equal" : $$<> ( string1 string2 -- t | f ) \ "s-s-not-equal" : $CFIND ( char string -- index | -1 ) \ "s-c-find" : $CFIND< ( char string -- index | -1 ) \ "s-c-find-back" : $$FIND ( string1 string2 index length -- index | -1 ) \ "s-s-find" : $CMEM ( char string -- t | f ) \ "s-c-mem" : $$VER ( string1 string2 -- index | -1 ) \ "s-s-ver" : $>UPPER ( string -- ) \ "s-to-upper" : $>LOWER ( string -- ) \ "s-to-lower" : $CONVERT ( string index -- ) \ "s-convert" : $N ( -- number | -1 ) \ "s-to-n"