Go to the first, previous, next, last section, table of contents.


Regular Expression Package

Procedures that handle regular expression support Unicode in the sense that it handle characters as basic 16-bit logical units, and that it supports case-insensitive match for all Unicode characters including those beyong ASCII range.

Regular expression engine is based on the one developed by Henry Spencer. We have modified the original source to support Unicode characters.

Before using functions and variables in Regex Package, following function call should be made to dynamically load necessary library.

(use 'regex)

Regular Expression

Regular expressions, as defined in POSIX 1003.2, come in two forms: modern regular expressions and obsolete regular expressions. Obsolete regular expressions (roughly those of ed; 1003.2 basic regular expressions) mostly exist for backward compatibility in some old programs. KSM-Scheme uses modern regular expressions (roughly those of egrep; 1003.2 calls these extended regular expressions) with additional functionality of back references.

A regular expression is one or more non-empty branches, separated by |. It matches anything that matches one of the branches.

"one|two|three" ==> matches "one", "two", or "three"

A branch is one or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by a single *, +, ?, or bound. An atom followed by * matches a sequence of 0 or more matches of the atom. An atom followed by + matches a sequence of 1 or more matches of the atom. An atom followed by ? matches a sequence of 0 or 1 matches of the atom.

"a*" ==> matches "", "a", "aa", "aaa", ...
"a+" ==> matches "a", "aa", "aaa", ...
"a?" ==> matches "" or "a"

A bound is { followed by a decimal integer, possibly followed by , possibly followed by another decimal integer, always followed by }. The integers must lie between 0 and 255 inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound containing one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom followed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the atom.

"a{3}"   ==> matches "aaa"
"a{3,}"  ==> matches "aaa", "aaaa", "aaaaa", ...
"a{3,4}" ==> matches "aaa" or "aaaa"

An atom is a regular expression enclosed in () (matching a match for the regular expression), an empty set of () (matching the null string), a bracket expression (see below), . (matching any single character), ^ (matching the null string at the beginning of a line), $ (matching the null string at the end of a line), a back reference (see below), a \ followed by one of the characters ^.[$()|*+?{\ (matching that character taken as an ordinary character), a \ followed by any other character (matching that character taken as an ordinary character, as if the \ had not been present), or a single character with no other significance (matching that character). A { followed by a character other than a digit is an ordinary character, no the beginning of a bound. It is illegal to end a regular expression with \.

A bracket expression is a list of characters enclosed in []. It normally matches any single character from the list (but see below). If the list begins with ^, it matches any single character (but see below) not from the rest of the list. If two characters in the list are separated by -, this is shorthand for the full range of characters between those two (inclusive), e.g. [0-9] matches any decimal digit. It should be noted that ordering of characters is equivalent to that of Unicode code values. It is illegal for two ranges to share an endpoint, e.g. a-c-e.

"[a-z]"  ==> matches small letter alphabets
"[^0-9]" ==> matches all characters except '0' through '9'

To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal - as the first endpoint of a range, enclose it in [. and .] to make it a character-name element (see below). With the exception of these and some combinations using [ (see next paragraphs), all other special characters, including \, lose their special sigfinicance within a bracket expression.

"[]]"       ==> matches ']'
"[^]]"      ==> matches characters except ']'
"[-]"       ==> matches '-'
"[^-a]"     ==> matches characters except '-' and 'a'
"[^]-~]"    ==> matches characters except '-' through '~'
"[a--]"     ==> matches characters from 'a' through '-'
"[abc-]"    ==> matches 'a', 'b', 'c', or '-'
"[[.-.]-~]" ==> matches characters from '-' through '~'

Within a bracket expression, a character-name element enclosed in [. and .] stands for the character that is represented by the name. For example, [.backspace.] stands for #\U{0008}. For a list of acceptable names of characters, see section Character Names.

Within a bracket expression, the name of a character class enclosed in [: and :] stands for the list of all characters belonging to that class. Standard character class names are:
alnum digit punct
alpha graph space
blank lower upper
cntrl print xdigit

These stand for the character classes defined in ctype(3). A character class may not be used as an endpoint of a range.

There are two special cases of bracket expressions: the bracket expression [[:<:]] and [[:>:]] match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore.

In the event that a regular expression could match more than one substring of a given string, the regular expression matches the one starting earliest in the string. If the regular expression could match more than one substring starting at that point, it matches the longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the regular expression taking priority over ones starting later. Match lengths are measured in characters. A null string is considered longer than no match at all. For example, bb* matches the three middle characters of abbbc, (wee|week)(knights|nights) matches all ten characters of weeknights, when (.*).* is matched against abc the prenthesized subexpression matches all three characters, and when (a*)* is matched against bc both the whole regular expression and the parenthesized subexpression match the null string.

If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket expression, it is effectively transformed into a bracket expression containing both cases, e.g. x becomes [xX]. When it apperas inside a bracket expression, all case counterparts of it are added to the bracket expression, so that (e.g.) [x] becomes [xX] and [^x] becomes [^xX].

A back reference atom that is represented by a \ followed by a non-zero decimal digit d matches the same sequence of characters matched by the d-th parenthesized subexpression (numbering subexpressions by the positions of their opening parentheses, left to right), so that (e.g.) ([bc])\1 matches bb or cc but not bc.

NOTE: This description of regular expression is based on the documentation included in the regular expression library of Henry Spencer.

Regex Procedures

Function: regex:regcomp pattern &key (ignore-case #f) (newline #f)
(no-special #f)

pattern should be a string that specify a regular expression. Returns a regular expression object that is specified by pattern.

If ignore-case is not false, it specifies a matching that ignores upper/lower case distinctions.

If newline is not false, it specifies a newline-sensitive matching. By default, newline is completely ordinary character with no special meaning in either regular expressions or strings. With this flag, [^...] bracket expressions and . never match newline. ^ matches the empty string immediately after a newline. $ matches the empty string immediately before a newline.

If no-special is not false, it specifies a matching with meanings of all special characters turned off. All characters are thus considered ordinary, so the regular expression is a literal string.

This procedure supports for Unicode characters as basic 16-bit logical units. Case-insensitive match supports for Unicode characters (beyond ASCII characters).

(define re (regex:regcomp "[a-z]+.scm"))
(regex:regexec re "abc.scm") ==> #t
(regex:regexec re "Abc.scm") ==> #f

(define re (regex:regcomp "[a-z]+.scm" :ignore-case #t))
(regex:regexec re "Abc.scm") ==> #t

(define re (regex:regcomp "[^0-9]+"))
(regex:regexec re "abcd\nefg") ==> #t
(regex:subexpr re 0)           ==> "abcd\nefg"
(define re (regex:regcomp "[^0-9]+") :newline)
(regex:regexec re "abcd\nefg") ==> #t
(regex:subexpr re 0)           ==> "abcd"

(define re (regex:regcomp ".+"))
(regex:regexec re "abcd\nefg") ==> #t
(regex:subexpr re 0)           ==> "abcd\nefg"
(define re (regex:regcomp ".+") :newline)
(regex:regexec re "abcd\nefg") ==> #t
(regex:subexpr re 0)           ==> "abcd"

(define re (regex:regcomp "^hello$"))
(regex:regexec re "aaa\nhello\nbbb")  ==> #f
(define re (regex:regcomp "^hello$" :newline))
(regex:regexec re "aaa\nhello\nbbb")  ==> #t

(define re (regex:regcomp ".+"))
(regex:regexec re "abcd")             ==> #f
(regex:regexec re ".+")               ==> #t

In KSM-Scheme, several escape sequences in a string are processes while they are read from the program source. They are

"\""      ==> "  (a string composed of a doublequote character)
"\\"      ==> \
"\n"      ==> newline
"\r"      ==> carriage return
"\t"      ==> tab
"\U{XXXX} ==> character corresponding to Unicode 
              code value XXXX (hexadecimal)

Consequently, a string "\n" is converted to a string with one character (newline character) while it is read, and (regex:regcomp "[\n]") returns a regular expression object that matches a string containing a newline character.

A backslash in a string followed by other characters loses its special meaning and represents a backslash character by itself. Consequently, (regex:regcomp "\$"), as well as (regex:regcomp "\\$"), returns a regular expression object that matches a string containing a character '$'.

@source{regex/regex.c} @use{regex}

Function: regex:regexec regex string &key (not-bol #f) (not-eol
#f))

Matches a regular expression object regex against string, and returns #t if matching succeeds or #f if matching fails.

If not-bol is not false, the first character of string is not the beginning of a line, so the ^ anchor should not match before it. This does not affect the behavior of newlines when regex was created with :newline #t.

If not-eol is not false, termination of string does not end a line, so the `$' anchor should not match before it. This does not affect the behavior of newlines when regex was compiled with :newline #t.

This procedure supports for Unicode characters as basic 16-bit logical units.

(define re (regex:regcomp "ab+c"))
(regex:regexec re "abbbbbbbc")   ==> #t
(regex:regexec re "ac")          ==> #f

(define re (regex:regcomp "^hello"))
(regex:regexec re "hello")          ==> #t
(regex:regexec re "hello" :not-bol) ==> #f

(define re (regex:regcomp "hello$"))
(regex:regexec re "hello")          ==> #t
(regex:regexec re "hello" :not-eol) ==> #f

@source{regex/regex.c} @use{regex}

Function: regex:subexpr regex index
Returns a string comprised of characters matched by the index-th parenthesized subexpression in regex. regex must be a regular expression object with which regex:regexec has been successfully conducted.

index of 1 indicates the first subexpression, index of 2 indicates the second subexpression, and so on. index of 0 indicates the whole substring that had been matched with regex.

(define re (regex:regcomp "(a+)(b+)"))
(regex:regexec re "aaabbccc")  ==> #t
(regex:subexpr re 0)           ==> "aaabb"
(regex:subexpr re 1)           ==> "aaa"
(regex:subexpr re 2)           ==> "bb"

Function: regex:split regex string

Splits string with delimiters that match pattern. Returns a list comprising of the substrings.

(regex:split "abcdefg" (regex:regcomp "d")) 
                   ==> ("abc" "efg")
(regex:split "123k234h345" (regex:regcomp "[a-z]")) 
                   ==> ("123" "234" "345")
(regex:split "123abc234" (regex:regcomp "[a-z]+")) 
                   ==> ("123" "234")

@source{regex/regex.c} @use{regex}

Function: regex:replace regex string generator &key (not-bol #f) (not-eol
#f) (count -1)

Returns a string which is created by replacing strings returned by generator for subexpression that match regex in string. regex should be a regular expression object. generator should be a function that takes one argument, a regular expression object, and returns a string.

not-bol and not-eol arguments have the same meanings as in regex:regexec.

count specifies the number of replacements to be made. If count is -1, all matching subexpressions are replaced.

(regex:replace (regex:regcomp ":")
               "abc:def:ghi"
               (lambda (re) " "))
  ==> "abc def ghi"

(regex:replace (regex:regcomp "([a-z])([a-z])")
               "123ab234cd345ef"
               (lambda (re)
                 (string-append (regex:subexpr re 2)
                                (regex:subexpr re 1))))
  ==> "123ba234dc345fe"

(regex:replace (regex:regcomp "([0-9]+)X([0-9]+)")
               "123X789A234X777B333X222" 
               (lambda (reg) 
                  (string-append 
                    (regex:subexpr reg 2) 
                    "X" 
                    (regex:subexpr reg 1))) 
               :count 2)
  ==> "789X123A777X234B333X222"

Character Names

NUL '\0'
SOH '\001'
STX '\002'
ETX '\003'
EOT '\004'
ENQ '\005'
ACK '\006'
BEL '\007'
alert '\007'
BS '\010'
backspace '\b'
HT '\011'
tab '\t'
LF '\012'
newline '\n'
VT '\013'
vertical-tab '\v'
FF '\014'
form-feed '\f'
CR '\015'
carriage-return '\r'
SO '\016'
SI '\017'
DLE '\020'
DC1 '\021'
DC2 '\022'
DC3 '\023'
DC4 '\024'
NAK '\025'
SYN '\026'
ETB '\027'
CAN '\030'
EM '\031'
SUB '\032'
ESC '\033'
IS4 '\034'
FS '\034'
IS3 '\035'
GS '\035'
IS2 '\036'
RS '\036'
IS1 '\037'
US '\037'
space ' '
exclamation-mark '!'
quotation-mark '"'
number-sign '#'
dollar-sign '$'
percent-sign '%'
ampersand '&'
apostrophe '\"
left-parenthesis '('
right-parenthesis ')'
asterisk '*'
plus-sign '+'
comma ','
hyphen '-'
hyphen-minus '-'
period '.'
full-stop '.'
slash '/'
solidus '/'
zero '0'
one '1'
two '2'
three '3'
four '4'
five '5'
six '6'
seven '7'
eight '8'
nine '9'
colon ':'
semicolon ';'
less-than-sign '<'
equals-sign '='
greater-than-sign '>'
question-mark '?'
commercial-at ''
left-square-bracket '['
backslash '\\'
reverse-solidus '\\'
right-square-bracket ']'
circumflex '^'
circumflex-accent '^'
underscore '_'
low-line '_'
grave-accent '`'
left-brace '{'
left-curly-bracket '{'
vertical-line '|'
right-brace '}'
right-curly-bracket '}'
tilde '~'
DEL '\177'


Go to the first, previous, next, last section, table of contents.