Procedures that handle regular expression support Unicode in the sense that it handle characters as basic 16-bit logical units, and that it supports case-insensitive match for all Unicode characters including those beyong ASCII range.
Regular expression engine is based on the one developed by Henry Spencer. We have modified the original source to support Unicode characters.
Before using functions and variables in Regex Package, following function call should be made to dynamically load necessary library.
(use 'regex)
Regular expressions, as defined in POSIX 1003.2, come in two forms:
modern regular expressions and obsolete regular expressions. Obsolete
regular expressions (roughly those of ed; 1003.2 basic
regular expressions) mostly exist for backward compatibility in some old
programs. KSM-Scheme uses modern regular expressions (roughly those of
egrep; 1003.2 calls these extended regular expressions)
with additional functionality of back references.
A regular expression is one or more non-empty branches, separated
by |. It matches anything that matches one of the branches.
"one|two|three" ==> matches "one", "two", or "three"
A branch is one or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.
A piece is an atom possibly followed by a single *,
+, ?, or bound. An atom followed by *
matches a sequence of 0 or more matches of the atom. An atom followed by
+ matches a sequence of 1 or more matches of the atom. An atom
followed by ? matches a sequence of 0 or 1 matches of the atom.
"a*" ==> matches "", "a", "aa", "aaa", ... "a+" ==> matches "a", "aa", "aaa", ... "a?" ==> matches "" or "a"
A bound is { followed by a decimal integer, possibly followed by
, possibly followed by another decimal integer, always followed
by }. The integers must lie between 0 and 255 inclusive, and if
there are two of them, the first may not exceed the second. An atom
followed by a bound containing one integer i and no comma matches
a sequence of exactly i matches of the atom. An atom followed by
a bound containing one integer i and a comma matches a sequence
of i or more matches of the atom. An atom followed by a bound
containing two integers i and j matches a sequence of
i through j (inclusive) matches of the atom.
"a{3}" ==> matches "aaa"
"a{3,}" ==> matches "aaa", "aaaa", "aaaaa", ...
"a{3,4}" ==> matches "aaa" or "aaaa"
An atom is a regular expression enclosed in () (matching a match
for the regular expression), an empty set of () (matching the
null string), a bracket expression (see below), .
(matching any single character), ^ (matching the null string at
the beginning of a line), $ (matching the null string at the end
of a line), a back reference (see below), a \ followed by
one of the characters ^.[$()|*+?{\ (matching that character
taken as an ordinary character), a \ followed by any other
character (matching that character taken as an ordinary character, as if
the \ had not been present), or a single character with no other
significance (matching that character). A { followed by a
character other than a digit is an ordinary character, no the beginning
of a bound. It is illegal to end a regular expression with \.
A bracket expression is a list of characters enclosed in []. It
normally matches any single character from the list (but see below). If
the list begins with ^, it matches any single character (but see
below) not from the rest of the list. If two characters in the
list are separated by -, this is shorthand for the full
range of characters between those two (inclusive),
e.g. [0-9] matches any decimal digit. It should be noted that
ordering of characters is equivalent to that of Unicode code values. It
is illegal for two ranges to share an endpoint, e.g. a-c-e.
"[a-z]" ==> matches small letter alphabets "[^0-9]" ==> matches all characters except '0' through '9'
To include a literal ] in the list, make it the first character
(following a possible ^). To include a literal -, make it
the first or last character, or the second endpoint of a range. To use a
literal - as the first endpoint of a range, enclose it in
[. and .] to make it a character-name element (see
below). With the exception of these and some combinations using [
(see next paragraphs), all other special characters, including \,
lose their special sigfinicance within a bracket expression.
"[]]" ==> matches ']' "[^]]" ==> matches characters except ']' "[-]" ==> matches '-' "[^-a]" ==> matches characters except '-' and 'a' "[^]-~]" ==> matches characters except '-' through '~' "[a--]" ==> matches characters from 'a' through '-' "[abc-]" ==> matches 'a', 'b', 'c', or '-' "[[.-.]-~]" ==> matches characters from '-' through '~'
Within a bracket expression, a character-name element enclosed in
[. and .] stands for the character that is represented by
the name. For example, [.backspace.] stands for
#\U{0008}. For a list of acceptable names of characters, see
section Character Names.
Within a bracket expression, the name of a character class
enclosed in [: and :] stands for the list of all
characters belonging to that class. Standard character class names are:
| alnum | digit | punct | |
| alpha | graph | space | |
| blank | lower | upper | |
| cntrl | xdigit |
These stand for the character classes defined in ctype(3). A
character class may not be used as an endpoint of a range.
There are two special cases of bracket expressions: the bracket
expression [[:<:]] and [[:>:]] match the null string at
the beginning and end of a word respectively. A word is defined as a
sequence of word characters which is neither preceded nor followed by
word characters. A word character is an alnum character (as
defined by ctype(3)) or an underscore.
In the event that a regular expression could match more than one
substring of a given string, the regular expression matches the one
starting earliest in the string. If the regular expression could match
more than one substring starting at that point, it matches the
longest. Subexpressions also match the longest possible substrings,
subject to the constraint that the whole match be as long as possible,
with subexpressions starting earlier in the regular expression taking
priority over ones starting later. Match lengths are measured in
characters. A null string is considered longer than no match at all.
For example, bb* matches the three middle characters of
abbbc, (wee|week)(knights|nights) matches all ten
characters of weeknights, when (.*).* is matched against
abc the prenthesized subexpression matches all three characters,
and when (a*)* is matched against bc both the whole
regular expression and the parenthesized subexpression match the null
string.
If case-independent matching is specified, the effect is much as if all
case distinctions had vanished from the alphabet. When an alphabetic
that exists in multiple cases appears as an ordinary character outside a
bracket expression, it is effectively transformed into a bracket
expression containing both cases, e.g. x becomes
[xX]. When it apperas inside a bracket expression, all case
counterparts of it are added to the bracket expression, so that (e.g.)
[x] becomes [xX] and [^x] becomes [^xX].
A back reference atom that is represented by a \ followed
by a non-zero decimal digit d matches the same sequence of
characters matched by the d-th parenthesized subexpression
(numbering subexpressions by the positions of their opening parentheses,
left to right), so that (e.g.) ([bc])\1 matches bb or
cc but not bc.
NOTE: This description of regular expression is based on the documentation included in the regular expression library of Henry Spencer.
pattern should be a string that specify a regular expression. Returns a regular expression object that is specified by pattern.
If ignore-case is not false, it specifies a matching that ignores
upper/lower case distinctions.
If newline is not false, it specifies a newline-sensitive
matching. By default, newline is completely ordinary character with no
special meaning in either regular expressions or strings. With this
flag, [^...] bracket expressions and . never match
newline. ^ matches the empty string immediately after a
newline. $ matches the empty string immediately before a
newline.
If no-special is not false, it specifies a matching with meanings
of all special characters turned off. All characters are thus considered
ordinary, so the regular expression is a literal string.
This procedure supports for Unicode characters as basic 16-bit logical units. Case-insensitive match supports for Unicode characters (beyond ASCII characters).
(define re (regex:regcomp "[a-z]+.scm")) (regex:regexec re "abc.scm") ==> #t (regex:regexec re "Abc.scm") ==> #f (define re (regex:regcomp "[a-z]+.scm" :ignore-case #t)) (regex:regexec re "Abc.scm") ==> #t (define re (regex:regcomp "[^0-9]+")) (regex:regexec re "abcd\nefg") ==> #t (regex:subexpr re 0) ==> "abcd\nefg" (define re (regex:regcomp "[^0-9]+") :newline) (regex:regexec re "abcd\nefg") ==> #t (regex:subexpr re 0) ==> "abcd" (define re (regex:regcomp ".+")) (regex:regexec re "abcd\nefg") ==> #t (regex:subexpr re 0) ==> "abcd\nefg" (define re (regex:regcomp ".+") :newline) (regex:regexec re "abcd\nefg") ==> #t (regex:subexpr re 0) ==> "abcd" (define re (regex:regcomp "^hello$")) (regex:regexec re "aaa\nhello\nbbb") ==> #f (define re (regex:regcomp "^hello$" :newline)) (regex:regexec re "aaa\nhello\nbbb") ==> #t (define re (regex:regcomp ".+")) (regex:regexec re "abcd") ==> #f (regex:regexec re ".+") ==> #t
In KSM-Scheme, several escape sequences in a string are processes while they are read from the program source. They are
"\"" ==> " (a string composed of a doublequote character)
"\\" ==> \
"\n" ==> newline
"\r" ==> carriage return
"\t" ==> tab
"\U{XXXX} ==> character corresponding to Unicode
code value XXXX (hexadecimal)
Consequently, a string "\n" is converted to a string with one
character (newline character) while it is read, and (regex:regcomp
"[\n]") returns a regular expression object that matches a string
containing a newline character.
A backslash in a string followed by other characters loses its special
meaning and represents a backslash character by itself. Consequently,
(regex:regcomp "\$"), as well as (regex:regcomp "\\$"),
returns a regular expression object that matches a string containing a
character '$'.
@source{regex/regex.c} @use{regex}
Matches a regular expression object regex against string,
and returns #t if matching succeeds or #f if matching
fails.
If not-bol is not false, the first character of string is
not the beginning of a line, so the ^ anchor should not match
before it. This does not affect the behavior of newlines when regex
was created with :newline #t.
If not-eol is not false, termination of string does not end
a line, so the `$' anchor should not match before it. This does not
affect the behavior of newlines when regex was compiled with
:newline #t.
This procedure supports for Unicode characters as basic 16-bit logical units.
(define re (regex:regcomp "ab+c")) (regex:regexec re "abbbbbbbc") ==> #t (regex:regexec re "ac") ==> #f (define re (regex:regcomp "^hello")) (regex:regexec re "hello") ==> #t (regex:regexec re "hello" :not-bol) ==> #f (define re (regex:regcomp "hello$")) (regex:regexec re "hello") ==> #t (regex:regexec re "hello" :not-eol) ==> #f
@source{regex/regex.c} @use{regex}
regex:regexec has been
successfully conducted.
index of 1 indicates the first subexpression, index of 2 indicates the second subexpression, and so on. index of 0 indicates the whole substring that had been matched with regex.
(define re (regex:regcomp "(a+)(b+)")) (regex:regexec re "aaabbccc") ==> #t (regex:subexpr re 0) ==> "aaabb" (regex:subexpr re 1) ==> "aaa" (regex:subexpr re 2) ==> "bb"
Splits string with delimiters that match pattern. Returns a list comprising of the substrings.
(regex:split "abcdefg" (regex:regcomp "d"))
==> ("abc" "efg")
(regex:split "123k234h345" (regex:regcomp "[a-z]"))
==> ("123" "234" "345")
(regex:split "123abc234" (regex:regcomp "[a-z]+"))
==> ("123" "234")
@source{regex/regex.c} @use{regex}
Returns a string which is created by replacing strings returned by generator for subexpression that match regex in string. regex should be a regular expression object. generator should be a function that takes one argument, a regular expression object, and returns a string.
not-bol and not-eol arguments have the same meanings as in
regex:regexec.
count specifies the number of replacements to be made. If
count is -1, all matching subexpressions are replaced.
(regex:replace (regex:regcomp ":")
"abc:def:ghi"
(lambda (re) " "))
==> "abc def ghi"
(regex:replace (regex:regcomp "([a-z])([a-z])")
"123ab234cd345ef"
(lambda (re)
(string-append (regex:subexpr re 2)
(regex:subexpr re 1))))
==> "123ba234dc345fe"
(regex:replace (regex:regcomp "([0-9]+)X([0-9]+)")
"123X789A234X777B333X222"
(lambda (reg)
(string-append
(regex:subexpr reg 2)
"X"
(regex:subexpr reg 1)))
:count 2)
==> "789X123A777X234B333X222"
| NUL | '\0' |
| SOH | '\001' |
| STX | '\002' |
| ETX | '\003' |
| EOT | '\004' |
| ENQ | '\005' |
| ACK | '\006' |
| BEL | '\007' |
| alert | '\007' |
| BS | '\010' |
| backspace | '\b' |
| HT | '\011' |
| tab | '\t' |
| LF | '\012' |
| newline | '\n' |
| VT | '\013' |
| vertical-tab | '\v' |
| FF | '\014' |
| form-feed | '\f' |
| CR | '\015' |
| carriage-return | '\r' |
| SO | '\016' |
| SI | '\017' |
| DLE | '\020' |
| DC1 | '\021' |
| DC2 | '\022' |
| DC3 | '\023' |
| DC4 | '\024' |
| NAK | '\025' |
| SYN | '\026' |
| ETB | '\027' |
| CAN | '\030' |
| EM | '\031' |
| SUB | '\032' |
| ESC | '\033' |
| IS4 | '\034' |
| FS | '\034' |
| IS3 | '\035' |
| GS | '\035' |
| IS2 | '\036' |
| RS | '\036' |
| IS1 | '\037' |
| US | '\037' |
| space | ' ' |
| exclamation-mark | '!' |
| quotation-mark | '"' |
| number-sign | '#' |
| dollar-sign | '$' |
| percent-sign | '%' |
| ampersand | '&' |
| apostrophe | '\" |
| left-parenthesis | '(' |
| right-parenthesis | ')' |
| asterisk | '*' |
| plus-sign | '+' |
| comma | ',' |
| hyphen | '-' |
| hyphen-minus | '-' |
| period | '.' |
| full-stop | '.' |
| slash | '/' |
| solidus | '/' |
| zero | '0' |
| one | '1' |
| two | '2' |
| three | '3' |
| four | '4' |
| five | '5' |
| six | '6' |
| seven | '7' |
| eight | '8' |
| nine | '9' |
| colon | ':' |
| semicolon | ';' |
| less-than-sign | '<' |
| equals-sign | '=' |
| greater-than-sign | '>' |
| question-mark | '?' |
| commercial-at | '' |
| left-square-bracket | '[' |
| backslash | '\\' |
| reverse-solidus | '\\' |
| right-square-bracket | ']' |
| circumflex | '^' |
| circumflex-accent | '^' |
| underscore | '_' |
| low-line | '_' |
| grave-accent | '`' |
| left-brace | '{' |
| left-curly-bracket | '{' |
| vertical-line | '|' |
| right-brace | '}' |
| right-curly-bracket | '}' |
| tilde | '~' |
| DEL | '\177' |
Go to the first, previous, next, last section, table of contents.