Nre Package Commands
NAME
nrematch - Match a regular expression against a string
SYNOPSIS
package require nre ?3.0?
nrematch ?switches? exp string ?matchVar? ?subMatchVar subMatchVar ...?
nrematch -eval ?switches? exp string ?matchVar matchScript? ?subMatchVar subMatchScript subMatchVar subMatchScript ...?
nrematch -list|-flatten|-split ?switches? exp string matchVar
nrematch -eval -list|-flatten|-split ?switches? exp string matchVar matchScript
DESCRIPTION
Determines whether the regular expression exp matches part or
all of string.Returns the number of times exp matched. The number of matches
can be greater than 1 if the -all or -split switches are used.
If additional arguments are specified after string then they
are treated as the names of variables in which to return
information about which part(s) of string matched exp.
MatchVar will be set to the range of string that
matched all of exp. The first subMatchVar will contain
the characters in string that matched the leftmost parenthesized
subexpression within exp, the next subMatchVar will
contain the characters that matched the next parenthesized
subexpression to the right in exp, and so on.Instead of using the standard regular expression package it uses
the package described in this man page.
If there are more subMatchVar's than parenthesized
subexpressions within exp, or if a particular subexpression
in exp doesn't match the string (e.g. because it was in a
portion of the expression that wasn't matched), then the corresponding
subMatchVar will be set to ``-1 -1'' if -indices,to ``0.0 0.0'' if -textidx, or to an empty string otherwise.The exception to this is the -prune switch. Use it if you do
not want empty items added to the match variables.
If the -list, -flatten, or -split switches are
used then matchVar is required and subMatchVar is not allowed.
If the -eval switch is used
then each matchVar and subMatchVar has an associated script
that will be executed when a match is found.
Any matchVar, subMatchVar, matchScript,
and subMatchScript can be an empty string if you want
nrematch to ignore it.
If the initial arguments to nrematch start with - then
they are treated as switches. The following switches are
currently supported:
-
-nocase
-
Causes upper-case characters in string to be treated as
lower case during the matching process.
-
-indices
-
Changes what is stored in the subMatchVars.
Instead of storing the matching characters from string,
each variable
will contain a list of two decimal strings giving the indices
in string of the first and last characters in the matching
range of characters.
-
-textidxChanges what is stored in the subMatchVars.
Instead of storing the matching characters from string,
each variable
will contain a list of two Tk text widget indices that specify the
matching range of characters in string.
The first points to the first character in the range and
the second points to the position after the last character in the range.
A text widget index is of the form line.char that
indicates the char'th character on line line.
Lines are numbered from 1. Within a line, characters are numbered from 0.
-
-allInstead of returning after a single match all ranges in string
that match exp are found. Returns the number of matches found.
The matchVar and subMatchVars are set to an empty list and as each
match is found an element is appended to the var's list.
If the -indices switch is used then two elements are appended
to each list for each match found.
-
-splitImplies -all and -flatten. In addition the text matched
by the entire exp is not appended to matchVar. Instead the
text that preceeded the match is appended followed by any captured
subexpressions. Finally when exp fails to match any remaining
unmatched text from string is appended to matchVar.
Note that this switch is used by the nresplit command.
-
-limit numLimits the number of matches the -all or -split to num.
num must be an integer.
If the limit is reached then acts as if no more matches exist.
-
-start numStart matching input at the offset num.
This switch will not change the result index values;
those are still computed from the start of string.
-
-end numAct as if the input string was only of length num.
In combination with -start this can save a call to string range.
-
-pruneNormally an empty element will be added to the match result if
a subexpression did not match at all. The -prune switch
changes this behavior so that if a subexpression did not match at
all nothing will be added to the result for it.
Note that pruning is only done for -all and -split.
-
-listCauses all of the matched strings to be put in the required matchVar
as a list. The command detects the number of captured subexpression in
exp and adds that number of additional elements to the matchVar
list.
In the case of the -all switch each element added to matchVar
is itself a list whose size if the number of captured subexpressions plus one.
In the case of the -split switch an additional element will be added
to matchVar for each match if the number of captured subexpressions
is greater than 0. That additional element will be a list which will have
an element for each captured subexpression.
-
-flattenLike -list except a sublist will not be used. Instead the captured
subexpressions will be directly append to matchVar.
-
-evalCauses matchScript and any subMatchScripts to be evaluated for
each match found. Before doing any evaluations matchVar and any
subMatchVars are set with the normal value they would be without
the -eval switch.
If you don't care about a particular part of the match
then use an empty string for that match variable.
The scripts are evaluated from left to right. The script can be
an empty string if no evaluation is desired for a particular match variable.
Since the match variables are all set before script evaluation they can
all be accessed from any of the scripts.
If a subexpression did not match at all, including an empty string,
then its corresponding subMatchScript will not be evaluated.
If an evaluated script executes break or return then no more
scripts will be evaluated. In the case of -all or -split
nrematch will act as if the current match did not happen and that
no more matches exist.
If continue is executed and -all or -split is used
then nrematch will act as if the current match did not happen and
will try to find another match.
-
-tryagain varnameIf the exp could have matched if string had had additional
text on its end then varname will be set to the offset at which
additional matches should be attempted once additional input is appended
to string. It will be set to -1 if trying again with the current
input could not yield a match.
This switch should only be used if the input is being read from a stream
which may have additional input.
-
--
-
Marks the end of switches. The argument following this one will
be treated as exp even if it starts with a -.
REGULAR EXPRESSIONS
Regular expressions are implemented using Henry Spencer's package
(thanks, Henry!),
and much of the description of regular expressions below is copied verbatim
from his manual entry.
A regular expression is zero or more branches, separated by ``|''.
It matches anything that matches one of the branches.
A branch is zero or more pieces, concatenated.
It matches a match for the first, followed by a match for the second, etc.
A piece is an atom possibly followed by ``*'', ``+'', ``?'',or ``{x,y}'' which in turn might be followed by a ``?''.A ``*'' matches a sequence of 0 or more matches of the atom.
A ``+'' matches a sequence of 1 or more matches of the atom.
A ``?'' matches a sequence of 0 or 1 matches of the atom.A ``{x}'' matches a sequence of x matches of the atom.
A``{x,}'' matches a sequence of x or more matches of the atom.
A ``{x,y}'' matches a sequence of at least x and at most y matches of
the atom.
By default a piece will match as long a sequence as
possible. However if the piece constructs described above have a ``?''
after them then piece will match as short a sequence as possible.
Note that the ``{x,y}'' repetition construct is only recognized if
the p flag is set.
An atom is a regular expression in parentheses
(matching a match for the regular expression), a range (see below),
``.'' (matching any single character),
``^'' (matching the null string at the beginning of the input string),
``$'' (matching the null string at the end of the input string),
a ``\'' followed by a single character (matching that characteror matching something special if the p flag is used;
see the FLAGS section for details),or a single character with no other significance (matching that character).
A range is a sequence of characters enclosed in ``[]''.
It normally matches any single character from the sequence.
If the sequence begins with ``^'',
it matches any single character not from the rest of the sequence.
If two characters in the sequence are separated by ``-'', this is shorthand
for the full list of ASCII characters between them
(e.g. ``[0-9]'' matches any decimal digit).
To include a literal ``]'' in the sequence, make it the first character
(following a possible ``^'').
To include a literal ``-'', make it the first or last character.A range can also contain POSIX character classes.
They represent a sequence of characters just as two characters
sperated by ``-'' do. However the sequence is determined using
the functions from C runtime library and current locale.
The following POSIX character classes are supported:
-
[:alnum:]Alphabetic and numeric characters. Defined by isalnum().
-
[:alpha:]Alphabetic characters. Defined by isalpha().
-
[:cntrl:]Control characters. Defined by iscntrl().
-
[:digit:]Digit characters. Defined by isdigit().
-
[:graph:]Printable characters excluding a space. Defined by isgraph().
-
[:lower:]Lowercase alphabetic characters even if the i switch is used.
Defined by islower().
-
[:print:]Printable characters including a space. Defined by isprint().
-
[:punct:]Punctuation characters. Defined by ispunct().
-
[:space:]Whitespace characters. Defined by isspace().
-
[:upper:]Uppercase alphabetic characters even if the i switch is used.
Defined by isupper().
-
[:xdigit:]Characters allowed in a hexidecimal number. Defined by isxdigit().
-
A parentheses atom in which the character immediately after the ``(''
is a ``?'' is a special construct with one of the following meanings:
-
``(?:''regexp``)'' are shy groups. This groups like
``()'' but doesn't capture the text for backreferences like ``()'' does.
It matches if regexp matches.
-
``(?=''regexp``)'' is a non-capturing zero-width positive lookahead
assertion. It matches if regexp matches.
The matched text is not consumed.
-
``(?!''regexp``)'' is a non-capturing zero-width negative lookahead
assertion. It matches if regexp does not match.
-
``(?#''any text``)'' is a comment. The entire atom is treated as an
empty string.
-
``(?ipxm)'' is a used to set flags. Any combination of the flag characters
``ipxm'' are allowed. The entire atom is treated as an empty string.
See the FLAGS section for a description of each flag.
-
``(?|''range``)'' is an alternate syntax for a character range.
Its benefit is that it does not use the Tcl special characters ``[]'' to
enclose the range.
FLAGS
Flags can be set using a ``(?''flag-char``)'' atom. Some commands that
use regular expressions have options that set some of these same
flags. For example the -nocase option sets the i flag. The advantage
of having the flags in the regular expression itself is that they can
then be used by any command without the need to add new command
switches. It is best to set the flags at the very beginning of the
regular expression; however they apply to the entire regular expression
no matter where they appear.
The i flag causes case to be ignored when alphabetic characters are
compared.
The m flag enables multi-line mode. The ``^'' atom is changed to match
at the beginning of the string or the beginning of any line in the
string. The ``$'' atom is changed to match at the end of the string or
the end of any line in the string. The ``.'' atom is changed to match
any character except ``\n''.
The x flag causes white space in the regular expression to be ignored
and removed during compilation. To include literal white space as an atom
to be matched preceed it with a backslash ``\''. Whitespace is only ignored
between atoms, pieces, branches, and regular expressions.
It is not ignored in ranges or in any other complex atom.
The white space includes comments where a comment starts with a ``#''
and continues to the end of the line.
The q flag enables quick compile mode.
This turns off expensive optimizations that tend to slow down compilation of the
regular expression. This can speed up compile time but may slow down
match time. Some regular expressions are optimized by the writer and
doing optimization is a waste of time. Another reason to use this switch
is if most of your time is spent compiling. This can happen if the
regular expression needs to be compiled each time and the text being
match against is small. This flag can also be used to work around bugs
in the optimizer.
The p flag enables extra escape sequences and constructs to be
recognized. See the BACKWARDS COMPATIBILITY section for why these
constructs are not enabled by default. The following are enabled:
-
\w
-
Match a "word" character.
Same as ``[_[:alnum:]]''.
-
\W
-
Match a non-word character.
Same as ``[^_[:alnum:]]''.
-
\s
-
Match a whitespace character.
Same as ``[[:space:]]''.
-
\S
-
Match a non-whitespace character.
Same as ``[^[:space:]]''.
-
\d
-
Match a digit character.
Same as ``[[:digit:]]''.
-
\D
-
Match a non-digit character.
Same as ``[^[:digit:]]''.
-
\b
-
Zero-width assertion matches a word boundary.
Current character matches \w and previous character matches \W or
current character matches \W and previous character matches \w.
The position before the first character in the string and after the
last character match \W.
-
\B
-
Zero-width assertion matches a non-word boundary.
Current character matches \w and previous character matches \w or
current character matches \W and previous character matches \W.
The position before the first character in the string and after the
last character match \W.
-
\<
-
Zero-width assertion matches start of word.
Current character matches \w and previous character matches \W.
The position before the first character in the string and after the
last character match \W.
-
\>
-
Zero-width assertion matches end of word.
Current character matches \W and previous character matches \w.
The position before the first character in the string and after the
last character match \W.
-
\A
-
Zero-width assertion matches only at beginning of string even
if m flag.
-
\Z
-
Zero-width assertion matches only at end of string even if m flag.
-
\G
-
Zero-width assertion matches only where previous -all match left off.
-
\Q
-
Quote mode. All characters following are treated as literal text
until a \E or the end of the regular expression.
-
\E
-
End quote mode.
-
\num
-
Backreference to the num'th captured substring. The value of num
must not be greater than the number of captured substrings to the left
of the backreference. The text from the backreference is inserted
into the regular expression and is always treated as literal text.
-
\meta
-
If the ``\'' is followed by a regular expression meta character then
the meta character is treated as literal text. The meta chararacters
are: ``\*+?()|[]{}^$''. If ``\'' is followed by anything else
the regexp compiler will raise an error.
-
{x,y}
-
This piece construct is a repetition operator and is described above
in the piece paragraph.
CHOOSING AMONG ALTERNATIVE MATCHES
In general there may be more than one way to match a regular expression
to an input string. For example, consider the commandnrematch (a*)b* aabaaabb x y
Considering only the rules given so far, x and y could
end up with the values aabb and aa, aaab and aaa,
ab and a, or any of several other combinations.
To resolve this potential ambiguity nrematch chooses among
alternatives using the following rules apply in decreasing
order of priority:
-
If a regular expression could match two different parts of an input string
then it will match the one that begins earliest.
-
If a regular expression contains | operators then the leftmost
matching sub-expression is chosen.
-
In *, +, ?, and{x,y}constructs,
longer matches are chosen in preference to shorter ones.
These operators are often called greedy because they
match the longest possible string that allows the entire regular
expression to match.In *?, +?, ??, and {x,y}? constructs,
shorter matches are chosen in preference to longer ones.
These operators are often called lazy because they match
the shortest possible string that allows the entire regular expression
to match.
-
In sequences of expression components the components are considered
from left to right.
-
In the example from above, (a*)b* matches aab: the (a*)
portion of the pattern is matched first and it consumes the leading
aa; then the b* portion of the pattern consumes the
next b. Or, consider the following example:
nrematch (ab|a)(b*)c abc x y z
After this command x will be abc, y will be
ab, and z will be an empty string.
Rule 4 specifies that (ab|a) gets first shot at the input
string and Rule 2 specifies that the ab sub-expression
is checked before the a sub-expression.
Thus the b has already been claimed before the (b*)
component is checked and (b*) must match an empty string.
LIMITS
The maximum number of capturing subexpressions ``()'' in a single
regular expression is 255. This limit does not apply to the
non-capturing ``(?:)''.
A compiled regular expression is limited in size to 32678 bytes. If
during compilation it is discovered that the regular expression
requires more memory then the operation will fail with the error:
``regexp too big''.
The counts in the repetition construct ``{x,y}'' must be greater than
or equal to zero and less than or equal to 255.
The maximum number of unique ranges in a regular expression is 64.
BACKWARDS COMPATIBILITY
Regular expressions from previous releases of Tcl should behave
exactly the same. The following new constructs:
(?...), *?, +?, and ??
will cause compilation errors in older regular expressions so they are
always recognized in new regular expressions.
All the other new constructs would have meant something else in older
regular expressions. So they always have the old meaning unless you
turn on one of the new flags. For example you need to start a regular
expression with (?p) if you want to use the new ``\'' sequences or
the ``{x,y}'' repetition construct.
PERFORMANCE INFORMATION
The first time a regular expressions is used it is compiled into
a Tcl object. The next time that object needs to be used as a
regular expression the compilation step will not be needed if the
object still exists and is still a regular expression.
So if the regular expression is a constant string:nrematch {abc|def|zeq} $str
then the first time the above command is executed the string
constant object is converted to a regular expression object
and will remain so giving a performance boost.
However if the regular expression string is not constant:
nrematch "$W1|$W2|$W3" $str
then the string object will need to be recreated each time the
above command executes.
If instead you stored the regular expression string into a variable
then the regular expression object would remain and not need to
be recreated each time:
set re "$W1|$W2|$W3"
proc foo {} {
global re
nrematch $re $str
}
If it is a complex regular expression used in more than one place
this can be a win in both time and space.
It is best to use (?i) instead of -nocase if you can
because then the text of the regular expression object describes
its state.
If you do not need the matchVar or a subMatchVar then you
can set that argument to an empty string ``{}''.
This tells nrematch to not bother setting a variable to that
particular captured subexpression.
BINARY CLEAN
The new regular expression compiler and matcher are binary clean.
This means that it is ok for the regular expression and the string
being matched to contain binary data including null bytes.
EXAMPLES
To match a C comment:nrematch {/\*.*?\*/} $str
To match a number if not followed by a period:
nrematch {[0-9]+(?![.])} $str
To match a number if followed by something other than a period:
nrematch {[0-9]+(?=[^.])} $str
To match an item that contains only letters, but not all uppercase:
nrematch {^(?![A-Z]*$)[a-zA-Z]*$} $str
To see if a string contains both 'this' and 'that':
nrematch {^(?=.*?this)(?=.*?that)} $str
KEYWORDS
match, nre, regular expression, string
Last change: 3.0
[ nre3.0 ]
Copyright © 1997 Darrel Schneider.