archie/release/base/help/english/regex/=
2024-05-28 17:59:32 +02:00

342 lines
8.2 KiB
Plaintext

archie uses ed(1) regular expressions in a number of
commands.
A regular expression, on the one hand, is a string like
any other; a sequence of characters. On the other
hand, special characters within the string have certain
functions which make regular expressions useful when
trying to match portions of other strings. In the fol-
lowing discussion and examples, a string containing a
regular expression will be called the ``pattern'', and
the string against which it is to be matched is called
the ``reference string''.
Regular expressions allow one to search for "all strings
ending with the letters 'ize'" or "all strings beginning
with a number between 1 and 3 and ending in a comma".
In order to accomplish this, regular expressions co-opt
the use of some characters to have special meaning.
They also provide for these characters to lose their
special meaning if the user so desires. The rules for
regular expresssion are
c Any character c matches itself unless it has been
assigned other special meaning as listed below. Most
special characters can be escaped (made to lose its
special meaning), by placing the character '\' in front
of it. This doesn't apply to '{' which is non-special
until it is escaped. Thus although '*' normally has
special meaning the string '\*' matches itself.
Example:
The pattern
acdef
matches
s83acdeffff or acdefsecs or acdefsecs
but not
accdef or aacde1f
That is it will any string that contains ``acdef'' any-
where in the reference string.
Example:
Normally the characters '*' and '$' are special,
but the pattern
a\*bse\$
acts as above. That is any reference string containing
``*abse$'' as a substring will be flagged as a match.
. A period matches any character except the newline
character. This is known as the wildcard character.
Example:
The pattern
....
will match any 4 characters in the reference string,
except a newline character.
^ If `^' appears at the begining of the pattern then it
is said to ``anchor'' the match to the beginning of the
line. That is, the reference string must start with the
pattern following the `^'. If this character appears
anywhere else other than at the beginning of the line,
then it is no longer considered special, and matches
itself as any non-special character would. Similarly if
it starts a string but is escaped, it matches itself.
Example:
The pattern
^efghi
Will match
efghi or efghijlk
but not
abcefghi
That is the pattern will match only those reference
strings starting with ``efghi''. Just containing the
substring is not sufficient.
$ Occurring at the end of the pattern, this character
``anchors'' the pattern to the end of the line (refer-
ence string). A '$' occurring anywhere else in the pat-
tern is regarded as a non-special. Similarly if it is
at the end of the pattern but is escaped, it is non-
special.
Example:
The pattern
efghi$
Will match
efghi or abcdefghi
but not
efghijkl
That is the pattern will match only those reference
strings ending with ``efghi''. Just containing the sub-
string is not sufficient.
\< This sequence in the pattern causes the one character
regular expression following it only to match something
at the beginning of a word: the beginning of a line or
just before a letter, digit or underline character, or
just after a charcter which is not one of these.
Example:
The pattern
\<abc
would match the last 'abc' in the reference string
@hijabc#+abc
but not the first since the first 'abc' did not start
on a ``word'' boundary.
\> Constrains the one-character regular expression fol-
lowing it to be at the end of a ``word'' as defined
above.
[string]
One or more characters within square brackets. This
pattern matches any single character within the brack-
ets. The caret, '^', has a special meaning if it is the
first character in the series: the pattern will match
any character other than one in the list.
Example:
The pattern
[^abc]
Will match any character except 'a', 'b' or 'c'.
To match a right bracket, ']', in the list it must be
put first:
[]ab01]
For a caret, '^', in the list it can appear anywhere
but first.
In
[ab^01]
the caret loses its special meaning.
The '-' character is special within square brackets. It
is interpreted as a range of characters (in the ASCII
character set) and will match any single character
within that range. '[a-z]' matches any lower case
letter. The '-' can be made non special by placing it
first or last within the square brackets.
The characters '$', '*' and '.' are not special within
square brackets.
Example:
The pattern
[ab01]
matches a single occurence of a character from the set
'a', 'b', '0', '1'.
Example:
The pattern
[^ab01]
will match any single character other than 'a', 'b',
'0', '1'.
Example :
The pattern
[a0-9b]
which matches one of 'a', 'b' or a digit between 0 and
9 inclusive.
Example :
The pattern
[^a0-9b.$]
means any single character not 'a', 'b' '.' , '$' or a
digit between 0 and 9 inclusive.
* An asterisk following a regular expression in the pat-
tern has the effect of matching zero or more
occurrences of that expression.
Example:
The pattern
a*
means zero or more occurrences of the character 'a'.
Example:
The pattern
[A-Z]*
means zero or more occurrences of the upper case alpha-
bet.
\{m\}
\{m,\}
\{m,n\}
A one-character regular expression followed by one of
the three of these constructions causes a range of
occurrences of that regular expression to be matched.
If it is followed by \{m\} where m is a non-negative
integer between 0 and 255 (inclusive), then exactly m
occurrences of that regular expression are matched. If
followed by \{m,\}, then at least m occurrences are
matched. Finally, if it is followed by \{m,n\} (where
n is a non-negative integer between 0 and 255 and where
n > m), then between m and n occurrences of the expres-
sion are matched.
Example:
The pattern
ab\{3\}
would match any substring in the reference string of an
'a' followed by exactly 3 'b's.
Example:
The pattern
ab\{3,\}
would match any substring in the reference string of an
'a' followed by at least 3 'b's.
Example:
The pattern
ab\{3,5\}
would match any substring in the reference string of an
'a' followed by at least 3 but at most 5 'b's.
Common Problems with Regular Expression
(1) When matching a substring it is not necessary to use
the wildcard character to match the part of the refer-
ence string preceeding and following the substring.
Example:
The pattern
abcd
will match any reference string containing this pat-
tern. It is not necessary to use
.*abcd.*
as the pattern.
(2) In order to constrain a pattern to the entire reference
pattern, use the the construction:
^pattern$
(3) The easiest way to obtain case insensitivity in a regu-
lar expression is to use the '[]' operator. For exam-
ple, a pattern to match the word ``hello'' regarless of
the case of the letters would be:
[Hh][Ee][Ll][Ll][Oo]