Regular Expressions reference
TODO: Review, finish the highlighting.
Introduction
Regular Expressions (RegEx) is a syntax for specifying patterns of text to search and replace, which can be used for renaming files via the Regular Expressions renaming rule.
Special metacharacters allow you to specify, for instance, that a particular string you are looking for occurs at the beginning or end of a line, or contains N recurrences of a certain character. Metacharacters, such as $ . ^ { [ ( | ) * + ? \ are interpreted according to their individual meaning, instead of finding a literal match for them.
In this document, patterns are shown in orange color. The subject text which is checked against a pattern for a possible match is shown in bold black. Parts of the subject text are color-coded to provide a clue as to why a certain part matches (green color), or does not match (red color).
Simple matches
When the search string does not contain any metacharacters, the RegEx processor works like "normal" search. It tries to find an exact copy of the search string. This is also known as a literal match.
If you want to find a literal match for a metacharacter, put a backslash \ before it. The \ character is called an escape character, because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called escape sequence.
For example, metacharacter ^ matches the beginning of string, but pattern \^ matches ^ character literally. Similarly, pattern \\ matches \ character literally.
| Pattern | Matches | Remarks |
|---|---|---|
| foobar | foobar | This pattern does not contain any metacharacters, so all characters are matched literally. |
| ^FooBarPtr | ^FooBarPtr | The \^ escape sequence searches for the character ^ literally. |
Escape sequences
We already saw one use of escape sequence (above).
Specific escape sequences are interpreted as special conditions, as listed below.
| Pattern | Remarks |
|---|---|
| \xnn | Character represented by the hex code nn |
| \x{nnnn} | Two bytes char with hex code nnnn (unicode) |
| \t | Tab (HT/TAB), same as \x09 (Hex 09) |
| \n | New line (NL), same as \x0a (Hex 0a) |
| \r | Carriage return (CR), same as \x0d (Hex 0d) |
| \f | Form feed (FF), same as \x0c (Hex 0c) |
| foo\x20bar | Matches foo bar (note the space in the middle), but does not match foobar |
| \tfoobar | Matches foobar preceded by a tab (the tab is needed for the match) |
Note that the tab, new line, carriage return, and form feed are all known as white space characters, because they are normally not visible on display, but the RegEx processor can distinguish between them.
Character classes
A character class is a list of characters surrounded by square brackets [ and ], which will match any one (and only one) character from the list.
Note that:
- The characters are not separated with a comma or a space.
- If you repeat any character in the list, it is considered only once (duplicates are ignored).
- A hyphen character
-is used to indicate a range of characters.
| Pattern | Remarks |
|---|---|
| [abcdef] | Matches a, b, c, d, e, or f (only one character), but no other characters |
| [c-m] | Matches any one (and only one) of the small alphabetical characters, from c to m |
| [G-J] | Matches any one (and only one) of the capital alphabetical characters from G to J |
| [a-zA-Z] | Matches any one (and only one) of the alphabetical characters (capital or small) |
| [5-8] | Matches any one (and only one) of numerical characters from 5 to 8 |
| [\n-\x1F] | Matches any one (and only one) of characters with their ordinal value in range from #10 (\n) to #31 (\x1F), which in ASCII character table correspond to some non-printable characters. Note the use of escape sequences inside of this example. |
There are some special conditions:
- If you do not want any of the characters in the specified class, then place
^at the very beginning of the list, which means "none of the characters listed in this class". - If you want
[or]itself to be a member of a class, put it at the start or end of the list, or use an escape sequence by putting\before it.
| Pattern | Remarks |
|---|---|
| [-az] | Matches a, z, and - (since - is at the beginning of the pattern, the escape sequence is not needed) |
| [a\-z] | Matches a, z, and - (since - is not at the beginning/end of the pattern, the escape sequence is needed) |
| [^0-9] | Matches any non-digit character |
| []-a] | Matches any character from ] to a. (since ] is at the beginning of the pattern, the escape sequence is not needed) |
| foob[aeiou]r | Matches with foobar and foober, but not foobbr, foobcr, etc. |
| foob[^aeiou]r | Matches with foobbr, foobcr etc. but not foobar, foober, etc. |
Predefined classes
Some of the character classes are used so often that RegEx has predefined escape sequences to represent them.
| Pattern | Remarks |
|---|---|
| \w | an alphanumeric character, including an underscore _ character |
| \W | a non-alphanumeric character |
| \d | a numeric character |
| \D | a non-numeric character |
| \s | any space (same as the [ \t\n\r\f] character class) |
| \S | a non space |
| . | any character in line (the symbol is just a dot) |
Notice that the capitalized letters act as nagatives, for example, \w with \W.
Word and text boundaries
A word boundary \b matches a position between a word character \w and a non-word character \W. For the purpose of a word boundary position, the start and end of text will be treated as non-word characters \W. These markers are commonly used for matching patterns as whole words, while ignoring occurrences within words.
| Pattern | Remarks |
|---|---|
| \b | word boundary |
| \B | not word boundary |
| \A | start of text (^ is an alternative) |
| \Z | end of text ($ is an alternative) |
For example, \bhis\b will search for a whole word his, but will ignore this, history or whistle.
Iterators
Iterators (quantifiers) are meta-characters that specify how many times the preceding expression has to repeat. For example, finding a numeric sequence exactly 3 to 5 digits long.
Iterators can be Greedy or Non-Greedy. Greedy means the expression grabs as much matching text as possible. In contrast, the non-greedy expression tries to match as little as possible.
All iterators are greedy by default. Adding ? (question mark) at the end of an iterator makes it non-greedy.
For example:
- when b+ (a greedy expression) is applied to string abbbc, it matches bbb (as many as possible),
- but when b+? (a non-greedy expression) is applied to abbbc, it matches only b (as few as possible).
| Pattern | Remarks | Greedy | Remarks |
|---|---|---|---|
| * | zero or more | Yes | equivalent to {0,} |
| + | one or more | Yes | equivalent to {1,} |
| ? | zero or one | Yes | equivalent to {0,1} |
| {n} | exactly n times | Yes | |
| {n,} | at least n times | Yes | |
| {n,m} | at least n but not more than m times | Yes | |
| *? | zero or more | No | equivalent to {0,}? |
| +? | one or more | No | equivalent to {1,}? |
| ?? | zero or one | No | equivalent to {0,1}? |
| {n}? | exactly n times | No | |
| {n,}? | at least n times | No | |
| {n,m}? | at least n but not more than m times | No |
Let us see some examples:
| Pattern | Remarks |
|---|---|
| foob.*r | matches foobar, foobxyz123r and foobr |
| foob.+r | matches foobar, foobxyz123r but not foobr |
| foob.?r | matches foobar, foobbr and foobr but not foobxyz123r |
| fooba{2}r | matches foobaar |
| fooba{2,}r | matches foobaar, foobaaar, foobaaaar but not foobar |
| fooba{2,3}r | matches foobaar, foobaaar but not foobaaaar or foobar |
Alternatives
A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter | is used to separate the alternatives.
For example, fee|fie|foe will match with fee, fie, orand foe in the target string.text.
ItIn iscomplex expressions it may be difficult to understandidentify where each alternative starts and ends. This is why it is a common practice to includegroup alternatives ininto parentheses, to make it easier to understand.
For example, fee|fie|foe can be written as (fee|fie|foe) or as f(e|i|o).e, to make it easier to understand.e
Alternatives are triedtested for a match in the order in which they appear, from left to right, sointerating until the firstfull alternativeexpression match is found foror whichall thealternatives entirehave expressionfailed matches,to is the one that is chosen.match. For example, when matching foo|foot against barefoot, only the foo part will match, because that is the first alternative tried,foo andwill itmatch successfully matches the target string. (This is important when you are capturing matched text using parentheses.)first.
| Remarks | |
|---|---|
foo(bar|foo) |
matches |
Also remember that alternatives cannot be used inside a character class (square brackets), because | is interpreted as a literalliterally within []. That means that [fee|fie|foe] is same as [feio|], .(Thewhere otherrepeating characters are treated as duplicates, and ignored).ignored.
Subexpressions
Parts of any RegExa pattern can be enclosed in round brackets (), just like using brackets in a mathematics formula.formula a+(b+c). Each part that is enclosed in brackets is called a "subexpression".
TheSubexpressions bracketscan serveprovide twoclarity mainin purposes:
- expressions,
Betterandreadability,canasbe referenced in both the expression itself and in themathematicalreplacementformulaa+(b+c).Make a functional group, as in the mathematical formulaa(b+c). This group is evaluated first.
Let usLet's see some examples:
| Remarks | |
|---|---|
( |
fee|fie|foe |
(foobar){2,3} |
Matches |
foobar{2,3} |
Matches |
foob([0-9]|a+)r |
Backreferences
YouBackreferences mustallow haveyou toldto (orreference heard-)individual jokessubexpressions, likeenabling thiscomplex one:repeating patterns.
"TwoEach guys walk in a bar. The first guy says.... Then the second guy replies....".
Then you are already familiar with backreferences!
A "backreference"subexpression is aidentifyed numberedby referenceits toindex a previously mentioned thing.
RegEx also has backreferences. Let us understand how backreferences are defined in RegEx.
The RegEx engine tries to find text that matches the whole RegEx pattern. If a matching text is found, the RegEx engine identifies the matching text for each(order of the subexpressionsappearance) in the pattern.
At this stage, the RegEx engine gives numbers to these matching parts:
The text that matches theentireRegExfull expressiontakes the number '0'.The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching thenth subexpression will take the number 'n'.
Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called "backreference".)
The backreference to the nth subexpression is written as \n.
The backreferencesand can be usedreferenced tousing composea backslash \ followed by the RegExsubexpression patternindex, itself,e.g. as\1, shown\2, below:\3, and so on.
Let's see some examples:
(.)\1+ |
Matches any character that is repeated at least twice, e.g. aaaa and cc. |
(.+)\1+ |
Substitution of text using backreferenceSubstitutions
TheSubexpressions backreferences arecan also usedbe in find-and-replace operations, to re-assemble new text from old.
The expressions \1 through \9 serve as backreferences to the subexpressions found in the RegEx pattern. The expression \0 is used to represent the text that matches the whole RegEx pattern. These are used in the "find" part of the operation.The expressions $1 through $9 represent the actual text that matches therespectivesubexpressions. These are used in the "replace" part of the operation.The expressions $0 refers to the whole original name. Note: it is not necessary to enclosed them in round brackets () for this use, $0 is just there.
The replacement text is typically a combination of-
The text that matched the subexpressions, andSome new text.
Note that the RegEx pattern may have some parts that are not enclosed in (). (In other words, it may have parts that are not subexpressions.) Such parts are not usedreferenced in the replacement text.pattern, allowing you to assemble the output using individual subespression matches.
HereEach aresubexpression is identified by its index (order of appearance) in the full expression and can be referenced using a dollar sign $ followed by the subexpression index, e.g. $1, $2, $3, and so on. The full expression match can be referenced using $0.
Let's see some "find-and-replace"examples examples:of replacement with substitutions:
| Replace | Description | |
|---|---|---|
( |
$2, $1 |
Switch two words around and put a comma For |
\b(\d{2})-(\d{2})-(\d{4})\b |
$3-$2-$1 |
Find dd-mm-yyyy format and reverse them into yyyy-mm-dd format.For example, |
Note that the last example is not a robust approach for handling dates, because \d matches any digit in 0-9 range. This means that sequences like "00-00-0000" and "99-99-9999" will also match this pattern, but do not represent a valid date.
UpperText case and lower case manipulationsadjustments
BackreferencesSubexpressions can also be used to adjust the text case (upper case, lower case) of a certain patterns or fragments,patterns, which otherwise cannot be easily achieved with generic text case manipulation rules.
The following flags can be combined with subexpression refrences in the replace pattern:
| Flag | Description |
|---|---|
\L |
Convert all characters to lowercase. |
\l |
Convert only the first character to |
\U |
Convert all characters to uppercase. |
\u |
Convert only the first character to uppercase. |
These flags can be used together with the backreferences in the replace pattern to adjust the case of text inserted by backreferences.
For example, we can do the following manipulations:
| Input | Find | Replace | Result |
|---|---|---|---|
(.+) (.+) |
$1 $2 |
||
(.+) (.+) |
\U$1 $2 |
||
(.+) (.+) |
$1 \L$2 |
||
(.+) (.+) |
\u$1 \L$2 |
Note: Case manipulation features were added in v5.72.4 Beta. This feature is less common and may not exist in other RegEx engines.implementations.
Limitations for binary data
One of theA known limitation of the RegEx engineprocessor is when working with binary data is that the input stringtext is not searched beyond the first occurrence of a NULL character (\x00). This woulddoes not affect file names because there are simply no NULL characters in them, but may affect parsing of binary content of filesfiles, for example, when working inwith the Pascal Script rule for example..
Useful references
- Regular-Expressions.info – Excellent site devoted to regular expressions. It is nicely structured, with many easy to understand examples.
- TRegExpr – Regular expressions library for Delphi and Free Pascal. For syntax and API documentation see regex.sorokin.engineer.
- FPC RegEx packages – Regular expressions libraries included in Free Pascal.