Regular Expressions reference
TODO: PlaceholderReview, article,finish the highlighting.
Introduction
Regular Expressions (RegEx) is a syntax for specifying patterns of text to search and replace, which can be used for renaming files via the Regular Expressions renaming rule.
Special metacharacters allow you to specify, for instance, that a particular string you are looking for occurs at the beginning or end of a line, or contains N recurrences of a certain character. Metacharacters, such as $ . ^ { [ ( | ) * + ? \ are interpreted according to their individual meaning, instead of finding a literal match for them.
In this document, patterns are shown in orange color. The subject text which is checked against a pattern for a possible match is shown in bold black. Parts of the subject text are color-coded to provide a clue as to why a certain part matches (green color), or does not match (red color).
Simple matches
When the search string does not contain any metacharacters, the RegEx processor works like "normal" search. It tries to find an exact copy of the search string. This is also known as a literal match.
If you want to find a literal match for a metacharacter, put a backslash \ before it. The \ character is called an escape character, because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called escape sequence.
For example, metacharacter ^ matches the beginning of string, but pattern \^ matches ^ character literally. Similarly, pattern \\ matches \ character literally.
\^ escape sequence searches for the character ^ literally.
Escape sequences
We already saw one use of escape sequence (above).
Specific escape sequences are interpreted as special conditions, as listed below.
Note that the tab, new line, carriage return, and form feed are all known as white space characters, because they are normally not visible on display, but the RegEx processor can distinguish between them.
Character classes
A character class is a list of characters surrounded by square brackets [ and ], which will match any one (and only one) character from the list.
Note that:
- is used to indicate a range of characters.
There are some special conditions:
^ at the very beginning of the list, which means "none of the characters listed in this class".
If you want [ or ] itself to be \ before it.
Predefined classes
Some of the character classes are used so often that RegEx has predefined escape sequences to represent them.
_ character
\W
a non-alphanumeric character
\d
a numeric character
\D
a non-numeric character
\s
any space (same as the [ \t\n\r\f] character class)
\S
a non space
.
any character in line (the symbol is just a dot)
Notice that the capitalized letters act as nagatives, for example, \w with \W.
Word and text boundaries
A word boundary \b matches a position between a word character \w and a non-word character \W. For the purpose of a word boundary position, the start and end of text will be treated as non-word characters \W. These markers are commonly used for matching patterns as whole words, while ignoring occurrences within words.
For example, \bhis\b will search for a whole word his, but will ignore this, history or whistle.
Iterators
Iterators (quantifiers) are meta-characters that specify how many times the preceding expression has to repeat. For example, finding a numeric sequence exactly 3 to 5 digits long.
Iterators can be Greedy or Non-Greedy. Greedy means the expression grabs as much matching text as possible. In contrast, the non-greedy expression tries to match as little as possible.
All iterators are greedy by default. Adding ? (question mark) at the end of an iterator makes it non-greedy.
For example:
Let us see some examples:
Alternatives
A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter | is used to separate the alternatives.
For example, fee|fie|foe will match with fee, fie, or foe in the target string.
It is difficult to understand where each alternative starts and ends. This is why it is a common practice to include alternatives in parentheses, to make it easier to understand.
For example, fee|fie|foe can be written as f(e|i|o)e, to make it easier to understand.
Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. For example, when matching foo|foot against barefoot, only the foo part will match, because that is the first alternative tried, and it successfully matches the target string. (This is important when you are capturing matched text using parentheses.)
Also remember that alternatives cannot be used inside a character class (square brackets), because | is interpreted as a literal within []. That means [fee|fie|foe] is same as [feio|]. (The other characters are treated as duplicates, and ignored).
Subexpressions
Parts of any RegEx pattern can be enclosed in brackets (), just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a "subexpression".
The brackets serve two main purposes:
Let us see some examples:
Backreferences
You must have told (or heard-) jokes like this one:
"Two guys walk in a bar. The first guy says.... Then the second guy replies....".
Then you are already familiar with backreferences!
A "backreference" is a numbered reference to a previously mentioned thing.
RegEx also has backreferences. Let us understand how backreferences are defined in RegEx.
The RegEx engine tries to find text that matches the whole RegEx pattern. If a matching text is found, the RegEx engine identifies the matching text for each of the subexpressions in the pattern.
At this stage, the RegEx engine gives numbers to these matching parts:
Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called "backreference".)
The backreference to the nth subexpression is written as \n.
The backreferences can be used to compose the RegEx pattern itself, as shown below:
Substitution of text using backreference
The backreferences are also used in find-and-replace operations, to re-assemble new text from old.
The replacement text is typically a combination of-
Note that the RegEx pattern may have some parts that are not enclosed in (). (In other words, it may have parts that are not subexpressions.) Such parts are not used in the replacement text.
Here are some "find-and-replace" examples:
Upper case and lower case manipulations
Backreferences can also be used to adjust the case of a certain patterns or fragments, which cannot be easily achieved with generic case manipulation rules.
\L
Convert all characters to lowercase.
\l
Convert only the first character to lowercase (that's a lower case L).
\U
Convert all characters to uppercase.
\u
Convert only the first character to uppercase.
These flags can be used together with the backreferences in the replace pattern to adjust the case of text inserted by backreferences.
For example, we can do the following manipulations:
Note: Case manipulation features were added in v5.72.4 Beta. This feature is less common and may not exist in other RegEx engines.
Limitations for binary data
One of the known limitation of RegEx engine when working with binary data is that the input string is not searched beyond the first occurrence of NULL character (\x00). This would not affect file names because there are simply no NULL characters in them, but may affect parsing of binary content of files when working in Pascal Script for example.