Regular Expressions reference

Introduction

Regular Expressions (RegEx) is a syntax for specifying patterns of text to search and replace, which can be used for renaming files via the Regular Expressions renaming rule.

Special metacharacters allow you to specify, for instance, that a particular string you are looking for occurs at the beginning or end of a line, or contains N recurrences of a certain character. Metacharacters, such as $ . ^ { [ ( | ) * + ? \ are interpreted according to their individual meaning, instead of finding a literal match for them.

In this document, the subject text which is checked against a pattern for a possible match is shown in bold. Parts of the subject text may be color-coded to provide a clue as to why a certain part matches (green color), or does not match (red color).

Simple matches

When the search string does not contain any metacharacters, the RegEx processor works like "normal" search. It tries to find an exact copy of the search string. This is also known as a literal match.

If you want to find a literal match for a metacharacter, put a backslash \ before it. The \ character is called an escape character, because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called escape sequence.

For example, metacharacter ^ matches the beginning of string, but pattern \^ matches ^ character literally. Similarly, pattern \\ matches \ character literally.

Pattern	Matches	Remarks
`foobar`	foobar	This pattern does not contain any metacharacters, so all characters are matched literally.
`\^FooBar`	^FooBar	The `\^` escape sequence matches the `^` character literally.

Escape sequences

We already saw one use of escape sequence (above).

Specific escape sequences are interpreted as special conditions, as listed below.

Pattern	Remarks
`\xnn`	Character represented by the hex code nn
`\x{nnnn}`	Two bytes char with hex code nnnn (unicode)
`\t`	Tab (HT/TAB), same as \x09 (Hex 09)
`\n`	New line (NL), same as \x0a (Hex 0a)
`\r`	Carriage return (CR), same as \x0d (Hex 0d)
`\f`	Form feed (FF), same as \x0c (Hex 0c)
`foo\x20bar`	Matches foo bar (note the space in the middle), but does not match foobar
`\tfoobar`	Matches foobar preceded by a tab (the tab is needed for the match)

Note that the tab, new line, carriage return, and form feed are all known as white space characters, because they are normally not visible on display, but the RegEx processor can distinguish between them.

Character classes

A character class is a list of characters surrounded by square brackets [ and ], which will match any one (and only one) character from the list.

Note that:

The characters are not separated with a comma or a space.
If you repeat any character in the list, it is considered only once (duplicates are ignored).
A hyphen character - is used to indicate a range of characters.

Pattern	Remarks
`[abcdef]`	Matches a, b, c, d, e, or f (only one character), but no other characters
`[c-m]`	Matches any one (and only one) of the small alphabetical characters, from c to m
`[G-J]`	Matches any one (and only one) of the capital alphabetical characters from G to J
`[a-zA-Z]`	Matches any one (and only one) of the alphabetical characters (capital or small)
`[5-8]`	Matches any one (and only one) of numerical characters from 5 to 8
`[\x00-\x1F]`	Matches any one (and only one) of characters ordinal value range from #10 (`\n`) to #31 (`\x1F`). This includes some non-printable control characters in ASCII.

There are some special conditions:

If you do not want any of the characters in the specified class, then place ^ at the very beginning of the list, which means "none of the characters listed in this class".
If you want [ or ] itself to be a member of a class, put it at the start or end of the list, or use an escape sequence by putting \ before it.

Pattern	Remarks
`[-az]`	Matches a, z, and -. Note the `-` is at the beginning of the pattern, the escape sequence is not needed.
`[a\\-z]`	Matches a, z, and - Note the `-` is not at the beginning/end of the pattern, the escape sequence is needed.
`[^0-9]`	Matches any non-digit character.
`[]-a]`	Matches any character from ] to a. Note the `]` is at the beginning of the pattern, the escape sequence is not needed.
`foob[aeiou]r`	Matches with foobar and foober, but not foobbr, foobcr, etc.
`foob[^aeiou]r`	Matches with foobbr, foobcr etc. but not foobar, foober, etc.

Predefined classes

Some of the character classes are used so often that RegEx has predefined escape sequences to represent them.

Pattern	Description
`\w`	Alphanumeric character, including an underscore `_` character.
`\W`	Non-alphanumeric character.
`\d`	Numeric character.
`\D`	Non-numeric character.
`\s`	White space character, same as `[ \t\n\r\f]` character class.
`\S`	Any character, excluding white space characters.
`.`	Any character.

Notice that the capitalized classes act as nagatives, for example, \W has an inverse meaning of \w.

Word and text boundaries

A word boundary \b matches a position between a word character \w and a non-word character \W. For the purpose of a word boundary position, the start and end of text are treated like a non-word character \W. These markers are commonly used for matching patterns as whole words, while ignoring occurrences within words.

Pattern	Description
`\b`	Word boundary.
`\B`	Not word boundary.
`\A` or `^`	Start of text
`\Z` or `$`	End of text

For example, \bhis\b will match a whole word his, but will not match this, history or whistle.

Iterators

Iterators (quantifiers) are meta-characters that specify how many times the preceding expression has to repeat. For example, finding a numeric sequence exactly 3 to 5 digits long.

Iterators can be greedy or non-greedy. Greedy means the expression grabs as much matching text as possible. In contrast, the non-greedy expression tries to match as little as possible.

All iterators are greedy by default. Adding ? (question mark) at the end of an iterator makes it non-greedy.

For example:

when b+ (a greedy expression) is applied to string abbbc, it matches bbb (as many as possible),
but when b+? (a non-greedy expression) is applied to abbbc, it matches b (as few as possible).

Iterator	Description	Greedy?	Alternative
`*`	zero or more	Yes	`{0,}`
`+`	one or more	Yes	`{1,}`
`?`	zero or one	Yes	`{0,1}`
`{n}`	exactly n times	Yes
`{n,}`	at least n times	Yes
`{n,m}`	at least n but not more than m times	Yes
`*?`	zero or more	No	`{0,}?`
`+?`	one or more	No	`{1,}?`
`??`	zero or one	No	`{0,1}?`
`{n}?`	exactly n times	No
`{n,}?`	at least n times	No
`{n,m}?`	at least n but not more than m times	No

Let's see some examples:

Pattern	Remarks
`foob.*r`	matches foobar, foobxyz123r and foobr
`foob.+r`	matches foobar, foobxyz123r but not foobr
`foob.?r`	matches foobar, foobbr and foobr but not foobxyz123r
`fooba{2}r`	matches foobaar
`fooba{2,}r`	matches foobaar, foobaaar, foobaaaar but not foobar
`fooba{2,3}r`	matches foobaar, foobaaar but not foobaaaar or foobar

Alternatives

A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter | (vertical pipe) is used to separate the alternatives. For example, fee|fie|foe will match fee, fie, and foe in the subject text.

It is a common practice to group alternatives into parentheses, to make it easier to identify and distinguish them. For example, fee|fie|foe can be written as (fee|fie|foe) or as f(e|i|o)e.

Alternatives are tested for a match in the order in which they appear, from left to right, interating until the full expression match is found or all alternatives have failed to match. For example, when matching foo|foot against barefoot, the first alternative foo will match first.

Pattern	Remarks
`foo(bar\|foo)`	Matches foobar or foofoo.

Also remember that alternatives cannot be used inside a character class (square brackets), because | is interpreted literally within []. That means that [fee|fie|foe] is same as [feio|], where repeating characters are treated as duplicates, and ignored.

Subexpressions

Parts of a pattern can be enclosed in round brackets (), just like using brackets in a mathematics formula a+(b+c). Each part that is enclosed in brackets is called a subexpression. Subexpressions can provide clarity in complex expressions, and can be referenced in both the expression itself and in the replacement pattern.

Let's see some examples:

Pattern	Remarks
`(fee\|fie\|foe)bar`	Subexpression `fee\|fie\|foe` is clearly separated from the remaining expression.
`(foobar){2,3}`	Matches foobar if repeated 2 or 3 times, i.e. foobarfoobar and foobarfoobarfoobar.
`foobar{2,3}`	Matches fooba followed by the character r repeated 2 or 3 times, i.e. foobarr and foobarrr.
`foob([0-9]\|a+)r`	Matches foob0r, foob1r, foobar, foobaar, foobaaaar, etc.

Backreferences

Backreferences allow you to reference individual subexpressions, enabling complex repeating patterns.

Each subexpression is identifyed by its index (order of appearance) in the full expression and can be referenced using a backslash \ followed by the subexpression index, e.g. \1, \2, \3, and so on.

Let's see some examples:

Pattern	Remarks
`(.)\1+`	Matches any character that is repeated at least twice, e.g. aaaa and cc.
`(.+)\1+`	Matches any sequence of characters that is repeated at least twice, e.g. aaaa, cc, abababab, 123123.

Substitutions

Subexpressions can also be referenced in the replacement pattern, allowing you to assemble the output using individual subespression matches.

Each subexpression is identified by its index (order of appearance) in the full expression and can be referenced using a dollar sign $ followed by the subexpression index, e.g. $1, $2, $3, and so on. The full expression match can be referenced using $0.

Let's see some examples of replacement with substitutions:

Find	Replace	Description
`(\w+) (\w+)`	`$2, $1`	Switch two words around and put a comma between them. For example, "John Smith" becomes "Smith, John".
`\b(\d{2})-(\d{2})-(\d{4})\b`	`$3-$2-$1`	Find dates in `dd-mm-yyyy` format and reverse them into `yyyy-mm-dd` format. For example, "25-10-2007" becomes "2007-10-25".

Note that the last example is not a robust approach for handling dates, because \d matches any digit in 0-9 range. This means that sequences like "00-00-0000" and "99-99-9999" will also match this pattern, but do not represent a valid date.

Text case adjustments

Subexpressions can also be used to adjust the text case (upper case, lower case) of certain patterns, which otherwise cannot be easily achieved with generic text case manipulation rules.

The following flags can be combined with subexpression refrences in the replace pattern:

Flag	Description
`\L`	Convert all characters to lowercase.
`\l`	Convert only the first character to lowercase.
`\U`	Convert all characters to uppercase.
`\u`	Convert only the first character to uppercase.

For example, we can do the following manipulations:

Input	Find	Replace	Result
hello WORLD	`(.+) (.+)`	`$1 $2`	hello WORLD
hello WORLD	`(.+) (.+)`	`\U$1 $2`	HELLO WORLD
hello WORLD	`(.+) (.+)`	`$1 \L$2`	hello world
hello WORLD	`(.+) (.+)`	`\u$1 \L$2`	Hello world

Note: Case manipulation features were added in v5.72.4 Beta. This feature is less common and may not exist in other RegEx implementations.

Limitations for binary data

A known limitation of the RegEx processor is when working with binary data the input text is not searched beyond the first occurrence of a NULL character (\x00). This does not affect file names because there are simply no NULL characters in them, but may affect parsing of binary content of files, for example, when working with the Pascal Script rule.

Useful references

Regular-Expressions.info – Excellent site devoted to regular expressions. It is nicely structured, with many easy to understand examples.
TRegExpr – Regular expressions library for Delphi and Free Pascal. For syntax and API documentation see regex.sorokin.engineer.
FPC RegEx packages – Regular expressions libraries included in Free Pascal.