Difference between revisions of "ReNamer:Regular Expressions"

From den4b Wiki
Jump to navigation Jump to search
(binary NULL limitation note, and cleanup)
(replacing <span color> with <span class>, replacing (0x201C and 0x201D) quotes with normal (0x22))
Line 1: Line 1:
 
{{Cleanup|
 
{{Cleanup|
* Replace all span color highlights with custom '''hl-*''' classes found in [[MediaWiki:Common.css]].
+
* Replace all span color highlights with custom '''hl-*''' classes found in [[MediaWiki:Common.css]].(krtek: probably already done)
* Replace all '''“”''' (0x201C and 0x201D) quotes with normal '''""''' (0x22).
+
* Replace all '''""''' (0x201C and 0x201D) quotes with normal '''""''' (0x22). (krtek: probably already done)
 +
* Check for italized " (quotes) and make them normal. Especially ending quotes got italized or bolded.
 
* }}
 
* }}
  
Line 9: Line 10:
 
*The specified text must occur ''n'' times, etc.
 
*The specified text must occur ''n'' times, etc.
  
RegEx expressions use <span style="color: darkorange; font-weight: bold;">'''$ . ^ { [ ( | ) * +&nbsp;? \ '''</span> characters (called ''metacharacters'') in various combinations to specify all these conditions.The search engine ''interprets'' these metacharacters, rather than finding a literal match for them.  
+
RegEx expressions use <span class="hl-orange">'''$ . ^ { [ ( | ) * +&nbsp;? \ '''</span> characters (called ''metacharacters'') in various combinations to specify all these conditions.The search engine ''interprets'' these metacharacters, rather than finding a literal match for them.  
  
The RegEx search expression is called a '''''RegEx pattern”''''', because a single expression can match a large number of actual text that has the specified ''pattern.'' For example, the RegEx pattern <span style="color: darkorange; font-weight: bold;">'''b?t'''</span> matches with '''bat''', '''bet''', '''bit''', '''bot''' and '''but, '''etc.  
+
The RegEx search expression is called a "'''''RegEx pattern"''''', because a single expression can match a large number of actual text that has the specified ''pattern.'' For example, the RegEx pattern <span class="hl-orange">'''b?t'''</span> matches with '''bat''', '''bet''', '''bit''', '''bot''' and '''but, '''etc.  
  
Remember that RegEx strings are case-sensitive (The words<span style="color: darkorange; font-weight: bold;">''' cat'''</span>, <span style="color: darkorange; font-weight: bold;">''' CAT'''</span>,<span style="color: darkorange; font-weight: bold;">''' cAt'''</span>, <span style="color: darkorange; font-weight: bold;">''' Cat'''</span>, <span style="color: darkorange; font-weight: bold;">''' caT'''</span>, <span style="color: darkorange; font-weight: bold;">''' cAT'''</span>, <span style="color: darkorange; font-weight: bold;">''' CAt '''</span>and <span style="color: darkorange; font-weight: bold;">'''CaT '''</span>are not equivalent).  
+
Remember that RegEx strings are case-sensitive (The words<span class="hl-orange">''' cat'''</span>, <span class="hl-orange">''' CAT'''</span>,<span class="hl-orange">''' cAt'''</span>, <span class="hl-orange">''' Cat'''</span>, <span class="hl-orange">''' caT'''</span>, <span class="hl-orange">''' cAT'''</span>, <span class="hl-orange">''' CAt '''</span>and <span class="hl-orange">'''CaT '''</span>are not equivalent).  
  
Also, note that even the digits (<span style="color: darkorange; font-weight: bold;">0</span>-<span style="color: darkorange; font-weight: bold;">9</span>) are “numeric characters” for RegEx.  
+
Also, note that even the digits (<span class="hl-orange">0</span>-<span class="hl-orange">9</span>) are "numeric characters" for RegEx.  
  
In this section, the RegEx expressions (patterns) are shown in <span style="color: darkorange; font-weight: bold;">'''bold orange'''</span>. The target strings (which are compared with the RegEx expression for a possible match) are shown in '''bold black. '''A part of the target text is color-coded to provide a clue as to why a certain part matches ('''<span style="color: teal; font-weight: bold;">green</span>''' color), or does <u>not</u> match ('''<span style="color: red; font-weight: bold;">red</span>''' color)  
+
In this section, the RegEx expressions (patterns) are shown in <span class="hl-orange">'''bold orange'''</span>. The target strings (which are compared with the RegEx expression for a possible match) are shown in '''bold black. '''A part of the target text is color-coded to provide a clue as to why a certain part matches ('''<span class="hl-teal">green</span>''' color), or does <u>not</u> match ('''<span class="hl-red">red</span>''' color)  
  
 
=== Simple (literal) matches  ===
 
=== Simple (literal) matches  ===
  
When the search string does not contain any metacharacters, the RegEx engine works like “normal” search. (it tries to find an exact copy of the search string.) (This is also known as “literal match”).  
+
When the search string does not contain any metacharacters, the RegEx engine works like "normal" search. (it tries to find an exact copy of the search string.) (This is also known as "literal match").  
  
If you want to find a literal match for a metacharacter, put a backslash '''\''' ''before'' it. (The '''<span style="color: darkorange; font-weight: bold;">\</span>''' character is called ''escape character”, because ''it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called ''escape sequence”'').  
+
If you want to find a literal match for a metacharacter, put a backslash '''\''' ''before'' it. (The '''<span class="hl-orange">\</span>''' character is called "''escape character''", because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called "''escape sequence''").  
  
For example, metacharacter '''<span style="color: darkorange; font-weight: bold;">^</span>''' matches the beginning of string, but '''<span style="color: darkorange; font-weight: bold;">\^</span>''' matches the character '''<span style="color: teal; font-weight: bold;">^</span>'''.  
+
For example, metacharacter '''<span class="hl-orange">^</span>''' matches the beginning of string, but '''<span class="hl-orange">\^</span>''' matches the character '''<span class="hl-teal">^</span>'''.  
  
Note that the RegEx pattern '''<span style="color: darkorange; font-weight: bold;">\\</span>''' matches the character '''<span style="color: teal; font-weight: bold;">\</span>'''.  
+
Note that the RegEx pattern '''<span class="hl-orange">\\</span>''' matches the character '''<span class="hl-teal">\</span>'''.  
  
 
{| class="prettytable"
 
{| class="prettytable"
Line 35: Line 36:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foobar</span>'''</center>  
+
| <center>'''<span class="hl-orange">foobar</span>'''</center>  
| <center>'''<span style="color: teal; font-weight: bold;">foobar</span>'''</center>  
+
| <center>'''<span class="hl-teal">foobar</span>'''</center>  
 
| This RegEx pattern does not contain any metacharacters; so all characters are matched literally.
 
| This RegEx pattern does not contain any metacharacters; so all characters are matched literally.
 
|-
 
|-
| '''<span style="color: darkorange; font-weight: bold;">\^FooBarPtr</span>'''  
+
| '''<span class="hl-orange">\^FooBarPtr</span>'''  
| <center>'''<span style="color: teal; font-weight: bold;">^FooBarPtr</span>'''</center>  
+
| <center>'''<span class="hl-teal">^FooBarPtr</span>'''</center>  
| The '''<span style="color: darkorange; font-weight: bold;">\^</span>''' escape sequence searches for the character '''^''' ''literally'' .
+
| The '''<span class="hl-orange">\^</span>''' escape sequence searches for the character '''^''' ''literally'' .
 
|}
 
|}
  
Line 55: Line 56:
 
| <center>'''matches-'''</center>
 
| <center>'''matches-'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\xnn</span>'''</center>  
+
| <center>'''<span class="hl-orange">\xnn</span>'''</center>  
 
| Character represented by the hex code ''nn''
 
| Character represented by the hex code ''nn''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\x{nnnn}</span>'''</center>  
+
| <center>'''<span class="hl-orange">\x{nnnn}</span>'''</center>  
 
| two bytes char with hex code nnnn (unicode)
 
| two bytes char with hex code nnnn (unicode)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\t</span>'''</center>  
+
| <center>'''<span class="hl-orange">\t</span>'''</center>  
 
| tab (HT/TAB), same as \x09 (Hex 09)
 
| tab (HT/TAB), same as \x09 (Hex 09)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\n</span>'''</center>  
+
| <center>'''<span class="hl-orange">\n</span>'''</center>  
 
| new line (NL), same as \x0a (Hex 0a)
 
| new line (NL), same as \x0a (Hex 0a)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\r</span>'''</center>  
+
| <center>'''<span class="hl-orange">\r</span>'''</center>  
 
| carriage return (CR), same as \x0d (Hex 0d)
 
| carriage return (CR), same as \x0d (Hex 0d)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\f</span>'''</center>  
+
| <center>'''<span class="hl-orange">\f</span>'''</center>  
 
| form feed (FF), same as \x0c (Hex 0c)
 
| form feed (FF), same as \x0c (Hex 0c)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foo\x20bar</span>'''</center>  
+
| <center>'''<span class="hl-orange">foo\x20bar</span>'''</center>  
| matches '''<span style="color: teal; font-weight: bold;">foo bar</span>''' (note the space in the middle), but does ''not'' match '''foobar'''
+
| matches '''<span class="hl-teal">foo bar</span>''' (note the space in the middle), but does ''not'' match '''foobar'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\tfoobar</span>'''</center>  
+
| <center>'''<span class="hl-orange">\tfoobar</span>'''</center>  
| matches '''<span style="color: teal; font-weight: bold;">foobar</span>''' preceded by a tab (the tab is needed for the match)
+
| matches '''<span class="hl-teal">foobar</span>''' preceded by a tab (the tab is needed for the match)
 
|}
 
|}
  
Note that the tab, new line, carriage return, and form feed are known as “white spaces”. But RegEx can distinguish between them. This allows you to make high-precision searches.  
+
Note that the tab, new line, carriage return, and form feed are known as "white spaces". But RegEx can distinguish between them. This allows you to make high-precision searches.  
  
 
=== Character Classes  ===
 
=== Character Classes  ===
  
A character class is a list of characters in square brackets '''<span style="color: darkorange; font-weight: bold;">[]</span>''', which will match any one (and ''only one''-) character from the list.  
+
A character class is a list of characters in square brackets '''<span class="hl-orange">[]</span>''', which will match any one (and ''only one''-) character from the list.  
  
 
Note that-  
 
Note that-  
Line 90: Line 91:
 
*The characters are not separated with a comma or a space.  
 
*The characters are not separated with a comma or a space.  
 
*If you repeat any character in the list, it is considered only once (duplicates are ignored).  
 
*If you repeat any character in the list, it is considered only once (duplicates are ignored).  
*A hyphen '''<span style="color: darkorange; font-weight: bold;">-</span>''' is used to indicate range of characters.
+
*A hyphen '''<span class="hl-orange">-</span>''' is used to indicate range of characters.
  
 
{| class="prettytable"
 
{| class="prettytable"
Line 97: Line 98:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[abdef]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[abdef]</span>'''</center>  
| Matches '''<span style="color: teal; font-weight: bold;">d</span>''', '''<span style="color: teal; font-weight: bold;">e</span>''', or '''<span style="color: teal; font-weight: bold;">f</span>''' (only ''one'' character), but no other characters
+
| Matches '''<span class="hl-teal">d</span>''', '''<span class="hl-teal">e</span>''', or '''<span class="hl-teal">f</span>''' (only ''one'' character), but no other characters
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[c-m]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[c-m]</span>'''</center>  
| Matches any one (and only one) of the small alphabetical characters, from '''<span style="color: teal; font-weight: bold;">c</span>''' to '''<span style="color: teal; font-weight: bold;">m</span>'''
+
| Matches any one (and only one) of the small alphabetical characters, from '''<span class="hl-teal">c</span>''' to '''<span class="hl-teal">m</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[G-J]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[G-J]</span>'''</center>  
| Matches any one (and only one) of the capital alphabetical characters from '''<span style="color: teal; font-weight: bold;">G</span>''' to '''<span style="color: teal; font-weight: bold;">J</span>'''
+
| Matches any one (and only one) of the capital alphabetical characters from '''<span class="hl-teal">G</span>''' to '''<span class="hl-teal">J</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[a-zA-Z]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[a-zA-Z]</span>'''</center>  
 
| Matches any one (and only one) of the alphabetical characters (capital or small)
 
| Matches any one (and only one) of the alphabetical characters (capital or small)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[5-8]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[5-8]</span>'''</center>  
| Matches any one (and only one) of numerical characters from '''<span style="color: teal; font-weight: bold;">5</span>''' to '''<span style="color: teal; font-weight: bold;">8</span>'''
+
| Matches any one (and only one) of numerical characters from '''<span class="hl-teal">5</span>''' to '''<span class="hl-teal">8</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[\n-\x0D]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[\n-\x0D]</span>'''</center>  
 
|  
 
|  
Matches any one (and only one) of '''<span style="color: teal; font-weight: bold;">#10</span>''','''''<i><span style="color: teal; font-weight: bold;"> </span></i>'''<span style="color: teal; font-weight: bold;">#11 </span>''','''''<span style="color: teal; font-weight: bold;"> #12 </span>''or '''''<span style="color: teal; font-weight: bold;">#13</span>'''<br>(Note the use of [[ReNamer:Regular Expressions#Simple_.28literal.29_matches|escape sequence]] inside a class)  
+
Matches any one (and only one) of '''<span class="hl-teal">#10</span>''','''''<i><span class="hl-teal"> </span></i>'''<span class="hl-teal">#11 </span>''','''''<span class="hl-teal"> #12 </span>''or '''''<span class="hl-teal">#13</span>'''<br>(Note the use of [[ReNamer:Regular Expressions#Simple_.28literal.29_matches|escape sequence]] inside a class)  
  
 
|}
 
|}
Line 120: Line 121:
 
There are some special conditions:  
 
There are some special conditions:  
  
*If you do not want any of the characters in the specified class, then place '''<span style="color: darkorange; font-weight: bold;">^</span>''' at the very beginning of the list (RegEx interprets that as “none of the characters listed in this class”).  
+
*If you do not want any of the characters in the specified class, then place '''<span class="hl-orange">^</span>''' at the very beginning of the list (RegEx interprets that as "none of the characters listed in this class").  
*If you want '''<span style="color: darkorange; font-weight: bold;">[</span>''' or '''<span style="color: darkorange; font-weight: bold;">]</span>''' itself to be a member of a class, put it at the start or end of the list, or create a [[ReNamer:Regular Expressions#Simple_.28literal.29_matches|escape sequence]] (by putting '''<span style="color: darkorange; font-weight: bold;">\</span>''' before it).
+
*If you want '''<span class="hl-orange">[</span>''' or '''<span class="hl-orange">]</span>''' itself to be a member of a class, put it at the start or end of the list, or create a [[ReNamer:Regular Expressions#Simple_.28literal.29_matches|escape sequence]] (by putting '''<span class="hl-orange">\</span>''' before it).
  
 
{| class="prettytable" style="width: 539px; height: 206px;"
 
{| class="prettytable" style="width: 539px; height: 206px;"
Line 128: Line 129:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[-az]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[-az]</span>'''</center>  
| matches '''<span style="color: teal; font-weight: bold;">a</span>'','''''<i><span style="color: teal; font-weight: bold;">z</span>''',<span style="font-weight: bold;"> </span>and '''<span style="color: teal; font-weight: bold;">-</span></i>'''<br>'''(since '''<span style="color: darkorange; font-weight: bold;">–</span>''' is put at the beginning, the escape sequence is not needed)
+
| matches '''<span class="hl-teal">a</span>'','''''<i><span class="hl-teal">z</span>''', and '''<span class="hl-teal">-</span></i>'''<br>'''(since '''<span class="hl-orange">–</span>''' is put at the beginning, the escape sequence is not needed)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[a\-z]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[a\-z]</span>'''</center>  
| matches '''<span style="color: teal; font-weight: bold;">a</span>'','''''<i><span style="color: teal; font-weight: bold;">z</span>''',<span style="font-weight: bold;"> </span>and '''<span style="color: teal; font-weight: bold;">-</span></i>'''&nbsp;''' <br>(since '''<span style="color: darkorange; font-weight: bold;">–</span>''' is ''not'' at the beginning/end, the escape sequence ''is'' needed)
+
| matches '''<span class="hl-teal">a</span>'','''''<i><span class="hl-teal">z</span>''', and '''<span class="hl-teal">-</span></i>'''&nbsp;''' <br>(since '''<span class="hl-orange">–</span>''' is ''not'' at the beginning/end, the escape sequence ''is'' needed)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[^0-9]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[^0-9]</span>'''</center>  
 
| matches any ''non-digit'' character
 
| matches any ''non-digit'' character
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">[]-a]</span>'''</center>  
+
| <center>'''<span class="hl-orange">[]-a]</span>'''</center>  
| matches any character from '''<span style="color: teal; font-weight: bold;">]</span>''' to '''<span style="color: teal; font-weight: bold;">a</span>'''. <br>(since '''<span style="color: darkorange; font-weight: bold;">]</span> '''is at the beginning, the escape sequence ''is'' ''not'' needed)
+
| matches any character from '''<span class="hl-teal">]</span>''' to '''<span class="hl-teal">a</span>'''. <br>(since '''<span class="hl-orange">]</span> '''is at the beginning, the escape sequence ''is'' ''not'' needed)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foob[aeiou]r</span>'''</center>  
+
| <center>'''<span class="hl-orange">foob[aeiou]r</span>'''</center>  
| Matches with '''foob'''''<span style="color: teal; font-weight: bold;">a</span>'''''r''', '''foob'''''<span style="color: teal; font-weight: bold;">e</span>'''''r,''' etc. but not '''foob'''''<span style="color: red; font-weight: bold;">b</span>'''''r''', '''foob'''''<span style="color: red; font-weight: bold;">c</span>'''''r''', etc.
+
| Matches with '''foob'''''<span class="hl-teal">a</span>'''''r''', '''foob'''''<span class="hl-teal">e</span>'''''r,''' etc. but not '''foob'''''<span class="hl-red">b</span>'''''r''', '''foob'''''<span class="hl-red">c</span>'''''r''', etc.
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foob[^aeiou]r</span>'''</center>  
+
| <center>'''<span class="hl-orange">foob[^aeiou]r</span>'''</center>  
| Matches with '''foob'''''<span style="color: teal; font-weight: bold;">b</span>'''''r''', '''foob'''''<span style="color: teal; font-weight: bold;">c</span>'''''r''' etc. but not '''foob'''''<span style="color: red; font-weight: bold;">a</span>'''''r''', '''foob'''''<span style="color: red; font-weight: bold;">e</span>'''''r''', etc.
+
| Matches with '''foob'''''<span class="hl-teal">b</span>'''''r''', '''foob'''''<span class="hl-teal">c</span>'''''r''' etc. but not '''foob'''''<span class="hl-red">a</span>'''''r''', '''foob'''''<span class="hl-red">e</span>'''''r''', etc.
 
|}
 
|}
  
@@@ The <span style="color: darkorange; font-weight: bold;">]-a</span>example would need a clarification as to what is the natural sequence of characters, and where is a superset of all possible characters described?
+
@@@ The "<span class="hl-orange">]-a</span>" example would need a clarification as to what is the natural sequence of characters, and where is a superset of all possible characters described?
  
 
=== Predefined Classes  ===
 
=== Predefined Classes  ===
Line 158: Line 159:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\w</span>'''</center>  
+
| <center>'''<span class="hl-orange">\w</span>'''</center>  
 
| an alphanumeric character, including an ''underscore'' ('''_''')
 
| an alphanumeric character, including an ''underscore'' ('''_''')
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\W</span>'''</center>  
+
| <center>'''<span class="hl-orange">\W</span>'''</center>  
 
| a non-alphanumeric character
 
| a non-alphanumeric character
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\d</span>'''</center>  
+
| <center>'''<span class="hl-orange">\d</span>'''</center>  
 
| a numeric character
 
| a numeric character
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\D</span>'''</center>  
+
| <center>'''<span class="hl-orange">\D</span>'''</center>  
 
| a non-numeric character
 
| a non-numeric character
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\s</span>'''</center>  
+
| <center>'''<span class="hl-orange">\s</span>'''</center>  
| any space (same as the '''<span style="color: darkorange; font-weight: bold;">[ \t\n\r\f]</span>''' class)
+
| any space (same as the '''<span class="hl-orange">[ \t\n\r\f]</span>''' class)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\S</span>'''</center>  
+
| <center>'''<span class="hl-orange">\S</span>'''</center>  
 
| a non space
 
| a non space
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">.</span>'''</center>  
+
| <center>'''<span class="hl-orange">.</span>'''</center>  
 
| any character in line (the symbol is just a dot)
 
| any character in line (the symbol is just a dot)
 
|}
 
|}
  
Notice that the capitalized letter is used to negate (for example, compare '''<span style="color: darkorange; font-weight: bold;">\w</span>''' with '''<span style="color: darkorange; font-weight: bold;">\W</span>''')  
+
Notice that the capitalized letter is used to negate (for example, compare '''<span class="hl-orange">\w</span>''' with '''<span class="hl-orange">\W</span>''')  
  
 
=== Word/Text Boundaries  ===
 
=== Word/Text Boundaries  ===
  
A word boundary (<span style="color: darkorange; font-weight: bold;">\b</span>) is a spot between two characters that has a <span style="color: darkorange; font-weight: bold;">\w</span> on one side of it and a <span style="color: darkorange; font-weight: bold;">\W</span> on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a <span style="color: darkorange; font-weight: bold;">\W</span>.  
+
A word boundary (<span class="hl-orange">\b</span>) is a spot between two characters that has a <span class="hl-orange">\w</span> on one side of it and a <span class="hl-orange">\W</span> on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a <span class="hl-orange">\W</span>.  
  
 
{| class="prettytable"
 
{| class="prettytable"
Line 191: Line 192:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\b</span>'''</center>  
+
| <center>'''<span class="hl-orange">\b</span>'''</center>  
 
| word boundary
 
| word boundary
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\B</span>'''</center>  
+
| <center>'''<span class="hl-orange">\B</span>'''</center>  
 
| not word boundary
 
| not word boundary
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\A</span>'''</center>  
+
| <center>'''<span class="hl-orange">\A</span>'''</center>  
| start of text ('''<span style="color: darkorange; font-weight: bold;">^</span>''' is an alternative)
+
| start of text ('''<span class="hl-orange">^</span>''' is an alternative)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">\Z</span>'''</center>  
+
| <center>'''<span class="hl-orange">\Z</span>'''</center>  
| end of text ('''<span style="color: darkorange; font-weight: bold;">$</span>''' is an alternative)
+
| end of text ('''<span class="hl-orange">$</span>''' is an alternative)
 
|}
 
|}
  
These markers are combined with the search string to specify where exactly you want the search string to be. For example, '''<span style="color: darkorange; font-weight: bold;">\bhis\b</span> '''will search for a whole word '''his''', but will ignore '''<span style="color: red; font-weight: bold;">t</span>his''', '''his<span style="color: red; font-weight: bold;">tory</span> '''or '''<span style="color: red; font-weight: bold;">w</span>his<span style="color: red; font-weight: bold;">tle</span>'''.  
+
These markers are combined with the search string to specify where exactly you want the search string to be. For example, '''<span class="hl-orange">\bhis\b</span> '''will search for a whole word '''his''', but will ignore '''<span class="hl-red">t</span>his''', '''his<span class="hl-red">tory</span> '''or '''<span class="hl-red">w</span>his<span class="hl-red">tle</span>'''.  
  
 
=== Iterators (Quantifiers)  ===
 
=== Iterators (Quantifiers)  ===
Line 210: Line 211:
 
Iterators (quantifiers) are meta-characters that specify how many times the ''preceding'' expression has to repeat, A typical example is to find a 3-to-5 digit number.  
 
Iterators (quantifiers) are meta-characters that specify how many times the ''preceding'' expression has to repeat, A typical example is to find a 3-to-5 digit number.  
  
RegEx newbies often place the iterators ''after'' the character that needs to repeat. Just remember that RegEx syntax is exact opposite of the usual English syntax. So, instead of ''four dogs'', we would have to say ''dogs four'', RegEx-style.  
+
RegEx newbies often place the iterators ''after'' the character that needs to repeat. Just remember that RegEx syntax is exact opposite of the usual English syntax. So, instead of "''four dogs''", we would have to say "''dogs four''", RegEx-style.  
  
 
Iterators can be 'Greedy' or 'Non-Greedy'. Greedy means the expression grabs as ''much'' matching text as possible. In contrast, the non-greedy expression tries to match as ''little'' as possible.  
 
Iterators can be 'Greedy' or 'Non-Greedy'. Greedy means the expression grabs as ''much'' matching text as possible. In contrast, the non-greedy expression tries to match as ''little'' as possible.  
Line 216: Line 217:
 
For example,  
 
For example,  
  
*when '''<span style="color: darkorange; font-weight: bold;">b+</span>''' (a greedy expression) is applied to string '''abbbbc''', it returns '''bbbb''',  
+
*when '''<span class="hl-orange">b+</span>''' (a greedy expression) is applied to string '''abbbbc''', it returns '''bbbb''',  
*but when '''<span style="color: darkorange; font-weight: bold;">b+?</span> '''(a non-greedy expression) is applied to '''abbbbc''', it returns only '''b'''.
+
*but when '''<span class="hl-orange">b+?</span> '''(a non-greedy expression) is applied to '''abbbbc''', it returns only '''b'''.
  
Note that a '''<span style="color: darkorange; font-weight: bold;">?</span> '''attached to a greedy expression makes it non-greedy.  
+
Note that a '''<span class="hl-orange">?</span> '''attached to a greedy expression makes it non-greedy.  
  
 
{| class="prettytable"
 
{| class="prettytable"
Line 228: Line 229:
 
! <center>Remarks</center>
 
! <center>Remarks</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">*</span>'''</center>  
+
| <center>'''<span class="hl-orange">*</span>'''</center>  
 
| zero or more  
 
| zero or more  
 
| <center>Yes</center>  
 
| <center>Yes</center>  
| equivalent to '''<span style="color: darkorange; font-weight: bold;">{0,}</span>'''
+
| equivalent to '''<span class="hl-orange">{0,}</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">+</span>'''</center>  
+
| <center>'''<span class="hl-orange">+</span>'''</center>  
 
| one or more  
 
| one or more  
 
| <center>Yes</center>  
 
| <center>Yes</center>  
| equivalent to '''<span style="color: darkorange; font-weight: bold;">{1,}</span>'''
+
| equivalent to '''<span class="hl-orange">{1,}</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">?</span>'''</center>  
+
| <center>'''<span class="hl-orange">?</span>'''</center>  
 
| zero or one  
 
| zero or one  
 
|  
 
|  
| equivalent to '''<span style="color: darkorange; font-weight: bold;">{0,1}</span>'''
+
| equivalent to '''<span class="hl-orange">{0,1}</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">{n}</span>'''</center>  
+
| <center>'''<span class="hl-orange">{n}</span>'''</center>  
 
| exactly ''n'' times  
 
| exactly ''n'' times  
 
| <center>Yes</center>  
 
| <center>Yes</center>  
 
|  
 
|  
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">{n,}</span>'''</center>  
+
| <center>'''<span class="hl-orange">{n,}</span>'''</center>  
 
| at least ''n'' times  
 
| at least ''n'' times  
 
| <center>Yes</center>  
 
| <center>Yes</center>  
 
|  
 
|  
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">{n,m}</span>'''</center>  
+
| <center>'''<span class="hl-orange">{n,m}</span>'''</center>  
 
| at least ''n'' but not more than ''m'' times  
 
| at least ''n'' but not more than ''m'' times  
 
| <center>Yes</center>  
 
| <center>Yes</center>  
 
|  
 
|  
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">*?</span>'''</center>  
+
| <center>'''<span class="hl-orange">*?</span>'''</center>  
 
| zero or more  
 
| zero or more  
 
| <center>No</center>  
 
| <center>No</center>  
| equivalent to '''<span style="color: darkorange; font-weight: bold;">{0,}?</span>'''
+
| equivalent to '''<span class="hl-orange">{0,}?</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">+?</span>'''</center>  
+
| <center>'''<span class="hl-orange">+?</span>'''</center>  
 
| one or more  
 
| one or more  
 
| <center>No</center>  
 
| <center>No</center>  
| equivalent to '''<span style="color: darkorange; font-weight: bold;">{1,}?</span>'''
+
| equivalent to '''<span class="hl-orange">{1,}?</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">??</span>'''</center>  
+
| <center>'''<span class="hl-orange">??</span>'''</center>  
 
| zero or one  
 
| zero or one  
 
| <center>No</center>  
 
| <center>No</center>  
| equivalent to '''<span style="color: darkorange; font-weight: bold;">{0,1}?</span>'''
+
| equivalent to '''<span class="hl-orange">{0,1}?</span>'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">{n}?</span>'''</center>  
+
| <center>'''<span class="hl-orange">{n}?</span>'''</center>  
 
| exactly ''n'' times  
 
| exactly ''n'' times  
 
| <center>No</center>  
 
| <center>No</center>  
 
|  
 
|  
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">{n,}?</span>'''</center>  
+
| <center>'''<span class="hl-orange">{n,}?</span>'''</center>  
 
| at least ''n ''times  
 
| at least ''n ''times  
 
| <center>No</center>  
 
| <center>No</center>  
 
|  
 
|  
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">{n,m}?</span>'''</center>  
+
| <center>'''<span class="hl-orange">{n,m}?</span>'''</center>  
 
| at least ''n'' but not more than ''m'' times  
 
| at least ''n'' but not more than ''m'' times  
 
| <center>No</center>  
 
| <center>No</center>  
Line 296: Line 297:
 
! <center>Remarks</center>
 
! <center>Remarks</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foob.*r</span>'''</center>  
+
| <center>'''<span class="hl-orange">foob.*r</span>'''</center>  
| matches '''foob<span style="color: teal; font-weight: bold;">a</span>r''', '''foob<span style="color: teal; font-weight: bold;">alkjdflkj9</span>r''' and '''foobr'''
+
| matches '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">alkjdflkj9</span>r''' and '''foobr'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foob.+r</span>'''</center>  
+
| <center>'''<span class="hl-orange">foob.+r</span>'''</center>  
| matches '''foob<span style="color: teal; font-weight: bold;">a</span>r''', '''foob<span style="color: teal; font-weight: bold;">alkjdflkj9</span>r''' but not '''foobr'''
+
| matches '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">alkjdflkj9</span>r''' but not '''foobr'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foob.?r</span>'''</center>  
+
| <center>'''<span class="hl-orange">foob.?r</span>'''</center>  
| matches '''foob<span style="color: teal; font-weight: bold;">a</span>r,''' '''foob<span style="color: teal; font-weight: bold;">b</span>r''' and '''foobr''' but not '''foob<span style="color: red; font-weight: bold;">alkj9</span>r'''
+
| matches '''foob<span class="hl-teal">a</span>r,''' '''foob<span class="hl-teal">b</span>r''' and '''foobr''' but not '''foob<span class="hl-red">alkj9</span>r'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">fooba{2}r</span>'''</center>  
+
| <center>'''<span class="hl-orange">fooba{2}r</span>'''</center>  
| matches '''foob<span style="color: teal; font-weight: bold;">aa</span>r'''
+
| matches '''foob<span class="hl-teal">aa</span>r'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">fooba{2,}r</span>'''</center>  
+
| <center>'''<span class="hl-orange">fooba{2,}r</span>'''</center>  
| matches '''foob<span style="color: teal; font-weight: bold;">aa</span>r,''' '''foob<span style="color: teal; font-weight: bold;">aaa</span>r''', '''foob<span style="color: teal; font-weight: bold;">aaaa</span>r''' etc. but not '''foob<span style="color: red; font-weight: bold;">a</span>r'''
+
| matches '''foob<span class="hl-teal">aa</span>r,''' '''foob<span class="hl-teal">aaa</span>r''', '''foob<span class="hl-teal">aaaa</span>r''' etc. but not '''foob<span class="hl-red">a</span>r'''
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">fooba{2,3}r</span>'''</center>  
+
| <center>'''<span class="hl-orange">fooba{2,3}r</span>'''</center>  
| matches '''foob<span style="color: teal; font-weight: bold;">aa</span>r''', or '''foob<span style="color: teal; font-weight: bold;">aaa</span>r''' but not '''foob<span style="color: red; font-weight: bold;">aaaa</span>r''' or '''foob<span style="color: red; font-weight: bold;">a</span>r'''
+
| matches '''foob<span class="hl-teal">aa</span>r''', or '''foob<span class="hl-teal">aaa</span>r''' but not '''foob<span class="hl-red">aaaa</span>r''' or '''foob<span class="hl-red">a</span>r'''
 
|}
 
|}
  
 
=== Alternatives  ===
 
=== Alternatives  ===
  
A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter '''<span style="color: darkorange; font-weight: bold;">|</span>''' is used to separate the alternatives.  
+
A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter '''<span class="hl-orange">|</span>''' is used to separate the alternatives.  
  
For example, '''<span style="color: darkorange; font-weight: bold;">fee|fie|foe </span>'''will match with '''fee''', '''fie''', or '''foe''' in the target string.  
+
For example, '''<span class="hl-orange">fee|fie|foe </span>'''will match with '''fee''', '''fie''', or '''foe''' in the target string.  
  
 
It is difficult to understand where each alternative starts and ends. This is why it is a common practice to include alternatives in parentheses, to make it easier to understand.  
 
It is difficult to understand where each alternative starts and ends. This is why it is a common practice to include alternatives in parentheses, to make it easier to understand.  
  
For example, '''<span style="color: darkorange; font-weight: bold;">fee|fie|foe </span>'''can be written as '''<span style="color: darkorange; font-weight: bold;">f(e|i|o)e</span>''', to make it easier to understand.  
+
For example, '''<span class="hl-orange">fee|fie|foe </span>'''can be written as '''<span class="hl-orange">f(e|i|o)e</span>''', to make it easier to understand.  
  
Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. For example, when matching '''<span style="color: darkorange; font-weight: bold;">foo|foot</span>''' against '''barefoot,''' only the '''foo '''part will match, because that is the first alternative tried, and it successfully matches the target string. (This is important when you are capturing matched text using parentheses.)  
+
Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. For example, when matching '''<span class="hl-orange">foo|foot</span>''' against '''barefoot,''' only the '''foo '''part will match, because that is the first alternative tried, and it successfully matches the target string. (This is important when you are capturing matched text using parentheses.)  
  
 
{| class="prettytable"
 
{| class="prettytable"
Line 332: Line 333:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| '''<span style="color: darkorange; font-weight: bold;">foo(bar&#124;foo)</span>'''  
+
| '''<span class="hl-orange">foo(bar&#124;foo)</span>'''  
 
| matches '''foobar''' or '''foofoo'''
 
| matches '''foobar''' or '''foofoo'''
 
|}
 
|}
  
Also remember that alternatives cannot be used inside a character class (square brackets), because '''<span style="color: darkorange; font-weight: bold;">|</span>''' is interpreted as a literal within '''<span style="color: darkorange; font-weight: bold;">[]</span>'''. That means '''<span style="color: darkorange; font-weight: bold;">[fee|fie|foe]</span>''' is same as '''<span style="color: darkorange; font-weight: bold;">[feio|]</span>'''. (The other characters are treated as duplicates, and ignored).
+
Also remember that alternatives cannot be used inside a character class (square brackets), because '''<span class="hl-orange">|</span>''' is interpreted as a literal within '''<span class="hl-orange">[]</span>'''. That means '''<span class="hl-orange">[fee|fie|foe]</span>''' is same as '''<span class="hl-orange">[feio|]</span>'''. (The other characters are treated as duplicates, and ignored).
  
 
=== Subexpressions  ===
 
=== Subexpressions  ===
  
Parts of any RegEx pattern can be enclosed in brackets <span style="color: darkorange; font-weight: bold;">()</span>, just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a ''subexpression”.''  
+
Parts of any RegEx pattern can be enclosed in brackets <span class="hl-orange">()</span>, just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a "''subexpression".''  
  
 
The brackets serve two main purposes:  
 
The brackets serve two main purposes:  
Line 354: Line 355:
 
| <center>'''Remarks'''</center>
 
| <center>'''Remarks'''</center>
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">(fee)&#124;(fie)&#124;(foe)</span>'''</center>  
+
| <center>'''<span class="hl-orange">(fee)&#124;(fie)&#124;(foe)</span>'''</center>  
| Much better readability than the equivalent RegEx pattern '''<span style="color: darkorange; font-weight: bold;">fee&#124;fie&#124;foe</span>'''.
+
| Much better readability than the equivalent RegEx pattern '''<span class="hl-orange">fee&#124;fie&#124;foe</span>'''.
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">(foobar){2,3}</span>'''</center>  
+
| <center>'''<span class="hl-orange">(foobar){2,3}</span>'''</center>  
 
| Matches with the entire enclosed string '''foobar '''repeated 2 or 3 times'''.'''  
 
| Matches with the entire enclosed string '''foobar '''repeated 2 or 3 times'''.'''  
 
(i.e., matches with '''foobarfoobar '''or '''foobarfoobarfoobar''')<br>(The iterator acts on the entire subexpression. Compare with the example below!)  
 
(i.e., matches with '''foobarfoobar '''or '''foobarfoobarfoobar''')<br>(The iterator acts on the entire subexpression. Compare with the example below!)  
  
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">foobar{2,3}</span>'''</center>  
+
| <center>'''<span class="hl-orange">foobar{2,3}</span>'''</center>  
| Matches with '''fooba''' followed by the character '''r '''repeated 2 or 3 times'''.<br>'''(i.e., matches with fooba<span style="color: teal; font-weight: bold;">rr</span> or fooba<span style="color: teal; font-weight: bold;">rrr</span>) (The iterator acts only on the last character.)
+
| Matches with '''fooba''' followed by the character '''r '''repeated 2 or 3 times'''.<br>'''(i.e., matches with fooba<span class="hl-teal">rr</span> or fooba<span class="hl-teal">rrr</span>) (The iterator acts only on the last character.)
 
|-
 
|-
| '''<span style="color: darkorange; font-weight: bold;">foob([0-9]&#124;a+)r</span>'''  
+
| '''<span class="hl-orange">foob([0-9]&#124;a+)r</span>'''  
| matches only the character '''foob<span style="color: teal; font-weight: bold;">0</span>r''', '''foob<span style="color: teal; font-weight: bold;">1</span>r ''', '''foob<span style="color: teal; font-weight: bold;">a</span>r''', '''foob<span style="color: teal; font-weight: bold;">aa</span>r''', '''foob<span style="color: teal; font-weight: bold;">aaaa</span>r''', etc. <br>(The subexpression is evaluated first.)
+
| matches only the character '''foob<span class="hl-teal">0</span>r''', '''foob<span class="hl-teal">1</span>r ''', '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">aa</span>r''', '''foob<span class="hl-teal">aaaa</span>r''', etc. <br>(The subexpression is evaluated first.)
 
|}
 
|}
  
Line 373: Line 374:
 
You must have told (or heard-) jokes like this one:  
 
You must have told (or heard-) jokes like this one:  
  
“Two guys walk in a bar. The '''''first guy''''' says.... Then the '''''second guy''''' replies.....  
+
"Two guys walk in a bar. The '''''first guy''''' says.... Then the '''''second guy''''' replies....".  
  
 
Then you are already familiar with ''backreferences''!  
 
Then you are already familiar with ''backreferences''!  
  
A ''“backreference”'' is a ''numbered reference ''to a previously mentioned thing.  
+
A ''"backreference"'' is a ''numbered reference ''to a previously mentioned thing.  
  
 
RegEx also has backreferences. Let us understand how backreferences are defined in RegEx.  
 
RegEx also has backreferences. Let us understand how backreferences are defined in RegEx.  
Line 388: Line 389:
 
*The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching the ''n''th subexpression will take the number 'n'.
 
*The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching the ''n''th subexpression will take the number 'n'.
  
Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called '''“backreference”'''.)  
+
Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called '''"backreference"'''.)  
  
The backreference to the ''n''<sup>th</sup> subexpression is written as '''<span style="color: darkorange; font-weight: bold;">\n</span>'''.  
+
The backreference to the ''n''<sup>th</sup> subexpression is written as '''<span class="hl-orange">\n</span>'''.  
  
 
The backreferences can be used to compose the RegEx pattern itself, as shown below:  
 
The backreferences can be used to compose the RegEx pattern itself, as shown below:  
Line 396: Line 397:
 
{| class="prettytable"
 
{| class="prettytable"
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">(.)\1+</span>'''</center>  
+
| <center>'''<span class="hl-orange">(.)\1+</span>'''</center>  
 
| matches '''aaaa '''and '''cc '''(any single character that is repeated twice or more)
 
| matches '''aaaa '''and '''cc '''(any single character that is repeated twice or more)
 
|-
 
|-
| <center>'''<span style="color: darkorange; font-weight: bold;">(.+)\1+</span>'''</center>  
+
| <center>'''<span class="hl-orange">(.+)\1+</span>'''</center>  
| matches '''<span style="color: blue; font-weight: bold;">aa</span><span style="color: deeppink; font-weight: bold;">aa</span>''', '''<span style="color: blue; font-weight: bold;">c</span><span style="color: deeppink; font-weight: bold;">c</span>''', '''<span style="color: blue; font-weight: bold;">ab</span><span style="color: deeppink; font-weight: bold;">ab</span><span style="color: blue; font-weight: bold;">ab</span><span style="color: deeppink; font-weight: bold;">ab</span>''', '''<span style="color: blue; font-weight: bold;">123</span><span style="color: deeppink; font-weight: bold;">123</span>'''  
+
| matches '''<span class="hl-blue">aa</span><span class="hl-pink">aa</span>''', '''<span class="hl-blue">c</span><span class="hl-pink">c</span>''', '''<span class="hl-blue">ab</span><span class="hl-pink">ab</span><span class="hl-blue">ab</span><span class="hl-pink">ab</span>''', '''<span class="hl-blue">123</span><span class="hl-pink">123</span>'''  
 
(a set of one or more characters, repeated twice or more)  
 
(a set of one or more characters, repeated twice or more)  
  
(The character-sets are alternately colored '''<span style="color: blue; font-weight: bold;">blue</span>''' and '''<span style="color: deeppink; font-weight: bold;">pink</span> '''for easy identification. Observe how a RegEx pattern can match quite different text! )  
+
(The character-sets are alternately colored '''<span class="hl-blue">blue</span>''' and '''<span class="hl-pink">pink</span> '''for easy identification. Observe how a RegEx pattern can match quite different text! )  
  
 
|}
 
|}
Line 411: Line 412:
 
The backreferences are also used in ''find-and-replace'' operations, to re-assemble new text from old.  
 
The backreferences are also used in ''find-and-replace'' operations, to re-assemble new text from old.  
  
*The expressions '''<span style="color: darkorange; font-weight: bold;">\1</span>''' through '''<span style="color: darkorange; font-weight: bold;">\9</span>''' serve as backreferences to the subexpressions found in the RegEx pattern. The expression '''<span style="color: darkorange; font-weight: bold;">\0</span> '''is used to represent the text that matches the whole RegEx pattern. These are used in the "find" part of the operation.  
+
*The expressions '''<span class="hl-orange">\1</span>''' through '''<span class="hl-orange">\9</span>''' serve as backreferences to the subexpressions found in the RegEx pattern. The expression '''<span class="hl-orange">\0</span> '''is used to represent the text that matches the whole RegEx pattern. These are used in the "find" part of the operation.  
*The expressions '''<span style="color: darkorange; font-weight: bold;">$1</span>''' through '''<span style="color: darkorange; font-weight: bold;">$9</span>''' represent the actual text that matches the ''respective'' subexpressions.These are used in the "replace" part of the operation.
+
*The expressions '''<span class="hl-orange">$1</span>''' through '''<span class="hl-orange">$9</span>''' represent the actual text that matches the ''respective'' subexpressions.These are used in the "replace" part of the operation.
  
 
The replacement text is typically a combination of-  
 
The replacement text is typically a combination of-  

Revision as of 10:43, 30 August 2009

{{{iparam}}} This article needs to be cleaned up!
  • Replace all span color highlights with custom hl-* classes found in MediaWiki:Common.css.(krtek: probably already done)
  • Replace all "" (0x201C and 0x201D) quotes with normal "" (0x22). (krtek: probably already done)
  • Check for italized " (quotes) and make them normal. Especially ending quotes got italized or bolded.

Regular Expressions (RegEx) allow you to use precise search conditions, such as:

  • Your search string must be located at the beginning (or at the end) of a line,
  • The specified text must occur n times, etc.

RegEx expressions use $ . ^ { [ ( | ) * + ? \ characters (called metacharacters) in various combinations to specify all these conditions.The search engine interprets these metacharacters, rather than finding a literal match for them.

The RegEx search expression is called a "RegEx pattern", because a single expression can match a large number of actual text that has the specified pattern. For example, the RegEx pattern b?t matches with bat, bet, bit, bot and but, etc.

Remember that RegEx strings are case-sensitive (The words cat, CAT, cAt, Cat, caT, cAT, CAt and CaT are not equivalent).

Also, note that even the digits (0-9) are "numeric characters" for RegEx.

In this section, the RegEx expressions (patterns) are shown in bold orange. The target strings (which are compared with the RegEx expression for a possible match) are shown in bold black. A part of the target text is color-coded to provide a clue as to why a certain part matches (green color), or does not match (red color)

Simple (literal) matches

When the search string does not contain any metacharacters, the RegEx engine works like "normal" search. (it tries to find an exact copy of the search string.) (This is also known as "literal match").

If you want to find a literal match for a metacharacter, put a backslash \ before it. (The \ character is called "escape character", because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called "escape sequence").

For example, metacharacter ^ matches the beginning of string, but \^ matches the character ^.

Note that the RegEx pattern \\ matches the character \.

RegEx pattern
Matches-
Remarks
foobar
foobar
This RegEx pattern does not contain any metacharacters; so all characters are matched literally.
\^FooBarPtr
^FooBarPtr
The \^ escape sequence searches for the character ^ literally .

Escape sequences

We already saw one use of escape sequence (above).

Specific escape sequences are interpreted as special conditions, as listed below.

RegEx pattern
matches-
\xnn
Character represented by the hex code nn
\x{nnnn}
two bytes char with hex code nnnn (unicode)
\t
tab (HT/TAB), same as \x09 (Hex 09)
\n
new line (NL), same as \x0a (Hex 0a)
\r
carriage return (CR), same as \x0d (Hex 0d)
\f
form feed (FF), same as \x0c (Hex 0c)
foo\x20bar
matches foo bar (note the space in the middle), but does not match foobar
\tfoobar
matches foobar preceded by a tab (the tab is needed for the match)

Note that the tab, new line, carriage return, and form feed are known as "white spaces". But RegEx can distinguish between them. This allows you to make high-precision searches.

Character Classes

A character class is a list of characters in square brackets [], which will match any one (and only one-) character from the list.

Note that-

  • The characters are not separated with a comma or a space.
  • If you repeat any character in the list, it is considered only once (duplicates are ignored).
  • A hyphen - is used to indicate range of characters.
RegEx Pattern
Remarks
[abdef]
Matches d, e, or f (only one character), but no other characters
[c-m]
Matches any one (and only one) of the small alphabetical characters, from c to m
[G-J]
Matches any one (and only one) of the capital alphabetical characters from G to J
[a-zA-Z]
Matches any one (and only one) of the alphabetical characters (capital or small)
[5-8]
Matches any one (and only one) of numerical characters from 5 to 8
[\n-\x0D]

Matches any one (and only one) of #10, #11 , #12 or #13
(Note the use of escape sequence inside a class)

There are some special conditions:

  • If you do not want any of the characters in the specified class, then place ^ at the very beginning of the list (RegEx interprets that as "none of the characters listed in this class").
  • If you want [ or ] itself to be a member of a class, put it at the start or end of the list, or create a escape sequence (by putting \ before it).
RegEx Pattern
Remarks
[-az]
matches a,z, and -
(since is put at the beginning, the escape sequence is not needed)
[a\-z]
matches a,z, and - 
(since is not at the beginning/end, the escape sequence is needed)
[^0-9]
matches any non-digit character
[]-a]
matches any character from ] to a.
(since ] is at the beginning, the escape sequence is not needed)
foob[aeiou]r
Matches with foobar, foober, etc. but not foobbr, foobcr, etc.
foob[^aeiou]r
Matches with foobbr, foobcr etc. but not foobar, foober, etc.

@@@ The "]-a" example would need a clarification as to what is the natural sequence of characters, and where is a superset of all possible characters described?

Predefined Classes

Some of the character classes are used so often that RegEx has predefined escape sequences to represent them.

RegEx Pattern
Remarks
\w
an alphanumeric character, including an underscore (_)
\W
a non-alphanumeric character
\d
a numeric character
\D
a non-numeric character
\s
any space (same as the [ \t\n\r\f] class)
\S
a non space
.
any character in line (the symbol is just a dot)

Notice that the capitalized letter is used to negate (for example, compare \w with \W)

Word/Text Boundaries

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

RegEx Pattern
Remarks
\b
word boundary
\B
not word boundary
\A
start of text (^ is an alternative)
\Z
end of text ($ is an alternative)

These markers are combined with the search string to specify where exactly you want the search string to be. For example, \bhis\b will search for a whole word his, but will ignore this, history or whistle.

Iterators (Quantifiers)

Iterators (quantifiers) are meta-characters that specify how many times the preceding expression has to repeat, A typical example is to find a 3-to-5 digit number.

RegEx newbies often place the iterators after the character that needs to repeat. Just remember that RegEx syntax is exact opposite of the usual English syntax. So, instead of "four dogs", we would have to say "dogs four", RegEx-style.

Iterators can be 'Greedy' or 'Non-Greedy'. Greedy means the expression grabs as much matching text as possible. In contrast, the non-greedy expression tries to match as little as possible.

For example,

  • when b+ (a greedy expression) is applied to string abbbbc, it returns bbbb,
  • but when b+? (a non-greedy expression) is applied to abbbbc, it returns only b.

Note that a ? attached to a greedy expression makes it non-greedy.

RegEx pattern
Remarks
Greedy?
Remarks
*
zero or more
Yes
equivalent to {0,}
+
one or more
Yes
equivalent to {1,}
?
zero or one equivalent to {0,1}
{n}
exactly n times
Yes
{n,}
at least n times
Yes
{n,m}
at least n but not more than m times
Yes
*?
zero or more
No
equivalent to {0,}?
+?
one or more
No
equivalent to {1,}?
??
zero or one
No
equivalent to {0,1}?
{n}?
exactly n times
No
{n,}?
at least n times
No
{n,m}?
at least n but not more than m times
No

Let us see some examples:

RegEx pattern
Remarks
foob.*r
matches foobar, foobalkjdflkj9r and foobr
foob.+r
matches foobar, foobalkjdflkj9r but not foobr
foob.?r
matches foobar, foobbr and foobr but not foobalkj9r
fooba{2}r
matches foobaar
fooba{2,}r
matches foobaar, foobaaar, foobaaaar etc. but not foobar
fooba{2,3}r
matches foobaar, or foobaaar but not foobaaaar or foobar

Alternatives

A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter | is used to separate the alternatives.

For example, fee|fie|foe will match with fee, fie, or foe in the target string.

It is difficult to understand where each alternative starts and ends. This is why it is a common practice to include alternatives in parentheses, to make it easier to understand.

For example, fee|fie|foe can be written as f(e|i|o)e, to make it easier to understand.

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. For example, when matching foo|foot against barefoot, only the foo part will match, because that is the first alternative tried, and it successfully matches the target string. (This is important when you are capturing matched text using parentheses.)

RegEx Pattern
Remarks
foo(bar|foo) matches foobar or foofoo

Also remember that alternatives cannot be used inside a character class (square brackets), because | is interpreted as a literal within []. That means [fee|fie|foe] is same as [feio|]. (The other characters are treated as duplicates, and ignored).

Subexpressions

Parts of any RegEx pattern can be enclosed in brackets (), just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a "subexpression".

The brackets serve two main purposes:

  • Better readability, as in the mathematical formula a+(b+c).
  • Make a functional group, as in the mathematical formula a(b+c). This group is evaluated first.

Let us see some examples:

RegEx Pattern
Remarks
(fee)|(fie)|(foe)
Much better readability than the equivalent RegEx pattern fee|fie|foe.
(foobar){2,3}
Matches with the entire enclosed string foobar repeated 2 or 3 times.

(i.e., matches with foobarfoobar or foobarfoobarfoobar)
(The iterator acts on the entire subexpression. Compare with the example below!)

foobar{2,3}
Matches with fooba followed by the character r repeated 2 or 3 times.
(i.e., matches with foobarr or foobarrr) (The iterator acts only on the last character.)
foob([0-9]|a+)r matches only the character foob0r, foob1r , foobar, foobaar, foobaaaar, etc.
(The subexpression is evaluated first.)

Backreferences

You must have told (or heard-) jokes like this one:

"Two guys walk in a bar. The first guy says.... Then the second guy replies....".

Then you are already familiar with backreferences!

A "backreference" is a numbered reference to a previously mentioned thing.

RegEx also has backreferences. Let us understand how backreferences are defined in RegEx.

The RegEx engine tries to find text that matches the whole RegEx pattern. If a matching text is found, the RegEx engine identifies the matching text for each of the subexpressions in the pattern.

At this stage, the RegEx engine gives numbers to these matching parts:

  • The text that matches the entire RegEx expression takes the number '0'.
  • The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching the nth subexpression will take the number 'n'.

Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called "backreference".)

The backreference to the nth subexpression is written as \n.

The backreferences can be used to compose the RegEx pattern itself, as shown below:

(.)\1+
matches aaaa and cc (any single character that is repeated twice or more)
(.+)\1+
matches aaaa, cc, abababab, 123123

(a set of one or more characters, repeated twice or more)

(The character-sets are alternately colored blue and pink for easy identification. Observe how a RegEx pattern can match quite different text! )

Substitution of text using backreference

The backreferences are also used in find-and-replace operations, to re-assemble new text from old.

  • The expressions \1 through \9 serve as backreferences to the subexpressions found in the RegEx pattern. The expression \0 is used to represent the text that matches the whole RegEx pattern. These are used in the "find" part of the operation.
  • The expressions $1 through $9 represent the actual text that matches the respective subexpressions.These are used in the "replace" part of the operation.

The replacement text is typically a combination of-

  • The text that matched the subexpressions, and
  • Some new text.

Note that the RegEx pattern may have some parts that are not enclosed in (). (In other words, it may have parts that are not subexpressions.) Such parts are not used in the replacement text.

Here are some "find-and-replace" examples:

Expression Replace Description
(.*) (.*)
$2, $1
Switch two words around and put a comma after the resulting first word. Example: if input string is "John Smith", then output will be "Smith, John".

Notice that the replacement text also has additional literal text in the middle (comma and space).

\b(\d{2})-(\d{2})-(\d{4})\b
$3-$2-$1
Find date sequences in dd-mm-yyyy format and reverse them into yyyy-mm-dd format.
(e.g. 25-10-2007 is converted to 2007-10-25).

Note: This is not a very robust example, because \d can represent any digit in range of 0-9. That means sequences like 99-99-9999 also will match this pattern, resulting in a problem. This in fact shows that you need to be careful with RegEx patterns!

\[.*?\]
Remove the contents of the [...] (square brackets), and the brackets too.
(Replace with nothing means deleting.)

Limitations for binary data

One of the known limitation of RegEx engine when working with binary data is that the input string is not search beyond the first occurrence of NULL character (\x00). This would not affect file names because there are simply no NULL characters in them, but may affect parsing of binary content of files when working in Pascal Script for example.

External links