Regular expression in Python


Python supports regular expressions. Regular expressions are a puzzle in themselves, but very useful.

"I know, regular expressions can be used to solve this problem." So now "I" have two questions to answer.

Regular expressions (also known as rule expressions) are languages (i.e., they are themselves small, highly specialized programming languages) used to describe specific structures (rules) of strings, and are executed by related engines.

Related course: Complete Python Programming Course & Exercises

Python Regular Expression

In Python, regular matching can be directly invoked through the embedded integrated RE module. The regular expression pattern is compiled into a series of bytecodes, which are then executed by the matching engine written by C.

A regular expression is a special string pattern used to match a set of strings. Match expressions using the given composition rules and characters.

Role

Usually used to retrieve and replace text that fits a certain pattern (rule).

  • Data validation (testing patterns within a string): testing whether the input string conforms to certain rules, is allowed to be entered, etc. For example, you can test the input string to see if there is a phone number pattern, credit card number pattern, IP address pattern, etc.; verify the legitimacy of the email address, date of birth, etc.

  • Manipulate text: Use regular expressions to identify specific text in a document, delete that text completely or replace it with other text.

  • Extract substrings from strings based on pattern matching: find specific text within a document, or within an input field.

Basic grammar (rules)

Previously on.

  • Strings are one of the most involved data structures in programming, and the need to operate on strings is ubiquitous.
  • The match (operation) object of a regular expression is a string, not another type of content.
  • The pattern and string to be searched can be either a Unicode string (str) or an 8-bit string (byte). However, Unicode strings cannot be matched to byte patterns, and vice versa. That is, the type must be the same.
  • Backslash \ obsession: Regular expressions use the backslash character \ as an escape character.
  • Regular expressions can be linked together to form new regular expressions. For example, if A and B are both regular expressions, then AB is also a regular expression. That is, if a string p matches A and a string q matches B, then the string pq matches AB.
  • Greedy vs. non-greedy mode for quantity words: quantity words in Python are greedy by default, always trying to match as many characters as possible; non-greedy is the opposite, always trying to match as few characters as possible. Example: the regular expression ab will find abbb if used to find abbb, while the non-greedy quantity word ab? , will find a.
  • Matching patterns: regular expressions provide some available matching patterns, such as case ignoring, multi-line matching, etc., which will be introduced together in the factory method re.compile(pattern[, flags]) of the Pattern class.

Python Regular Expressions - Practice: re Module Explained

Python implements regular expressions via the re module. The general step in using re is to first compile the string form of the regular expression into a Pattern instance; then use the Pattern instance to process the text and get a match result (a Match instance); and finally use the Match instance to get information and do other operations.

1, re module definition of several functions, constants, exceptions.

In Python 3.6 and above, the Flag constant is now an instance of RegexFlag (which is a subclass of enum.IntFlag).

1.1, re.compile(pattern[, flags=0])

It is a factory method of the Pattern class that compiles regular expression patterns in string form into regular expression objects (Pattern objects) that can be matched using match(), search(), and other methods of regular expression objects (described in the next section).

Where the second parameter flags is the matching pattern, which can be taken using either the bit or the operator | to indicate simultaneous effect, such as re.I | re.M. Alternatively, the pattern can be specified in the regex string, such as re.compile('pattern', re.I | re.M) is equivalent to re.compile('(?im)pattern').

Name Description
re.ASCII makes \w, \W, \b, \B, \d, \D, \s, \S perform only ASCII matches, not full Unicode matches.
re.DEBUG displays debugging information about compiled expressions
re.I (re.IGNORECASE) Performs case-insensitive matching. For example, the expression [A-Z] will match lowercase letters.
L (re.LOCALE) Make the predetermined character class \w, \W, \b, \B depends on the current region setting. ps: Python 3.6 or above, re.LOCALE is only used in byte mode, not compatible with re.ASCII.
re.M (re.MULTILINE) Multi-line mode that changes the behavior of ^ and $.
re.S (re.DOTALL) Point an arbitrary match pattern that changes the . behavior.
re.X (re.VERBOSE) Detailed mode. Regular expressions in this mode can be multiple lines, ignore blank characters, and add comments.

With respect to re.VERBOSE, the following two regular expressions are equivalent.

a = re.compile(r"""\d + # the integral part
                   \.    # the decimal point
                   \d * # some fractional digits """, re.X)
b = re.compile(r"\d+\. \d*")

RE provides numerous modular methods for completing the functions of regular expressions. These methods can be replaced with the corresponding methods of the Pattern instance, with the only benefit of writing one less line of re.compile() code, but also without the ability to reuse the compiled Pattern object. These methods will be presented together in the Instance Methods section of the Pattern class. The example above can be abbreviated to read.

m = re.match(r'hello', 'hello world!')
print(m.group())

The RE module also provides a method, ESCAPE(STRING), which is used to prefix regular expression meta-characters in STRING such as */+/? etc. before adding an escape character and then returning it, which is a little useful when a large number of matching meta-characters are required.

1.2, re.search(pattern, string, flags=0)

Scan the string to find the first position where the regular expression pattern produces a match and return the corresponding match. If no position in the string matches the pattern, then None is returned; note that this is different from finding a zero-length match at a point in the string.

1.3, re.match(pattern, string, flags=0)

If 0 or more characters at the beginning of a string match the regular expression pattern, the corresponding match is returned. Returns None if the string does not match the pattern; note that this is different from a 0-length match.

PS: Even in MULTILINE mode, re.match() will match only at the beginning of the string, not at the beginning of each line.

If you want to find a match anywhere in the string, use SEARCH().

1.4, re.fullmatch(pattern, string, flags=0)

Python 3.4 added.

If the entire string matches the regular expression pattern, the corresponding match is returned.

If the string doesn't match the pattern, it returns None; note that this is also different from a 0-length match.

1.5, re.split(pattern, string, maxsplit=0, flags=0)

Split the string by regular expressions (regular expressions act as splitters). If regular expressions are enclosed in parentheses, the matching strings are also listed to be returned in the LIST. The parameter maxsplit is the number of splits, which defaults to 0; at non-0, up to one maxsplit split occurs, and the rest of the string is returned as the last element of the list. The parameter flags were added in Python 3.1.

>>> import re
>>> re.split(r"\W+", "Words,words,words.")#Match non-word characters 1 or more times with r"\W+" and use it as a pattern for splitting the string. Also     count, so it is followed by empty characters.
['Words', 'words', 'words', '']
>>> re.split(r"(\W+)", "Words,words,words.")# matching strings also appear in the list
['Words', ',', 'words', ',', 'words', '.' , '']
>>> re.split(r"\W+", "Words,words,words.", 1)#split 1 time
['Words', 'words,words.']
>>> re.split("[a-f]+", "0a3B9", flags=re.IGNORECASE)#match pattern ignores case and uses letters as a pattern for splitting strings. The result is to g    et split numbers.
['0', '3', '9']

If the splitter is enclosed in parentheses (group capture) and it matches exactly to the beginning of the string, the result will begin with an empty string. The same applies to endings.

>>> re.split(r'(\W+)', '. .words, words...')
[''', '...' , 'words', ', ', 'words', '...' , '']

This way, the separator component always finds the same relative index in the results list.

PS: split() does not split a string with an empty pattern match. Example: even though 'x*' can match up to 0 'x', it will not return this result ['', 'a', 'b', 'c', '']. The correct example results in a match to 1 'x'.

>>> re.split('x*', 'axbc')
['a', 'bc']

In Python 3.5, if the match mode is null, it will raise ValueError.

>>> re.split("^$", "foo\n\nbar\n", flags=re.M)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...

ValueError: split() requires a non-empty pattern match.

1.6, re.findall(pattern, string, flags=0)

Returns all non-overlapping match patterns in the string as a list. The string scans from left to right and returns matches in the order found. Returns the group list if there are one or more groups in the pattern, and the tuple list if there are multiple groups in the pattern. The results contain null matches.

>>>findall(r'^|\w+', 'two words')
['', 'wo', 'words']

PS: Due to the limitations of the current implementation, characters after empty matches are not included in the next match. Therefore, there is no "t" in the above results. This was changed in Python 3.7.

The regular expression r'^|\w+' matches the beginning of an empty string or the beginning of a single word character. | Match from left to right, and once the left match is successful, the right expression match is skipped.

1.7, re.finditer(pattern, string, flags=0)

Returns an iterator that produces matches in all non-overlapping matches of the RE pattern in the string. The string scans from left to right and returns matches in the order found. The results contain null matches. Reference may be made to findall().

1.8, re.sub(pattern, repl, string, count=0, flags=0)

sub, substitute abbreviation, translated as substitution.

Use regular expressions to implement more powerful substitutions than the regular string REPLACE.

pattern, the pattern string representing a regular expression, can be either a string or a pattern object; if the pattern is not provided, the original string is returned.

repl, can be a string or a function.

If it is a string, any of these backslash \ escapes will be processed, i.e. \n to a single line feed, \r to a carriage return, etc., for unknown escapes such as \& will be recognized as itself, and a reverse reference (e.g. \6) indicates the 6th group that matches the pattern.

PS: A pattern consisting of \ and ASCII letters will be erroneous if it is an unknown escape.

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', r'static PyObject*\npy_\1(void)\n{', 'def myfunc():')

'static PyObject*\npy_myfunc(void)\n{'

If REPL is a function, for example.

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram--files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' &', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

count, the maximum number of replacements.

1.9, re.subn(pattern, repl, string, count=0, flags=0)

The execution operation is similar to sub(), but returns a tuple (new_string, number_of_subs_made). Includes new strings, number of replacements.

1.10, re.escape(pattern)

All characters in the escape pattern except ASCII letters, numbers, '_'. This is useful if you want to match a string that contains regular expression meta-characters.

>>> print(re.escape('python.exe'))
python\.exe
>>> legal_chars = string.ascii_lowercase + string.digits + "! #$%&'*+-. ^_`|~:"
>>> print('[%s]+' % re.escape(legal_chars))

[abcdefghijklmnopqrstuvwxyz0123456789! #\The number of people who have been killed in the past two years has been increased from 1,000 to 2,000, and the number of people who have been killed in the past two years has been increased from 1,000 to 2,000. \^_`|\~\:]+

>>> operators = ['+', '-', '*', '/', '**']
>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
\/|\-|\+|\*\*||\*

This function is not available in sub(), subn() replacement strings, only the backslash is escaped. Example.

>>> digits_re = r'\d+'
>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))

/usr/sbin/sendmail - \d+ errors, \d+ warnings

1.11, re.purge()

purge, translated as purge, purge.

Clear the regular expression cache.

1.12, exception re.error(msg, pattern=None, pos=None)

An exception raised when a string passed to one of the functions is not a valid regular expression, or when other errors occur during compilation or matching.

The properties of the re.error() instance are: msg, pattern, pos, lineno, colno.

2、Regular expression object pattern

The regular expression object supports the following methods and properties.

2.1, regex.search(string[, pos[, endpos]])

Scan the string to find the first position where the regular expression produces a match and return a corresponding match object. If there is no match in the string to the location of the pattern, it returns None; however, this is not the same as finding a 0 length match at some point in the string.

The parameter pos, which is optional, indicates the start index of the search and defaults to 0. This differs from string slicing.' The ^' pattern character matches at the true start of the string, after the line break, but not necessarily at the index where the search begins.

The argument endpos, also optional, limits the distance of the search string; if endpos is the length of the string, then only characters from pos to endpos-1 will be searched for a match; if endpos<pos, no match will be found; in addition, if rx is a compiled regular expression object, then rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

>>> pattern = re.compile("d")
>>> pattern.search("dog") # Match at index 0
<_sre.SRE_Match object; span=(0, 1), match='d'>
>>> pattern.search("dog", 1) # No match; search doesn't include the "d"

2.2, regex.match(string[, pos[, endpos]])

If 0 or more characters are matched to this regular expression at the beginning of the string, a corresponding match object is returned. If no match is made, return to None.

Optional parameters pos, endpos refer to regex.search().

>>> pattern = re.compile("o")
>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
<_sre.SRE_Match object; span=(1, 2), match='o'>

Search() VS match() difference.

re.match() checks only the beginning of a string.

re.search() checks any position of the string.

>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>

A regular expression starting with '^' in search() will strictly match to the beginning of the string.

>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>

2.3, regex.fullmatch(string[, pos[, endpos]]) This is new to Python 3.4.

If the entire string matches the regular expression, the corresponding match object is returned; otherwise, None is returned.

>>> pattern = re.compile("o[gh]")
>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
>>> pattern.fullmatch("ogre") # No match as not the full string matches.
>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
<_sre.SRE_Match object; span=(1, 3), match='og'>

2.4、regex.split(string, maxsplit=0)

After using the compiled pattern, the same as re.splilt().

2.5, regex.findall(string[, pos[, endpos]])

After using the compiled pattern, similar to re.findall(). With the addition of the POS, ENDPOS parameters, it's similar to re.search().

2.6, regex.finditer(string[, pos[, endpos]])

After using the compiled pattern, similar to re.finditer(). With the addition of the POS, ENDPOS parameters, it's similar to re.search().

2.7, regex.sub(repl, string, count=0)

After using the compiled PATTERN, the same as re.sub().

2.8, regex.subn(repl, string, count=0)

The same as re.subn() when the compiled pattern is used.

2.9, regex.flags

This is the flag for regular expression matching. is a combination of flags provided to compile(), (?...) The inline flag in the pattern; if the pattern is a Unicode string, it is an implicit flag (e.g. UNICODE).

2.10, regex.groups

The group serial number captured in the PATTERN.

2.11, regex.groupindex

2.12, regex.pattern

The pattern string of the compiled RE object.

3、Matching objects

The Match object supports the following methods and properties.

3.1, match.expand(template)

Returns a string that is obtained by doing a backslash substitution in the string template, as in the sub() method. Transitions such as \n are converted to appropriate characters, and numeric back-references (e.g. \1, \2), named back-references (\g<1>, \g) are replaced by the content of the corresponding group.

Python 3.5 changes: groups that are not matched to will be replaced with empty strings.

3.2、match.group([group1, ...])

Returns one or more subgroups of matches.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'
>> m.group(1)
'Malcolm'
>>> m.group(2)
'Reynolds'
>>> m = re.match(r"(.) +", "a1b2c3") # Matches 3 times.
>>> m.group(1) # Returns only the last match.

'c3'

3.3, match.getitem(g): This is new to Python 3.6.

It is equivalent to m.group(g). It is easier to access individual groups from a match.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m[0] # The entire match
'Isaac Newton'
>>> m[1] # The first parenthesized subgroup.
'Isaac'
 >>> m[2] # The second parenthesized subgroup.
'Newton'

3.4、match.gas(default=None)

Returns a tuple that contains all subgroups that match.

>>> m = re.match(r"(\d+)\. (\d+)", "24.1632")
>>> m.fruits()
('24', '1632')

>>> m = re.match(r"(\d+)\.? (\d+)?" , "24")
>>> m.fruits() # Second group defaults to None.
('24', None)
>>> m.fruits('0') # Now, the second group defaults to '0'.
('24', '0')

3.5、match.groupdict(default=None)

Returns a dictionary containing all named subgroups that match.

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

3.6, match.start([group]) and match.end([group])

Returns an index of the beginning and end of a substring matched by the group.

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]

'tony@tiger.net'

3.7, match.span([group])

For a match m, a tuple group (m.start(group), m.end(group)) is returned.

3.8, match.pos and match.endpos

The values of pos and endpos are exactly those passed to the regex object by the method SEARCH() or MATCH().

3.9、match.lastindex

Get the last match for the index of the captured group. If not, it's None.

3.10, match.lastgroup

Get the last name of the captured group that matches. If not, return None.

3.11, match.re

The regular expression object of the match instance generated by the match() or search() method.

3.12, match.string

Get the string passed to the match() or search() method.

4, regular expressions official example

Playing cards to find a pair.

A player has 5 playing cards in his hand, each corresponding to 5 characters, then 5 cards is a string containing 5 characters. Where, A is the king; K is the king; Q is the queen; J is the jack; T is the 10; and 2-9 is the value of itself.

First, use a helper function to elegantly display the match.

def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

Secondly, check that the cards in your hand are valid.

>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q")) # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e")) # Invalid.
>>>> displaymatch(valid.match("akt")) # Invalid.
>>> displaymatch(valid.match("727ak")) # Valid.
"<Match: '727ak', groups=()>"

Then, check your hand for pairs of cards.

>>> pair = re.compile(r". *(.) . *\1")
>>> displaymatch(pair.match("717ak")) # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak")) # No pairs.
>>> displaymatch(pair.match("354aa")) # Pair of aces.
"<Match: '354aa', groups=('a',)>"

Finally, find out what the pair is.

>>> pair.match("717ak").group(1)
'7'

# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    re.match(r". *(.) . *\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

>>> pair.match("354aa").group(1)
'a'