Python Regex - RegEx Functions | Metacharacters | Special Sequences

A regular expression is a set of characters with highly specialized syntax that we can use to find or match other characters or groups of characters. In short, regular expressions, or Regex, are widely used in the UNIX world.

Import the re Module

The re-module in Python gives full support for regular expressions of Pearl style. The re module raises the re.error exception whenever an error occurs while implementing or using a regular expression.

We'll go over crucial functions utilized to deal with regular expressions.

But first, a minor point: many letters have a particular meaning when utilized in a regular expression called metacharacters.

The majority of symbols and characters will easily match. (A case-insensitive feature can be enabled, allowing this RE to match Python or PYTHON.) For example, the regular expression 'check' will match exactly the string 'check'.

There are some exceptions to this general rule; certain symbols are special metacharacters that don't match. Rather, they indicate that they must compare something unusual or have an effect on other parts of the RE by recurring or modifying their meaning.

Metacharacters or Special Characters

As the name suggests, there are some characters with special meanings:

CharactersMeaning
.Dot - It matches any characters except the newline character.
^Caret - It is used to match the pattern from the start of the string. (Starts With)
$Dollar - It matches the end of the string before the new line character. (Ends with)
*Asterisk - It matches zero or more occurrences of a pattern.
+Plus - It is used when we want a pattern to match at least one.
?Question mark - It matches zero or one occurrence of a pattern.
{}Curly Braces - It matches the exactly specified number of occurrences of a pattern
[]Bracket - It defines the set of characters
|Pipe - It matches any of two defined patterns.

Special Sequences:

The ability to match different sets of symbols will be the first feature regular expressions can achieve that's not previously achievable with string techniques. On the other hand, Regexes isn't much of an improvement if that had been their only extra capacity. We can also define that some sections of the RE must be reiterated a specified number of times.

The first metacharacter we'll examine for recurring occurrences is *. Instead of matching the actual character '*,' * signals that the preceding letter can be matched 0 or even more times rather than exactly once.

Ba*t, for example, matches 'bt' (zero 'a' characters), 'bat' (one 'a' character), 'baaat' (three 'a' characters), etc.

Greedy repetitions, such as *, cause the matching algorithm to attempt to replicate the RE as many times as feasible. If later elements of the sequence fail to match, the matching algorithm will retry with lesser repetitions.

Special Sequences consist of '\' followed by a character listed below. Each character has a different meaning.

CharacterMeaning
\dIt matches any digit and is equivalent to [0-9].
\DIt matches any non-digit character and is equivalent to [^0-9].
\sIt matches any white space character and is equivalent to [\t\n\r\f\v]
\SIt matches any character except the white space character and is equivalent to [^\t\n\r\f\v]
\wIt matches any alphanumeric character and is equivalent to [a-zA-Z0-9]
\WIt matches any characters except the alphanumeric character and is equivalent to [^a-zA-Z0-9]
\AIt matches the defined pattern at the start of the string.
\br"\bxt" - It matches the pattern at the beginning of a word in a string.
r"xt\b" - It matches the pattern at the end of a word in a string.
\BThis is the opposite of \b.
\ZIt returns a match object when the pattern is at the end of the string.

RegEx Functions:

  • compile - It is used to turn a regular pattern into an object of a regular expression that may be used in a number of ways for matching patterns in a string.
  • search - It is used to find the first occurrence of a regex pattern in a given string.
  • match - It starts matching the pattern at the beginning of the string.
  • fullmatch - It is used to match the whole string with a regex pattern.
  • split - It is used to split the pattern based on the regex pattern.
  • findall - It is used to find all non-overlapping patterns in a string. It returns a list of matched patterns.
  • finditer - It returns an iterator that yields match objects.
  • sub - It returns a string after substituting the first occurrence of the pattern by the replacement.
  • subn - It works the same as 'sub'. It returns a tuple (new_string, num_of_substitution).
  • escape - It is used to escape special characters in a pattern.
  • purge - It is used to clear the regex expression cache.

1. re.compile(pattern, flags=0)

It is used to create a regular expression object that can be used to match patterns in a string.

Example:

Output:

Match Object: 

This is equivalent to:

re_obj = re.compile(pattern)
result = re_obj.search(string)
=result = re.search(pattern, string)

Note - When it comes to using regular expression objects several times, the re.complie() version of the program is much more efficient.

2. re.match(pattern, string, flags=0)

  • It starts matching the pattern from the beginning of the string.
  • Returns a match object if any match is found with information like start, end, span, etc.
  • Returns a NONE value in the case no match is found.

Parameters

  • pattern:-this is the expression that is to be matched. It must be a regular expression
  • string:-This is the string that will be compared to the pattern at the start of the string.
  • flags:-Bitwise OR (|) can be used to express multiple flags.

Example:

Output:


Span: (0, 5)
Start: 0
End: 5

Another example of the implementation of the re.match() method in Python.

  • The expressions ".w*" and ".w*?" will match words that have the letter "w," and anything that does not has the letter "w" will be ignored.
  • The for loop is used in this Python re.match() illustration to inspect for matches for every element in the list of words.

CODE:

Output:

There isn't any match!!

3. re.search(pattern, string, flags=0)

The re.search() function will look for the first occurrence of a regular expression sequence and deliver it. It will verify all rows of the supplied string, unlike Python's re.match(). If the pattern is matched, the re.search() function produces a match object; otherwise, it returns "null."

To execute the search() function, we must first import the Python re-module and afterward run the program. The "sequence" and "content" to check from our primary string are passed to the Python re.search() call.

Here is the description of the parameters -

pattern:- this is the expression that is to be matched. It must be a regular expression

string:- The string provided is the one that will be searched for the pattern wherever within it.

flags:- Bitwise OR (|) can be used to express multiple flags. These are modifications, and the table below lists them.

Code

Output:

search object group :   Python through tutorials on javatpoint
search object group 1 :  on
search object group 2 :  javatpoint

4. re.sub(pattern, repl, string, count=0, flags=0)

  • It substitutes the matching pattern with the 'repl' in the string
  • Pattern - is simply a regex pattern to be matched
  • repl - repl stands for "replacement" which replaces the pattern in string.
  • Count - This parameter is used to control the number of substitutions

Example 1:

Output:

Original text: I like Javatpoint!
Substituted text:  I love Javatpoint!

In the above example, the sub-function replaces the 'like' with 'love'.

Example 2 - Substituting 3 occurrences of a pattern.

Output:

Original text: I like Javatpoint! I also like tutorials!
Substituted text: I Like Javatpoint! I aLso Like tutorials!

Here, first three occurrences of 'l' is substituted with the "L".

5. re.subn(pattern, repl, string, count=0, flags=0)

  • Working of subn if same as sub-function
  • It returns a tuple (new_string, num_of_substitutions)

Example:

Output:

Original text: I like Javatpoint! I also like tutorials!
Substituted text: ('I Like Javatpoint! I aLso Like tutorials!', 3)

In the above program, the subn function replaces the first three occurrences of 'l' with 'L' in the string.

6. re.fullmatch(pattern, string, flags=0)

  • It matches the whole string with the pattern.
  • Returns a corresponding match object.
  • Returns None in case no match is found.
  • On the other hand, the search() function will only search the first occurrence that matches the pattern.

Example:

Output:

None

In the above program, only the 'Hello world" has completely matched the pattern, not 'Hello'.

Q. When to use re.findall()?

Ans. Suppose we have a line of text and want to get all of the occurrences from the content, so we use Python's re.findall() function. It will search the entire content provided to it.

7. re.finditer(pattern, string, flags=0)

  • Returns an iterator that yields all non-overlapping matches of pattern in a string.
  • String is scanned from left to right.
  • Returning matches in the order they were discovered

Output:


  
  



8. re.split(pattern, string, maxsplit=0, flags=0)

  • It splits the pattern by the occurrences of patterns.
  • If maxsplit is zero, then the maximum number of splits occurs.
  • If maxsplit is one, then it splits the string by the first occurrence of the pattern and returns the remaining string as a final result.

Example:

Output:

When maxsplit = 0, result: ['Learn', 'Python', 'through', 'tutorials', 'on', 'javatpoint']
When maxsplit = 1, result = ['Learn', 'Python through tutorials on javatpoint']

9. re.escape(pattern)

  • It escapes the special character in the pattern.
  • The esacpe function become more important when the string contains regular expression metacharacters in it.

Example:

Output:

Result: https://www\.javatpoint\.com/

The escape function escapes the metacharacter '.' from the pattern. This is useful when want to treat metacharacters as regular characters to match the actual characters themselves.

10. re.purge()

  • The purge function does not take any argument that simply clears the regular expression cache.

Example:

Output:





  • After using, pattern1 and pattern2 to search for matches in the string '123abc'.
  • We have cleared the cache using re.purge().
  • We have again used pattern1 and pattern2 to search for matches in the string '456def'.
  • Since the regular expression cache has been cleared. The regular expressions are recompiled, and searching for matches in the '456def' has been performed with the new regular expression object.

Matching Versus Searching - re.match() vs. re.search()

Python has two primary regular expression functions: match and search. The match function looks for a match only where the string starts, whereas the search function looks for a match everywhere in the string.

CODE:

Output:

There isn't any match!!
Search object group :  

The match function checks whether the string is starting with 'through' or not, and the search function checks whether there is 'through' in the string or not.

CONCLUSION

The re-module in Python supports regular expression. Regular expressions are an advanced tool for text processing and pattern matching. We can find patterns in text strings using the re-module and split and replace text depending on patterns, among other things.

Also, using the re-package isn't always a good idea. If we're only searching a fixed string or a specific character class and not leveraging any re-features like the IGNORECASE flag, regular expressions' full capability would not be needed. Strings offer various ways of doing tasks with fixed strings, and they're generally considerably faster than the larger, more generalized regular expression solver because the execution is a simple short C loop that has been optimized for the job.






Latest Courses