Regular Expressions with Python: Look ahead and Look behind

These two important tools have been bugging me every time I come up with a situation. Because I never remember this one. Blame my memory.
So it's time I cement this idea somewhere . And no better place than a blog page.

LOOK AHEAD

LOOK AHEAD means , I want to see what lies ahead first, and then decide, to do something now.


Syntax of regex pattern:  pattern1(?=pattern2)


Where Pattern1 is the pattern for the part that we ACTUALLY want to capture. Pattern2 is the pattern which needs to be found as MANDATORY. Logically speaking, 

IF PATTERN2 is FOUND, then print/get/capture/show PATTERN1


Example: Let there be a string "Hello World"

Aim: I want to find 'Hello' ...ONLY IF its followed by 'World'



re.search(r'\w+(?= World)', 'Hello World').group()

Result : >> 'Hello'

Explanation:

  •  The \w+ is Pattern1
  •  The 'World' is Pattern2
  •  r'\w+(?= World)' means, Find anything which fits \w+ IF it is followed by 'World'

Rules:
  • The Pattern2 needs to be in parenthesis.
  • If pattern2 itself has parenthesis, then they need to be bracketed. i.e [ ( ] or [ ) ]

LOOK BEHIND


Its just the opposite of the above. It means , I want to see what lies behind me , and then decide to do something.



Syntax of regex pattern:  (?<=pattern1)pattern2


Where pattern2 is the pattern for the part that we ACTUALLY want to capture. Pattern1 is the pattern which needs to found as MANDATORY.


IF PATTERN1 is FOUND, then print/get/capture/show PATTERN2


Example : We will take the same example. 'Hello World'

Aim : I want to find 'World' ONLY if it is preceded by 'Hello ' (The space here also counts)

>>> re.search(r'(?<=Hello )\w+', 'Hello World').group()

Result : 'World'

Rules: 
  • Pattern1 needs to be in parenthesis
  • If pattern2 itself has parenthesis, then they need to be bracketed. i.e [ ( ] or [ ) ]


LOOK AHEAD & LOOK BEHIND COMBINED

Consume only if it is surrounded by the things we want.



Syntax of regex pattern: (?<=pattern_behind)pattern_middle(?=pattern_ahead)


Where pattern_middle is the pattern for the part that we ACTUALLY want to capture. Pattern_behind and Pattern_ahead are patterns which need to be found as MANDATORY.


IF PATTERN_AHEAD AND IF PATTERN_BEHIND are BOTH found, consume PATTERN_MIDDLE


Example: We will take a new example . 'Hello My World'

Aim: I want to find any word that occurs in between Hello & World


>>> re.search('(?<=Hello )\w+(?= World)', 'Hello My World').group()

Result : 'My'

More examples:


Problem : Remove all special symbols with a space,  that come in between two alphanumeric character.
Target string'This$#is% Matrix#  %!'


re.sub(r'(?<=[a-zA-Z0-9])[$#@%^\s]+(?=[a-zA-Z0-9])', ' ', 'This$#is% Matrix#  %!')

Explanation: For alphanumeric character, we used [a-zA-Z0-9]

To find special characters between them , we use a combo of look ahead and look behind.

Total Pageviews