How to extract Dutch zip code with regex (Python) - python

Assume I have the following list:
[4486AE Capelle aan de Ijsel, 4706TR Amsterdam]
I would like to extract the zip code for each element.
The desired output is:
[4486AE, 4706TR]
I tried to find a regular expression for Dutch zip codes in Python. However, I only found a JavaScript expression. This is what I tried so far:
import re
test = '4706TR Amsterdam'
match = re.search(r"/^(?:NL-)?(\d{4})\s*([A-Z]{2})$/i", test)
print(match)
This gives me an empty result. Here is where I got the expression from: https://rgxdb.com/r/4W9GV8AC
Anyone has an idea how to solve this? Other SO posts do not focus on Python expression for Dutch zip codes.

The pattern that you tried /^(?:NL-)?(\d{4})\s*([A-Z]{2})$/i has a Javascript notation.
The leading and trailing / are the pattern delimiters, ^ and $ are the anchors to assert the start and end of the string and the /i flag is for a case insensitive match.
In Python, to get the match in your question you can match 4 digits and 2 uppercase chars A-Z between word boundaries \b to prevent partial matches instead of using the anchors as the matches are not the only string.
The case insensitive matches can be done using re.IGNORECASE
Using re.search can also return None, so first check if re.search has a value and then use .group() to get the match.
import re
test = '4706TR Amsterdam'
match = re.search(r"\b\d{4}[A-Z]{2}\b", test, re.IGNORECASE)
if match:
print(match.group())
Output
4706TR
See a Python demo
If you want to match an optional NL- part, the pattern can be:
\b(?:NL-)?\d{4}[A-Z]{2}\b
Regex demo

Related

Python matching dashes using Regular Expressions

I am currently new to Regular Expressions and would appreciate if someone can guide me through this.
import re
some = "I cannot take this B01234-56-K-9870 to the house of cards"
I have the above string and trying to extract the string with dashes (B01234-56-K-9870) using python regular expression. I have following code so far:
regex = r'\w+-\w+-\w+-\w+'
match = re.search(regex, some)
print(match.group()) #returns B01234-56-K-9870
Is there any simpler way to extract the dash pattern using regular expression? For now, I do not care about the order or anything. I just wanted it to extract string with dashes.
Try the following regex (as shortened by The fourth bird),
\w+-\S+
Original regex: (?=\w+-)\S+
Explanation:
\w+- matches 1 or more words followed by a -
\S+ matches non-space characters
Regex demo!

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Why isn't this regex matching a string with percentage symbol?

I have a file which has the following input :
xa%1bc
ba%1bc
.
.
and so on. I want to use match and regex to identify the lines which have a%1bin them.
I am using
import re
p1 = re.compile(r'\ba%1b\b', flags=re.I)
if re.match(p1,linefromfile):
continue
It doesnt seem to detect the line with %1. What is the issue? Thanks
match only search the pattern at the beginning of the string, if you want to find out if a string contains a pattern, use search instead. Besides you don't need the word boundary, \b:
re.search(pattern, string, flags=0)
Scan through string looking for
the first location where the regular expression pattern produces a
match, and return a corresponding match object. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
re.match(pattern, string, flags=0)
If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding match object. Return None if the string does not match
the pattern; note that this is different from a zero-length match.
import re
if re.search(r"a%1b", "xa%1bc"):
print("hello")
# hello
You can try
if 'a%1b' in linefromfile:
OR
if you need regex
if re.match('a%1b', linefromfile):

regular expression match issue in Python

For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java
You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68
You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})
You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Categories

Resources