Python matching dashes using Regular Expressions - python

I am currently new to Regular Expressions and would appreciate if someone can guide me through this.
import re
some = "I cannot take this B01234-56-K-9870 to the house of cards"
I have the above string and trying to extract the string with dashes (B01234-56-K-9870) using python regular expression. I have following code so far:
regex = r'\w+-\w+-\w+-\w+'
match = re.search(regex, some)
print(match.group()) #returns B01234-56-K-9870
Is there any simpler way to extract the dash pattern using regular expression? For now, I do not care about the order or anything. I just wanted it to extract string with dashes.

Try the following regex (as shortened by The fourth bird),
\w+-\S+
Original regex: (?=\w+-)\S+
Explanation:
\w+- matches 1 or more words followed by a -
\S+ matches non-space characters
Regex demo!

Related

How to extract Dutch zip code with regex (Python)

Assume I have the following list:
[4486AE Capelle aan de Ijsel, 4706TR Amsterdam]
I would like to extract the zip code for each element.
The desired output is:
[4486AE, 4706TR]
I tried to find a regular expression for Dutch zip codes in Python. However, I only found a JavaScript expression. This is what I tried so far:
import re
test = '4706TR Amsterdam'
match = re.search(r"/^(?:NL-)?(\d{4})\s*([A-Z]{2})$/i", test)
print(match)
This gives me an empty result. Here is where I got the expression from: https://rgxdb.com/r/4W9GV8AC
Anyone has an idea how to solve this? Other SO posts do not focus on Python expression for Dutch zip codes.
The pattern that you tried /^(?:NL-)?(\d{4})\s*([A-Z]{2})$/i has a Javascript notation.
The leading and trailing / are the pattern delimiters, ^ and $ are the anchors to assert the start and end of the string and the /i flag is for a case insensitive match.
In Python, to get the match in your question you can match 4 digits and 2 uppercase chars A-Z between word boundaries \b to prevent partial matches instead of using the anchors as the matches are not the only string.
The case insensitive matches can be done using re.IGNORECASE
Using re.search can also return None, so first check if re.search has a value and then use .group() to get the match.
import re
test = '4706TR Amsterdam'
match = re.search(r"\b\d{4}[A-Z]{2}\b", test, re.IGNORECASE)
if match:
print(match.group())
Output
4706TR
See a Python demo
If you want to match an optional NL- part, the pattern can be:
\b(?:NL-)?\d{4}[A-Z]{2}\b
Regex demo

Why can't I scoop out some ID's of some strings using regex?

I'm trying to scoop out some ID's from some strings. The portion I would like to grab from each string is between bd- and ?. The latter is not always present so I wish to make this sign ? optional. I know I can achieve the same using string manipulation but I wish to do the same using regex.
I've tried with:
import re
content = """
id-HTRY098WE
id-KNGT371WE?witkl
id-ZXV555NQE?phnu
eh-VCBG075LK
"""
for item in re.findall(r'id-(.*)\??',content):
print(item)
Output it yields:
HTRY098WE
KNGT371WE?witkl
ZXV555NQE?phnu
Expected output:
HTRY098WE
KNGT371WE
ZXV555NQE
How can I scrape ID's out of some strings?
You could use a capturing group with a negated character class to match not a questionmark or a whitespace character.
The pattern that you tried first matches until the end of the string using .*. Then at the end of the string, it tries to match an optional question mark \??. This will succeed (because it is optional) resulting in matching the whole string for the first 3 examples.
id-([^?\s]+)
Regex demo | Python demo
For example
import re
content = """
id-HTRY098WE
id-KNGT371WE?witkl
id-ZXV555NQE?phnu
eh-VCBG075LK
"""
for item in re.findall(r'id-([^?\s]+)',content):
print(item)
Result
HTRY098WE
KNGT371WE
ZXV555NQE
Or match only alphanumerics:
id-([A-Z0-9]+)
Regex demo

Regex match single characters between strings

I have a string with some markup which I'm trying to parse, generally formatted like this.
'[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
I want to match the asterisks within the [list] tags so I can re.sub them as [**] but I'm having trouble forming an expression to grab them. So far, I have:
match = re.compile('\[list\].+?\[/list\]', re.DOTALL)
This gets everything within the list, but I can't figure out a way to narrow it down to the asterisks alone. Any advice would be massively appreciated.
You may use a re.sub and use a lambda in the replacement part. You pass the match to the lambda and use a mere .replace('*','**') on the match value.
Here is the sample code:
import re
s = '[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
match = re.compile('\[list].+?\[/list]', re.DOTALL)
print(match.sub(lambda m: m.group().replace('*', '**'), s))
# = > [*]
# [list][**][**][/list][*]text[list][**][/list]
See the IDEONE demo
Note that a ] outside of a character class does not have to be escaped in Python re regex.

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Python regex multiline replacement

I searched existing questions but they do not seem to answer this specific question.
I have the following python program
description = """\
before
{cs:id=841398|rep=myrepo}: after
"""
pattern = re.compile(r"(.*)\{cs:id=(.*)\|rep=(.*)\}(.*)")
and I need to replace the regex in the description to look like the below but I can't get the pattern and replacement syntax right
description="""\
before
841398 : after
"""
The crucible.app.com:9090 is a constant that I have beforehand so I basically need to substitute the pattern with my replacement.
Can someone show me what is the best python regex find and replace syntax for this?
There is no need for the first and last (.*) in your pattern. To write back captured groups in the replacement string, use \1 and \2:
description = re.sub(pattern, "\1", description)
By the way, another way to improve your pattern (performance- and robustness-wise) is to mkae the inner repetitions more explicit so that they cannot accidentally go past the | or }:
pattern = re.compile(r"\{cs:id=([^|]*)\|rep=([^}]*)\}")
You can also use named groups:
pattern = re.compile(r"\{cs:id=(?P<id>[^|]*)\|rep=(?P<rep>[^}]*)\}")
And then in the replacement string:
"\g<id>"
Use re.sub / RegexObject.sub:
>>> pattern = re.compile(r"{cs:id=(.*?)\|rep=(.*?)}")
>>> description = pattern.sub(r'\1', description)
>>> print(description)
before
841398: after
\1, \2 refer to matched group 1, 2.
I modified the regular expression slightly.
No need to escape {, }.
Removed capturing group before, after {..}.
Used non-greedy match: .*?

Categories

Resources