Excluding Matches With Regular Expressions - python

Hi I have a pattern that I can identify with python regex but I want to return everything but the match
with this function I can extract the year of the string
def stripyear(string):
p_year = re.compile('[2][0][0-9][0-9][0-1][0-9][0-3][0-9]')
ym = p_year.findall(string)
return ''.join(ym)
so stripyear('UTMSIRGAS200020150409')will return '20150409' but i want it to return
'UTMSIRGAS2000'
i.e. the opposite
I tried this patter '\b(?!([2][0][0-9][0-9][0-1][0-9][0-3][0-9])\b)\w+' and it works if there was a space before the patter in here but python does not like it.
thanks for any help.

Use re.sub to replace with empty string:
re.sub("[2][0][0-9][0-9][0-1][0-9][0-3][0-9]", "", input)

Related

How can I add a string inside a string?

The problem is simple, I'm given a random string and a random pattern and I'm told to get all the posible combinations of that pattern that occur in the string and mark then with [target] and [endtarget] at the beggining and end.
For example:
given the following text: "XuyZB8we4"
and the following pattern: "XYZAB"
The expected output would be: "[target]X[endtarget]uy[target]ZB[endtarget]8we4".
I already got the part that identifies all the words, but I can't find a way of placing the [target] and [endtarget] strings after and before the pattern (called in the code match).
import re
def tagger(text, search):
place_s = "[target]"
place_f = "[endtarget]"
pattern = re.compile(rf"[{search}]+")
matches = pattern.finditer(text)
for match in matches:
print(match)
return test_string
test_string = "alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z "
pattern = "XYZAB"
print(tagger(test_string, pattern))
I also tried the for with the sub method, but I couldn't get it to work.
for match in matches:
re.sub(match.group(0), place_s + match.group(0) + place_f, text)
return text
re.sub allows you to pass backreferences to matched groups within your pattern. so you do need to enclose your pattern in parentheses, or create a named group, and then it will replace all matches in the entire string at once with your desired replacements:
In [10]: re.sub(r'([XYZAB]+)', r'[target]\1[endtarget]', test_string)
Out[10]: 'alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] '
With this approach, re.finditer is not not needed at all.

Regex to capture string if other string present within brackets

I am trying to create a Python regex to capture a file name, but only if the text "external=true" appears within the square brackets after the alleged file name.
I believe I am nearly there, but am missing a specific use-case. Essentially, I want to capture the text between qrcode: and the first [, but only if the text external=true appears between the two square brackets.
I have created the regex qrcode:([^:].*?)\[.*?external=true.*?\], which does not work for the second line below: it incorrectly returns vcard3.txt and does not return vcard4.txt.
qrcode:vcard1.txt[external=true] qrcode:vcard2.txt[xdim=2,ydim=2]
qrcode:vcard3.txt[xdim=2,ydim=2] qrcode:vcard4.txt[xdim=2,ydim=2,external=true]
qrcode:vcard5.txt[xdim=2,ydim=2,external=true,foreground=red,background=white]
qrcode:https://www.github.com[foreground=blue]
https://regex101.com/r/bh3IMb/3
As an alternative you can use
qrcode:([\w\.]+)(?=\[[\w\=,]*external=true[^\]]*)
See the regex demo.
Python demo:
import re
regex = re.compile(r"qrcode:([\w\.]+)(?=\[[\w\=,]*external=true[^\]]*)")
sample = """
qrcode:vcard1.txt[external=true] qrcode:vcard2.txt[xdim=2,ydim=2]
qrcode:vcard3.txt[xdim=2,ydim=2] qrcode:vcard4.txt[xdim=2,ydim=2,external=true]
qrcode:vcard5.txt[xdim=2,ydim=2,external=true,foreground=red,background=white]
qrcode:https://www.github.com[foreground=blue]
"""
print(regex.findall(sample))
Output:
['vcard1.txt', 'vcard4.txt', 'vcard5.txt']
Using positive look-ahead (for qrcode:) and positive look-behind (for [*external=true with lazy matching to capture the smallest of such groups.
Regex101 explanation: https://regex101.com/r/bOezIm/1
A complete python example:
import re
pattern = r"(?<=qrcode:)[^:]*?(?=\[[^\]]*?external=true)"
string = """
qrcode:vcard1.txt[external=true] qrcode:vcard2.txt[xdim=2,ydim=2]
qrcode:vcard3.txt[xdim=2,ydim=2] qrcode:vcard4.txt[xdim=2,ydim=2,external=true]
qrcode:vcard5.txt[xdim=2,ydim=2,external=true,foreground=red,background=white]
qrcode:https://www.github.com[foreground=blue]
"""
print(re.findall(pattern, string))

Python regex: excluding square brackets and the text inside

I am trying to write a regex that excludes square brackets and the text inside them.
My sample text looks like this: 'WordA, WordB, WordC, [WordD]'
I want to match each text item in the string except '[WordD]'. I've tried using a negative lookahead, something like... [A-Z][A-Za-z]+(?!\[[A-Z]+\]) but doing so is still matching the text inside the brackets.
Is negative lookahead the best approach? If so, where am I going wrong?
Rather than a regex, you might consider splitting by commas and then filtering by whether the word starts with [:
output = [word for word in str.split(', ') if word[0] != '[']
If you use a regex, you can match either the beginning of the string, or lookbehind for a space:
re.findall(r'(?:^|(?<= ))[A-Z][A-Za-z]+', str)
Or you could negative lookahead for ] at the end, after a word boundary:
output = re.findall(r'[A-Z][A-Za-z]+\b(?!\])', str)
This can be as simple as
(\w+),
Regex Demo
Retrieve value of Group 1 for desired result.
I'm guessing that maybe you were trying to write some expression similar to:
[A-Z][a-z]*[A-Z](?=,|$)
or,
[A-Z][a-z]+[A-Z](?=,|$)
Test
import re
regex = r"[A-Z][a-z]*[A-Z](?=,|$)"
string = """
WordA, WordB, WordC, [WordD]
WordA, WordB, WordC, [WordD], WordE
"""
print(re.findall(regex, string))
Output
['WordA', 'WordB', 'WordC', 'WordA', 'WordB', 'WordC', 'WordE']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regex in Python - Substring with single "re.sub" call

I am looking into the Regex function in Python.
As part of this, I am trying to extract a substring from a string.
For instance, assume I have the string:
<place of birth="Stockholm">
Is there a way to extract Stockholm with a single regex call?
So far, I have:
location_info = "<place of birth="Stockholm">"
#Remove before
location_name1 = re.sub(r"<place of birth=\"", r"", location_info)
#location_name1 --> Stockholm">
#Remove after
location_name2 = re.sub(r"\">", r"", location_name1)
#location_name2 --> Stockholm
Any advice on how to extract the string Stockholm, without using two "re.sub" calls is highly appreciated.
Sure, you can match the beginning up to the double quotes, and match and capture all the characters other than double quotes after that:
import re
p = re.compile(r'<place of birth="([^"]*)')
location_info = "<place of birth=\"Stockholm\">"
match = p.search(location_info)
if match:
print(match.group(1))
See IDEONE demo
The <place of birth=" is matched as a literal, and ([^"]*) is a capture group 1 matching 0 or more characters other than ". The value is accessed with .group(1).
Here is a REGEX demo.
print re.sub(r'^[^"]*"|"[^"]*$',"",location_info)
This should do it for you.See demo.
https://regex101.com/r/vV1wW6/30#python
Is there a specific reason why you are removing the rest of the string, instead of selecting the part you want with something like
location_info = "<place of birth="Stockholm">"
location_info = re.search('<.*="(.*)".*>', location_info, re.IGNORECASE).group(1)
this code tested under python 3.6
test = '<place of birth="Stockholm">'
resp = re.sub(r'.*="(\w+)">',r'\1',test)
print (resp)
Stockholm

Allowing escape sequences in my regular expression

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?
The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)
To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

Categories

Resources