Python Regular Expression Named Capture Groups - python

Im learning regular expressions, specifically named capture groups.
Having an issue where I'm not able to figure out how to write an if/else statement for my function findVul().
Basically how the code works or should work is that findVul() goes through data1 and data2, which has been added to the list myDATA.
If the regex finds a match for the entire named group, then it should print out the results. It currently works perfectly.
CODE:
import re
data1 = '''
dwadawa231d .2 vulnerabilities discovered dasdfadfad .One vulnerability discovered 123e2121d21 .12 vulnerabilities discovered sgwegew342 dawdwadasf
2r3232r32ee
'''
data2 = ''' d21d21 .2 vul discovered adqdwdawd .One vulnerability disc d12d21d .two vulnerabilities discovered 2e1e21d1d f21f21
'''
def findVul(data):
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.finditer(pattern, data)
for x in match:
print(x.group())
myDATA = [data1,data2] count_data = 1
for x in myDATA:
print('\n--->Reading data{0}\n'.format(count_data))
count_data+=1
findVul(x)
OUTPUT:
--->Reading data1
2 vulnerabilities discovered
One vulnerability discovered
12 vulnerabilities discovered
--->Reading data2
Now I want to add an if/else statement to check if there are any matches for the entire named group.
I tried something like this, but it doesn't seem to be working.
CODE:
def findVul(data):
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.finditer(pattern, data)
if len(list(match)) != 0:
print('\nVulnerabilities Found!\n')
for x in match:
print(x.group())
else:
print('No Vulnerabilities Found!\n')
OUTPUT:
--->Reading data1
Vulnerabilities Found!
--->Reading data2
No Vulnerabilities Found!
As you can see it does not print the vulnerabilities that should be in data1.
Could someone please explain the correct way to do this and why my logic is wrong.
Thanks so much :) !!

The problem is that re.finditer() returns an iterator that is evaluated when you do the len(list(match)) != 0 test; when you iterate over it again in the for-loop, it is already exhausted and there are no items left. The simple fix is just to add a match = list(match) line after the finditer() call.

I did some more research after #AdamKG response.
I wanted to utlize the re.findall() function.
re.findall() will return a list of all matched substrings. In my case I have capture groups inside of my named capture group. This will return a list with tuples.
For example the following regex with data1:
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+
(vulnerabilities|vulnerability)\s+discovered)')
match = re.findall(pattern, data)
Will return a list with tuples:
[('2 vulnerabilities discovered', '2', 'vulnerabilities'), ('One vulnerability
discovered', 'One', 'vulnerability'), ('12 vulnerabilities discovered', '12',
'vulnerabilities')]
My Final Code for findVul():
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.findall(pattern, data)
if len(match) != 0:
print('Vulnerabilties Found!\n')
for x in match:
print('--> {0}'.format(x[0]))
else:
print('No Vulnerability Found!\n')

Related

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

python if/else list comprehension

I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']

Python3 replace tags based on condition of the type of tag

I want all the tags in a text that look like <Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela> to be replaced with <a my-inner-type="CR:1234">Bob Alice</a> and <a my-inner-type="BS:5678">Nelson Mandela</a> respectively. So basically, depending on the Type whether TypeA or TypeB, I want to replace the text accordingly in a text string using Python3 and regex.
I tried doing the following in python but not sure if that's the right approach to go forward:
import re
def my_replace():
re.sub(r'\<(.*?)\>', replace_function, data)
With the above, I am trying to do a regex of the< > tag and every tag I find, I pass that to a function called replace_function to split the text between the tag and determine if it is a TypeA or a TypeB and compute the stuff and return the replacement tag dynamically. I am not even sure if this is even possible using the re.sub but any leads would help. Thank you.
Examples:
<Car:1234|Bob Alice> becomes <a my-inner-type="CR:1234">Bob Alice</a>
<Bus:5678|Nelson Mandela> becomes <a my-inner-type="BS:5678">Nelson Mandela</a>
This is perfectly possible with re.sub, and you're on the right track with using a replacement function (which is designed to allow dynamic replacements). See below for an example that works with the examples you give - probably have to modify to suit your use case depending on what other data is present in the text (ie. other tags you need to ignore)
import re
def replace_function(m):
# note: to not modify the text (ie if you want to ignore this tag),
# simply do (return the entire original match):
# return m.group(0)
inner = m.group(1)
t, name = inner.split('|')
# process type here - the following will only work if types always follow
# the pattern given in the question
typename = t[4:]
# EDIT: based on your edits, you will probably need more processing here
# eg:
if t.split(':')[0] == 'Car':
typename = 'CR'
# etc
return '<a my-inner-type="{}">{}</a>'.format(typename, name)
def my_replace(data):
return re.sub(r'\<(.*?)\>', replace_function, data)
# let's just test it
data = 'I want all the tags in a text that look like <TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela> to be replaced with'
print(my_replace(data))
Warning: if this text is actually full html, regex matching will not be reliable - use an html processor like beautifulsoup. ;)
Probably an extension to #swalladge's answer but here we use the advantage of a dictionary, if we know a mapping. (Think replace dictionary with a custom mapping function.
import re
d={'TypeA':'A',
'TypeB':'B',
'Car':'CR',
'Bus':'BS'}
def repl(m):
return '<a my-inner-type="'+d[m.group(1)]+m.group(2)+'">'+m.group(3)+'</a>'
s='<TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
print()
s='<Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
OUTPUT
<a my-inner-type="A:1234">Bob Alice</a> or <a my-inner-type="B:5678">Nelson Mandela</a>
<a my-inner-type="BS:1234">Bob Alice</a> or <a my-inner-type="CR:5678">Nelson Mandela</a>
Working example here.
regex
We capture what we need in 3 groups and refer to them through match object.Highlighted in bold are the three groups that we captured in the regex.
<(.*?)(:\d+)\|(.*?)>
We use these 3 groups in our repl function to return the right string.
Sorry this isn't a complete answer but I'm falling asleep at the computer, but this is the regex that'll match either of the strings you provided, (<Type)(\w:)(\d+\|)(\w+\s\w+>). Check out https://pythex.org/ for testing your regex stuff.
Try with:
import re
def get_tag(match):
base = '<a my-inner-type="{}">{}</a>'
inner_type = match.group(1).upper()
my_inner_type = '{}{}:{}'.format(inner_type[0], inner_type[-1], match.group(2))
return base.format(my_inner_type, match.group(3))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Bus:1234|Bob Alice>'))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Car:5678|Nelson Mandela>'))
This code will work if you have it in the form <Type:num|name>:
def replaceupdate(tag):
replace = ''
t = ''
i = 1
ident = ''
name = ''
typex = ''
while t != ':':
typex += tag[i]
t = tag[i]
i += 1
t = ''
while t != '|':
if tag[i] == '|':
break
ident += tag[i]
t = tag[i]
i += 1
t = ''
i += 1
while t != '>':
name += tag[i]
t = tag[i]
i += 1
replace = '<a my-inner-type="{}{}">{}</a>'.format(typex, ident, name)
return replace
I know it does not use regex and it has to split the text some other way, but this is the main bulk.

Match unique patterns in string - Python

I have a list of strings called txtFreeForm:
['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.]
I need to check if only 'Add roth' exists in the sentence. To do that i used this
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if match is not None:
print(each_line)
But this obviously returns both the strings in my list as both contain 'add roth'. Is there a way to exclusively search for 'Add roth' in a sentence, because i have a bunch of these patterns to search in strings.
Thanks for your help!
Can you fix this problem by using the .Length property of strings? I'm not an experienced Python programmer, but here is how I think it should work:
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if (match is not None) and (len(txtFreeForm) == len("Add Roth")):
print(each_line)
Basically, if the text is in the string, AND the length of the string is exactly to the length of the string "Add Roth", then it must ONLY contain "Add Roth".
I hope this was helpful.
EDIT:
I misunderstood what you were asking. You want to print out sentences that contain "Add Roth", but not sentences that contain "Add Roth in plan". Is this correct?
How about this code?
for each_line in txtFreeForm:
match_AR = re.search('add roth',each_line.lower())
match_ARIP = re.search('add roth in plan',each_line.lower())
if (match_AR is True) and (match_ARIP is None):
print(each_line)
This seems like it should fix the problem. You can exclude any strings (like "in plan") by searching for them too and adding them to the comparison.
You're close :) Give this a shot:
for each_line in txtFreeForm:
match = re.search('add roth (?!in[-]plan)',each_line.lower())
if match is not None:
print(each_line[match.end():])
EDIT:
Ahhh I misread... you have a LOT of these. This calls for some more aggressive magic.
import re
from functools import partial
txtFreeForm = ['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.']
def roths(rows):
for row in rows:
match = re.search('add roth\s*', row.lower())
if match:
yield row, row[match.end():]
def filter_pattern(pattern):
return partial(lazy_filter_out, pattern)
def lazy_filter(pattern):
return partial(lazy_filter, pattern)
def lazy_filter_out(pattern, rows):
for row, rest in rows:
if not re.match(pattern, rest):
yield row, rest
def magical_transducer(bad_words, nice_rows):
magical_sentences = reduce(lambda x, y: y(x), [roths] + map(filter_pattern, bad_words), nice_rows)
for row, _ in magical_sentences:
yield row
def main():
magic = magical_transducer(['in[-]plan'], txtFreeForm)
print(list(magic))
if __name__ == '__main__':
main()
To explain a bit about what's happening hear, you mentioned you have a LOT of these words to process. The traditional way you might compare two groups of items is with nested for-loops. So,
results = []
for word in words:
for pattern in patterns:
data = do_something(word_pattern)
results.append(data)
for item in data:
for thing in item:
and so on...
and so fourth...
I'm using a few different techniques to attempt to achieve a "flatter" implementation and avoid the nested loops. I'll do my best to describe them.
**Function compositions**
# You will often see patterns that look like this:
x = foo(a)
y = bar(b)
z = baz(y)
# You may also see patterns that look like this:
z = baz(bar(foo(a)))
# an alternative way to do this is to use a functional composition
# the technique works like this:
z = reduce(lambda x, y: y(x), [foo, bar, baz], a)

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Categories

Resources