I am trying to parse strings that look like this.
<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,">
<Objective Type="Availability">
<Goal>99.99</Goal>
<Actual>100.00</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2880</Checks>
</Objective>
<Objective Type="Uptime">
<Goal/>
<Actual/>
<Compliant/>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>
I want to use regex to find the position of 'Description' and then string between the quotes, so I want 'Get Metadata'. Then, I want to find the position of 'From' and get the string between the quotes, so I want this '2019-01-16 00:00'. Finally, I want to find the position of 'Thru' and get the string between the quotes, so I want this '2019-01-16 23:59'. How can I do this with 3 separate regex commands and parse this into 3 separate strings? TIA.
You can do this with 1 regex pattern
pattern = re.compile('Description="(.*)" From="(.*)" Thru="(.*)" obj')
for founds in re.findall(pattern=pattern, string=string):
desc, frm, thru = founds
print(desc)
print(frm)
print(thru)
# ouput
# Get Metadata
# 2019-01-16 00:00
# 2019-01-16 23:59
Or you can do the same step with different patterns
pattern_desc = re.compile('Description="(.*)" From')
pattern_frm = re.compile('From="(.*)" Thru')
pattern_thru = re.compile('Thru="(.*)" obj')
re.findall(pattern_desc, string)
# output: ['Get Metadata']
re.findall(pattern_frm, string)
# output: ['2019-01-16 00:00']
re.findall(pattern_thru, string)
# output: ['2019-01-16 23:59']
This regex should give you the content of description, the others should be similar:
'Description="([\w\s]+)" From'
I put together a little working example with a regex to get the data you are looking for.
import re
long_string = '''
<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,">
<Objective Type="Availability">
<Goal>99.99</Goal>
<Actual>100.00</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2880</Checks>
</Objective>
<Objective Type="Uptime">
<Goal/>
<Actual/>
<Compliant/>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>
'''
match = re.search('Description=\"(.+?)\" From=\"(.+?)\" Thru=\"(.+?)\"', long_string)
if match:
print(match.group(1))
print(match.group(2))
print(match.group(3))
It gives this output:
Get Metadata
2019-01-16 00:00
2019-01-16 23:59
Hope this helps.
Your three regex you need for capturing the mentioned values will be this,
Description="([^"]*)"
From="([^"]*)"
Thru="([^"]*)"
Which you can generate dynamically through a function and re-use it for finding value for any type of data. Try this python code demo,
import re
def getValue(str, key):
m = re.search(key + '="([^"]*)"',str)
if m:
return m.group(1)
s = '''<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,">
<Objective Type="Availability">
<Goal>99.99</Goal>
<Actual>100.00</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2880</Checks>
</Objective>
<Objective Type="Uptime">
<Goal/>
<Actual/>
<Compliant/>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>'''
print('Description: ' + getValue(s,'Description'))
print('From: ' + getValue(s,'From'))
print('Thru: ' + getValue(s,'Thru'))
Prints,
Description: Get Metadata
From: 2019-01-16 00:00
Thru: 2019-01-16 23:59
In pure python, it should be something like this:
xml = '<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,"><Objective Type="Availability"><Goal>99.99</Goal><Actual>100.00</Actual><Compliant>Yes</Compliant><Errors>0</Errors><Checks>2880</Checks></Objective><Objective Type="Uptime"><Goal/><Actual/><Compliant/><Errors>0</Errors><Checks>0</Checks></Objective>'
report = xml.split('>')[0]
description = report.split("Description=\"")[1].split("\" From=\"")[0]
from_ = report.split("From=\"")[1].split("\" Thru=\"")[0]
thru = report.split("Thru=\"")[1].split("\" obj_device=\"")[0]
Related
I am new to python and I have been searching for a method to replace a series of patterns and cannot find a method that uses regex, none of which I found have worked for me, here are some of my patterns and the code I am using:
regexes = {
r'\s(\(|\[)(.*?)Mix(.*?)(\)|\])/i' : r"",
r'\s(\(|\[)(.*?)Version(.*?)(\)|\])/i' : r"",
r'\s(\(|\[)(.*?)Remix(.*?)(\)|\])/i' : r"",
r'\s(\(|\[)(.*?)Extended(.*?)(\)|\])/i' : r"",
r'\s\(remix\)/i' : r"",
r'\s\(original\)/i' : r"",
r'\s\(intro\)/i' : r"",
}
def multi_replace(dict, text):
for key, value in dict.items():
text = re.sub(key, value, text)
return text
filename = "Testing (Intro)"
name = multi_replace(regexes, filename)
print(name)
I am pulling filenames from directories of music I own as I am a DJ, I belong to many record pools and they label their songs sometimes as follows;
SomeGuy - Song Name Here (Intro)
SomeGirl - Song Name Here (Remix)
SomeGirl - Song Name Here (Extended Version)
SomeGuy - Song Name Here (12" Mix Vocal)
and so on...
my regex above works in PHP in which it will remove all the values like (Intro) (Remix) (Extended Version), etc. so the output is;
SomeGuy - Song Name Here
SomeGirl - Song Name Here
SomeGirl - Song Name Here
SomeGuy - Song Name Here
and so on...
For ignorecase you need to use re.I or re.IGNORECASE
Try with this code:
import re
regexes = {
r'\s(\(|\[)(.*?)Mix(.*?)(\)|\])' : r"",
r'\s(\(|\[)(.*?)Version(.*?)(\)|\])' : r"",
r'\s(\(|\[)(.*?)Remix(.*?)(\)|\])' : r"",
r'\s(\(|\[)(.*?)Extended(.*?)(\)|\])' : r"",
r'\s\(remix\)' : r"",
r'\s\(original\)' : r"",
r'\s\(intro\)' : r"",
}
def multi_replace(dict, text):
for key, value in dict.items():
p = re.compile(key, re.I)
text = p.sub(value, text)
return text
filename = "Testing (Intro)"
name = multi_replace(regexes, filename)
print(name)
I am trying to create a function that will return a string from the text based on these conditions:
If 'recurring payment authorized on' in the string, get the 1st text after 'on'
If 'recurring payment' in the string, get everything before
Currently I have written the following:
#will be used in an apply statement for a column in dataframe
def parser(x):
x_list = x.split()
if " recurring payment authorized on " in x and x_list[-1]!= "on":
return x_list[x_list.index("on")+1]
elif " recurring payment" in x:
return ' '.join(x_list[:x_list.index("recurring")])
else:
return None
However this code looks awkward and is not robust. I want to use regex to match those strings.
Here are some examples of what this function should return:
recurring payment authorized on usps abc should return usps
usps recurring payment abc should return usps
Any help on writing regex for this function will be appreciated. The input string will only contain text; there will be no numerical and special characters
Using Regex with lookahead and lookbehind pattern matching
import re
def parser(x):
# Patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
Output
In text: recurring payment authorized on usps abc
Found: usps
In text: usps recurring payment abc
Found: usps
In text: recurring payment authorized on att xxx xxx
Found: att
In text: recurring payment authorized on 25.05.1980 xxx xxx
Found: 25.05.1980
In text: att recurring payment xxxxx
Found: att
In text: 12.14.14. att recurring payment xxxxx
Found: 12.14.14. att
Explanation
Lookahead and Lookbehind pattern matching
Regex Lookbehind
(?<=foo) Lookbehind Asserts that what immediately precedes the current
position in the string is foo
So in pattern: r'(?<= authorized on )(.*?)(\s+)'
foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace
So the above causes (.*?) to capture all characters after " authorized on " until the first whitespace character.
Regex Lookahead
(?=foo) Lookahead Asserts that what immediately follows the current position in the string is foo
So with: r'^(.*?)\s(?=recurring payment)'
foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space
Thus, (.*?) will match all characters from beginning of string until we get whitespace followed by "recurring payment"
Better Performance
Desirable since you're applying to Dataframe which may have lots of columns.
Take the pattern compilation out of the parser and place it in the module (33% reduction in time).
def parser(x):
# Use predined patterns (pattern_on, pattern_recur) from globals
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
# Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
I am not sure that this level of complexity requires RegEx.
Hoping that RegEx is not a strict requirement for you here's a solution not using it:
examples = [
'stuff more stuff recurring payment authorized on ExampleA useless data',
'other useless data ExampleB recurring payment',
'Not matching phrase payment example authorized'
]
def extract_data(phrase):
result = None
if "recurring payment authorized on" in phrase:
result = phrase.split("recurring payment authorized on")[1].split()[0]
elif "recurring payment" in phrase:
result = phrase.split("recurring payment")[0]
return result
for example in examples:
print(extract_data(example))
Output
ExampleA
other useless data ExampleB
None
Not sure if this is any faster, but Python has conditionals:
If authorized on is present then
match the next substring of non-space characters else
match everything that occurs before recurring
Note that the result will be in capturing group 2 or 3 depending on which matched.
import re
def xstr(s):
if s is None:
return ''
return str(s)
def parser(x):
# Patterns to search
pattern = re.compile(r"(authorized\son\s)?(?(1)(\S+)|(^.*) recurring)")
m = pattern.search(t)
if m:
return xstr(m.group(2)) + xstr(m.group(3))
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
You can do this with a single regex and without explicit lookahead nor lookbehind.
Please let me know if this works, and how it performs against #DarryIG's solution.
import re
from collections import namedtuple
ResultA = namedtuple('ResultA', ['s'])
ResultB = namedtuple('ResultB', ['s'])
RX = re.compile('((?P<init>.*) )?recurring payment ((authorized on (?P<authorized>\S+))|(?P<rest>.*))')
def parser(x):
'''https://stackoverflow.com/questions/59600852/use-regex-to-match-multiple-words-in-sequence
>>> parser('recurring payment authorized on usps abc')
ResultB(s='usps')
>>> parser('usps recurring payment abc')
ResultA(s='usps')
>>> parser('recurring payment authorized on att xxx xxx')
ResultB(s='att')
>>> parser('recurring payment authorized on 25.05.1980 xxx xxx')
ResultB(s='25.05.1980')
>>> parser('att recurring payment xxxxx')
ResultA(s='att')
>>> parser('12.14.14. att recurring payment xxxxx')
ResultA(s='12.14.14. att')
'''
m = RX.match(x)
if m is None:
return None # consider ValueError
recurring = m.groupdict()['init'] or m.groupdict()['rest']
authorized = m.groupdict()['authorized']
if (recurring is None) == (authorized is None):
raise ValueError('invalid input')
if recurring is not None:
return ResultA(recurring)
else:
return ResultB(authorized)
I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.
You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps
I have this paragraph in a variable
"Information about this scan : abc version : 5.2.5 pqr version : 201 403061815 hello kdshfldfs;dfkfjljcsdlc sljc lsjclsj csjclks cscjsld"
I want to fetch 'abc version' and 'pqr version'.
How can I achieve that?
You can do it as follows:
string = "Your Paragraph String"
string = string.split()
abc_version = string[string.index('abc')+3]
AND
pqr_version = string[string.index('pqr')+3] #This will give 201
OR
pqr_version = ' '.join(string[string.index('pqr')+3:string.index('pqr')+5]) #This will give 201 403061815
Please specify where your pqr version string starts and ends.
What I am trying to match is something like this:
public FUNCTION_NAME
FUNCTION_NAME proc near
......
FUNCTION_NAME endp
FUNCTION_NAME can be :
version_etc
version_etc_arn
version_etc_ar
and my pattern is:
pattern = "public\s+" + func_name + "[\s\S]*" + func_name + "\s+endp"
and match with:
match = re.findall(pattern, content)
So currently I find if the fuction_name equals version_etc, then it will match
all the version_etc_arn, version_etc_ar and version_etc.....
which means if the pattern is :
"public\s+" + "version_etc" + "[\s\S]*" + "version_etc" + "\s+endp"
then it will match:
public version_etc_arn
version_etc_arn proc near
......
version_etc_arn endp
public version_etc_ar
version_etc_ar proc near
......
version_etc_ar endp
public version_etc
version_etc proc near
......
version_etc endp
And I am trying to just match:
public version_etc
version_etc proc near
......
version_etc endp
Am I wrong? Could anyone give me some help?
THank you!
[\s\S]* matches 0-or-more of anything, including the _arn that you are trying to exclude. Thus, you need to require a whitespace after func_name:
pattern = r"(?sm)public\s+{f}\s.*^{f}\s+endp".format(f=func_name)