Use regex to match multiple words in sequence - python

I am trying to create a function that will return a string from the text based on these conditions:
If 'recurring payment authorized on' in the string, get the 1st text after 'on'
If 'recurring payment' in the string, get everything before
Currently I have written the following:
#will be used in an apply statement for a column in dataframe
def parser(x):
x_list = x.split()
if " recurring payment authorized on " in x and x_list[-1]!= "on":
return x_list[x_list.index("on")+1]
elif " recurring payment" in x:
return ' '.join(x_list[:x_list.index("recurring")])
else:
return None
However this code looks awkward and is not robust. I want to use regex to match those strings.
Here are some examples of what this function should return:
recurring payment authorized on usps abc should return usps
usps recurring payment abc should return usps
Any help on writing regex for this function will be appreciated. The input string will only contain text; there will be no numerical and special characters

Using Regex with lookahead and lookbehind pattern matching
import re
def parser(x):
# Patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
Output
In text: recurring payment authorized on usps abc
Found: usps
In text: usps recurring payment abc
Found: usps
In text: recurring payment authorized on att xxx xxx
Found: att
In text: recurring payment authorized on 25.05.1980 xxx xxx
Found: 25.05.1980
In text: att recurring payment xxxxx
Found: att
In text: 12.14.14. att recurring payment xxxxx
Found: 12.14.14. att
Explanation
Lookahead and Lookbehind pattern matching
Regex Lookbehind
(?<=foo) Lookbehind Asserts that what immediately precedes the current
position in the string is foo
So in pattern: r'(?<= authorized on )(.*?)(\s+)'
foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace
So the above causes (.*?) to capture all characters after " authorized on " until the first whitespace character.
Regex Lookahead
(?=foo) Lookahead Asserts that what immediately follows the current position in the string is foo
So with: r'^(.*?)\s(?=recurring payment)'
foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space
Thus, (.*?) will match all characters from beginning of string until we get whitespace followed by "recurring payment"
Better Performance
Desirable since you're applying to Dataframe which may have lots of columns.
Take the pattern compilation out of the parser and place it in the module (33% reduction in time).
def parser(x):
# Use predined patterns (pattern_on, pattern_recur) from globals
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
# Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]

I am not sure that this level of complexity requires RegEx.
Hoping that RegEx is not a strict requirement for you here's a solution not using it:
examples = [
'stuff more stuff recurring payment authorized on ExampleA useless data',
'other useless data ExampleB recurring payment',
'Not matching phrase payment example authorized'
]
def extract_data(phrase):
result = None
if "recurring payment authorized on" in phrase:
result = phrase.split("recurring payment authorized on")[1].split()[0]
elif "recurring payment" in phrase:
result = phrase.split("recurring payment")[0]
return result
for example in examples:
print(extract_data(example))
Output
ExampleA
other useless data ExampleB
None

Not sure if this is any faster, but Python has conditionals:
If authorized on is present then
match the next substring of non-space characters else
match everything that occurs before recurring
Note that the result will be in capturing group 2 or 3 depending on which matched.
import re
def xstr(s):
if s is None:
return ''
return str(s)
def parser(x):
# Patterns to search
pattern = re.compile(r"(authorized\son\s)?(?(1)(\S+)|(^.*) recurring)")
m = pattern.search(t)
if m:
return xstr(m.group(2)) + xstr(m.group(3))
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))

You can do this with a single regex and without explicit lookahead nor lookbehind.
Please let me know if this works, and how it performs against #DarryIG's solution.
import re
from collections import namedtuple
ResultA = namedtuple('ResultA', ['s'])
ResultB = namedtuple('ResultB', ['s'])
RX = re.compile('((?P<init>.*) )?recurring payment ((authorized on (?P<authorized>\S+))|(?P<rest>.*))')
def parser(x):
'''https://stackoverflow.com/questions/59600852/use-regex-to-match-multiple-words-in-sequence
>>> parser('recurring payment authorized on usps abc')
ResultB(s='usps')
>>> parser('usps recurring payment abc')
ResultA(s='usps')
>>> parser('recurring payment authorized on att xxx xxx')
ResultB(s='att')
>>> parser('recurring payment authorized on 25.05.1980 xxx xxx')
ResultB(s='25.05.1980')
>>> parser('att recurring payment xxxxx')
ResultA(s='att')
>>> parser('12.14.14. att recurring payment xxxxx')
ResultA(s='12.14.14. att')
'''
m = RX.match(x)
if m is None:
return None # consider ValueError
recurring = m.groupdict()['init'] or m.groupdict()['rest']
authorized = m.groupdict()['authorized']
if (recurring is None) == (authorized is None):
raise ValueError('invalid input')
if recurring is not None:
return ResultA(recurring)
else:
return ResultB(authorized)

Related

Find a money amount specified in a email and print it next to a str

I've been trying to make an API that automates a detector for new transaction emails in my Gmail Account, and now it detects all the emails coming from my bank that are related to a transaction send to myself, but it prints the whole email (which contains the money expressed like this "$" in a certain point of the email), and I want it that the script detects the money amount and then prints a string with it like this for example
print("Transaction Recieved"," $" ,moneyAmount,"!")
The Messages Filter part of the script
#Messages Filter
message_count = 50
for message in messages[:message_count]:
msg = service.users().messages().get(userId='me', id=message['id']).execute()
email = (msg['snippet'])
if "Datos de la transferencia que recibiste Monto $" in email:
service.users().messages().modify(userId='me', id=message['id'], body={'removeLabelIds': ['UNREAD']}).execute()
print(f'{email}\n')
And I can't figure how to make it so it detects the money amount that is expressed inside the bank's email.
I tried to do an attempt of making it to detect a string pattern of a value of 1 digit that could go from 1 to 9 followed by a "." and after that a value of 3 digits that could go from 001 to 999 and then print the whole value, but it didn't work so I'm out of ideas.
Here's my attempt of making it
#Messages Filter
message_count = 50
for message in messages[:message_count]:
msg = service.users().messages().get(userId='me', id=message['id']).execute()
email = (msg['snippet'])
if "Datos de la transferencia que recibiste Monto $" in email:
value = re.findall(r'$\d{1}.\d{3}', email)
service.users().messages().modify(userId='me', id=message['id'], body={'removeLabelIds': ['UNREAD']}).execute()
print(f'{email}\n')
print(f'{value}\n')
The amount of money is displayed like this in the email
And this is like directly from the email
If someone with an idea could help me, I would be very grateful.
P.S.:Btw if you see something that you don't understand because of the language, it's because it's in Spanish, but it's mostly in strings so it shouldn't be a problem.
I had to replace the string pattern that I was using for the re and it ended like this
re.findall(r'\$\d+(?:.\d{3})+(?:,\d+)?', email)
Insted of this
re.findall(r'$\d{1}.\d{3}', email)
Thanks to #Grismar for the answer

Validating string for possible space returns error in Django form even though there's no space

I have two fields in my Django form where I want the user to insert exactly one word. So to have some validation in the view i created the following validator (idea is to search for spaces):
Views.py
[..]
if ' ' in poller_choice_one or poller_choice_two:
raise ValidationError(('Limit your choice to one word'), code='invalid')
else:
[..]
To make it more robust I added the strip option to the FormFields to be validated:
# Poller Choices
poller_choice_one = forms.CharField(label='Choice one', max_length=20, strip=True)
poller_choice_two = forms.CharField(label='Choice two', max_length=20, strip=True)
I tried like a bunch of inputs from single digits to single chars etc., it always raises the validation error
Your if statement is interpreted as:
if (' ' in poller_choice_one) or (poller_choice_two):
# …
it will first check if there is a space in the poller_choice_one. If that is not the case it will evaluate the truthiness of the poller_choice_two. A string has as truthiness True if it contains at least one character. So the if condition will be True from the moment there is a space in poller_choice_one, or poller_choice_two has at least one character.
You thus should rewrite the condition to:
if ' ' in poller_choice_one or ' ' in poller_choice_two:
raise ValidationError(('Limit your choice to one word'), code='invalid')
# …
Using regex you could implement this validator:
def validate(s0, s1):
return re.match(" ", s0) or re.match(" ", s1)
Example:
poller_choice_one = "hi"
poller_choice_two = "how are you?"
if not validate(poller_choice_one, poller_choice_two):
raise ValidationError(('Limit your choice to one word'), code='invalid')

Separating text/text processing using regex

I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.
You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps

How to retrieve well formatted JSON from AWS Lambda using Python

I have a function in AWS Lambda that connects to the Twitter API and returns the tweets which match a specific search query I provided via the event. A simplified version of the function is below. There's a few helper functions I use like get_secret to manage API keys and process_tweet which limits what data gets sent back and does things like convert the created at date to a string. The net result is that I should get back a list of dictionaries.
def lambda_handler(event, context):
twitter_secret = get_secret("twitter")
auth = tweepy.OAuthHandler(twitter_secret['api-key'],
twitter_secret['api-secret'])
auth.set_access_token(twitter_secret['access-key'],
twitter_secret['access-secret'])
api = tweepy.API(auth)
cursor = tweepy.Cursor(api.search,
q=event['search'],
include_entities=True,
tweet_mode='extended',
lang='en')
tweets = list(cursor.items())
tweets = [process_tweet(t) for t in tweets if not t.retweeted]
return json.dumps({"tweets": tweets})
From my desktop then, I have code which invokes the lambda function.
aws_lambda = boto3.client('lambda', region_name="us-east-1")
payload = {"search": "paint%20protection%20film filter:safe"}
lambda_response = aws_lambda.invoke(FunctionName="twitter-searcher",
InvocationType="RequestResponse",
Payload=json.dumps(payload))
results = lambda_response['Payload'].read()
tweets = results.decode('utf-8')
The problem is that somewhere between json.dumpsing the output in lambda and reading the payload in Python, the data has gotten screwy. For example, a line break which should be \n becomes \\\\n, all of the double quotes are stored as \\" and Unicode characters are all prefixed by \\. So, everything that was escaped, when it was received by Python on my desktop with the escaping character being escaped. Consider this element of the list that was returned (with manual formatting).
'{\\"userid\\": 190764134,
\\"username\\": \\"CapitalGMC\\",
\\"created\\": \\"2018-09-02 15:00:00\\",
\\"tweetid\\": 1036267504673337344,
\\"text\\": \\"Protect your vehicle\'s paint! Find out how on this week\'s blog.
\\\\ud83d\\\\udc47\\\\n\\\\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW\\"}'
I can use regex to fix some problems (\\" and \\\\n) but the Unicode is tricky because even if I match it, how do I replace it with a properly escaped character? When I do this in R, using the aws.lambda package, everything is fine, no weird escaped escapes.
What am I doing wrong on my desktop with the response from AWS Lambda that's garbling the data?
Update
The process tweet function is below. It literally just pulls out the bits I care to keep, formats the datetime object to be a string and returns a dictionary.
def process_tweet(tweet):
bundle = {
"userid": tweet.user.id,
"username": tweet.user.screen_name,
"created": str(tweet.created_at),
"tweetid": tweet.id,
"text": tweet.full_text
}
return bundle
Just for reference, in R the code looks like this.
payload = list(search="paint%20protection%20film filter:safe")
results = aws.lambda::invoke_function("twitter-searcher"
,payload = jsonlite::toJSON(payload
,auto_unbox=TRUE)
,type = "RequestResponse"
,key = creds$key
,secret = creds$secret
,session_token = creds$session_token
,region = creds$region)
tweets = jsonlite::fromJSON(results)
str(tweets)
#> 'data.frame': 133 obs. of 5 variables:
#> $ userid : num 2231994854 407106716 33553091 7778772 782310 ...
#> $ username: chr "adaniel_080213" "Prestige_AdamL" "exclusivedetail" "tedhu" ...
#> $ created : chr "2018-09-12 14:07:09" "2018-09-12 11:31:56" "2018-09-12 10:46:55" "2018-09-12 07:27:49" ...
#> $ tweetid : num 1039878080968323072 1039839019989983232 1039827690151444480 1039777586975526912 1039699310382931968 ...
#> $ text : chr "I liked a #YouTube video https://url/97sRShN4pM Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film" "Another #Corvette #ZO6 full body clearbra wrap completed using #xpeltech ultimate plus PPF ... Paint protection"| __truncated__ "We recently protected this Tesla Model 3 with Paint Protection Film and Ceramic Coating.#teslamodel3 #charlotte"| __truncated__ "Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film https://url/AD1cl5dNX3" ...
tweets[131,]
#> userid username created tweetid
#> 131 190764134 CapitalGMC 2018-09-02 15:00:00 1036267504673337344
#> text
#> 131 Protect your vehicle's paint! Find out how on this week's blog.👇\n\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW
In your lambda function you should return a response object with a JSON object in the response body.
# Lambda Function
def get_json(event, context):
"""Retrieve JSON from server."""
# Business Logic Goes Here.
response = {
"statusCode": 200,
"headers": {},
"body": json.dumps({
"message": "This is the message in a JSON object."
})
}
return response
Don't use json.dumps()
I had a similar issue, and when I just returned "body": content instead of "body": json.dumps(content) I could easily access and manipulate my data. Before that, I got that weird form that looks like JSON, but it's not.

Why my regex can not work in Python?

What I am trying to match is something like this:
public FUNCTION_NAME
FUNCTION_NAME proc near
......
FUNCTION_NAME endp
FUNCTION_NAME can be :
version_etc
version_etc_arn
version_etc_ar
and my pattern is:
pattern = "public\s+" + func_name + "[\s\S]*" + func_name + "\s+endp"
and match with:
match = re.findall(pattern, content)
So currently I find if the fuction_name equals version_etc, then it will match
all the version_etc_arn, version_etc_ar and version_etc.....
which means if the pattern is :
"public\s+" + "version_etc" + "[\s\S]*" + "version_etc" + "\s+endp"
then it will match:
public version_etc_arn
version_etc_arn proc near
......
version_etc_arn endp
public version_etc_ar
version_etc_ar proc near
......
version_etc_ar endp
public version_etc
version_etc proc near
......
version_etc endp
And I am trying to just match:
public version_etc
version_etc proc near
......
version_etc endp
Am I wrong? Could anyone give me some help?
THank you!
[\s\S]* matches 0-or-more of anything, including the _arn that you are trying to exclude. Thus, you need to require a whitespace after func_name:
pattern = r"(?sm)public\s+{f}\s.*^{f}\s+endp".format(f=func_name)

Categories

Resources