Regex pattern matching unexpected value - python

I am using following python regex code to analyze values from the To field of an email:
import re
PATTERN = re.compile(r'''((?:[^(;|,)"']|"[^"]*"|'[^']*')+)''')
list = PATTERN.split(raw)[1::2]
The list should output the name and address of each recipient, based on either "," or ";" as seperator. If these values are within quotes, they are to be ignorded, this is part of the name, often: "Last Name, First Name"
Most of the times this works well, however in the following case I am getting unexpected behaviour:
"Some Name | Company Name" <name#example.com>
In this case it is splitting on the "|" character. Even though when I check the pattern on regex tester websites, it selects the name and address as a whole. What am I doing wrong?
Example input would be:
"Some Name | Company Name" <name1#example.com>, "Some Other Name | Company Name" <name2#example.com>, "Last Name, First Name" <name3#example.com>

This is not a direct answer to your question but to the problem you seem to be solving and therefore maybe still helpful:
To parse emails I always make extensive use of Python's email library.
In your case you could use something like this:
from email.utils import getaddresses
from email import message_from_string
msg = message_from_string(str_with_msg_source)
tos = msg.get_all('to', [])
ccs = msg.get_all('cc', [])
resent_tos = msg.get_all('resent-to', [])
resent_ccs = msg.get_all('resent-cc', [])
all_recipients = getaddresses(tos + ccs + resent_tos + resent_ccs)
for (name, address) in all_recipients:
# do some postprocessing on name or address if necessary
This always took reliable care of splitting names and addresses in mail headers in my cases.

You can use a much simpler regex using look arounds to split the text.
r'(?<=>)\s*,\s*(?=")'
Regex Explanation
\s*,\s* matches , which is surrounded by zero or more spaces (\s*)
(?<=>) Look behind assertion. Checks if the , is preceded by a >
(?=") Look ahead assertion. Checks if the , is followed by a "
Test
>>> re.split(r'(?<=>)\s*,\s*(?=")', string)
['"Some Name | Company Name" <name1#example.com>', '"Some Other Name | Company Name" <name2#example.com>', '"Last Name, First Name" <name3#example.com>']
Corrections
Case 1 In the above example, we used a single delimiter ,. If yo wish to split on basis of more than one delimiters you can use a character class
r'(?<=>)\s*[,;]\s*(?=")'
[,;] Character class, matches , or ;
Case 2 As mentioned in comments, if the address part is missing, all we need to do is to add " to the look behind
Example
>>> string = '"Some Other Name | Company Name" <name2#example.com>, "Some Name, Nothing", "Last Name, First Name" <name3#example.com>'
>>> re.split(r'(?<=(?:>|"))\s*[,;]\s*(?=")', string)
['"Some Other Name | Company Name" <name2#example.com>', '"Some Name, Nothing"', '"Last Name, First Name" <name3#example.com>']

Related

Pandas Python Regular Expression Assistance

I wasn't sure what to call this title, feel free to edit it if you think there is a better name.
What I am trying to do is find cases that match certain search criteria.
Specifically, I am trying to find sentences that contain the word "where" in them. Once I have identified that, I am trying to find cases where the word "SQL" command is also located within that same tag.
Let's say I have a dataframe that looks like this:
search_criteria = ['where']
df4
Q R
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example" type="SqlCommand">select id, name, from table where criteria = '5'</property><sentence>dave hates stuff>
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example">select id, name, from table where criteria = '5'</properties><sentence>dave hates stuff>
I am trying to return this:
Q R
0 file.sql <properties>version = "2", description = "example">select id, name, from table</properties>
This record should get returned because it contains both "where" and "sqlcommand".
Here is my current process:
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<[^<]*?' + 'where' + '[^>]*?>)', re.IGNORECASE)
sql_command_regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<property[^<]*?' + 'sqlcommand' + '[^>]*?<\/property>)', re.IGNORECASE)
if not regex_stuff.empty: #if one of the search criteria is found
if not sql_command_regex_stuff.empty: #check to see if the phrase "sqlcommand" is found anywhere as well
(insert rest of code)
This does not return anything.
What am I doing wrong?
Edit #1:
It seems like I need to do something at the end, to make the regex look something like this:
<property[^<]*?SqlCommand[^(<\/property>)]*
I feel like this is the right direction, doesn't work, but I feel like this is the right step.
You could just filter with str.contains:
df[(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
0 file.sql <sentence>dave likes stuff</sentence><properti...
or use ~ to return the opposite: strings that do not contain 'sqlcommand' or 'where'
df[~(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
1 file.sql <sentence>dave likes stuff</sentence><properti...
First of all, you have to have proper XML and SQL content, so you should
make the following corrections:
As the opening tag is <properties>, the closing tag must also be
</properties>, not </property>.
version, description and type are attributes (after them
there is > closing the opening tag, so after properties there
should be a space, not >.
Remove , after version="2".
Remove , after name.
Remove ( before <properties and ) after </properties>.
To find the required rows, use str.contains as the filtering
expression.
Below you have an example program:
import pandas as pd
import re
df4 = pd.DataFrame({
'Q' : 'file.sql',
'R' : [
'<s>dave</s><properties type="SqlCommand">select id, name '
'from table where criteria=\'5\'</properties><s>dave</s>',
'<s>dave</s><properties>select id, name from table '
'where criteria=\'6\'</properties><s>dave</s>',
'<s>mike</s><properties type="SqlCommand">drop table "Xyz"'
'</properties><s>mike</s>' ]})
df5 = df4[df4.R.str.contains(
'<properties[^<>]+?sqlcommand[^<>]+?>[^<>]+?where',
flags=re.IGNORECASE)]
print(df5)
Note that the regex takes care about the proper sequence of
strings:
First match <properties.
Then a sequence of chars other than < and > ([^<>]+?).
so we are still within the just opened XML tag.
Then match sqlcommand (ignoring case).
Then another sequence of chars other than < and >
([^<>]+?).
Then >, closing the tag.
Then another sequence of chars other than < and >
([^<>]+?).
And finally where (also ignoring case).
An attempt to check for sqlcommand and where in two separate
regexes is wrong, as these words can be at other locations,
which do not meet your requirement.

Execute only if string contains a ','?

I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"

python regex get first part of an email address

I am quite new to python and regex and I was wondering how to extract the first part of an email address upto the domain name. So for example if:
s='xjhgjg876896#domain.com'
I would like the regex result to be (taking into account all "sorts" of email ids i.e including numbers etc..):
xjhgjg876896
I get the idea of regex - as in I know I need to scan till "#" and then store the result - but I am unsure how to implement this in python.
Thanks for your time.
You should just use the split method of strings:
s.split("#")[0]
As others have pointed out, the better solution is to use split.
If you're really keen on using regex then this should work:
import re
regexStr = r'^([^#]+)#[^#]+$'
emailStr = 'foo#bar.baz'
matchobj = re.search(regexStr, emailStr)
if not matchobj is None:
print matchobj.group(1)
else:
print "Did not match"
and it prints out
foo
NOTE: This is going to work only with email strings of SOMEONE#SOMETHING.TLD. If you want to match emails of type NAME<SOMEONE#SOMETHING.TLD>, you need to adjust the regex.
You shouldn't use a regex or split.
local, at, domain = 'john.smith#example.org'.rpartition('#')
You have to use right RFC5322 parser.
"#####"#example.com is a valid email address, and semantically localpart("#####") is different from its username(#####)
As of python3.6, you can use email.headerregistry:
from email.headerregistry import Address
s='xjhgjg876896#domain.com'
Address(addr_spec=s).username # => 'xjhgjg876896'
#!/usr/bin/python3.6
def email_splitter(email):
username = email.split('#')[0]
domain = email.split('#')[1]
domain_name = domain.split('.')[0]
domain_type = domain.split('.')[1]
print('Username : ', username)
print('Domain : ', domain_name)
print('Type : ', domain_type)
email_splitter('foo.goo#bar.com')
Output :
Username : foo.goo
Domain : bar
Type : com
Here is another way, using the index method.
s='xjhgjg876896#domain.com'
# Now lets find the location of the "#" sign
index = s.index("#")
# Next lets get the string starting from the begining up to the location of the "#" sign.
s_id = s[:index]
print(s_id)
And the output is
xjhgjg876896
need to install package
pip install email_split
from email_split import email_split
email = email_split("ssss#ggh.com")
print(email.domain)
print(email.local)
Below should help you do it :
fromAddr = message.get('From').split('#')[1].rstrip('>')
fromAddr = fromAddr.split(' ')[0]
Good answers have already been answered but i want to put mine anyways.
If i have an email john#gmail.com i want to get just "john".
i want to get only "john"
If i have an email john.joe#gmail.com i want to get just "john"
i want to get only "john"
so this is what i did:
name = recipient.split("#")[0]
name = name.split(".")[0]
print name
cheers
You can also try to use email_split.
from email_split import email_split
email = email_split('xjhgjg876896#domain.com')
email.local # xjhgjg876896
email.domain # domain.com
You can find more here https://pypi.org/project/email_split/ . Good luck :)
The following will return the continuous text before #
re.findall(r'(\S+)#', s)
You can find all the words in the email and then return the first word.
import re
def returnUserName(email):
return re.findall("\w*",email)[0]
print(returnUserName("johns123.ss#google.com")) #Output is - johns123
print(returnUserName('xjhgjg876896#domain.com')) #Output is - xjhgjg876896

Method for parsing text Cc field of email header?

I have the plain text of a Cc header field that looks like so:
friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>
Are there any battle tested modules for parsing this properly?
(bonus if it's in python! the email module just returns the raw text without any methods for splitting it, AFAIK)
(also bonus if it splits name and address into to fields)
There are a bunch of function available as a standard python module, but I think you're looking for
email.utils.parseaddr() or email.utils.getaddresses()
>>> addresses = 'friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>'
>>> email.utils.getaddresses([addresses])
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'), ('Smith, Jane', 'jane.smith#uconn.edu')]
I haven't used it myself, but it looks to me like you could use the csv package quite easily to parse the data.
The bellow is completely unnecessary. I wrote it before realising that you could pass getaddresses() a list containing a single string containing multiple addresses.
I haven't had a chance to look at the specifications for addresses in email headers, but based on the string you provided, this code should do the job splitting it into a list, making sure to ignore commas if they are within quotes (and therefore part of a name).
from email.utils import getaddresses
addrstring = ',friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>,'
def addrparser(addrstring):
addrlist = ['']
quoted = False
# ignore comma at beginning or end
addrstring = addrstring.strip(',')
for char in addrstring:
if char == '"':
# toggle quoted mode
quoted = not quoted
addrlist[-1] += char
# a comma outside of quotes means a new address
elif char == ',' and not quoted:
addrlist.append('')
# anything else is the next letter of the current address
else:
addrlist[-1] += char
return getaddresses(addrlist)
print addrparser(addrstring)
Gives:
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'),
('Smith, Jane', 'jane.smith#uconn.edu')]
I'd be interested to see how other people would go about this problem!
Convert multiple E-mail string in to dictionary (Multiple E-Mail with name in to one string).
emailstring = 'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>'
Split string by Comma
email_list = emailstring.split(',')
name is key and email is value and make dictionary.
email_dict = dict(map(lambda x: email.utils.parseaddr(x), email_list))
Result like this:
{'John Smith': 'john.smith#email.com', 'Friends': 'friend#email.com', 'Smith': 'jane.smith#uconn.edu'}
Note:
If there is same name with different email id then one record is skip.
'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>, Friends <friend_co#email.com>'
"Friends" is duplicate 2 time.

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources