I have the plain text of a Cc header field that looks like so:
friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>
Are there any battle tested modules for parsing this properly?
(bonus if it's in python! the email module just returns the raw text without any methods for splitting it, AFAIK)
(also bonus if it splits name and address into to fields)
There are a bunch of function available as a standard python module, but I think you're looking for
email.utils.parseaddr() or email.utils.getaddresses()
>>> addresses = 'friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>'
>>> email.utils.getaddresses([addresses])
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'), ('Smith, Jane', 'jane.smith#uconn.edu')]
I haven't used it myself, but it looks to me like you could use the csv package quite easily to parse the data.
The bellow is completely unnecessary. I wrote it before realising that you could pass getaddresses() a list containing a single string containing multiple addresses.
I haven't had a chance to look at the specifications for addresses in email headers, but based on the string you provided, this code should do the job splitting it into a list, making sure to ignore commas if they are within quotes (and therefore part of a name).
from email.utils import getaddresses
addrstring = ',friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>,'
def addrparser(addrstring):
addrlist = ['']
quoted = False
# ignore comma at beginning or end
addrstring = addrstring.strip(',')
for char in addrstring:
if char == '"':
# toggle quoted mode
quoted = not quoted
addrlist[-1] += char
# a comma outside of quotes means a new address
elif char == ',' and not quoted:
addrlist.append('')
# anything else is the next letter of the current address
else:
addrlist[-1] += char
return getaddresses(addrlist)
print addrparser(addrstring)
Gives:
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'),
('Smith, Jane', 'jane.smith#uconn.edu')]
I'd be interested to see how other people would go about this problem!
Convert multiple E-mail string in to dictionary (Multiple E-Mail with name in to one string).
emailstring = 'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>'
Split string by Comma
email_list = emailstring.split(',')
name is key and email is value and make dictionary.
email_dict = dict(map(lambda x: email.utils.parseaddr(x), email_list))
Result like this:
{'John Smith': 'john.smith#email.com', 'Friends': 'friend#email.com', 'Smith': 'jane.smith#uconn.edu'}
Note:
If there is same name with different email id then one record is skip.
'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>, Friends <friend_co#email.com>'
"Friends" is duplicate 2 time.
Related
I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question
I have an API feeding into my program into a django many to many model field. The names of the individuals within my database are structured with a separated first name and last name. However, the API is sending a bulk list of names structured as as a string list as so: "Jones, Bob Smith, Jason Donald, Mic" (Last name-comma-space-first name-space-new last name- etc.)
How would I separate this string in a way that would allow me to filter and add a particular user to the many-to-many field?
Thanks!!
This answer excludes the case where a first name or last name contains space (this case is much more complicated as you will have a word with a space on his left AND on his right).
You need to replace the -comma-space- by something without a space (because you also have a space between two different names).
string = "Jones, Bob Smith, Jason Donald, Mic"
names = []
for name in string.replace(', ', ',').split(' '):
name = name.split(',')
last_name = name[0]
first_name = name[1]
names.append((last_name, first_name))
names
Output:
[('Jones', 'Bob'), ('Smith', 'Jason'), ('Donald', 'Mic')]
You can use regex:
s = "Jones, Bob Smith, Jason Donald, Mic"
list(re.findall(r'(\S+), (\S+)', s))
# [('Jones', 'Bob'), ('Smith', 'Jason'), ('Donald', 'Mic')]
I have a file email.txt which has email addresses as follows:
James, Brian < brian.james#abc.com>; Attar, Daniel < Daniel.Attar#abc.com>; Alex, James < james.alex#abc.com>; Trendy, Elizabeth < elizabeth.trendy#abc.com>; jones, Gary < Gary.Jones#abc.com>; bones, byron < byron.bones#abc.com>;
I want to write the email addresses into a .csv file in one column like this:
brian.james#abc.com
daniel.attar#abc.com
...
byron.bones#abc.com
I wrote a Python script as follows which does this:
fn1 = "email.txt"
f1 = open(fn1,"r")
f1r1 = f1.readlines()
f1r2 = [i.strip() for i in f1r1]
f1r3 = [i.split(";") for i in f1r2]
s1 = f1r3[0]
a = open("ef.csv","w")
for i in s1:
j = i.split("<")
a.write(j[1].strip(">")+"\n")
a.close()
Is there a better, more efficient or more elegant way to write this?
You could consider reading the contents of the text file as a single string and then using re to extract the emails from that string.
In this case, it looks like your email format is fairly specific, so the regex below is also specific also. Realize, though, that a regex capable of finding any RFC 5322-compliant email address (the "official standard" for email address formats) is several hundred characters long. For more on that see How to Find or Validate an Email Address from Jan Goyvaerts.
Anyway...
import re
with open('emails.txt', 'r') as file:
# Produces a single string, `emails`
emails = file.read().replace('\n', '')
regex = re.compile('\S+\.\S+#abc\.com')
for email in regex.findall(emails):
print(email)
# brian.james#abc.com
# Daniel.Attar#abc.com
# james.alex#abc.com
# elizabeth.trendy#abc.com
# Gary.Jones#abc.com
# byron.bones#abc.com
Regex walkthrough: this regex assumes each email takes a pretty specific form: something<dot>somethingelse<at>abc.com.
\S+ is 1 or more non-whitespace characters
\. is a literal period (backslashing a metacharacter)
I am working on a research project and I have a list of about ~200 names and 6 email addresses. The requirement is to map every one of those emails to a single email address following this requirement:
"Names starting with A, B, C, D, E will map to email1. F, G, H, I, J will map to email2" and so on and so forth.
Now I'm trying to think of a way to map those names to the specific email in a fashion of "if name starts with A-E then email1, rather than iterating through all the names and checking for the starting letter of each name. Is there a way to accomplish this? I'm thinking RegEx might help, but not sure exactly how (possibly something along the lines of ^[a-eA-E]?)
The re module has an undocumented Scanner class which can be used to attach an arbitrary function call to regex patterns. When the Scanner.scan method is called, the supplied text is matched against each regex pattern, and the associated function is called when a match is found. The scan method ends when the remaining text matches none of the patterns.
import re
def make_email(i):
def email(scanner, token):
print('{t}: Send to email{i}'.format(t=token, i=i))
return email
scanner = re.Scanner(
[(pat, make_email(i)) # 2
for i, pat in enumerate((r"^[a-e]\w+", r"^[f-j]\w+"))] # 1
+ [(r"\s+", None)],
flags=re.IGNORECASE|re.MULTILINE)
scanner.scan("""\
Albert
Barry
Carrie
David
Erin
Franklin
Geoff
Harold
Isadore
Jay""")
prints
Albert: Send to email0
Barry: Send to email0
Carrie: Send to email0
David: Send to email0
Erin: Send to email0
Franklin: Send to email1
Geoff: Send to email1
Harold: Send to email1
Isadore: Send to email1
Jay: Send to email1
You can add more regex patterns here.
The Scanner class is initialized with a list of 2-tuples. Each
2-tuple consists of a regex pattern, and the associated callback
function.
The simple and straightforward solution is to create a simple dictionary with regexes as keys, and loop over those.
import re
mappings = { r'^[a-e]': "email0", r'^[f-j]': "email1" }
for name in names:
for regex in mappings:
if re.match(regex, name, flags=re.IGNORECASE):
print "%s: send to %s" % (name, mappings[regex])
break
else:
print "%s: no match" % name
If you do this on an industrial scale, you would probably want to precompile the regexes with re.compile() but for a quick and dirty solution, this gets the job done.
You only need to know the first letter in each name, and map it to an email address. You don't need a regex for that.
def address(name):
addresses = ['foo#bar.com', 'spam#eggs.org', ... ]
i = 'abcdefghijklmnopqrstuvwxyz'.find(name[0].lower()) // 5
return addresses[i]
Then you want to iterate over the names.
for name in names: print(name, address(name))
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.