How to eliminate email formatting in received email? - python

I am practicing sending emails with Google App Engine with Python. This code checks to see if message.sender is in the database:
class ReceiveEmail(InboundMailHandler):
def receive(self, message):
querySender = User.all()
querySender.filter("userEmail =", message.sender)
senderInDatabase = None
for match in querySender:
senderInDatabase = match.userEmail
This works in the development server because I send the email as "az#example.com" and message.sender="az#example.com"
But I realized that in the production server emails come formatted as "az <az#example.com> and my code fails because now message.sender="az <az#example.com>" but the email in the database is simple "az#example.com".
I searched for how to do this with regex and it is possible but I was wondering if I can do this with Python lists? Or, what do you think is the best way to achieve this result? I need to take just the email address from the message.sender.
App Engine documentation acknowledges the formatting but I could not find a specific way to select the email address only.
Thanks!
EDIT2 (re: Forest answer)
#Forest:
parseaddr() appears to be simple enough:
>>> e = "az <az#example.com>"
>>> parsed = parseaddr(e)
>>> parsed
('az', 'az#example.com')
>>> parsed[1]
'az#example.com'
>>>
But this still does not cover the other type of formatting that you mention: user#example.com (Full Name)
>>> e2 = "<az#example.com> az"
>>> parsed2 = parseaddr(e2)
>>> parsed2
('', 'az#example.com')
>>>
Is there really a formatting where full name comes after the email?
EDIT (re: Adam Bernier answer)
My try about how the regex works (probably not correct):
r # raw string
< # first limit character
( # what is inside () is matched
[ # indicates a set of characters
^ # start of string
> # start with this and go backward?
] # end set of characters
+ # repeat the match
) # end group
> # end limit character

Rather than storing the entire contents of a To: or From: header field as an opaque string, why don't you parse incoming email and store email address separately from full name? See email.utils.parseaddr(). This way you don't have to use complicated, slow pattern matching when you want to look up an address. You can always reassemble the fields using formataddr().

If you want to use regex try something like this:
>>> import re
>>> email_string = "az <az#example.com>"
>>> re.findall(r'<([^>]+)>', email_string)
['az#example.com']
Note that the above regex handles multiple addresses...
>>> email_string2 = "az <az#example.com>, bz <bz#example.com>"
>>> re.findall(r'<([^>]+)>', email_string2)
['az#example.com', 'bz#example.com']
but this simpler regex doesn't:
>>> re.findall(r'<(.*)>', email_string2)
['az#example.com>, bz <bz#example.com'] # matches too much
Using slices—which I think you meant to say instead of "lists"—seems more convoluted, e.g.:
>>> email_string[email_string.find('<')+1:-1]
'az#example.com'
and if multiple:
>>> email_strings = email_string2.split(',')
>>> for s in email_strings:
... s[s.find('<')+1:-1]
...
'az#example.com'
'bz#example.com'

Related

How to remove duplicated results of regular expression (re) in Python

There is a string:
str = 'Please Contact Prof. Zheng Zhao: Zheng.Z#xxx.com for details, or our HR: john.will#xxx.com'
I wanted to parse all of the email in that string, so I set:
p = r'[\w\.]+#[\w\.]+'
re.findall(p, str)
And the result was:
['zheng.z#xxx.com', 'Zheng.Z#xxx.com', 'john.will#xxx.com']
Apparently, the first and the second are duplicated. How do we prevent this from happening?
You can remove duplicates using a set. A set is like an unordered list which can't contain duplicates. In this case, you don't care about case, so making the results lowercase will let you properly check for duplicates.
import re
s = 'Please Contact Prof. Zheng Zhao: Zheng.Z#xxx.com for details, or our HR: john.will#xxx.com'
p = r'[\w\.]+#[\w\.]+'
list(set(result.lower() for result in re.findall(p, s)))

Scrapy regexp for sitemap_follow

if you have a sitemap.xml containing:
abc.com/sitemap-1.xml
abc.com/sitemap-2.xml
abc.com/image-sitemap.xml
how do i write sitemap_follow to read only the sitemap-xxx sitemaps and not image-sitemap.xml?
I tried with
^sitemap
with no luck. What should I do? negate "image"? How?
Edit:
Scrapy code:
self._follow = [regex(x) for x in self.sitemap_follow]
and
if any(x.search(loc) for x in self._follow):
The regex is applied to the whole url. The only way I see a solution without modifying Scrapy is to have a Scraper just for abc.com and add it to the regex OR just add the / to the regex
To answer your question naively and directly I offer this code. In other words, I can match each of the items in the sitemap index file using the regex ^.$.
>>> import re
>>> sitemap_index_file_content = [
... 'abc.com/sitemap-1.xml',
... 'abc.com/sitemap-2.xml',
... 'abc.com/image-sitemap.xml'
... ]
>>> for s in sitemap_index_file_content:
... m = re.match(r'^.*$', s)
... if m:
... m.group()
...
'abc.com/sitemap-1.xml'
'abc.com/sitemap-2.xml'
'abc.com/image-sitemap.xml'
This implies that you would set sitemap_follow in the following way, since the spiders documentation says that this variable expects to receive a list.
>>> sitemap_follow = ['^.$']
But then the same page of documentation says, 'By default, all sitemaps are followed.' Thus, this would appear to be entirely unnecessary.
I wonder what you are trying to do.
EDIT: In response to a comment. You might be able to do this using what is called a 'negative lookbehind assertion', in this cases that's the (?<!image-). My reservation about this is that you need to be able to scan over stuff like abc.com at the beginnings of the URLs which could present quite fascinating challenges.
>>> for s in sitemap_index_file_content:
... m = re.match(r'[^\/]*\/(?<!image-)sitemap.*', s)
... if m:
... m.group()
...
'abc.com/sitemap-1.xml'
'abc.com/sitemap-2.xml'
One option to skip urls is to override sitemap_filter() on your class:
class MySpider(SitemapSpider):
name = "scraperapi_sitemap"
/* Your current code goes here ... */
def sitemap_filter(self, entries):
"""This method can be used to filter sitemap entries by their
attributes, for example, you can filter locs with lastmod greater
than a given date or (see docs) or skipping by complex regex.
"""
image_url_pattern = '.*/image-.*'
for entry in entries:
result = re.match(image_url_pattern, entry['loc'])
if result:
logging.info("Skipping "+ str(entry))
else:
yield entry

Properly mapping first names to emails?

I am working on a research project and I have a list of about ~200 names and 6 email addresses. The requirement is to map every one of those emails to a single email address following this requirement:
"Names starting with A, B, C, D, E will map to email1. F, G, H, I, J will map to email2" and so on and so forth.
Now I'm trying to think of a way to map those names to the specific email in a fashion of "if name starts with A-E then email1, rather than iterating through all the names and checking for the starting letter of each name. Is there a way to accomplish this? I'm thinking RegEx might help, but not sure exactly how (possibly something along the lines of ^[a-eA-E]?)
The re module has an undocumented Scanner class which can be used to attach an arbitrary function call to regex patterns. When the Scanner.scan method is called, the supplied text is matched against each regex pattern, and the associated function is called when a match is found. The scan method ends when the remaining text matches none of the patterns.
import re
def make_email(i):
def email(scanner, token):
print('{t}: Send to email{i}'.format(t=token, i=i))
return email
scanner = re.Scanner(
[(pat, make_email(i)) # 2
for i, pat in enumerate((r"^[a-e]\w+", r"^[f-j]\w+"))] # 1
+ [(r"\s+", None)],
flags=re.IGNORECASE|re.MULTILINE)
scanner.scan("""\
Albert
Barry
Carrie
David
Erin
Franklin
Geoff
Harold
Isadore
Jay""")
prints
Albert: Send to email0
Barry: Send to email0
Carrie: Send to email0
David: Send to email0
Erin: Send to email0
Franklin: Send to email1
Geoff: Send to email1
Harold: Send to email1
Isadore: Send to email1
Jay: Send to email1
You can add more regex patterns here.
The Scanner class is initialized with a list of 2-tuples. Each
2-tuple consists of a regex pattern, and the associated callback
function.
The simple and straightforward solution is to create a simple dictionary with regexes as keys, and loop over those.
import re
mappings = { r'^[a-e]': "email0", r'^[f-j]': "email1" }
for name in names:
for regex in mappings:
if re.match(regex, name, flags=re.IGNORECASE):
print "%s: send to %s" % (name, mappings[regex])
break
else:
print "%s: no match" % name
If you do this on an industrial scale, you would probably want to precompile the regexes with re.compile() but for a quick and dirty solution, this gets the job done.
You only need to know the first letter in each name, and map it to an email address. You don't need a regex for that.
def address(name):
addresses = ['foo#bar.com', 'spam#eggs.org', ... ]
i = 'abcdefghijklmnopqrstuvwxyz'.find(name[0].lower()) // 5
return addresses[i]
Then you want to iterate over the names.
for name in names: print(name, address(name))

python regex get first part of an email address

I am quite new to python and regex and I was wondering how to extract the first part of an email address upto the domain name. So for example if:
s='xjhgjg876896#domain.com'
I would like the regex result to be (taking into account all "sorts" of email ids i.e including numbers etc..):
xjhgjg876896
I get the idea of regex - as in I know I need to scan till "#" and then store the result - but I am unsure how to implement this in python.
Thanks for your time.
You should just use the split method of strings:
s.split("#")[0]
As others have pointed out, the better solution is to use split.
If you're really keen on using regex then this should work:
import re
regexStr = r'^([^#]+)#[^#]+$'
emailStr = 'foo#bar.baz'
matchobj = re.search(regexStr, emailStr)
if not matchobj is None:
print matchobj.group(1)
else:
print "Did not match"
and it prints out
foo
NOTE: This is going to work only with email strings of SOMEONE#SOMETHING.TLD. If you want to match emails of type NAME<SOMEONE#SOMETHING.TLD>, you need to adjust the regex.
You shouldn't use a regex or split.
local, at, domain = 'john.smith#example.org'.rpartition('#')
You have to use right RFC5322 parser.
"#####"#example.com is a valid email address, and semantically localpart("#####") is different from its username(#####)
As of python3.6, you can use email.headerregistry:
from email.headerregistry import Address
s='xjhgjg876896#domain.com'
Address(addr_spec=s).username # => 'xjhgjg876896'
#!/usr/bin/python3.6
def email_splitter(email):
username = email.split('#')[0]
domain = email.split('#')[1]
domain_name = domain.split('.')[0]
domain_type = domain.split('.')[1]
print('Username : ', username)
print('Domain : ', domain_name)
print('Type : ', domain_type)
email_splitter('foo.goo#bar.com')
Output :
Username : foo.goo
Domain : bar
Type : com
Here is another way, using the index method.
s='xjhgjg876896#domain.com'
# Now lets find the location of the "#" sign
index = s.index("#")
# Next lets get the string starting from the begining up to the location of the "#" sign.
s_id = s[:index]
print(s_id)
And the output is
xjhgjg876896
need to install package
pip install email_split
from email_split import email_split
email = email_split("ssss#ggh.com")
print(email.domain)
print(email.local)
Below should help you do it :
fromAddr = message.get('From').split('#')[1].rstrip('>')
fromAddr = fromAddr.split(' ')[0]
Good answers have already been answered but i want to put mine anyways.
If i have an email john#gmail.com i want to get just "john".
i want to get only "john"
If i have an email john.joe#gmail.com i want to get just "john"
i want to get only "john"
so this is what i did:
name = recipient.split("#")[0]
name = name.split(".")[0]
print name
cheers
You can also try to use email_split.
from email_split import email_split
email = email_split('xjhgjg876896#domain.com')
email.local # xjhgjg876896
email.domain # domain.com
You can find more here https://pypi.org/project/email_split/ . Good luck :)
The following will return the continuous text before #
re.findall(r'(\S+)#', s)
You can find all the words in the email and then return the first word.
import re
def returnUserName(email):
return re.findall("\w*",email)[0]
print(returnUserName("johns123.ss#google.com")) #Output is - johns123
print(returnUserName('xjhgjg876896#domain.com')) #Output is - xjhgjg876896

Python string interpretation and parse

I'm trying to learn how to interpret and parse a string in python. I want to make a "string command" (don't know if is the right expression). But to explain better I will take an example: I want a command like in SQL, where there is a string with keywords that will make a process do what is asking for. Like this: cursor.execute("UPDATE Cars SET Price=? WHERE Id=?", (50000, 1)). But I want to create a format for my project like this (it is not necessary to be with sql): mydef("U={Cars[Price=50000], Id=1}")
Syntax table: <command>={<table>[<value name>=<value (int/str/float/bool)>], <id>=<value to id>}
Where command is: U=update, C=create, S=select, I=insert, D=delete
Well, I really want to learn how can I do it in Python. If is possible.
Pyparsing is a simple pure-Python, small-footprint, liberally-licensed module for creating parsers like the one you describe. Here are a couple of presentations I gave at PyCon'06 (updated for the Texas Python UnConference, 2008), one an intro to pyparsing itself, and one a demo of using pyparsing for parsing and executing a simple command language (a text adventure game).
Intro to Pyparsing - http://www.ptmcg.com/geo/python/confs/TxUnconf2008Pyparsing.html
A Simple Adventure Game Command Parser - http://www.ptmcg.com/geo/python/confs/pyCon2006_pres2.html
Both presentations are written using S5, so if you mouse into the lower right hand corner, you'll see << and >> buttons, a Ø button to see the entire presentation as a single printable web page, and a combo box to jump to a particular page.
You can find out more about pyparsing at http://pyparsing.wikispaces.com.
Just to be clear, are you aware that Python2.5+ includes sqlite?
import sqlite3
conn = sqlite3.connect(dbname.db)
curs = conn.cursor()
curs.execute("""CREATE TABLE Cars (UID INTEGER PRIMARY KEY, \
"Id" VARCHAR(42), \
"Price" VARCHAR(42))""")
curs.execute("UPDATE Cars SET Price=? WHERE Id=?", (50000, 1))
Edit to add: I didn't actually test this; you'll at least need an insert statement to make this work.
I did this code, I don't know if this will work. Just want the opinion.
>>> s = '<command>={<table>[<value name>=<value>], <id>=<value id>}'
>>> s1 = s.split('=', 1)
>>> s2 = s1[1].split(',', 1)
>>> s2 = s1[1].replace('{', '').replace('}', '').split(',', 1)
>>> s3 = s2[0].replace(']', '').split('[')
>>> s4 = s3[1].split('=')
>>> s1
['<command>', '{<table>[<value name>=<value>], <id>=<value id>}']
>>> s2
['<table>[<value name>=<value>]', ' <id>=<value id>']
>>> s3
['<table>', '<value name>=<value>']
>>> s4
['<value name>', '<value>']
>>> s5 = s2[1].split('=')
to split the entire command and get the args:
<command>={<table>[<value name>=<value>],<id>=<value id>}
["<command>", "{<table>[<value name>=<value>],<id>=<value id>}"]
["<table>[<value name>=<value>]", "<id>=<value id>"]
["<table>", "<value name>=<value>"]
["<value name>", "<value>"]
["<id>", "<value id>"]

Categories

Resources