Django or python manipulate email addresses and reason about domains - python

I want to be able to parse email addresses to isolate the domain part, and test if an email address is part of a given domain.
The email module doesn't, as far as I can tell, do that. Is there anything worth using to do this other than the usual string handling and regex routines?
Note: I know how to deal with python strings. I don't need basic recipes, although awesome recipes are welcome.
The problem here is essentially that email addresses have the format (schematically) userpart#sub\.domain\.[sld]+\.tld.
Stripping the part before the # is easy; the hard part is parsing the domain to work out which parts are subdomains on a larger organisation's domain, rather than generic second-level (or, I guess even higher order) public domains.
Imagine parsing user#mail.organisation.co.uk to find that the organisation's domain name is organisation.co.uk and so be able to match both mail.organisation.co.uk and finance.organisation.co.uk as subdomains of organisation.co.uk.
There are basically two possible (non-dns-based) approaches: build a finite automaton that knows about all generic slds and their relation to the tld (including popular 'fake' slds like uk.com), or try to guess, based on the knowledge that there must be a tld, and assuming that if there are three (or more) elements, the second-level domain is generic if it has fewer than three/four characters. The relative drawbacks of each approach should be obvious.
The alternative is to look through DNS entries to work out what is a registered domain, which has its own drawbacks.
In any case, I would rather piggyback on the work of others.

As per #dm03514's comment, there is a python library that does exactly this: tldextract:
>>> import tldextract
>>> tldextract.extract('foo#bar.baz.org.uk')
ExtractResult(subdomain='bar', domain='baz', tld='org.uk')

With this simple script, we replace # with #. so that our domain is terminated and the endswith won't match a domain ending with the same text.
def address_in_domain(address, domain):
return address.replace('#', '#.').endswith('.' + domain)
if __name__ == '__main__':
addresses = [
'user1#domain.com',
'user1#anotherdomain.com',
'user2#org.domain.com',
]
print filter(lambda address: address_in_domain(address, 'domain.com'), addresses)
# Prints: ['user1#domain.com', 'user2#org.domain.com']

Related

Is there a builtin library in Python that can parse out the domain part (if any) of an email address?

I know that I can use email.utils.parseaddr to parse out an email address properly, even a tricksy one:
>>> parseaddr('Bad Horse <bad.horse#example(no one expects the #-ish inquisition!).com')
('Bad Horse (no one expects the #-ish inquisition!)', 'bad.horse#example.com')
(So good, Python! So good!)
However, that's just a string:
>>> type(parseaddr('Bad Horse <bad.horse#example(no one expects the #-ish inquisition!).com')[-1])
<class 'str'>
In the typical case I can just do .rsplit('#', maxsplit=1)[-1] to get the domain. But what if I'm just sending local mail without a domain?
>>> parseaddr('Wayne <wayne>')[-1].rsplit('#', maxsplit=1)[-1]
'wayne'
That's not quite what I want - I'd prefer maybe None or 'localhost'.
Does anything like that come in Python's included batteries?
I haven't been able to find anything yet, so my current approach is to make a slight adjustment:
try:
domain = parseaddr('Wayne <wayne>')[-1].rsplit('#', maxsplit=1)[1]
except IndexError:
# There was no '#' in the email address
domain = None # or 'localhost'
In the absence of a better way, this works and gets me what I need.

Isolate TLD from FQDN using regex

I am attempting to isolate TLDs utilizing regex from giant lists of FQDNs without importing 3rd party modules and am attempting to determine if there is a more eloquent way of doing this. My way works but is a bit cumbersome for my liking.
Sample code:
domains = ['x.sample1.com', 'y.sample2.org', 'z.sample3.biz']
temp = []
for domain in domains:
temp.append(re.findall('\.[a-z0-9]+', domain, re.I)
tlds = []
for item in temp:
for tld in item:
tlds.append(tld)
It is inconvenient how the return of the re.findall is a list object as it makes the iterating process an entire level deeper than desired but am unsure of how to get around this.
The "quick fix" is either to take the last item in each array:
split('.', domain)[-1]
Or, if you really don't care about the first matches, then don't capture them at all:
re.find('\.[a-z0-9]+$', domain, re.I)
(Note the use of $ to match the end of string.)
HOWEVER, note that it's impossible to solve this problem properly with regex. For example, how can you know that the TLD for google.co.uk is co.uk, and not just uk?
The only full solution to this problem, unfortunately, is by using a library that implements the public suffix list - which is basically just a very long (manually updated) list of all TLDs. For example, in python: https://pypi.python.org/pypi/publicsuffix/

python UUID based on email

How do I generate UUID based on email ids ?
I have read the docs.
I prefer to use the UUID module.
Without knowing exactly what the namespace thing is about, I'd try this:
>> import uuid
>> mail = "foo#bar.example"
>> uuid.uuid5(uuid.NAMESPACE_URL, mail)
UUID('45348e31-1ca5-57f3-ad95-cb80bf6ad145')
If all you need is a unique hash you can also use the hashlib module.
>> import hashlib
>> m = hashlib.sha1()
>> m.update(mail)
>> m.hexdigest()
'edb13b9a276142c6dcb93534a21f497fec4b93f8'
You need to generate "version 3 UUID / UUID3" OR "version 5 UUID / UUID5" to solve your problem.
A version 3 UUID is created using the DNS namespace.
>> import uuid
>> uuid.NAMESPACE_DNS
>> UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
>> uuid.uuid3(uuid.NAMESPACE_DNS, 'YOU EMAIL ID')
>> UUID('3d813cbb-47fb-32ba-91df-831e1593ac29')
UUID5 can be generated similarly..
And you can also use "NAMESPACE_URL" to generate UUID3 or UUID5.
(uuid.NAMESPACE_URL)
As others have told you, you have to use uuid3 or uuid5. (Which one, it doesn't really matter if you don't care about cryptography. I'll use uuid3 in this example.) Now you have to decide on a namespace.
DNS doesn't make sense, since it only accepts FQDNs, which email address surely is not. X.500 can theoretically be used if you are in LDAP, but it's still more complicated than necessary. OID tree, as far as I know, doesn't have an arc for emails - and rightly so, since they are trying to build a permanent registry, and email address are not really permanent.
So, that leaves URIs. Are email addresses URIs? Fortunately, yes. [Formally, it's for URLs only, but fortunately, email addresses are URLs, too.:] URIs have a syntax described in this Wikipedia article. So you have to find a scheme, and then fit your data into it. IANA gives you a list of schemes, where you can find "mailto" as "Electronic mail address" "Permanent" scheme. Seems like exactly what we want.
You also get the link to the RFC, in this case RFC 6068, which tells you how exactly you should format your email address. The possible problem is that you speak about "email id", that could possibly mean just the "local-part" of it (the "username" as it's usually called). Of course, that won't do, since it isn't unique globally.
[The only way you could make it work is to somehow restrict the namespace to your mail server. You can do it with MX records and DNS, but much simpler is to just code the domain into the whole email address.]
def email_uuid(email_id, domain='your.domain.example.com'):
from uuid import uuid3, NAMESPACE_URL
if '#' not in email_id:
email_id += '#' + domain
return uuid3(NAMESPACE_URL, 'mailto:' + email_id)

Python: Matching & Stripping port number from socket data

I have data coming in to a python server via a socket. Within this data is the string '<port>80</port>' or which ever port is being used.
I wish to extract the port number into a variable. The data coming in is not XML, I just used the tag approach to identifying data for future XML use if needed. I do not wish to use an XML python library, but simply use something like regexp and strings.
What would you recommend is the best way to match and strip this data?
I am currently using this code with no luck:
p = re.compile('<port>\w</port>')
m = p.search(data)
print m
Thank you :)
Regex can't parse XML and shouldn't be used to parse fake XML. You should do one of
Use a serialization method that is nicer to work with to start with, such as JSON or an ini file with the ConfigParser module.
Really use XML and not something that just sort of looks like XML and really parse it with something like lxml.etree.
Just store the number in a file if this is the entirety of your configuration. This solution isn't really easier than just using JSON or something, but it's better than the current one.
Implementing a bad solution now for future needs that you have no way of defining or accurately predicting is always a bad approach. You will be kept busy enough trying to write and maintain software now that there is no good reason to try to satisfy unknown future needs. I have never seen a case where "I'll put this in for later" has led to less headache later on, especially when I put it in by doing something completely wrong. YAGNI!
As to what's wrong with your snippet other than using an entirely wrong approach, angled brackets have a meaning in regex.
Though Mike Graham is correct, using regex for xml is not 'recommended', the following will work:
(I have defined searchType as 'd' for numerals)
searchStr = 'port'
if searchType == 'd':
retPattern = '(<%s>)(\d+)(</%s>)'
else:
retPattern = '(<%s>)(.+?)(</%s>)'
searchPattern = re.compile(retPattern % (searchStr, searchStr))
found = searchPattern.search(searchStr)
retVal = found.group(2)
(note the complete lack of error checking, that is left as an exercise for the user)

Sanitising user input using Python

What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?
Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.
Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.
The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.
Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.
I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML
If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.
To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")

Categories

Resources