Validate Email Header in Python - python

I have a RegEx for validating email addresses, but I'm really looking to validate a whole From header. Any of these would be valid:
name#domain.com
<name#domain.com>
My Name <name#domain.com>
Is there anything out there that would validate these as valid from headers? I'm going to look in the smtp library :)

I couldn't get the posted response to work, so I've been working on this and finally got this one, which seems to work so far. I'm sure it'll miss/catch something, but it's working for now.
[a-zA-Z0-9+_\-\.\ ]*[ ]*<?[a-zA-Z0-9+_\-\.]+#[0-9a-zA-Z][.-0-9a-zA-Z]*.[a-zA-Z]+>?

Be aware that there are plenty of other valid cases of e-mail addresses beyond what you've posted.
See here for a recipe that may help. Also read this for a great discussion of parsing email addresses with a regex. There are any number of good regexes in there that will match the uses you're looking for, imho :-)

Related

Py3 imaplib: get only immediate body (no reply) of email [duplicate]

There are two pre-existing questions on the site.
One for Python, one for Java.
Java How to remove the quoted text from an email and only show the new text
Python Reliable way to only get the email text, excluding previous emails
I want to be able to do pretty much exactly the same (in PHP). I've created a mail proxy, where two people can have a correspondance together by emailing a unique email address.
The problem I am finding however, is that when a person receives the email and hits reply, I am struggling to accurately capture the text that he has written and discard the quoted text from previous correspondance.
I'm trying to find a solution that will work for both HTML emails and Plaintext email, because I am sending both.
I also have the ability if it helps to insert some <*****RESPOND ABOVE HERE*******> tag if neccessary in the emails meaning that I can discard everything below.
What would you recommend I do? Always add that tag to the HTML copy and the plaintext copy then grab everything above it?
I would still then be left with the scenario of knowing how each mail client creates the response. Because for example Gmail would do this:
On Wed, Nov 2, 2011 at 10:34 AM, Message Platform <35227817-7cfa-46af-a190-390fa8d64a23#dev.example.com> wrote:
## In replies all text above this line is added to your message conversation ##
Any suggestions or recommendations of best practices?
Or should I just grab the 50 most popular mail clients, and start creating custom Regex for each. Then for each of these clients, also a bizallion different locale settings since I'm guessing the locale of the user will also influence what is added.
Or should I just remove the preceding line always if it contains a date?.. etc
Unfortunately, you're in for a world of hurt if you want to try to clean up emails meticulously (removing everything that's not part of the actual reply email itself). The ideal way would be to, as you suggest, write up regex for each popular email client/service, but that's a pretty ridiculous amount of work, and I recommend being lazy and dumb about it.
Interestingly enough, even Facebook engineers have trouble with this problem, and Google has a patent on a method for "Detecting quoted text".
There are three solutions you might find acceptable:
Leave It Alone
The first solution is to just leave everything in the message. Most email clients do this, and nobody seems to complain. Of course, online message systems (like Facebook's 'Messages') look pretty odd if they have inception-style replies. One sneaky way to make this work okay is to render the message with any quoted lines collapsed, and include a little link to 'expand quoted text'.
Separate the Reply from the Older Message
The second solution, as you mention, is to put a delineating message at the top of your messages, like --------- please reply above this line ----------, and then strip that line and anything below when processing the replies. Many systems do this, and it's not the worst thing in the world... but it does make your email look more 'automated' and less personal (in my opinion).
Strip Out Quoted Text
The last solution is to simply strip out any new line beginning with a >, which is, presumably, a quoted line from the reply email. Most email clients use this method of indicating quoted text. Here's some regex (in PHP) that would do just that:
$clean_text = preg_replace('/(^\w.+:\n)?(^>.*(\n|$))+/mi', '', $message_body);
There are some problems using this simpler method:
Many email clients also allow people to quote earlier emails, and preface those quote lines with > as well, so you'll be stripping out quotes.
Usually, there's a line above the quoted email with something like On [date], [person] said. This line is hard to remove, because it's not formatted the same among different email clients, and it may be one or two lines above the quoted text you removed. I've implemented this detection method, with moderate success, in my PHP Imap library.
Of course, testing is key, and the tradeoffs might be worth it for your particular system. YMMV.
There are many libraries out there that can help you extract the reply/signature from a message:
Ruby: https://github.com/github/email_reply_parser
Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
JavaScript: https://github.com/turt2live/node-email-reply-parser
Java: https://github.com/Driftt/EmailReplyParser
PHP: https://github.com/willdurand/EmailReplyParser
I've also read that Mailgun has a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: https://www.mailgun.com/blog/handle-incoming-emails-like-a-pro-mailgun-api-2-0/
Hope this helps!
Possibly helpful: quotequail is a Python library that helps identify quoted text in emails
Afaik, (standard) emails should quote the whole text by adding a ">" in front of every line. Which you could strip by using strstr(). Otherwise, did you trie to port that Java example to php? It's nothing else than Regex.
Even pages like Github and Facebook do have this problem.
Just an idea: You have the text which was originally sent, so you can look for it and remove it and additional surrounding noise from the reply. It is not trivial, because additional line breaks, HTML elements, ">" characters are added by the mail client application.
The regex is definitely better if it works, because it is simple and it perfectly cuts the original text, but if you find that it frequently does not work then this can be a fallback method.
I agree that quoted text or reply is just a TEXT. So there's no accurate way to fetch it. Anyway you can use regexp replace like this.
$filteringMessage = preg_replace('/.*\n\n((^>+\s{1}.*$)+\n?)+/mi', '', $message);
Test
https://regex101.com/r/xO8nI1/2

Trying to isolate a link from IRC messages

I Have an IRC bot I'm working on, and one of the features I would like it to have is to take any link a person posts and use BeautifulSoup to parse that page. Now, I have the bot working, getting the messages people post, etc. But, how would I pull a link from the IRC message? Say someone says this:
Person: Check out http://www.site.com, it's cool!
How would I take the link out and assign it to a variable for later use, without pulling the other parts of the message?
I think it's something to do with regexs, but I'm not sure.
You will indeed need to use regular expressions.
There's a decent article with a regular expression for matching URLs and somewhat of a description of what it's doing at daring fireball.
You can look at how Django does it here.
Finally, Python's regular expression documentation may also be useful.
You are on the exact track to finish this. You gave yourself the answer with the last sentence of your question. You will use a regular expression with a capture group to get the url and from there you can parse/grab the page that the user has said in the irc.
This site may be of some use for you: http://www.regular-expressions.info/

I am looking to parse a large amount of emails from a public dataset, specifically emails from the late 90's or early 2000's with Python

So basically I want to parse the Enron public emails data-set and I am uncertain about email formatting and types back in the day. I am unfamiliar with MIME types and those other formatting details. So I want to know if all emails have the same first couple lines and last couple lines.
I essentially want to get rid of everything but the body of emails. So I would also like to know whether it would be easier (not knowing what I know), to use the C method of parsing by lines or to essentially try to clean up all the emails to leave just want I need. I don't care too much about white space, but I am also not very skilled at regex or lexical parsing so if anyone has good references on refreshing regex or can breakdown probably the only rules that I'll probably need that would be great.
Wow, that's a whole lot of "...I don't knows..." with zero info about your objective. About the best advice I can offer is you read RFC-822. http://www.faqs.org/rfcs/rfc822.html
You're going to have to one up on Regex parsing if you are going to want to extract any meaningful info from the emails. I'd suggest the Oreilly book on regex, or reading over http://www.regular-expressions.info/
If you have more targeted questions it's possible SO could help you then
Good luck

Any way to detect mistyped urls in python?

My python program involves going to a user-supplied url and then doing stuff on the page. Ideally, mistyped urls would be recognized and pop up an error. But if they have the right syntax and just don't point anywhere, then either an ISP error page or an ad site is loaded instead.
For example:
"http://washingtonn.edu" --> http://search5.comcast.com/?cat=dnsr&con=dsqcy&url=washingtonn.edu
"http://www.amazdon.com/" --> http://www.amazdon.com/
Is there any way to detect these without knowing all the possible pages? The second one might be pretty hard because it's an actual site, but I'd be happy with catching the first.
Thanks!
Unless I am misunderstanding your question, what you ask for is impossible, doesn't make sense, or is far far from trivial.
If you think about it, other than a 404 error, where you detect that a page does not exist, if a page does exist there is not way of knowing whether the page is "good" or "bad" as this is subjective. It might be possible to apply some general rules, but you can't make embrace all the possibilities.
The only way would be something like what Google does with the suggestions, but this would imply a huge database with a list of popularity of websites, and test every time for proximity, but that is far beyond trivial and probably not necessary.
For handling 404 statutes in python you could use lie httplib.
Good luck!
You can check the HTTP status code of your requests. Probably most interesting for you is the 404 - Not Found status. In the second case, you are right - if the response is a web page, you can't know if is what user wanted or is a typo
What you're talking about is heuristics and it's actually a very complex topic. You could have a list of common websites and common misspellings- if something cannot resolve (i.e, 404 HTTP response) check the input against the list, and pick the "closest" answer (this is a whole algorithm in-of-itself). It wouldn't be too reliable though, because a misspelled website may indeed resolve correctly (although to the unintended domain).
a really simple solution, if you're very concerned about misspelled urls is to just ask for the URL twice.
You could use a regex to check for a valid url, and also use httplib to check for the response codes and require a 200 to continue.
HTTPConnection.getresponse() will return 200 if a url is valid

two questions (RFC822, login info) about sending email via python

1 -
In my email-sending script, I store spaced-out emails in a string, then I use ", ".join(to.split()). However, it looks like the script only sends to the 1st email - is it something to do with RFC822 format? If so, how can I fix this?
2 -
I feel a bit edgy having my password visable in my script. Is there a way to retrieve this info from cookies or saved passwords from firefox?
Thanks in advance!
Use ', '.join() for the list in the To: or Cc: header, but the headers are only for show. What determines where the mail actually goes is the RCPT envelope. Assuming you're using smtplib, that's the second argument:
connection.sendmail(senderaddress, to.split(), mailtext)
2: it's possible, but far from straightforward. Browsers don't want external programs looking at their security-sensitive stored data.
For the second part of your question, you could take a look at the netrc module (http://docs.python.org/library/netrc.html).
This isn't much better than having the password in the script, but it does allow the script to be readable for anyone using the computer, while you have the password in a file in your home directory that is only readable by you.

Categories

Resources