Download POP3 headers from a certain date (Python)

Download POP3 headers from a certain date (Python) - python

I'm trying to write a pop3 and imap clients in python using available libs, which will download email headers (and subsequently entire email bodies) from various servers and save them in a mongodb database. The problem I'm facing is that this client downloads emails in addition to a user's regular email client. So with the assumption that a user might or might not leave emails on the server when downloading using his mail client, I'd like to fetch the headers but only collect them from a certain date, to avoid grabbing entire mailboxes every time I fetch the headers.
As far as I can see the POP3 list call will get me all messages on the server, even those I probably already downloaded. IMAP doesn't have this problem.
How do email clients handle this situation when dealing with POP3 servers?

Outlook logs in to a POP3 server and issues the STAT, LIST and UIDL commands; then if it decides the user has no new messages it logs out. I have observed Outlook doing this when tracing network traffic between a client and my DBMail POP3 server. I have seen Outlook fail to detect new messages on a POP3 server using this method. Thunderbird behaves similarly but I have never seen it fail to detect new messages.
Issue the LIST and UIDL commands to the server after logging in. LIST gives you an index number (the message's linear position in the mailbox) and the size of each message. UIDL gives you the same index number and a computed hash value for each message.
For each user you can store the size and hash value given by LIST and UIDL. If you see the same size and hash value, assume it is the same message. When a given message no longer appears in this list, assume it has been deleted and clear it from your local memory.
For complete purity, remember the relative positions of the size/hash pairs in the message list, so that you can support the possibility that they may repeat. (My guess on Outlook's new message detection failure is that sometimes these values do repeat, at least for DBMail, but Outlook remembers them even after they are deleted, and forever considers them not new. If it were me, I would try to avoid this behavior.)
Footnote: Remember that the headers are part of the message. Do not trust anything in the header for this reason: dates, senders, even server hand-off information can be easily faked and cannot be assumed unique.

Related

Recover email list from a mailbox using Python

Is there any possibility or any library to log in to a given mail and recover a list of messages for a given sender?
I mean the situation in which I provide an e-mail address, based on this address, all messages in the inbox are filtered, and I am returned to the list of e-mails or the user's last message.
I use flask-mail to send emails, but I don't think it is possible to recover the list of messages.

You should check the standard mailbox library. It provides functionalities to read mailboxes stored on disk using the most popular mailbox file formats (Maildir, mbox, MH, Babyl, and MMDF at the time of this writing).
Be warned, nowaday, for performance, reasons many mail clients are using embedded database engines to store emails. SQLite being popular choice, you can also try the sqlite3 library.
Finally, You will also find exotic file formats like Mork. For that, you will have to write your own parser or turn to PyPy to search if someone has already done the work for you.
As a personal note, if your email client allows changing its storage backend, you may consider switching to a well know text-based storage format for your emails--it definitely helps in case of disaster recovery
As an example, I am using Thunderbird and set it up to use the mbox file format. So I can iterate over the message of my Junk folder that way:
>>> path = '~/.thunderbird/4tuag540.default/ImapMail/ssl0.ovh-1.net/INBOX.sbd/Junk'
>>> from mailbox import mbox
>>> junk = mbox(path)
>>> for message in junk:
... # Prinf the "From" header:
... print(message['From'])
...

Fetch mails via POP3, but keep them on the server

I'd like to fetch mails from a server, but I also want to control when to delete them.
Is there a way to do this?
I know this setting is very usual in mail clients, but it seems this option is not well supported by POPv3 specification and/or server implementations.
(I'm using python, but I'm ok with other languages/libraries, Python's poplib seems very simplistic)

Most POP3 clients may delete successfully retrieved messages automatically, but that's a feature of the client itself, not the protocol. POPv3 supports four basic operations during the transaction phase of a session:
Listing all available messages in the mailbox. (LIST)
Retrieving a specific message (RETR)
Flagging a message for deletion (DELE)
Clearing all deletion flags (RSET)
After the client ends the session with the QUIT command, any messages still flagged for deletion are deleted during the update phase. Note, though, that the RETR command (based on my reading of RFC1939 does not flag a message for deletion; that needs to be done explicitly with the DELE command.
Note, however, that a particular POP3 server may have a policy of deleting retrieved messages, whether or not the client requested they be deleted. Whether such a server provides an operation to bypass that is beyond the scope of the protocol. (A discussion of this point is mentioned in section 8 of the RFC, but is not part of the protocol itself.)

POP3 by design downloads and removes mail from a server after it's successfully fetched. If you don't want that, then use the IMAP protocol instead. That protocol has support to allow you to delete mail at your leisure as opposed to when it's synced to your machine.

Parsing emails as soon as they are received

I have users sending emails with some text I need to extract. Each user's email is mapped to a single mailbox. I'm currently using a cron job that polls the mailbox (postfix) every 5 minutes, checks for new messages, and sends it to a queue where I have workers parse them. I have two main questions:
Is there a way I can parse the email as soon as it's received instead of
polling the server? Also, how could
I implement this to be scalable? For
example, if there are 50 incoming
messages per second.
I'm programatically writing each user's email address to point to mailbox in the postfix configuration file. Would it be better to create a catch all account, so I don't have to write each email address? However, I know catch-all accounts are more susceptible to spam.

Use a pipe alias to catch the email, then use celery to dump it into a MQ for processing.

Yes, this can be done quite easily. All you need to do is configure the postfix to forward email to a script instead of to a mailbox. It does not really have to be a catch-all, you can configure postfix to forward specific emails to a script. The script can be written in any language. I wrote such script in php a couple of times. Another possibility for a very busy server, like 50 emails per second is to write your own filter server, then configure postfix to pass each message to your filter.
TO forward email to a script, in aliases file put a line like this: the path must point to this file
someaccount |/usr/local/bin/emailParser.php
To forward emails to a filter, it has to be configured in master.cf, a little more difficult.

I would recommend using Procmail for this. It is specifically designed to process your incoming mail and you can pass all mail with a certain property to your app.
http://www.procmail.org/
The spam problem with catchall addresses can usually be solved quite easily by monitoring all mail on the machine. If multiple addresses recieve the same mail, than there's a high probability that it's spam.

In Django, I want to insert a database record by sending myself an email?

I'm looking into a possible feature for my little to-do application... I like the idea that I can send an email to a particular email address, containing a to-do task I need to complete, and this will be read by my web application and be put in the database... So, when I come to log into my application, the to-do task I emailed will be there as a entry in the app.
Is this possible? I have a slice with SliceHost (basically a dedicated server) so I have total control on what to install etc. I'm using Python/Django/MySQL for this.
Any ideas on what steps to take to make this happen?

If I were to implement this, I'd use a scheduler and a job to be scheduled.
That job would connect to the mail server (be it POP3 or IMAP) and parse the unread messages (or messages unread by the job). Based on that I would insert that record.
You'd get 2 types of records that way. A list of mail message ids which have been processed (so you don't reprocess mails) and a list of tasks.
Disadvantage is that it takes some time before you see the task, as the job only executes every X minutes, or seconds.
If that is not good enough I'd go for a permanent IMAP connection, but you'd have to implement more error handling; you don't just retry automatically every X minutes.
Googling for Django +scheduler will get you started.
also have a look at this StackOverflow thread, no need to reinvent the wheel :)

I needed the exact same thing. I use the Lamson project (which is written in python) to transform email, forward email based on rules to my www.evernote.com and thinking rock www.trgtd.com.au accounts, update firewall web filtering rules, update allow/deny lists for my spam filter, read and write databases etc....
I like to think of it as email server automation and email application development.
www.lamsonproject.org
Troy

One way that I've solved this in the past was using qmail's .qmail files (docs).
Basically you set up qmail and point your email address (for ease of use, lets assume proc#whatever.com is your email address) to your home directory. In that directory you set up a .qmail-proc file to handle the mail.
This allows you to use a full-fledged SMTP server on your server, including spam filtering, forwarding, aliases, all that fun stuff. You can then pipe the data from an email into an application. In your case, I would suggest making a Mangement Command in Django to process the email (I'll call it proc_email). Thus your .qmail-proc may look like:
/var/spool/mail/proc
| /www/django/myproject/manage.py proc_email
This stores a copy of the email in /var/spool/mail/proc, then passes the email to the script in the second line. The email itself is passed to proc_email via sys.stdin. Simply read the email from there, and store it through your Django Models.
If you need to process email for different addresses later, you can also set up aliases which point to your home directory, and use .qmail-<username> files for each alias. Allowing you to pass other flags (such as the username for each alias) to proc_email if needed.
I should note that this isn't the simplest solution, but it can scale, and is pretty darn bullet proof.

I would not focus on Django for this.
I would create a mail server to catch these emails. Use http://docs.python.org/library/smtpd.html.
I would then use just the Django ORM to update the database based on the emails received.

Deleting the most recently received email via Python script?

I use Gmail and an application that notifies me if I've received a new email, containing its title in a tooltip. (GmailNotifier with Miranda-IM) Most of the emails I receive are ones I don't want to read, and it's annoying having to login to Gmail on a slow connection just to delete said email. I believe plugin is closed source.
I've been (unsuccessfully) trying to write a script that will login and delete the 'top' email (the one most recently received). However this is not as easy I thought it would be.
I first tried using imaplib, but discovered that it doesn't contain any of the methods I hoped it would. It's a bit like the dbapi spec, containing only minimal functionality incase the imap spec is changed. I then tried reading the imap RFC (rfc3501). Halfway through it, I realized I didn't want to write an entire mail client, so decided to try using pop3 instead.
poplib is also minimal but seemingly has what I need. However pop3 doesn't appear to sort the messages in any order I'm familiar with. I have to either call top() or retr() on every single email to read the headers if I want to see the date received.
I could probably iterate through every single message header, searching for the most recent date, but that's ugly. I want to avoid parsing my entire mailbox if possible. I also don't want to 'pop' the mailbox and download any other messages.
It's been 6 hours now and I feel no closer to a solution than when I started. Am I overlooking something simple? Is there another library I could try? (I found a 'chilkat' one, but it's bloated to hell, and I was hoping to do this with the standard library)

import poplib
#connect to server
mailserver = poplib.POP3_SSL('pop.gmail.com')
mailserver.user('recent:YOURUSERNAME') #use 'recent mode'
mailserver.pass_('YOURPASSWORD') #consider not storing in plaintext!
#newest email has the highest message number
numMessages = len(mailserver.list()[1])
#confirm this is the right one, can comment these out later
newestEmail = mailserver.retr(numMessages)
print newestEmail
#most servers will not delete until you quit
mailserver.dele(numMessages)
mailserver.quit()
I worked with the poplib recently, writing a very primitive email client. I tested this with my email server (not gmail) on some test emails and it seemed to work correctly. I would send yourself a few dummy emails to test it out first.
Caveats:
Make sure you are using 'recent
mode':
http://mail.google.com/support/bin/answer.py?answer=47948
Make sure your Gmail account has POP3
enabled: Gmail > Settings >
Forwarding and POP/IMAP > "Enable POP
for all mail"
Hope this helps, it should be enough to get you going!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.