Way to avoid downloading file twice via IMAP - python

I'm writing a script in Python that saves attachments from Gmail, only from unseen emails. To save on bandwidth I want to make sure that every file only gets downloaded once.
-I can't check the folder where I save them, because the file could be removed already, and then it shouldn't download again. (The scripts accesses the Inbox read_only, so it doesn't mark the email as read. As soon as the script runs again it will download the same attachments again, until the email gets marked read via another channel.)
-Right now I save the filename to a sqlite database, but there's 2 problems: I haven't figured out how to check the database for the filename the next time I run the script, and there's also a chance that somewhen down the line an attachment arrives with the same filename, which then wouldn't get downloaded.
What's a safe and scalable way to make sure I don't download the files more than once?

There are several open source projects in Python that already perform this task very well. Why don't you take a look at OfflineIMAP and getmail's source code. Also, if you're just trying to backup your GMail account, I suggest you use one of those rather than rolling your own...

You could not only save the filename to the database but save, for example, the Date:-header of the mail, too. (Or any combination of headers of which you are sure that they define a mail uniquely).

You could fetch the headers for the message, and use the message's Date and/or Message-Id header value to construct a "unique id prefix" for all of the attachments in that message. Then create a key of the form [unique_id]_[filename], check if that key exists in your database or filesystem. If not, download all attachments for that message, and save each with the modified unique id key.

Related

Python / Email processing: How to get the date an attachment was created?

I have a Python program that will read an Outlook inbox using these Python libraries:
1. IMAPClient
2. email
I want to know if it is possible to get the date the email attachment was created.
I don't see anything in email headers that stand out. I can get the date an email was sent (or forwarded), but it is the case when an email is forwarded that prompts this question.
I want to get the date of the attachment inside the email. If anyone has done this, and has a full working code snippet to share, it would be greatly appreciated.
I have done several searches, looked carefully through email headers, looked at the two library documentation I am using (IMAPClient, and email), and see nothing that stands out that would lead to a solution.
Some file formats include this, most don't, some may. For example the EXIF data in some JPEG files includes it. To read that you'll need an EXIF library, not an IMAP library. Microsoft Word files include a creation date, IIRC that's mandatory for that format, but again you'll need a type-specific library.
IMAP is merely the channel through which you download the data you want to examine.

How can I remove all the email not in the important or sent folder?

There are many emails in my All mailbox more than there are in the Important and Sent mailboxes. I want to remove all the mails which are not in the Important or Sent mailbox.
I can not do any of the following steps
1) Delete all the emails in the All mailbox, (when i delete all the emails in the All mailbox, all the emails in the Important and Sent mailboxes will be deleted at the same time)
2) and copy emails from the Important and Sent mailboxes.
How can I write code to accomplish this?
The problem can become another form:
how can i make a copy of emails in my gmailbox :"[Gmail]/&kc 2JgQ-" into local directory g:\mygmail ?
There are 5 emails in my gmail--inbox ,i save all of them in the g:\mygmails,and name them as 0th.myemail 1th.myemail 2th.myemail 3th.myemail 4th.myemail with the following code,now how can i read them by thunderbird or some email soft ,i don't want to write my own code to read them?
import email,imaplib
att_path="g:\\mygmails\\"
user="xxxx"
password="yyyy"
con=imaplib.IMAP4_SSL('imap.gmail.com')
con.login(user,password)
con.select('INBOX')
resp, items = con.search(None, "ALL")
items = items[0].split()
for id,num in enumerate(items):
resp, data = con.fetch(num, "(RFC822)")
data=data[0][1]
fp = open(att_path+str(id)+"th"+".myemail", 'wb')
fp.write(data)
fp.close()
After doing some digging around on google, I found a github repository that provides a module for doing just this. It is not very well documented but the source code is very easy to read so it isn't a significant loss at all.
In terms of using this module, you can load in each email with the specified labels and mark them for being saved, then go through all the emails and delete the ones that have not been marked.
I don't currently see a natural way to mark the emails on the remote server, so you may have to implement something where you record the emails as strings and store them in a set.
If you have any questions still, just post a comment to this answer and I can elaborate more.
For Example: if you wanted to copy the entries of a particular mailbox into a python data structure, you can do so like this:
# Global Variables
username, password, mailboxname = '', '', '[Gmail]/&kc 2JgQ-'
# Set up
import gmail
g = gmail.Gmail()
g.login(username, password)
# Actual code.
emails = []
for email in g.mailbox(mailboxname).mail():
emails.append(email.fetch())
# Tear down.
g.logout()
So assuming that you adjust the global variables accordingly, you now have a python list (in the python variable emails) of all the emails in mailboxname for the gmail account username. Once you have this, you can easily do something like saving it to a file(s).
If you like Windows_PowerShell I have a solution that can be reuse with little effort and customized for your needs. You can setup Mail_User_Agent to use the Web Access API and automate this task. In my examples good old Powershell (as we know already - task automation and configuration management framework from Microsoft) with it's headless IE capabilities (will make it work as a Daemon and allow it to communicate with us only if preconditions are true) is able to support all this.
And to be more precise if You have to Login and use Firewall Web Access APIs - the implementation is almost the same. So with one stone we get two birds - every morning You'll be behind-the-wall and knowing your mail content. Here You can see sample solution.

Sync local file with HTTP server location (in Python)

I have an HTTP server which host some large file and have python clients (GUI apps) which download it.
I want the clients to download the file only when needed, but have an up-to-date file on each run.
I thought each client will download the file on each run using the If-Modified-Since HTTP header with the file time of the existing file, if any. Can someone suggest how to do it in python?
Can someone suggest an alternative, easy, way to achieve my goal?
You can add a header called ETag, (hash of your file, md5sum or sha256 etc ), to compare if two files are different instead of last-modified date
I'm assuming some things right now, BUT..
One solution would be to have a separate HTTP file on the server (check.php) which creates a hash/checksum of each files you're hosting. If the files differ from the local files, then the client will download the file. This means that if the content of the file on the server changes, the client will notice the change since the checksum will differ.
do a MD5 hash of the file contents, put it in a database or something and check against it before downloading anything.
Your solution would work to, but it requires the server to actually include the "modified" date in the Header for the GET request (some server softwares does not do this).
I'd say putting up a database that looks something like:
[ID] [File_name] [File_hash]
0001 moo.txt asd124kJKJhj124kjh12j
It seems to me the easiest solution is hosting the file in mercurial and using mercurial api to find the file's hash, downloading the file if the hash has changed.
Calculating the hash can be done as the answer to this question; for downloading the file urllib will be enough.

Parsing All Mail from Mailbox File in Python

Maybe I'm going about this the wrong way, but I want to parse a single "catch all" email inbox via Python. I see the email module and I can make it parse an individual email, but what I want to do is open (for example) /var/spool/mail/catchall and parse all of the individual messages inside it. Opening that file and running the parser over it treats the whole thing as one giant email. How would I break it into individual messages?
Alternatively: is this A Bad Idea, given I'm going to want to delete the messages when I'm done with them? I'm tying this route instead of POP/ IMAP only because the server support isn't available right now.
You'll want to use mailbox to actually go through the mailbox.

What's a Django/Python solution for providing a one-time url for people to download files?

I'm looking for a way to sell someone a card at an event that will have a unique code that they will be able to use later in order to download a file (mp3, pdf, etc.) only one time and mask the true file location so a savvy person downloading the file won't be able to download the file more than once. It would be nice to host the file on Amazon S3 to save on bandwidth where our server is co-located.
My thought for the codes would be to pre-generate the unique codes that will get printed on the cards and store those in a database that could also have a field that stores the number of times the file was downloaded. This way we could set how many attempts we would allow the user for downloading the file.
The part that I need direction on is how do I hide/mask the original file location so people can't steal that url and then download the file as many times as they want. I've done Google searches and I'm either not searching using the right keywords or there aren't very many libraries or snippets out there already for this type of thing.
I'm guessing that I might be able to rig something up using django.views.static.serve that acts as a sort of proxy between the actual file and the user downloading the file. The only drawback to this method I would think is that I would need to use the actual web server and wouldn't be able to store the file on Amazon S3.
Any suggestions or thoughts are greatly appreciated.
Neat idea. However, I would warn against the single-download method, because there is no guarantee that their first download attempt will be successful. Perhaps use a time-expiration method instead?
But it is certainly possible to do this with Django. Here is an outline of the basic approach:
Set up a django url for serving these files
Use a GET parameter which is a unique string to identify which file to get.
Keep a database table which has a FileField for the file to download. This table maps the unique strings to the location of the file on the file system.
To serve the file as a download, set the response headers in the view like this:
(path is the location of the file to serve)
with open(path, 'rb') as f:
response = HttpResponse(f.read())
response['Content-Type'] = 'application/octet-stream';
response['Content-Disposition'] = 'attachment; filename="%s"' % 'insert_filename_here'
return response
Since we are using this Django page to serve the file, the user cannot find out the original file location.
You can just use something simple such as mod_xsendfile. This functionality is also available in other popular webservers such lighttpd or nginx.
It works like this: when enabled your application (e.g. a trivial PHP script) can send a special response header, causing the webserver to serve a static file.
If you want it to work with S3 you will need to handle each and every request this way, meaning the traffic will go through your site, from there to AWS, back to your site and back to the client. Does S3 support symbolic links / aliases? If so you might just redirect a valid user to one of the symbolic URLs and delete that symlink after a couple of hours.

Categories

Resources