Trouble isolating emails when downloading via Python script - python

I have a script that fetches emails from my account, downloads the attachments, creates some html for an email blast program, and then zips them into a nice little archive. This works well when only one email is present in the inbox, however, the script hangs when multiple emails exist. I feel like this is because the section of the script that zips the files is not looping correctly. What I am trying to accomplish is one zip file for each email. 3 emails in the inbox = 3 seperate zip files. I've done my best to reduce my code for maximum readability while still maintaining the core structure. Could anyone point me in the right direction here? Thanks!
Code:
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1]
mail = email.message_from_string(email_body)
for part in mail.walk():
if part.get_content_type() == 'text/plain':
content = part.get_payload()
#do something/define variables from email contents
if mail.get_content_maintype() != 'multipart':
continue
for part in mail.walk():
if part.get_content_maintype() == 'multipart':
continue
if part.get('Content-Disposition') is None:
continue
filename = part.get_filename()
counter = 1
if not filename:
filename = 'part-%03d%s' % (counter, 'bin')
counter += 1
att_path = os.path.join(detach_dir, filename)
if not os.path.isfile(att_path) :
fp = open(att_path, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
path = 'C:\directory'
os.chdir(path)
for file in os.listdir('.'):
#download attachments
htmlFile = str(token)+'.html'
htmlCode = ('<html>HTML goes here</html>')
htmlData = open(os.path.join('C:\directory', htmlFile), 'w+')
htmlData.write(htmlCode)
print htmlFile+' Complete'
htmlData.close()
allFiles = [f for f in os.listdir('.')]
for file in allFiles:
archive = zipfile.ZipFile(token+'.zip', mode='a')
archive.write(file)
archive.close()
os.unlink(file)
UPDATE
Here is alink to the complete code. http://ideone.com/WEXv9P

There seems to be a mistake here:
counter = 1
if not filename:
filename = 'part-%03d%s' % (counter, 'bin')
counter += 1
Counter will always be 1 in this loop, you probably want to define it before the second
for part in mail.walk():
EDIT:
Okay, so I think the problem is at the last part of the code
allFiles = [f for f in os.listdir('.')]
for file in allFiles:
archive = zipfile.ZipFile(token+'.zip', mode='a')
archive.write(file)
archive.close()
os.unlink(file)
this will create a zip file for each part of the email
I think what you want to do is indent this out a level and change it to something more like this:
allFiles = [f for f in os.listdir(detach_dir) if not f.endswith(".zip")]
for file in allFiles:
archive = zipfile.ZipFile(token+'.zip', mode='a')
archive.write(file)
archive.close()
os.unlink(file)
That way it won't recursively zip other zip files or remove them

Related

Python: editing a series of txt files via loop

With Python I'm attempting to edit a series of text files to insert a series of strings. I can do so successfully with a single txt file. Here's my working code that appends messages before and after the main body within the txt file:
filenames = ['text_0.txt']
with open("text_0.txt", "w") as outfile:
for filename in filenames:
with open(filename) as infile:
header1 = "Message 1:"
lines = "\n\n\n\n"
header2 = "Message 2:"
contents = header1 + infile.read() + lines + header2
outfile.write(contents)
I'm seeking some assistance in structuring a script to iteratively make the same edits to a series of similar txt files in the directory. There are 20 or similar txt files are structured the same: text_0.txt, text_1.txt, text_2.txt, and so on. Any assistance is greatly appreciated.
to loop through a folder of text files, you need to do it like this:
import os
YOURDIRECTORY = "TextFilesAreHere" ##this is the folder where there's your text files
for file in os.listdir(YOURDIRECTORY):
filename = os.fsdecode(file)
with open(YOURDIRECTORY + "/" + filename, "r"):
###do what you want with the file
If you already know the file naming then you can simply loop:
filenames = [f'text_{index}.txt' for index in range(21)]
for file_name in filenames:
with open(file_name, "w") as outfile:
for filename in filenames:
with open(filename) as infile:
header1 = "Message 1:"
lines = "\n\n\n\n"
header2 = "Message 2:"
contents = header1 + infile.read() + lines + header2
outfile.write(contents)
Or loop the directory like:
import os
for filename in os.listdir(directory):
#do something , like check the filename in list

Get PDF attachments using Python

I admit that I am new to Python.
We have to process PDF files with attachments or annotated attachments. I am trying to extract attachments from a PDF file using PyPDF2 library.
The only (!) example found on GitHub contains the following code:
import PyPDF2
def getAttachments(reader):
catalog = reader.trailer["/Root"]
# VK
print (catalog)
#
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
And the call is:
rootdir = "C:/Users/***.pdf" # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
I am getting a KeyError: '/EmbeddedFiles'
A print of the catalog indeed does not contain EmbeddedFiles:
{'/Extensions': {'/ADBE': {'/BaseVersion': '/1.7', '/ExtensionLevel': 3}}, '/Metadata': IndirectObject(2, 0), '/Names': IndirectObject(5, 0), '/OpenAction': IndirectObject(6, 0), '/PageLayout': '/OneColumn', '/Pages': IndirectObject(3, 0), '/PieceInfo': IndirectObject(7, 0), '/Type': '/Catalog'}
This particular PDF contains 9 attachments. How can I get them?
Too Long for comments, and I have not tested personally this code, which looks very similar to your outline in the question, however I am adding here for others to test. It is the subject of a Pull Request https://github.com/mstamy2/PyPDF2/pull/440 and here is the full updated sequence as described by Kevin M Loeffler in https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/
Viewable at https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py
Download as
https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py
It always helps if you can provide an example input of the type you have problems with so that others can adapt the extraction routine to suit.
In response to getting an error
"I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error."
"Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch."
Unfortunately there are many pending pull requests not included into PyPDF2 https://github.com/mstamy2/PyPDF2/pulls and others may also be relevant or needed to aid with this and other shortcomings. Thus you need to see if any of those may also help.
For one pending example of a try catch that you might be able to include / and adapt for other use cases see https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3
Associated keywords for imbedded files apart from /Type/EmbeddedFiles include /Type /Filespec & /Subtype /FileAttachment note the pairs may not always have spaces so perhaps see if those can be interrogated for the attachments
Again on that last point the example searches for /EmbeddedFiles as indexed in the plural whilst any individual entry itself is identified as singular
This can be improved but it was tested to work (using PyMuPDF).
It detects corrupted PDF files, encryption, attachments, annotations and portfolios.
I am yet to compare the output with our internal classification.
Produces a semicolon separated file that can be imported into Excel.
import fitz # = PyMuPDF
import os
outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"
print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio", file = outfile)
enc=pages=count=names=annots=collection=''
for subdir, dirs, files in os.walk(folder):
for file in files:
#print (os.path.join(subdir, file))
filepath = subdir + os.sep + file
if filepath.endswith(".pdf"):
#print (filepath, file = outfile)
try:
doc = fitz.open(filepath)
enc = doc.is_encrypted
#print("Encrypted? ", enc, file = outfile)
pages = doc.page_count
#print("Number of pages: ", pages, file = outfile)
count = doc.embfile_count()
#print("Number of embedded files:", count, file = outfile) # shows number of embedded files
names = doc.embfile_names()
#print("Embedded files:", str(names), file = outfile)
#if count > 0:
# for emb in names:
# print(doc.embfile_info(emb), file = outfile)
annots = doc.has_annots()
#print("Has annots?", annots, file = outfile)
links = doc.has_links()
#print("Has links?", links, file = outfile)
trailer = doc.pdf_trailer()
#print("Trailer: ", trailer, file = outfile)
xreflen = doc.xref_length() # length of objects table
for xref in range(1, xreflen): # skip item 0!
#print("", file = outfile)
#print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
#print(doc.xref_object(i, compressed=False), file = outfile)
if "Collection" in doc.xref_object(xref, compressed=False):
#print ("Portfolio", file = outfile)
collection ='True'
break
else: collection="False"
#print(doc.xref_object(xref, compressed=False), file = outfile)
except:
#print ("Not a valid PDF", file = outfile)
enc=pages=count=names=annots=collection="Not a valid PDF"
print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )
outfile.close()
I was also running into the same problem with several pdfs that I have. I was able to make these changes to the referenced code that got it to work for me:
import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
attachments = {}
#First, get those that are pdf attachments
catalog = reader.trailer["/Root"]
if "/EmbeddedFiles" in catalog["/Names"]:
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
#Next, go through all pages and all annotations to those pages
#to find any attached files
for pagenum in range(0, reader.getNumPages()):
page_object = reader.getPage(pagenum)
if "/Annots" in page_object:
for annot in page_object['/Annots']:
annotobj = annot.getObject()
if annotobj['/Subtype'] == '/FileAttachment':
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].getData()
return attachments
handler = open(filename, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)
I know it is a late reply, but i only started looking into this yesterday. I have used the PyMuPdf library to extract the embedded files. here is my code:
import os
import fitz
def get_embedded_pdfs(input_pdf_path, output_path=None):
input_path = "/".join(input_pdf_path.split('/')[:-1])
if not output_path :
output_path = input_pdf_path.split(".")[0] + "_embeded_files/"
if output_path not in os.listdir(input_path):
os.mkdir(output_path)
doc = fitz.open(input_pdf_path)
item_name_dict = {}
for each_item in doc.embfile_names():
item_name_dict[each_item] = doc.embfile_info(each_item)["filename"]
for item_name, file_name in item_name_dict.items():
out_pdf = output_path + file_name
## get embeded_file in bytes
fData = doc.embeddedFileGet(item_name)
## save embeded file
with open(out_pdf, 'wb') as outfile:
outfile.write(fData)
disclaimer: I am the author of borb (the library used in this answer)
borb is an open-source, pure Python PDF library. It abstracts away most of the unpleasantness of dealing with PDF (such as having to deal with dictionaries and having to know PDF-syntax and structure).
There is a huge repository of examples, containing a section on dealing with embedded files, which you can find here.
I'll repeat the relevant example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we have read a Document
assert doc is not None
# retrieve all embedded files and their bytes
for k, v in doc.get_embedded_files().items():
# display the file name, and the size
print("%s, %d bytes" % (k, len(v)))
if __name__ == "__main__":
main()
After the Document has been read, you can simply ask it for a dict mapping the filenames unto the bytes.

IMAP how to save **all** attachments

I am trying to save excel file attachments from my inbox to a directory. My code is executing just fine because I am seeing the print outs but the attachments wont save in the file directory. Is there something I am missing in my code that is preventing the action of saving?
import email, getpass, imaplib, os, sys
detach_dir = r'\directory link'
user = "test"
pwd = "test"
sender_email = "example#example.com"
m = imaplib.IMAP4_SSL("outlook.office365.com")
m.login(user,pwd)
m.select('"INBOX/somestuff"')
print("ok")
resp, items = m.search(None, 'FROM', '"%s"' % sender_email)
items = items[0].split()
print("ok")
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1].decode('utf-8')
mail = email.message_from_string(email_body)
print("ok")
if mail.get_content_maintype() != 'multipart':
continue
subject = ""
if mail["subject"] is not None:
subject = mail["subject"]
print ("["+mail["From"]+"] :" + subject)
for part in mail.walk():
if part.get_content_maintype() == 'multipart':
continue
if part.get('Content-Disposition') is None:
continue
filename = part.get_filename()
counter = 1
if not filename:
filename = 'part-%03d%s' % (counter, 'bin')
counter += 1
att_path = os.path.join(detach_dir, filename)
if not os.path.isfile(att_path) :
fp = open(att_path, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
This code saves just one of the attachments in the subfolder but I am looking to get all attachments save to the directory:
detach_dir = r'directory link'
m = imaplib.IMAP4_SSL("outlook.office365.com")
m.login('user','pass')
m.select('"INBOX/subfolder"')
resp, items = m.search(None, 'All')
items = items[0].split()
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
filename = part.get_filename()
print(filename)
att_path = os.path.join(detach_dir, filename)
fp = open(att_path, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
print('check folder')
Question: This code saves just one of the attachments ... but I am looking to get all attachments
Implement iter-attachments()
resp, items = imap.search(None, "(UNSEEN)")
for n, num in enumerate(items[0].split(), 1):
resp, data = imap.fetch(num, '(RFC822)')
data01 = data[0][1]
msg_obj = email.message_from_string(data01)
for part in msg_obj.iter_attachments():
filename = part.get_filename()
print(filename)
iter_attachments()
Return an iterator over all of the immediate sub-parts of the message that are not candidate “body” parts. That is, skip the first occurrence of each of text/plain, text/html, multipart/related, or multipart/alternative (unless they are explicitly marked as attachments via Content-Disposition: attachment), and return all remaining parts.
Used modules and classes:
class imaplib.IMAP4
class email.message.EmailMessage
Here’s an example of how to unpack a MIME message, using email.message.walk(), into a directory of files:

To copy the attached file in an email.

I have been able to figure out how to get the name of the attached file in an email. i am just stuck after that. I don't know what to do after that, I have tried using os.path.join which just gives the path i want to download the folder to and joins it with the filename. Please suggest something. Thanks.
m = imaplib.IMAP4_SSL('outlook.office365.com',993)
m.login("UN", "PW")
m.select("Inbox")
typ, msgs = mail.search(None, '(SUBJECT "qwerty")')
msgs = msgs[0].split()
for emailid in msgs:
resp, data = mail.fetch(emailid, "(RFC822)")
email_body = data[0][1]
m = email.message_from_bytes(email_body)
if m.get_content_maintype() != 'multipart':
continue
for part in m.walk():
if part.get_content_maintype() == 'multipart':
continue
if part.get('Content-Disposition') is None:
continue
filename = part.get_filename()
print(filename)
Following the sample from this link you can set the path when using the open function. (raw string by prefixing the string with r)
fp = open(r'c:\tmp\folder\' + filename, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
print '%s saved!' % filename

Extract content from a file with mime multipart

I have a file that contain a tiff image and a document xml in a multipart mime document.
I would extract the image from this file.
How I can get it?
I have this code, but it requires an infinite time to extract it, if I have a big file (for example 30Mb), so this is unuseful.
f=open("content_file.txt","rb")
msg = email.message_from_file(f)
j=0
image=False
for i in msg.walk():
if i.is_multipart():
#print "MULTIPART: "
continue
if i.get_content_maintype() == 'text':
j=j+1
continue
if i.get_content_maintype() == 'image':
image=True
j=j+1
pl = i.get_payload(decode=True)
localFile = open("map.out.tiff", 'wb')
localFile.write(pl)
continue
f.close()
if (image==False):
sys.exit(0);
Thank you so much.
Solved:
def extract_mime_part_matching(stream, mimetype):
"""Return the first element in a multipart MIME message on stream
matching mimetype."""
msg = mimetools.Message(stream)
msgtype = msg.gettype()
params = msg.getplist()
data = StringIO.StringIO()
if msgtype[:10] == "multipart/":
file = multifile.MultiFile(stream)
file.push(msg.getparam("boundary"))
while file.next():
submsg = mimetools.Message(file)
try:
data = StringIO.StringIO()
mimetools.decode(file, data, submsg.getencoding())
except ValueError:
continue
if submsg.gettype() == mimetype:
break
file.pop()
return data.getvalue()
From:
http://docs.python.org/release/2.6.6/library/multifile.html
Thank you for the support.
It is not quite clear to me, why your code hangs. The indentation looks a bit wrong and opened files are not properly closed. You may also be low on memory.
This version works fine for me:
import email
import mimetypes
with open('email.txt') as fp:
message = email.message_from_file(fp)
for i, part in enumerate(message.walk()):
if part.get_content_maintype() == 'image':
filename = part.get_filename()
if not filename:
ext = mimetypes.guess_extension(part.get_content_type())
filename = 'image-%02d%s' % (i, ext or '.tiff')
with open(filename, 'wb') as fp:
fp.write(part.get_payload(decode=True))
(Partly taken from http://docs.python.org/library/email-examples.html#email-examples)

Categories

Resources