Sending multiple .CSV files after zipping them and without saving to disk - python

I am trying to retrieve files from the internet, save them as dataframe (without saving the files on computer), do some
calculation, save resulting dataframes in the required format as csv files, zip them and send them back to the orginal source(see below my code).
This works perfectly fine but as I have to send hundred of files and was wondering if there is a way of
saving the dataframe to the csv file and then zipping it on the go (i.e. without saving it on computer)? Below is my code;
file_names=['A','B','C']
hdll_url = 'http://XYZ'
saving_folder_url = 'C:\Users\Desktop\Testing_Folder'
general = []
for i in file_names:
url = ('http://')
data1 = requests.get(url)
Z1 = zipfile.ZipFile(StringIO.StringIO(data1.content))
x=pd.read_csv(Z1.open('%s.tsv'%i), sep = '\t', names=["Datetime","Power"])
x['Datetime'] = pd.to_datetime(x['Datetime'])
x = x.set_index("Datetime")
x = x.resample('H', how= 'sum')
general.append(x)
ABC = pd.DataFrame(general[0]['Power'] + general[1]['Power'] + general[3]['Power'] * 11.363)
ABC.to_csv('%s\ABC.tsv'%saving_folder_url,
sep='\t',header=False,date_format='%Y-%m-%dT%H:%M:%S.%f')
ABC_z1 = zipfile.ZipFile('%s\ABC.zip'%saving_folder_url,mode='w')
ABC_z1.write("%s\ABC.tsv"%saving_folder_url)
ABC_z1.close()
ABC_files1 = {'file': open("%s\ABC.zip"%saving_folder_url,'rb')}
ABC_r1 = requests.put(hdll_url, files=ABC_files1, headers = {'content-type':'application/zip',
'Content-Disposition': 'attachment;filename=ABC.zip'
} )
XYZ = pd.DataFrame(general[0]['Power'] + general[1]['Power']*100 )
XYZ.to_csv('%s\XYZ.tsv'%saving_folder_url,
sep='\t',header=False,date_format='%Y-%m-%dT%H:%M:%S.%f')
XYZ_z1 = zipfile.ZipFile('%s\XYZ.zip'%saving_folder_url,mode='w')
XYZ_z1.write("%s\XYZ.tsv"%saving_folder_url)
XYZ_z1.close()
XYZ_files1 = {'file': open("%s\XYZ.zip"%saving_folder_url,'rb')}
XYZ_r1 = requests.put(hdll_url, files=XYZ_files1, headers = {'content-type':'application/zip',
'Content-Disposition': 'attachment;filename=XYZ.zip'
} )
P.S: I am only allowed to send files in .zip format.

Have a look at BytesIO():
memory_file = BytesIO()
with zipfile.ZipFile(memory_file, 'w') as zf:
zi = zipfile.ZipInfo('your_filename.txt')
zi.date_time = time.localtime(time.time())[:6]
zi.compress_type = zipfile.ZIP_DEFLATED
zf.writestr(zi, 'your file content goes here')
memory_file.seek(0)
Then hand it to the put method:
requests.put(<your_url>, data=memory_file, header=<your_headers>)
Or something along the lines of that. I didn't test the code, but the first part taken from a flask app, which uses send_file() to hand the file over to the client.

Related

Get PDF attachments using Python

I admit that I am new to Python.
We have to process PDF files with attachments or annotated attachments. I am trying to extract attachments from a PDF file using PyPDF2 library.
The only (!) example found on GitHub contains the following code:
import PyPDF2
def getAttachments(reader):
catalog = reader.trailer["/Root"]
# VK
print (catalog)
#
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
And the call is:
rootdir = "C:/Users/***.pdf" # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
I am getting a KeyError: '/EmbeddedFiles'
A print of the catalog indeed does not contain EmbeddedFiles:
{'/Extensions': {'/ADBE': {'/BaseVersion': '/1.7', '/ExtensionLevel': 3}}, '/Metadata': IndirectObject(2, 0), '/Names': IndirectObject(5, 0), '/OpenAction': IndirectObject(6, 0), '/PageLayout': '/OneColumn', '/Pages': IndirectObject(3, 0), '/PieceInfo': IndirectObject(7, 0), '/Type': '/Catalog'}
This particular PDF contains 9 attachments. How can I get them?
Too Long for comments, and I have not tested personally this code, which looks very similar to your outline in the question, however I am adding here for others to test. It is the subject of a Pull Request https://github.com/mstamy2/PyPDF2/pull/440 and here is the full updated sequence as described by Kevin M Loeffler in https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/
Viewable at https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py
Download as
https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py
It always helps if you can provide an example input of the type you have problems with so that others can adapt the extraction routine to suit.
In response to getting an error
"I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error."
"Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch."
Unfortunately there are many pending pull requests not included into PyPDF2 https://github.com/mstamy2/PyPDF2/pulls and others may also be relevant or needed to aid with this and other shortcomings. Thus you need to see if any of those may also help.
For one pending example of a try catch that you might be able to include / and adapt for other use cases see https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3
Associated keywords for imbedded files apart from /Type/EmbeddedFiles include /Type /Filespec & /Subtype /FileAttachment note the pairs may not always have spaces so perhaps see if those can be interrogated for the attachments
Again on that last point the example searches for /EmbeddedFiles as indexed in the plural whilst any individual entry itself is identified as singular
This can be improved but it was tested to work (using PyMuPDF).
It detects corrupted PDF files, encryption, attachments, annotations and portfolios.
I am yet to compare the output with our internal classification.
Produces a semicolon separated file that can be imported into Excel.
import fitz # = PyMuPDF
import os
outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"
print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio", file = outfile)
enc=pages=count=names=annots=collection=''
for subdir, dirs, files in os.walk(folder):
for file in files:
#print (os.path.join(subdir, file))
filepath = subdir + os.sep + file
if filepath.endswith(".pdf"):
#print (filepath, file = outfile)
try:
doc = fitz.open(filepath)
enc = doc.is_encrypted
#print("Encrypted? ", enc, file = outfile)
pages = doc.page_count
#print("Number of pages: ", pages, file = outfile)
count = doc.embfile_count()
#print("Number of embedded files:", count, file = outfile) # shows number of embedded files
names = doc.embfile_names()
#print("Embedded files:", str(names), file = outfile)
#if count > 0:
# for emb in names:
# print(doc.embfile_info(emb), file = outfile)
annots = doc.has_annots()
#print("Has annots?", annots, file = outfile)
links = doc.has_links()
#print("Has links?", links, file = outfile)
trailer = doc.pdf_trailer()
#print("Trailer: ", trailer, file = outfile)
xreflen = doc.xref_length() # length of objects table
for xref in range(1, xreflen): # skip item 0!
#print("", file = outfile)
#print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
#print(doc.xref_object(i, compressed=False), file = outfile)
if "Collection" in doc.xref_object(xref, compressed=False):
#print ("Portfolio", file = outfile)
collection ='True'
break
else: collection="False"
#print(doc.xref_object(xref, compressed=False), file = outfile)
except:
#print ("Not a valid PDF", file = outfile)
enc=pages=count=names=annots=collection="Not a valid PDF"
print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )
outfile.close()
I was also running into the same problem with several pdfs that I have. I was able to make these changes to the referenced code that got it to work for me:
import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
attachments = {}
#First, get those that are pdf attachments
catalog = reader.trailer["/Root"]
if "/EmbeddedFiles" in catalog["/Names"]:
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
#Next, go through all pages and all annotations to those pages
#to find any attached files
for pagenum in range(0, reader.getNumPages()):
page_object = reader.getPage(pagenum)
if "/Annots" in page_object:
for annot in page_object['/Annots']:
annotobj = annot.getObject()
if annotobj['/Subtype'] == '/FileAttachment':
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].getData()
return attachments
handler = open(filename, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)
I know it is a late reply, but i only started looking into this yesterday. I have used the PyMuPdf library to extract the embedded files. here is my code:
import os
import fitz
def get_embedded_pdfs(input_pdf_path, output_path=None):
input_path = "/".join(input_pdf_path.split('/')[:-1])
if not output_path :
output_path = input_pdf_path.split(".")[0] + "_embeded_files/"
if output_path not in os.listdir(input_path):
os.mkdir(output_path)
doc = fitz.open(input_pdf_path)
item_name_dict = {}
for each_item in doc.embfile_names():
item_name_dict[each_item] = doc.embfile_info(each_item)["filename"]
for item_name, file_name in item_name_dict.items():
out_pdf = output_path + file_name
## get embeded_file in bytes
fData = doc.embeddedFileGet(item_name)
## save embeded file
with open(out_pdf, 'wb') as outfile:
outfile.write(fData)
disclaimer: I am the author of borb (the library used in this answer)
borb is an open-source, pure Python PDF library. It abstracts away most of the unpleasantness of dealing with PDF (such as having to deal with dictionaries and having to know PDF-syntax and structure).
There is a huge repository of examples, containing a section on dealing with embedded files, which you can find here.
I'll repeat the relevant example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we have read a Document
assert doc is not None
# retrieve all embedded files and their bytes
for k, v in doc.get_embedded_files().items():
# display the file name, and the size
print("%s, %d bytes" % (k, len(v)))
if __name__ == "__main__":
main()
After the Document has been read, you can simply ask it for a dict mapping the filenames unto the bytes.

UTF-8 is not saving txt file correctly - Python

My django application reads a file, and saves some report in other txt file. Everything works fine except my language letters. I mean, encoding="utf-8" can not read some letters. Here are my codes and example of report file:
views.py:
def result(request):
#region variables
# Gets the last uploaded document
last_uploaded = OriginalDocument.objects.latest('id')
# Opens the las uploaded document
original = open(str(last_uploaded.document), 'r')
# Reads the last uploaded document after opening it
original_words = original.read().lower().split()
# Shows up number of WORDS in document
words_count = len(original_words)
# Opens the original document, reads it, and returns number of characters
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
characters_count = len(read_original)
# Makes report about result
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w', encoding="utf-8")
report_twenties = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-20.txt", 'w', encoding="utf-8")
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
#endregion
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
report file:
['universitetindé™', 'té™hsili', 'alä±ram.', 'mé™n'] was found in static/other_documents\doc1.txt.
...
It's because of Windows OS. Try your application on MacOS it will work without any problem)

Extract email attachments and retain modification/creation date?

I'm trying to extract files from emails via IMAP using Python 3.7 (on Windows, fyi) and each of my attempts shows extracted files with Modification & Creation Date = time of extraction (which is incorrect).
As full email applications have the ability to preserve that information, it must me stored somewhere. I also gave working with structs a try, thinking the information may be stored in binary, but had no luck.
import email
from email.header import decode_header
import imaplib
import os
SERVER = None
OUT_DIR = '/var/out'
IMP_SRV = 'mail.domain.tld'
IMP_USR = 'user#domain.tld'
IMP_PWD = 'hunter2'
def login_mail():
global SERVER
SERVER = imaplib.IMAP4_SSL(IMP_SRV)
SERVER.login(IMP_USR, IMP_PWD)
def get_mail(folder='INBOX'):
mails = []
_, data = SERVER.uid('SEARCH', 'ALL')
uids = data[0].split()
for uid in uids:
_, s = SERVER.uid('FETCH', uid, '(RFC822)')
mail = email.message_from_bytes(s[0][1])
mails.append(mail)
return mails
def parse_attachments(mail):
for part in mail.walk():
if part.get_content_type() == 'application/octet-stream':
filename = get_filename(part)
output = os.path.join(OUT_DIR, filename)
with open(output, 'wb') as f:
f.write(part.get_payload(decode=True))
def get_filename(part):
filename = part.get_filename()
binary = part.get_payload(decode=True)
if decode_header(filename)[0][1] is not None:
filename = decode_header(filename)[0][0].decode(decode_header(filename)[0][1])
filename = os.path.basename(filename)
return filename
Can anyone tell me what I'm doing wrong and if it's somehow possible?
After getting said information it could be possible to modify the timestamps utilizing How do I change the file creation date of a Windows file?.
I was able to extract the creation-date and modification-date from the content-disposition header. Setting the file modified date is simple too.
attachment_creation_date = attachment.get_param('creation-date', None, 'content-disposition')
attachment_modification_date = attachment.get_param('modification-date', None, 'content-disposition')
Here's a more complete example that shows how to read these parameters if present:
def process_email_attachments(msg, output_directory):
for attachment in msg.iter_attachments():
try:
output_filename = attachment.get_filename()
except AttributeError:
print("Couldn't get attachment filename. Skipping.")
continue
# If no attachments are found, skip this file
if output_filename:
attachment_creation_date = attachment.get_param('creation-date', None, 'content-disposition')
attachment_modification_date = attachment.get_param('modification-date', None, 'content-disposition')
try:
output_file_full_path = os.path.join(output_directory, output_filename)
with open(output_file_full_path, "wb") as of:
payload = attachment.get_payload(decode=True)
of.write(payload)
if attachment_modification_date is not None:
attachment_modification_datetime = email.utils.parsedate_to_datetime(attachment_modification_date)
set_file_last_modified(output_file_full_path, attachment_modification_datetime)
except TypeError:
print("Couldn't get payload for %s" % output_filename)
def set_file_last_modified(file_path, dt):
dt_epoch = dt.timestamp()
os.utime(file_path, (dt_epoch, dt_epoch))
The second part of your question is how to set the file created date. This is platform dependent. There is already a separate question with answers demonstrating how to set the creation date on a Windows file: How do I change the file creation date of a Windows file?

Error opening photos stored in zip file in Django

I'm going to create a zip file from some of the image files stored on my server.
I've used the following function to do this:
def create_zip_file(user, examination):
from lms.models import StudentAnswer
f = BytesIO()
zip = zipfile.ZipFile(f, 'w')
this_student_answer = StudentAnswer.objects.filter(student_id=user.id, exam=examination)
for answer in this_student_answer:
if answer.answer_file:
answer_file_full_path = answer.answer_file.path
fdir, fname = os.path.split(answer_file_full_path)
zip.writestr(fname, answer_file_full_path)
zip.close() # Close
zip_file_name = "student-answers_"+ str(examination.id)+"_" + str(user.id) + "_" + date=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M") + '.zip'
response = HttpResponse(f.getvalue(), content_type="application/x-zip-compressed")
response['Content-Disposition'] = 'attachment; filename=%s' % zip_file_name
return response
Everything is fine and all photos are made in zip file but there is only one problem.
The problem is that the photos won't open and this error will appear in Windows:
Its look like we don't support this file format.
What is wrong with my codes?
To append data from file you have to use
write(filename)
Using writestr(filename) you add only string from variable filename but not from file.

Using urllib.urlretrieve to download files over HTTP not working

I'm still working on my mp3 downloader but now I'm having trouble with the files being downloaded. I have two versions of the part that's tripping me up. The first gives me a proper file but causes an error. The second gives me a file that is way too small but no error. I've tried opening the file in binary mode but that didn't help. I'm pretty new to doing any work with html so any help would be apprecitaed.
import urllib
import urllib2
def milk():
SongList = []
SongStrings = []
SongNames = []
earmilk = urllib.urlopen("http://www.earmilk.com/category/pop")
reader = earmilk.read()
#gets the position of the playlist
PlaylistPos = reader.find("var newPlaylistTracks = ")
#finds the number of songs in the playlist
NumberSongs = reader[reader.find("var newPlaylistIds = " ): PlaylistPos].count(",") + 1
initPos = PlaylistPos
#goes though the playlist and records the html address and name of the song
for song in range(0, NumberSongs):
songPos = reader[initPos:].find("http:") + initPos
namePos = reader[songPos:].find("name") + songPos
namePos += reader[namePos:].find(">")
nameEndPos = reader[namePos:].find("<") + namePos
SongStrings.append(reader[songPos: reader[songPos:].find('"') + songPos])
SongNames.append(reader[namePos + 1: nameEndPos])
initPos = nameEndPos
for correction in range(0, NumberSongs):
SongStrings[correction] = SongStrings[correction].replace('\\/', "/")
#downloading songs
fileName = ''.join([a.isalnum() and a or '_' for a in SongNames[0]])
fileName = fileName.replace("_", " ") + ".mp3"
# This version writes a file that can be played but gives an error saying: "TypeError: expected a character buffer object"
## songDL = open(fileName, "wb")
## songDL.write(urllib.urlretrieve(SongStrings[0], fileName))
# This version creates the file but it cannot be played (file size is much smaller than it should be)
## url = urllib.urlretrieve(SongStrings[0], fileName)
## url = str(url)
## songDL = open(fileName, "wb")
## songDL.write(url)
songDL.close()
earmilk.close()
Re-read the documentation for urllib.urlretrieve:
Return a tuple (filename, headers) where filename is the local file
name under which the object can be found, and headers is whatever the
info() method of the object returned by urlopen() returned (for a
remote object, possibly cached).
You appear to be expecting it to return the bytes of the file itself. The point of urlretrieve is that it handles writing to a file for you, and returns the filename it was written to (which will generally be the same thing as your second argument to the function if you provided one).

Categories

Resources