Converting Binary Data to PDF in Python/Django - python

I'm working on upgrading a legacy system and have come across a table full of .pdf files saved as binary data. I have dumped the table into a csv file and am trying to write a script which will take each row and recreate the files that were uploaded in the first place so that I can upload the files to S3.
I have tried this:
new_file = open(file_name, "wb")
doc = doc.encode('utf-8')
new_file.write(doc)
new_file.close()
where file_name = the saved file name, and doc = the binary data stored as a string in the database.
but all it gives me is a bunk pdf file with the binary data in it.
Here is what the data looks like stored, its just the first bit, way to big to copy and paste.
0x255044462D312E340A25E2E3CFD30D0A312030206F626A0A3C3C200A2F43726561746F72202843616E6F6E2069522D4144562043353034352020504446290A2F4372656174696F6E446174652028443A32303133303432393133303830342D303527303027290A2F50726F647563657220285C3337365C3337375C303030415C303030645C3030306F5C303030625C303030655C303030205C303030505C303030445C303030465C303030205C303030535C303030635C3030305C0A615C3030306E5C303030205C3030304C5C303030695C303030625C303030725C303030615C303030725C303030795C303030205C303030315C3030302E5C303030305C303030655C3030305C0A205C303030665C3030306F5C303030725C303030205C303030435C303030615C3030306E5C3030306F5C3030306E5C303030205C303030695C3030306D5C303030615C303030675C3030305C0A655C303030525C303030555C3030304E5C3030304E5C303030455C303030525C3030305C303030290A3E3E200A656E646F626A0A322030206F626A0A3C3C200A2F5061676573203320302052200A2F54797065202F436174616C6F67200A2F4F7574707574496E74656E747320313120302052200A2F4D6574616461746120313220302052200A3E3E200A656E646F626A0A342030206F626A0A3C3C202F54797065202F

Related

How to merge pdf files using python without storing them into the local directory

I have some pdf files which are uploaded on a remote server. I have URL for each file and we can download these PDF files by visiting those URLs.
My question is,
I want to merge all pdf files into a single file (but, without storing these files into local directory). How can I do that (in python module 'PyPDF2')?
Please move to pypdf. It's essentially the same as PyPDF2, but the development will continue there (I'm the maintainer of both projects).
Your question is answered in the docs:
https://pypdf.readthedocs.io/en/latest/user/streaming-data.html
Instead of writing to a file, you write to io.ByteIO stream:
from io import ByteIO
# e.g. writer = PdfWriter()
# ... do what you want to do with the PDFs
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
data = bytes_stream.read() # that is now the "bytes" represention

Unzip a file and save its content into a database

I am building a website using Django where the user can upload a .zip file. I do not know how many sub folders the file has or which type of files it contains.
I want to:
1) Unzip the file
2) Get all the file in the unzipped directory (which might contains nested sub folders)
3) Save these files (the content, not the path) into the database.
I managed to unzip the file and to output the files path.
However this is snot exactly what I want. Because I do not care about the file path but the file itself.
In addition, since I am saving the unzipped file into my media/documents, if different users upload different zip, and all the zip files are unzipped, the folder media/documents would be huge and it would impossible to know who uploaded what.
Unzipping the .zip file
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
Getting path of file in subfolders
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
views.py # It is not perfect, it is just an idea. I am just debugging.
def homeupload(request):
if request.method == "POST":
my_entity = Uploading()
# my_entity.my_uploads = request.FILES["my_uploads"]
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
my_entity.save()
You really don't have to clutter up your filesystem when using a ZipFile, as it contains methods that allow you to read the files stored in the zip, directly to memory, and then you can save those objects to a database.
Specifically, we can use .infolist() or .namelist() to get a list of all the files in the zip, and .read() to actually get their contents:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item) for item in zipObj.namelist()]
Now file_objects is a list of bytes objects that have the content of all the files. I didn't bother saving the names or file paths because you said it was unneccessary, but that can be done too. To see what you can do, check out what actually get's returned from infolist
If you want to save these bytes objects to your database, it's usually possible if your database can support it (most can). If you however wanted to get these files as plain text and not bytes, you just have to convert them first with something like .decode:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]
Notice that we didn't save any files on our system at any point, so there's nothing to worry about a lot of user uploaded files cluttering up your system. However the database storage size on your disk will increase accordingly.

How do I upload an .xlsx file to an FTP without creating a local file?

I am writing a script to pull xml files from an FTP, turn them into an .xlsx file, and re-upload to a different directory on the same FTP. I want to create the .xlsx file within my script instead of copying the xml data into a template and uploading my local file.
I tried creating a filename for the .xlsx doc, but i realize that i need to save it before i can upload to the FTP. My question is, would it be better to create a temporary folder on the server the script is being run and empty the folder out afterwards? or is there a way to upload the doc without saving it anywhere (preferred)? I will be running the script on a windows server
ftps.cwd(ftpExcelDir)
wbFilename = str(orderID + '.xlsx')
savedFile = saving the file somwhere # this is the part im having trouble with
ftps.storline('STOR ' + wbFilename, savedFile)
With the following code, i can get the .xlsx files to save to the FTP, but i recieve an invalid extension/corrupt file error from Excel:
ftps.cwd(ftpExcelDir)
wbFilename = str(orderID + '.xlsx')
inMemoryWB = io.BytesIO()
wb.save(inMemoryWB)
ftps.storbinary('STOR ' + wbFilename, inMemoryWB)
The ftp functions take file objects... but those don't strictly speaking need to be files. Python has BytesIO and StringIO objects which act like files, but are backed by memory. See: https://stackoverflow.com/a/44672691/8833934

reading in multiple text file extensions .pdf, .txt and .htm

I have a folder where I want to read all the text files from and put them into a corpus, however I am only able to do it with .txt files. How can I expand the code below to read in .pdf, .htm and .txt files?
corpus_raw = u""
for file_name in file_names:
with codecs.open(file_name, "r", "utf-8") as file_name:
corpus_raw += file_name.read()
print("Document is {0} characters long".format(len(corpus_raw)))
print()
For example:
with open ('/data/text_file.txt', "r", encoding = "utf-8") as f:
print(f.read())
Read in data where I can view it on a notebook.
with open ('/data/text_file.pdf', "r", encoding = "utf-8") as f:
print(f.read())
Read nothing.
There are two types of files, binary files and plain-text files. A file can have one or the other, or sometimes both.
Html files are plaintext, human readable files, which you can edit by hand, but PDF Files are binary + Text files where you'll need special programs to edit them.
If you want to read from pdf or html, it's possible. I wasn't sure if you meant to extract the text, or to extract the source code, so I'll provide explanations to both.
Extracting Text
Extracting text can be done easily for html files. Using webbrowser, you can open your file in the browser, and then use urllib for extracting text. For more info, refer to the answers here: Extracting text from HTML file using Python
For pdf files, you can use a python module called PyPDF2. Download it using pip:
$ pip install PyPDF2
and get started.
Here is an example of a simple program I found on the internet:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Extracting Source Code
Extracting source code is best done using python's open function as you did above.
For html files, you can just do what you did with text files. Or maybe to be simpler,
file = open("c:\\path\\to\\file")
print(file.read())
you can just do the above.
For pdf files, you do pretty much the same, but specifying the mode for editing in a different parameter in the open function. For more info, visit the sites in the More Info section.
file = open("c:\\path\\to\\file.extension", "a") #specifies the mode of editing. Unfortunately, you'll only be able to store data, not display it. But you can edit it, then save it after wards
print(file.readable()) #Will return false, proving to be not readable.
file.save("c:\\path\\to\\save\\in.extension")
More Info
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
https://www.programiz.com/python-programming/methods/built-in/open
This should work for htm/html files with no problem - they are basically just text files. Above, I only see that reading in .pdf has failed - was there a problem with .htm?
Also, reading in a .pdf may be much more difficult/involved than you think. A pdf contains a lot more information than just plaintext, and cannot be meaningfully edited in, say, notepad. As an example of what I mean, here's a small sample of what I got when I opened a .pdf in notepad:
%PDF-1.7
%âãÏÓ
1758 0 obj
<</Filter/FlateDecode/First 401/Length 908/N 51/Type/ObjStm>>stream
hÞ”ØQk\7à¿2ÍK,i4
Cã(Á”¾•–öâ.Ýn‚w]òó3rm˜Ÿ =ÄÜÝèΑ®?ÉÍ…e¦ê?Å/2e¥ÂJÙˆ+SÉT«ù7$"T„ZËT”´ù2£®L~©¯fÊ©±É–iÌ(¦ÄF¹&OðÑ’Œ|hnžU}Žñ¾®ûDOÉæCÄç'¿IF¸/±Å¿”±/ÿ!¾›Ú˜Æ>¤ùeiêóuÚ3õ®äU̘Է’Ìhì´$!_Êœ3©o­úaÇÖÅÏç·rGòuê‡Gé¾é>Žà›ì¾õä›ò£Õì›ðѵx¨ùQXÇ3ð'åC=ªJÃ6óç:¯Öý—ZòóúI¹ù…Ÿ3—ñ$<Éw‘èÍ›«›/dz/¸z¿¿?Ço'ÑoW¿îÆõX矮¯}Ý»ítþ#?~ö¥ç_ü”×éÓÕÇíÛyü6Ç÷·»û͇åòøé÷ýù°ýôöá´?n§}8ž·Ãa·ÿÜ>ßÞo‡ý¿§Wat£õ…Ñ~ûÏ[ýQÌÍß»¯çížRŽI
$L’ù¤“úËI%Ã$OâTHb˜dóI5&$(éé´SI“€ˆE”-&Š("4&E”=$1ÁPDYa1 ˆ`(‚çEä“€†"x^DŽÁ#C</"ÇŽ` ¢B</"ÇŽ¨#D…"x^DŽQˆ
EÔ±#*Q¡ˆº "vD"*QDÄŽ¨#„#uADì"Š¨"bG!P„Ì‹(±#ˆ(BæE”ØD!ó"Jì"!ó"JìˆD4(BæE”Ø
ˆhPD[;¢
Šh"bG4 ¢AmADìˆD(ÑDÄŽP B¡ˆ¶ "v„
E輎¡#„B:/‚cG(¡P„΋àØ
Dt(BçEpìˆDt(BçEpìˆDt(¢/ˆˆÑˆEô±#:Ñ¡ˆ¾ "vD"Šè"bGaPD_;€ƒ"l^Da#„A6/¢ÆŽ0  ›QcG1Þ¡¨y5–DN eA6¢Ö‹¬‚² ‹ç#O…ÉEzQ•ð›ª´#£]„¡wU ¿¬J:ô"ñPüŸÑçSÿ(íÃñ¯íÛÿA?û°§7¿8ìBÀawü‡nww›ßû]€ %“xw
endstream
endobj
1759 0 obj
<</Filter/FlateDecode/First 1907/Length 3450/N 200/Type/ObjStm>>stream
There are, however, options. I would suggest reading the page at https://www.geeksforgeeks.org/working-with-pdf-files-in-python/ as a starting point.

Exporting zipped folder only with csv content

1.I am using Oracle and the idea is to use python script to export tables as zipped folder containing csv file which holds my data.
2.Additionaly: Is it possible to save this data in csv per columns. For example, I have 4 columns and all of them are stored in 1 column in csv.
see this image
This is my script:
import os
import cx_Oracle
import csv
import zipfile
connection = cx_Oracle.connect('dbconnection')
SQL = "SELECT * FROM AIR_CONDITIONS_YEARLY_MVIEW ORDER BY TIME"
filename = "sample.csv"
cursor = connection.cursor()
cursor.execute(SQL)
with open (filename, 'r') as output:
writer = csv.writer (output)
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor.fetchall())
air_zip = zipfile.ZipFile("sample.zip", 'w')
air_zip.write(filename, compress_type=zipfile.ZIP_DEFLATED)
cursor.close()
connection.close()
air_zip.close()
Code I did exports me separately both csv and zipped folder with proper csv file inside and I want to keep exporting this zipped folder only!
Both sample.zip containing sample.csv as expected and sample.csv generated at the same time.
There are a list of problems:
The .csv file is not properly formatted (a row is seen as a single record (string) instead of a sequence of records):
Looking (blindly) at the code and tools (csv.writer, cx_Oracle) documentation, it seems correct
When noticing that the file is opened with Excel, I remembered that at some point I had a similar issue. A quick search yielded [SuperUser]: How to get Excel to interpret the comma as a default delimiter in CSV files?
. And this was the culprit (the .csv file looks fine in an online viewer / editor)
Code "exporting" both .csv and .zip files (I don't know what export means, I assume generate - meaning that after running the code, both files are present):
The way of getting around this is by deleting the .csv file after it was archived into the .zip file. Translated into code that would mean adding at the end of the current script snippet:
os.unlink(filename)
As a final observation (if one wants to be pedantic), the lines that close the cursor and the databbase could be moved just after the with block or before air_zip creation (there's no point keeping them open while archiving).

Categories

Resources