Storing and retrieving a PDF from Python's Mongoengine - python

I recently learned that the PDF files and images I uploaded to my Heroku website were removed whenever I updated the website. Due to this, I have been trying to store my PDFs in my MongoDB database using Mongoengine (with Flask and Python), and then retrieving them and storing them in the static folder (I was able to successfully do this with my images), with no luck.
Below is the relevant code for my Mongoengine class:
class Article(Document):
uploaded_content = FileField() # Field for storing PDF
uploaded_content_name = StringField() # File name for PDF
The relevant code for my Flask route that is trying to store the PDF:
data = Article()
if request.files['uploaded-article']:
data.uploaded_content = request.files['uploaded-article']
# uploaded_content_name given random name below, and stored in
# database
And then here is my code that tries to retrieve the PDF from mongoengine, and save it to my blog folder:
articles = Article.objects()
for art in articles:
path = os.path.join(app.config['BLOG_FOLDER'], art.uploaded_content_name)
if not os.path.isfile(path):
f = open(art.uploaded_content.read(), 'wb') # This lines gives the error
f.save(os.path.join(app.config['BLOG_FOLDER'] + art.uploaded_content_name), "PDF")
The line that gives me the error is when I try to open the PDF file I stored in my database. I have tried many different ways and have gotten various errors, but one I get is:
No such file or directory: b''. I can confirm that if I read() the database object, its just an empty byte string.
I have also tried changing my flask route to the code below, by storing the open PDF from Flask's request object. However, this gave me the error ValueError: embedded null byte when I tried to open it. However, the read() method gave me at least a really long byte string.
data = Article()
if request.files['uploaded-article']:
# store the PDF in the blog folder
article_pdf = request.files['uploaded-article']
article_pdf.save(os.path.join(app.config['BLOG_FOLDER'], article_pdf_filename))
# Open the PDF just stored in the blog folder
with open(os.path.join(app.config['BLOG_FOLDER'], article_pdf_filename), 'rb') as f:
# Store the opened PDF in the database
data.uploaded_content.put(f)
f.close()
# uploaded_content_name given random name below, and stored in
# database
Another random thing I tried was trying to open the PDF file using the BytesIO data structure, but it resulted in the same error above of an embedded null byte.
Are there any suggestions for how I can properly store and retrieve my PDF from my mongoengine database? My apologies for the complexity of my question - however, if needed I can add more details. If there are any alternative ways of storing my PDFs so they do not get lost on Heroku, I would take that as a valid solution as well.

As a reference for the future, it looks like this was not working because I did not set the content type correctly when putting the pdf in. My original code when saving the PDF to the data.uploaded_content field was:
data.uploaded_content.put(f)
However, I needed to define the mimetype correctly:
data.uploaded_content.put(f, content_type='application/pdf')
With this change it then worked, and I was able to successfully store the PDF in mongoengine. As far as storing the PDF to a folder after it was successfully uploaded, I used the following code:
if art.uploaded_content_name:
extension = art.uploaded_content_name.rsplit('.', 1)[1].lower()
path = os.path.join(app.config['BLOG_FOLDER'], art.uploaded_content_name)
if not os.path.isfile(path):
pdf = art.uploaded_content.read()
with open(os.path.join(app.config['BLOG_FOLDER'], art.uploaded_content_name), 'wb') as f:
f.write(pdf)

Related

Create a partial pdf from bytes in python

I have a pdf file somewhere. This pdf is being send to the destination in equal amount of bytes (apart from the last chunk).
Let's say this pdf file is being read in like this in python:
with open(filename, 'rb') as file:
chunk = file.read(3000)
while chunk:
#the sending method here
await asyncio.sleep(0.5)
chunk = file.read(3000)
the question is:
Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?
I tried it with pypdfium2 / PyPDF2, but they throw errors until the whole PDF file is arrived:
full_pdf = b''
def process(self, message):
self.full_pdf += message
partial = io.BytesIO(self.full_pdf)
try:
pdf=pypdfium2.PdfDocument(partial)
print(len(pdf))
except Exception as e:
print("error", e)
basically I'd like to get the pages of the document, even if it's not the whole document currently.
It's not possible to stream PDF and do anything useful with it before the whole file is present.
According to the PDF 1.7 standard, the structure is:
A one-line header identifying the version of the PDF specification to which the file conforms
A body containing the objects that make up the document contained in the file
A cross-reference table containing information about the indirect objects in the file
A trailer giving the location of the cross-reference table and of certain special objects within the body of the
file
The problem is that the x-ref table / trailer is at the end.
PDF Linearization: "fast web view"
The above part is true for arbitrary PDFs. However, it's possible to create so-called "linearized PDF files" (also called "fast web view"). Those files re-order the internal structure of PDF files to make them streamable.
At the moment, pypdf==3.4.0 does not support PDF linearization.
pikepdf claims to support that:
import pikepdf # pip install pikepdf
with pikepdf.open("input.pdf") as pdf:
pdf.save("out.pdf", linearize=True)

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

Uploading a file via paperclip through an external script

I'm trying to create a rails app that is a CMS for a client. The app currently has a documents class that uploads the document with paperclip.
Separate to this, we're running a python script that accesses the database and gets a bunch of information for a given event, creates a proposal word document, and uploads it to the database under the correct event.
This all works, but the app does not recognize the document. How do I make a python script that will correctly upload the document such that paperclip knows what's going on?
Here is my paperclip controller:
def new
#event = Event.find(params[:event_id])
#document = Document.new
end
def create
#event = Event.find(params[:event_id])
#document = #event.documents.new(document_params)
if #document.save
redirect_to event_path(#event)
end
end
private
def document_params
params.require(:document).permit(:event_id, :data, :title)
end
Model
validates :title, presence: true
has_attached_file :data
validates_attachment_content_type :data, :content_type => ["application/pdf", "application/msword"]
Here is the python code.
f = open(propStr, 'r')
binary = psycopg2.Binary(f.read())
self.cur.execute("INSERT INTO documents (event_id, title, data_file_name, data_content_type) VALUES (%d,'Proposal.doc',%s,'application/msword');" % (self.eventData[0], binary))
self.con.commit()
You should probably use Ruby to script this since it can load in any model information or other classes you need.
But assuming your requirements dictate the use of python, be aware that Paperclip does not store the documents in your database tables, only the files' metadata. The actual file is stored in your file system in the /public dir by default (could also be s3, etc depending on your configuration). I would make sure you were actually saving the file to the correct anticipated directory. The default path according to the docs is:
:rails_root/public/system/:class/:attachment/:id_partition/:style/:filename
so you will have to make another sql query to retrieve the id of your new record. I don't believe pdfs have a :style attribute since you don't use imagicmagick to resize them, so build a path that looks something like this:
/public/system/documents/data/000/000/123/my_file.pdf
and save it from your python script.

send data from blobstore as email attachment in GAE

Why isn't the code below working? The email is received, and the file comes through with the correct filename (it's a .png file). But when I try to open the file, it doesn't open correctly (Windows Gallery reports that it can't open this photo or video and that the file may be unsupported, damaged or corrupted).
When I download the file using a subclass of blobstore_handlers.BlobstoreDownloadHandler (basically the exact handler from the GAE docs), and the same blob key, everything works fine and Windows reads the image.
One more bit of info - the binary files from the download and the email appear very similar, but have a slightly different length.
Anyone got any ideas on how I can get email attachments sending from GAE blobstore? There are similar questions on S/O, suggesting other people have had this issue, but there don't appear to be any conclusions.
from google.appengine.api import mail
from google.appengine.ext import blobstore
def send_forum_post_notification():
blob_reader = blobstore.BlobReader('my_blobstore_key')
blob_info = blobstore.BlobInfo.get('my_blobstore_key')
value = blob_reader.read()
mail.send_mail(
sender='my.email#address.com',
to='my.email#address.com',
subject='this is the subject',
body='hi',
reply_to='my.email#address.com',
attachments=[(blob_info.filename, value)]
)
send_forum_post_notification()
I do not understand why you use a tuple for the attachment. I use :
message = mail.EmailMessage(sender = ......
message.attachments = [blob_info.filename,blob_reader.read()]
I found that this code doesn't work on dev_appserver but does work when pushed to production.
I ran into a similar problem using the blobstore on a Python Google App Engine application. My application handles PDF files instead of images, but I was also seeing a "the file may be unsupported, damaged or corrupted" error using code similar to your code shown above.
Try approaching the problem this way: Call open() on the BlobInfo object before reading the binary stream. Replace this line:
value = blob_reader.read()
... with these two lines:
bstream = blob_info.open()
value = bstream.read()
Then you can remove this line, too:
blob_reader = blobstore.BlobReader('my_blobstore_key')
... since bstream above will be of type BlobReader.
Relevant documentation from Google is located here:
https://cloud.google.com/appengine/docs/python/blobstore/blobinfoclass#BlobInfo_filename

Saving binary data into file on model via django storages boto s3

I'm pulling back a pdf from the echosign API, which gives me the bytes of a file.
I'm trying to take those bytes and save them into a boto s3 backed FileField. I'm not having much luck.
This was the closest I got, but it errors on saving 'speaker', and the pdf, although written to S3, appears to be corrupt.
Here speaker is an instance of my model and fileData is the 'bytes' string returned from the echosign api
afile = speaker.the_file = S3BotoStorageFile(filename, "wb", S3BotoStorage())
afile.write(fileData)
afile.close()
speaker.save()
I'm closer!
content = ContentFile(fileData)
speaker.profile_file.save(filename, content)
speaker.save()
Turns out the FileField is already a S3BotoStorage, and you can create a new file by passing the raw datadin like that. What I don't know is how to make it binary (I'm assuming its not). My file keeps coming up corrupted, despite having a good amount of data in it.
For reference here is the response from echosign:
https://secure.echosign.com/static/apiv14/sampleXml/getDocuments-response.xml
I'm essentially grabbing the bytes and passing it to ContentFile as fileData. Maybe I need to base64 decode. Going to try that!
Update
That worked!
It seems I have to ask the question here before I figure out the answer. Sigh. So the final code looks something like this:
content = ContentFile(base64.b64decode(fileData))
speaker.profile_file.save(filename, content)
speaker.save()

Categories

Resources