Create a partial pdf from bytes in python - python

I have a pdf file somewhere. This pdf is being send to the destination in equal amount of bytes (apart from the last chunk).
Let's say this pdf file is being read in like this in python:
with open(filename, 'rb') as file:
chunk = file.read(3000)
while chunk:
#the sending method here
await asyncio.sleep(0.5)
chunk = file.read(3000)
the question is:
Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?
I tried it with pypdfium2 / PyPDF2, but they throw errors until the whole PDF file is arrived:
full_pdf = b''
def process(self, message):
self.full_pdf += message
partial = io.BytesIO(self.full_pdf)
try:
pdf=pypdfium2.PdfDocument(partial)
print(len(pdf))
except Exception as e:
print("error", e)
basically I'd like to get the pages of the document, even if it's not the whole document currently.

It's not possible to stream PDF and do anything useful with it before the whole file is present.
According to the PDF 1.7 standard, the structure is:
A one-line header identifying the version of the PDF specification to which the file conforms
A body containing the objects that make up the document contained in the file
A cross-reference table containing information about the indirect objects in the file
A trailer giving the location of the cross-reference table and of certain special objects within the body of the
file
The problem is that the x-ref table / trailer is at the end.
PDF Linearization: "fast web view"
The above part is true for arbitrary PDFs. However, it's possible to create so-called "linearized PDF files" (also called "fast web view"). Those files re-order the internal structure of PDF files to make them streamable.
At the moment, pypdf==3.4.0 does not support PDF linearization.
pikepdf claims to support that:
import pikepdf # pip install pikepdf
with pikepdf.open("input.pdf") as pdf:
pdf.save("out.pdf", linearize=True)

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

How do I modify a document in a stream without changing the original document

Following the tutorial provided by ASPOSE I am able to save a document to a stream:
# Read only access is enough for Aspose.words to load a document.
stream = io.FileIO(docs_base.my_dir + "Document.docx")
doc = aw.Document(stream)
# You can close the stream now, it is no longer needed because the document is in memory.
stream.close()
# ... do something with the document.
# Convert the document to a different format and save to stream.
dstStream = io.FileIO(docs_base.my_dir + "Document.docx", "wb")
doc.save(dstStream, aw.SaveFormat.PDF)
dstStream.close()
However, when I do something with the document, the docx document is modified. How to do only save changes to the pdf output?
Aspose.Words does not modify the original document upon processing. The document is fully read into the memory, so the original file is not used by Aspose.Words after opening it.
The DOCX file is modified in your case because you save the output document with the same file name as an input document, so the original document is replaced.
You should modify your code like this:
# Read only access is enough for Aspose.words to load a document.
stream = io.FileIO(docs_base.my_dir + "Document.docx")
doc = aw.Document(stream)
# You can close the stream now, it is no longer needed because the document is in memory.
stream.close()
# ... do something with the document.
# Convert the document to a different format and save to stream.
dstStream = io.FileIO(docs_base.my_dir + "Document.pdf", "wb")
doc.save(dstStream, aw.SaveFormat.PDF)
dstStream.close()
Or even simpler, if you work with files:
doc = aw.Document(docs_base.my_dir + "Document.docx")
# ... do something with the document.
doc.save(docs_base.my_dir + "Document.pdf")
Aspose.Words automatically detects the save format by the output file extension when you save to file.

Storing and retrieving a PDF from Python's Mongoengine

I recently learned that the PDF files and images I uploaded to my Heroku website were removed whenever I updated the website. Due to this, I have been trying to store my PDFs in my MongoDB database using Mongoengine (with Flask and Python), and then retrieving them and storing them in the static folder (I was able to successfully do this with my images), with no luck.
Below is the relevant code for my Mongoengine class:
class Article(Document):
uploaded_content = FileField() # Field for storing PDF
uploaded_content_name = StringField() # File name for PDF
The relevant code for my Flask route that is trying to store the PDF:
data = Article()
if request.files['uploaded-article']:
data.uploaded_content = request.files['uploaded-article']
# uploaded_content_name given random name below, and stored in
# database
And then here is my code that tries to retrieve the PDF from mongoengine, and save it to my blog folder:
articles = Article.objects()
for art in articles:
path = os.path.join(app.config['BLOG_FOLDER'], art.uploaded_content_name)
if not os.path.isfile(path):
f = open(art.uploaded_content.read(), 'wb') # This lines gives the error
f.save(os.path.join(app.config['BLOG_FOLDER'] + art.uploaded_content_name), "PDF")
The line that gives me the error is when I try to open the PDF file I stored in my database. I have tried many different ways and have gotten various errors, but one I get is:
No such file or directory: b''. I can confirm that if I read() the database object, its just an empty byte string.
I have also tried changing my flask route to the code below, by storing the open PDF from Flask's request object. However, this gave me the error ValueError: embedded null byte when I tried to open it. However, the read() method gave me at least a really long byte string.
data = Article()
if request.files['uploaded-article']:
# store the PDF in the blog folder
article_pdf = request.files['uploaded-article']
article_pdf.save(os.path.join(app.config['BLOG_FOLDER'], article_pdf_filename))
# Open the PDF just stored in the blog folder
with open(os.path.join(app.config['BLOG_FOLDER'], article_pdf_filename), 'rb') as f:
# Store the opened PDF in the database
data.uploaded_content.put(f)
f.close()
# uploaded_content_name given random name below, and stored in
# database
Another random thing I tried was trying to open the PDF file using the BytesIO data structure, but it resulted in the same error above of an embedded null byte.
Are there any suggestions for how I can properly store and retrieve my PDF from my mongoengine database? My apologies for the complexity of my question - however, if needed I can add more details. If there are any alternative ways of storing my PDFs so they do not get lost on Heroku, I would take that as a valid solution as well.
As a reference for the future, it looks like this was not working because I did not set the content type correctly when putting the pdf in. My original code when saving the PDF to the data.uploaded_content field was:
data.uploaded_content.put(f)
However, I needed to define the mimetype correctly:
data.uploaded_content.put(f, content_type='application/pdf')
With this change it then worked, and I was able to successfully store the PDF in mongoengine. As far as storing the PDF to a folder after it was successfully uploaded, I used the following code:
if art.uploaded_content_name:
extension = art.uploaded_content_name.rsplit('.', 1)[1].lower()
path = os.path.join(app.config['BLOG_FOLDER'], art.uploaded_content_name)
if not os.path.isfile(path):
pdf = art.uploaded_content.read()
with open(os.path.join(app.config['BLOG_FOLDER'], art.uploaded_content_name), 'wb') as f:
f.write(pdf)

Is it possible to generate PDF with StreamingHttpResponse as it's possible to do so with CSV for large dataset?

I have a large dataset that I have to generate CSV and PDF for. With CSV, I use this guide: https://docs.djangoproject.com/en/3.1/howto/outputting-csv/
import csv
from django.http import StreamingHttpResponse
class Echo:
"""An object that implements just the write method of the file-like
interface.
"""
def write(self, value):
"""Write the value by returning it, instead of storing in a buffer."""
return value
def some_streaming_csv_view(request):
"""A view that streams a large CSV file."""
# Generate a sequence of rows. The range is based on the maximum number of
# rows that can be handled by a single sheet in most spreadsheet
# applications.
rows = (["Row {}".format(idx), str(idx)] for idx in range(65536))
pseudo_buffer = Echo()
writer = csv.writer(pseudo_buffer)
response = StreamingHttpResponse((writer.writerow(row) for row in rows),
content_type="text/csv")
response['Content-Disposition'] = 'attachment; filename="somefilename.csv"'
return response
It works great. However, I can't find anything that can be done for PDF. Can it? I use render_to_pdf as well as I use a template for PDF.
Think of CSV as a fruit salad. You can slice bananas in a big pot, add some grapefruits, some pineapple, ... and then split the whole into individual portions that you bring together to the table (this is: you generate your CSV file, and then you send it to the client). But you could also make individual portions directly: Cut some slices of a banana in a small bowl, add some grapefruits, some pineapple, ... bring this small bowl to the table, and repeat the process for other individual portions (this is: you generate your CSV file and send it part by part to the client as you generate it).
Well if CSV is a fruit salad, then PDF is a cake. You have to mix all your ingredients and put it in the oven. This means you can't bring a slice of the cake to the table until you have baked the whole cake. Likewise, you can't start sending your PDF file to the client until it's entirely generated.
So, to answer your question, this (response = StreamingHttpResponse((writer.writerow(row) for row in rows), content_type="text/csv")) can't be done for PDF.
However, once your file is generated, you can stream it to the client using FileResponse as mentioned in other answers.
If your issue is that the generation of the PDF takes too much time (and might trigger a timeout error for instance), here are some things to consider:
Try to optimize the speed of your generation algorithm
Generate the file in the background before the client requests it and store it in your storage system. You might want to use a cronjob or celery to trigger the generation of the PDF without blocking the HTTP request.
Use websockets to send the file to the client as soon as it is ready to be downloaded (see django-channels)
Have you tried FileResponse?
Something like this should work, it is basically what you can find in the Django doc:
import io
from django.http import FileResponse
from reportlab.pdfgen import canvas
def stream_pdf(request):
buffer = io.BytesIO()
p = canvas.Canvas(buffer)
p.drawString(10, 10, "Hello world.")
p.showPage()
p.save()
buffer.seek(io.SEEK_SET)
return FileResponse(buffer, as_attachment=True, filename='helloworld.pdf')
I had a similar situation where I am able to "generate and stream download" files of csv, json and xml types and I want to do the same with Excel - xlsx file.
Unfortunately, I couldn't do that. But, during that time I found a few things
The files , CSV, JSON and XML are text files with a proper representation. But, when comes to PDF or Excel (or similar files), these files are built with a proper formatting and proper metadata.
The binary data of PDF and similar docs are written to the io buffer only when we call some specific methods. [ showPage() and save() methods of reportlab. (source- Django Doc) ]
If we inspect the file stream, PDF and Excel require sophisticated special applications (eg: PDF reader, Bowsers etc) to view/read the data whereas, with CSV and JSON, we need only a simple text editor.
So, I conclude that the process of "on the fly generation of file with stream download" (not sure what is the correct technical term I should use) is not possible for all file types, but only possible for a few text-oriented files
Note: This is my limited experience, which may be wrong.
Looking at the link you provided it does provide a link to a page on creating and sending pdf files dynamically using reportlab.
import io
from django.http import FileResponse
from reportlab.pdfgen import canvas
def some_view(request):
# Create a file-like buffer to receive PDF data.
buffer = io.BytesIO()
# Create the PDF object, using the buffer as its "file."
p = canvas.Canvas(buffer)
# Draw things on the PDF. Here's where the PDF generation happens.
# See the ReportLab documentation for the full list of functionality.
p.drawString(100, 100, "Hello world.")
# Close the PDF object cleanly, and we're done.
p.showPage()
p.save()
# FileResponse sets the Content-Disposition header so that browsers
# present the option to save the file.
buffer.seek(0)
return FileResponse(buffer, as_attachment=True, filename='hello.pdf')
Here's a link to the reportlab api documentation. Its kinda lengthy and stored in a annoying to navigate single page pdf, but it should get you up and running and able to nicely format the PDFs as you want.

Parsing YAML out of a Markdown file

I am working with some legacy code that I have inherited (ie, many of these design decisions were not mine).
The code takes a directory organized into subdirectories with markdown files, and compiles them into one large markdown file (using Markdown-PP: https://github.com/jreese/markdown-pp). Then it converts this file into HTML (using pandoc: https://pandoc.org/), and finally into a PDF (using wkhtmltopdf: https://wkhtmltopdf.org/).
The problem that I am running into is that many of the original markdown files have YAML metadata headers. When stitched together by Markdown-PP, the large markdown ends up with numerous YAML metadata blocks interspersed throughout. Most of this metadata is lost when converting into HTML because of the way pandoc processes YAML (many of the headers use the same key names, and pandoc combines the separate YAML headers and only preserves the first value of the corresponding key).
I originally had no YAML appearing in the HTML, but was able to change this by correctly modifying the HTML template for pandoc. But I only get the first value for each corresponding key. It was not clear if there was a way around this in pandoc, so I instead looked into trying to process the YAML into HTML before the pandoc step. I have tried parsing the YAML in the combined markdown using PyYAML (yaml.load_all()) but only get the first YAML block to appear.
An example of a YAML block:
---
author: foo
size_minimum: 100
time_req_minutes: 120
# and so on
---
The issue being that each one of 20+ modules in the final document have this associated metadata.
To try to parse the YAML, I was using code borrowed from this post: Is it possible to use PyYAML to read a text file written with a "YAML front matter" block inside?
with a few modifications.
import yaml
import sys
def get_yaml(f):
pointer = f.tell()
if f.readline() != '---\n':
f.seek(pointer)
return ''
readline = iter(f.readline, '')
readline = iter(readline.__next__, '---\n') #underscores needed for Python3?
return ''.join(readline)
# Remove sys.argv, not sure what it was doing
with open(filepath, encoding='UTF-8') as f:
config = list(yaml.load_all(get_yaml(f), Loader=yaml.SafeLoader)) # Load all to get all the YAML documents, Loader option required for most recent PyYAML, and list because it was originally returning a generator object
text = f.read()
print("TEXT from", f)
#print(text)
print("CONFIG from", f)
print(config)
But even this only resulted in the first YAML block being read and output.
I would like to able to parse the YAML from the large markdown files, and replace it in the correct place with the corresponding HTML. I just am not sure if these (or any) packages have the capability of doing so. It may be that I just need to manually change the YAML to HTML in the original Markdown files (time intensive, but I could probably already be done with it if I had started that way).
What about this library: https://github.com/eyeseast/python-frontmatter
It parses both the front-matter and the Markdown in the file, placing the Markdown part in the content attribute of the resulting object.
Works with both front-matter containing and front-matterless (is there such a word?) files.

Categories

Resources