Convert SVG to PDF (svglib + reportlab not good enough) - python

I'm creating some SVGs in batches and need to convert those to a PDF document for printing. I've been trying to use svglib and its svg2rlg method but I've just discovered that it's absolutely appalling at preserving the vector graphics in my document. It can barely position text correctly.
My dynamically-generated SVG is well formed and I've tested svglib on the raw input to make sure it's not a problem I'm introducing.
So what are my options past svglib and ReportLab? It either has to be free or very cheap as we're already out of budget on the project this is part of. We can't afford the 1k/year fee for ReportLab Plus.
I'm using Python but at this stage, I'm happy as long as it runs on our Ubuntu server.
Edit: Tested Prince. Better but it's still ignoring half the document.

I use inkscape for this. In your django view do like:
from subprocess import Popen
x = Popen(['/usr/bin/inkscape', your_svg_input, \
'--export-pdf=%s' % your_pdf_output])
try:
waitForResponse(x)
except OSError, e:
return False
def waitForResponse(x):
out, err = x.communicate()
if x.returncode < 0:
r = "Popen returncode: " + str(x.returncode)
raise OSError(r)
You may need to pass as parameters to inkscape all the font files you refer to in your .svg, so keep that in mind if your text does not appear correctly on the .pdf output.

CairoSVG is the one I am using:
import cairosvg
cairosvg.svg2pdf(url='image.svg', write_to='image.pdf')

rst2pdf uses reportlab for generating PDFs. It can use inkscape and pdfrw for reading PDFs.
pdfrw itself has some examples that show reading PDFs and using reportlab to output.
Addressing the comment by Martin below (I can edit this answer, but do not have the reputation to comment on a comment on it...):
reportlab knows nothing about SVG files. Some tools, like svg2rlg, attempt to recreate an SVG image into a PDF by drawing them into the reportlab canvas. But you can do this a different way with pdfrw -- if you can use another tool to convert the SVG file into a PDF image, then pdfrw can take that converted PDF, and add it as a form XObject into the PDF that you are generating with reportlab. As far as reportlab is concerned, it is really no different than placing a JPEG image.
Some tools will do terrible things to your SVG files (rasterizing them, for example). In my experience, inkscape usually does a pretty good job, and leaves them in a vector format. You can even do this headless, e.g. "inkscape my.svg -A my.pdf".
The entire reason I wrote pdfrw in the first place was for this exact use-case -- being able to reuse vector images in new PDFs created by reportlab.

Just to let you know and for the future issue, I find a solution for this problem:
# I only install svg2rlg, not svglib (svg2rlg is inside svglib as well)
import svg2rlg
# Import of the canvas
from reportlab.pdfgen import canvas
# Import of the renderer (image part)
from reportlab.graphics import renderPDF
rlg = svg2rlg.svg2rlg("your_img.svg")
c = canvas.Canvas("example.pdf")
c.setTitle("my_title_we_dont_care")
# Generation of the first page
# You have a last option on this function,
# about the boundary but you can leave it as default.
renderPDF.draw(rlg, c, 80, 740 - rlg.height)
renderPDF.draw(rlg, c, 60, 540 - rlg.height)
c.showPage()
# Generation of the second page
renderPDF.draw(rlg, c, 50, 740 - rlg.height)
c.showPage()
# Save
c.save()
Enjoy a bit with the position (80, 740 - h), it is only the position.
If the code doesn't work, you can look at in the render's reportlab library.
You have a function in reportlab to create directly a pdf from your image:
renderPDF.drawToFile(rlg, "example.pdf", "title")
You can open it and read it. It is not very complicated. This code come from this function.

Related

Extracting Powerpoint background images using python-pptx

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.
After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.

How to open optimized GIF without bugs?

So this GIF looks perfectly fine before opening:
But, when opened using Pillow using
imageObject = Image.open(path.join(petGifs, f"{pokemonName}.gif"))
it bugs out, adding various boxes that have colors similar to that of the source image. This is an example frame, but almost every frame is different, and it's in different spots depending on the GIF:
The only thing, that has worked to fix this, is ezgif's unoptimize option (found in their optimize page). But, I'd need to do that on each GIF, and there's a lot of them.
I need either a way to bulk unoptimize, or a new way to open the GIF in Python (currently using Pillow), that will handle this.
At least for extracting proper single frames there might be a solution.
The disposal method for all frames (except the first) is set to 2, which is "restore to background color".
Diving through Pillow's source code, you'll find the according line where the disposal method 2 is considered, and, in the following, you'll find:
# by convention, attempt to use transparency first
color = (
frame_transparency
if frame_transparency is not None
else self.info.get("background", 0)
)
self.dispose = Image.core.fill("P", dispose_size, color)
If you check the faulty frames, you'll notice that this dark green color of the unwanted boxes is located at position 0 of the palette. So, it seems, the wrong color is picked for the disposal, because – I don't know why, yet – the above else case is picked instead of using the transparency information – which would be there!
So, let's just override the possibly faulty stuff:
from PIL import Image, ImageSequence
# Open GIF
gif = Image.open('223vK.gif')
# Initialize list of extracted frames
frames = []
for frame in ImageSequence.Iterator(gif):
# If dispose is set, and color is set to 0, use transparency information
if frame.dispose is not None and frame.dispose[0] == 0:
frame.dispose = Image.core.fill('P', frame.dispose.size,
frame.info['transparency'])
# Convert frame to RGBA
frames.append(frame.convert('RGBA'))
# Visualization overhead
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
for i, f in enumerate(frames, start=1):
plt.subplot(8, 8, i), plt.imshow(f), plt.axis('off')
plt.tight_layout(), plt.show()
The extracted frames look like this:
That seems fine to me.
If, by chance, the transparency information is actually set to 0, no harm should be done here, since we (re)set with the still correct transparency information.
I don't know, if (re)saving to GIF will work, since frames are now in RGBA mode, and saving to GIF from there is tricky as well.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.19041-SP0
Python: 3.9.1
PyCharm: 2021.1.3
Matplotlib: 3.4.2
Pillow: 8.3.1
----------------------------------------
You can try to use:
from PIL import Image, ImageSequence
im = Image.open(f"{pokemonName}.gif")
index = 1
for frame in ImageSequence.Iterator(im):
frame.save("frame%d.png" % index)
index += 1
I've found a solution that I like for unoptimizing gifs which might be of use to you.
It uses the gifsicle library, which is a command line tool for working with gifs. Crucially, gifsicle lets you unoptimize gifs like yours (I think the specific name of the optimization in your gif is "cumulative layers").
Once you install it with your package manager of choice, you can either call it within your code via Python's subprocess library, or use it yourself from the command line.
You specifically mentioned a way to bulk unoptimize, and you can do that very easily with gifsicle via something like:
gifsicle -U -b *.gif
This will overwrite every gif in the working directory with an unoptimized version simultaneously. If you want to keep optimized copies make backups. See the manual page for more info about how to use gifsicle.
Once the gif is unoptimized python should be able to open it normally.

Python: converting an xml file to an image

I am looking to convert a xml file to an image (ideally a png file) using a python script. I have not found much from my online research. I am trying to use PIL. From this post on StackOverflow I was able to find this code:
from PIL import Image
import ImageFont, ImageDraw
image = Image.new("RGBA", (288,432), (255,255,255))
usr_font = ImageFont.truetype("resources/HelveticaNeueLight.ttf", 25)
d_usr = ImageDraw.Draw(image)
d_usr = d_usr.text((105,280), "MYTEXT",(0,0,0), font=usr_font)
But I do not quite understand what's happening. I tried to replace "MYTEXT" with the actual xml file content and it did not work.
I am basically looking for any solution (ideally using PIL, but it can be another module for python). I came close using imgkit:
import imgkit
imgkit.from_file('example_IN.xml','example_OUT.png')
which returns a png file. The resolution of the image is terrible though, and it lies within a very large white rectangle. I may be missing something. I know you can modify options for imgkit, but I have no idea what modifications to bring, even after checking the documentation. Any help would be deeply appreciated.
Thank you so much!
Best regards.
I had a go in pyvips:
#!/usr/bin/env python3
import sys
import pyvips
from xml.sax.saxutils import escape
# load first arg as a string
txt = open(sys.argv[1], "r").read()
# pyvips allows pango markup in strings -- you can write stuff like
# text("hello <i>sailor!</i>")
# so we need to escape < > & in the text file
txt = escape(txt)
img = pyvips.Image.text(txt)
# save to second arg
img.write_to_file(sys.argv[2])
You can run it like this:
./txt2img.py vari.ws x.png
To make this:
It's pretty quick -- that took 300ms to run on this modest laptop.
The text method has a lot of options if you want higher res, to change the alignment, wrap lines at some limit, change the font, etc. etc.
https://libvips.github.io/libvips/API/current/libvips-create.html#vips-text
The solution suggested above by jcuppit using pyvips definitely works and is quick. I found another solution to make my previous code above work using imgkit (it is slower, I am giving it here just for reference): the resolution of the output image was bad. If this happens, width and height can be changed in the options (this is an easy fix I had missed):
import imgkit
options = {
'width' : 600,
'height' : 600
}
imgkit.from_file('example_IN.xml','example_OUT.png', options=options)
And that will convert a xml file into a png file as well.

How to create a PDF document with differing page sizes in reportlab, python

Is it possible to create a PDF document with differing page sizes in reportlab?
I would like to create a document where the first page has a different size then the other pages. Can anyone help?
Yes, this should be possible, since PDF supports this, it's just a question of how to make it happen in ReportLab. I've never done this, but the following should work:
c = reportlab.pdfgen.canvas.Canvas("test.pdf")
# draw some stuff on c
c.showPage()
c.setPageSize((700, 500)) #some page size, given as a tuple in points
# draw some more stuff on c
c.showPage()
c.save()
And your document should now have two pages, one with a default size page and one with a page of size 700 pt by 500 pt.
If you are using PLATYPUS you should be able to achieve the same sort of thing, but will probably require getting fancy in a BaseDocTemplate subclass to handle changing page sizes, since I'm pretty sure the PageTemplate machinery doesn't already support this since each PageTemplate is mainly a way of changing the way frames are laid out on each page. But it is technically possible, it just isn't documented and you'll probably have to spend some time reading and understanding how PLATYPUS works internally.
my use case was to create a big table inside pdf. as table was big it was getting cropped on both sides. this is how to create a pdf with custom size.i am using platypus from reportlab.
from reportlab.platypus import SimpleDocTemplate
from reportlab.lib.units import mm, inch
pagesize = (20 * inch, 10 * inch) # 20 inch width and 10 inch height.
doc = SimpleDocTemplate('sample.pdf', pagesize=pagesize)
data = [['sarath', 'indiana', 'usa'],
['jose', 'indiana', 'shhs']]
table = Table(data)
elems = []
elems.append(table)
doc.build(elems)
one downside of this technique is the size is same for all pages.But will help people looking to create pdf with custom size (same for all pages)

How to get the diff of two PDF files using Python?

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?
What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).
If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.
This question deals with pdf to text conversion in python: Python module for converting PDF to text.
The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.
This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.
I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by
Converting each page to an image using ghostsript
Diffing each page against page image of standard pdf, using PIL
e.g
im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)
imDiff = ImageChops.difference(im1, im2)
This works in my case for flagging any changes introduced due to code changes.
Met the same question on my encrypted pdf unittest, neither pdfminer nor pyPdf works well for me.
Here are two commands (pdftocairo, pdftotext) work perfect on my test. (Ubuntu Install: apt-get install poppler-utils)
You can get pdf content by:
from subprocess import Popen, PIPE
def get_formatted_content(pdf_content):
cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info
ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
stdout, stderr = ps.communicate(input=pdf_content)
if ps.returncode != 0:
raise OSError(ps.returncode, cmd, stderr)
return stdout
Seems pdftocairo can redraw pdf files, pdftotext can extract all text.
And then you can compare two pdf files:
c1 = get_formatted_content(open('f1.pdf').read())
c2 = get_formatted_content(open('f2.pdf').read())
print(cmp(c1, c2)) # for binary compare
# import difflib
# print(list(difflib.unified_diff(c1, c2))) # for text compare
Even though this question is quite old, my guess is that I can contribute to the topic.
We have several applications generating tons of PDFs. One of these apps is written in Python and recently I wanted to write integration tests to check if the PDF generation was working correctly.
Testing PDF generation is HARD, because the specs for PDF files are very complicated and non-deterministic. Two PDFs, generated with the same exact input data, will generate different files, so direct file comparison is discarded.
The solution: we have to go with testing the way they look like (because THAT should be deterministic!).
In our case, the PDFs are being generated with the reportlab package, but this doesn't matter from the test perspective, we just need a filename or the PDF blob (bytes) from the generator. We also need an expectation file containing a "good" PDF to compare with the one coming from the generator.
The PDFs are converted to images and then compared. This can be done in multiple ways, but we decided to use ImageMagick, because it is extremely versatile and very mature, with bindings for almost every programming language out there. For Python 3, the bindings are offered by the Wand package.
The test looks something like the following. Specific details of our implementation were removed and the example was simplified:
import os
from unittest import TestCase
from wand.image import Image
from app.generators.pdf import PdfGenerator
DIR = os.path.dirname(__file__)
class PdfGeneratorTest(TestCase):
def test_generated_pdf_should_match_expectation(self):
# `pdf` is the blob of the generated PDF
# If using reportlab, this is what you get calling `getpdfdata()`
# on a Canvas instance, after all the drawing is complete
pdf = PdfGenerator().generate()
# PDFs are vectorial, so we need to set a resolution when
# converting to an image
actual_img = Image(blob=pdf, resolution=150)
filename = os.path.join(DIR, 'expected.pdf')
# Make sure to use the same resolution as above
with Image(filename=filename, resolution=150) as expected:
diff = actual.compare(expected, metric='root_mean_square')
self.assertLess(diff[1], 0.01)
The 0.01 is as low as we can tolerate small differences. Considering that diff[1] varies from 0 to 1 using the root_mean_square metric, we are here accepting a difference up to 1% on all channels, comparing with the sample expected file.
Check this out, it can be useful: pypdf

Categories

Resources