Boxes are displayed instead of text in pdfkit - Python3 - python

I am trying to convert an HTML file to pdf using pdfkit python library.
I followed the documentation from here.
Currently, I am trying to convert plain texts to PDF instead of whole html document. Everything is working fine but instead of text, I am seeing boxes in the generated PDF.
This is my code.
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='/usr/local/bin/wkhtmltopdf/wkhtmltox/bin/wkhtmltopdf')
content = 'This is a paragraph which I am trying to convert to pdf.'
pdfkit.from_string(content,'test.pdf',configuration=config)
This is the output.
Instead of the text 'This is a paragraph which I am trying to convert to pdf.', converted PDF contains boxes.
Any help is appreciated.
Thank you :)

Unable to reproduce the issue with Python 2.7 on Ubuntu 16.04 and it works fine on the specs mentioned. From my understanding this problem is from your Operating System not having the font or encoding in which the file is being generated by the pdfkit.
Maybe try doing this:
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='/usr/local/bin/wkhtmltopdf/wkhtmltox/bin/wkhtmltopdf')
content = 'This is a paragraph which I am trying to convert to pdf.'
options = {
'encoding':'utf-8',
}
pdfkit.from_string(content,'test.pdf',configuration=config, options=options)
The options to modify pdf can be added as dictionary and assigned to options argument in from_string functions. The list of options can be found here.

This issue is referred here Include custom fonts in AWS Lambda
if you are using pdfkit on lambda you will have to setup ENV variables as
"FONT_CONFIG_PATH": '/opt/fonts/'
"FONTCONFIG_FILE": '/opt/fonts/fonts.conf'
if this problem is in the local environment a fresh installation of wkhtmltopdf must resolve this

Related

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0
Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)

Python library for dynamic documents

I want to write a script that generates reports for each team in my unit where each report uses the same template, but where the numbers specific to each team is used for each report. The report should be in a format like .pdf that non-programmers know how to open and read. This is in many ways similar to rmarkdown for R, but the reports I want to generate are based on data from code already written in python.
The solution I am looking for does not need to export directly to pdf. It can export to markdown and then I know how to convert. I do not need any fancier formatting than what markdown provides. It does not need to be markdown, but I know how to do everything else in markdown, if I only find a way to dynamically populate numbers and text in a markdown template from python code.
What I need is something that is similar to the code block below, but on a bigger scale and instead of printing output on screen this would saved to a file (.md or .pdf) that can then be shared with each team.
user = {'name':'John Doe', 'email':'jd#example.com'}
print('Name is {}, and email is {}'.format(user["name"], user["email"]))
So the desired functionality heavily influenced by my previous experience using rmarkdown would look something like the code block below, where the the template is a string or a file read as a string, with placeholders that will be populated from variables (or Dicts or objects) from the python code. Then the output can be saved and shared with the teams.
user = {'name':'John Doe', 'email':'jd#example.com'}
template = 'Name is `user["name"]`, and email is `user["email"]`'
output = render(template, user)
When trying to find a rmarkdown equivalent in python, I have found a lot of pointers to Jupyter Notebook which I am familiar with, and very much like, but it is not what I am looking for, as the point is not to share the code, only a rendered output.
Since this question was up-voted I want to answer my own question, as I found a solution that was perfect for me. In the end I shared these reports in a repo, so I write the reports in markdown and do not convert them to PDF. The reason I still think this is an answer to my original quesiton is that this works similar to creating markdown in Rmarkdown which was the core of my question, and markdown can easily be converted to PDF.
I solved this by using a library for backend generated HTML pages. I happened to use jinja2 but there are many other options.
First you need a template file in markdown. Let say this is template.md:
## Overview
**Name:** {{repo.name}}<br>
**URL:** {{repo.url}}
| Branch name | Days since last edit |
|---|---|
{% for branch in repo.branches %}
|{{branch[0]]}}|{{branch[1]}}|
{% endfor %}
And then you have use this in your python script:
from jinja2 import Template
import codecs
#create an dict will all data that will be populate the template
repo = {}
repo.name = 'training-kit'
repo.url = 'https://github.com/github/training-kit'
repo.branches = [
['master',15],
['dev',2]
]
#render the template
with open('template.md', 'r') as file:
template = Template(file.read(),trim_blocks=True)
rendered_file = template.render(repo=repo)
#output the file
output_file = codecs.open("report.md", "w", "utf-8")
output_file.write(rendered_file)
output_file.close()
If you are OK with your dynamic doc being in markdown you are done and the report is written to report.py. If you want PDF you can use pandoc to convert.
I would strongly recommend to install and use the pyFPDF Library, that enables you to write and export PDF files directly from python. The Library was ported from php and offers the same functionality as it's php-variant.
1.) Clone and install pyFPDF
Git-Bash:
git clone https://github.com/reingart/pyfpdf.git
cd pyfpdf
python setup.py install
2.) After successfull installation, you can use python code similar as if you'd work with fpdf in php like:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt="Hello", border=0)
pdf.output('myTest.pdf', 'F')
For more Information, take a look at:
https://pypi.org/project/fpdf/
To work with pyFPDF clone repo from: https://github.com/reingart/pyfpdf
pyFPDF Documentation:
https://pyfpdf.readthedocs.io/en/latest/Tutorial/index.html

How to display <IPython.core.display.HTML object>?

I try to run the below codes but I have a problem in showing the results.
also, I use pycharm IDE.
from fastai.text import *
data = pd.read_csv("data_elonmusk.csv", encoding='latin1')
data.head()
data = (TextList.from_df(data, cols='Tweet')
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=48))
data.show_batch()
The output while I run the line "data.show_batch()" is:
IPython.core.display.HTML object
If you don't want to work within a Jupyter Notebook you can save data as an HTML file and open it in a browser.
with open("data.html", "w") as file:
file.write(data)
You can only render HTML in a browser and not in a Python console/editor environment.
Hence it works in Jupiter notebook, Jupyter Lab, etc.
At best you call .data to see HTML, but again it will not render.
I solved my problem by running the codes on Jupiter Notebook.
You could add this code after data.show_batch():
plt.show()
Another option besides writing it do a file is to use an HTML parser in Python to programatically edit the HTML. The most commonly used tool in Python is beautifulsoup. You can install it via
pip install beautifulsoup4
Then in your program you could do
from bs4 import BeautifulSoup
html_string = data.show_batch().data
soup = BeautifulSoup(html_string)
# do some manipulation to the parsed HTML object
# then do whatever else you want with the object
Just use the data component of HTML Object.
with open("data.html", "w") as file:
file.write(data.data)

Python-Tika returning "None" content for PDF's, but works with TIFF's

I have a PDF that i'm trying to get Tika to parse. The PDF is not OCR. Tesseract is installed on my machine.
I used ImageMagik to convert the file.tiff to file.pdf, so the tiff file I am parsing is a direct conversion from the PDF.
Tika parses the TIFF with no problems, but returns "None" content for the PDF. What gives? I'm using Tika 1.14.1, tesseract 3.03, leptonica-1.70
Here is the code...
from tika import parser
# This works
print(parser.from_file('/from/file.tiff', 'http://localhost:9998/tika'))
# This returns "None" for content
print(parser.from_file('/from/file.pdf', 'http://localhost:9998/tika'))
So, after some feedback from the Chris Mattman (who was wonderful, and very helpful!), I sorted out the issue.
His response:
Since Tika Python acts as a thin client to the REST server, you just
need to make sure the REST server is started with a classpath
configuration that sets the right flags for TesseractOCR, see here:
http://wiki.apache.org/tika/TikaOCR
While I had read this before, the issue did not click for me until later and some further reading. TesseractOCR does not natively support OCR conversion of PDF's - therefore, Tika doesn't either as Tika relies on Tesseract's support of PDF conversion (and further, neither does tika-python)
My solution:
I combined subprocess, ImageMagick (CLI) and Tika to work together in python to first convert the PDF to a TIFF, and then allow Tika/Tesseract to perform an OCR conversion on the file.
Notes:
This process is very slow for large PDF's
Requires: tika-python, tesseract, imagemagick
The code:
from tika import parser
import subprocess
import os
def ConvertPDFToOCR(file):
meta = parser.from_file(fil, 'http://localhost:9998/tika')
# Check if parsed content is NoneType and handle accordingly.
if "content" in meta and meta['content'] is None:
# Run ImageMagick via subprocess (command line)
params = ['convert', '-density', '300', u, '-depth', '8', '-strip', '-background', 'white', '-alpha', 'off', 'temp.tiff']
subprocess.check_call(params)
# Run Tika again on new temp.tiff file
meta = parser.from_file('temp.tiff', 'http://localhost:9998/tika')
# Delete the temporary file
os.remove('temp.tiff')
return meta['content']
You can enable the X-Tika-PDFextractInlineImages': 'true' and directly extract text from images in the pdfs. No need for conversion. Took a while to figure out but works perfectly.
from tika import parser
headers = {
'X-Tika-PDFextractInlineImages': 'true',
}
parsed = parser.from_file("Citi.pdf",serverEndpoint='http://localhost:9998/rmeta/text',headers=headers)
print(parsed['content'])

How to read and print the contents of a ttf file?

Is there any way that I can open, read and write a ttf file?
Example:
with open('xyz.ttf') as f:
content = f.readline()
print(content)
A bit more:
If I open a .ttf (font) file with windows font viewer we see the following image
From this I like to extract following lines as text, with proper style.
What is exactly inside this file with *.ttf extension. I think you need to add more details of the input and output. If you reffering to a font type database you must first find a module/package to open and read it, since *.ttf isn't a normal text file.
Read the given links and install the required packages first:
https://pypi.python.org/pypi/FontTools
Then, as suggested:
from fontTools.ttLib import TTFont
font = TTFont('/path/to/font.ttf')
print(font)
<fontTools.ttLib.TTFont object at 0x10c34ed50>
If you need help with something else trying putting the input and expected output.
Other links:
http://www.starrhorne.com/2012/01/18/how-to-extract-font-names-from-ttf-files-using-python-and-our-old-friend-the-command-line.html
Here is a another useful python script:
https://gist.github.com/pklaus/dce37521579513c574d0

Categories

Resources