I install pdfkit using pip (pip install pdfkit).
Then i install wkhtmltopdf from here
But when i try to run the following code:
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe')
pdfkit.from_url('https://www.google.com', 'C:\\Users\\Χρήστος\\Desktop\\out-test.pdf', configuration=config)
An error oqqurs:
Traceback (most recent call last):
File "create_pdf.py", line 4, in <module>
pdfkit.from_url('https://www.google.com', 'C:\\Users\\Χρήστος\\Desktop\\out-test.pdf', configuration=config)
File "C:\Users\Χρήστος\AppData\Local\Programs\Python\Python38\lib\site-packages\pdfkit\api.py", line 26, in from_url
return r.to_pdf(output_path)
File "C:\Users\Χρήστος\AppData\Local\Programs\Python\Python38\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
QPainter::begin(): Returned false============================] 100%
Error: Unable to write to destination
Exit with code 1, due to unknown error.
Any advice would be useful.
I also run the code with administration priviledges.
Edit: From wkhtmltopdf site:
How do I use it? Download a precompiled binary or build from source
Create your HTML document that you want to turn into a PDF (or image)
Run your HTML document through the tool. For example, if I really like
the treatment Google has done to their logo today and want to capture
it forever as a PDF:
wkhtmltopdf http://google.com google.pdf
So i tried this:
cd C:\Program Files\wkhtmltopdf\bin
wkhtmltopdf http://google.com google.pdf
Didn't work, but when I run it with admininstration priviledges the pdf was make.
So the pdfkit module must open wkhtmltopdf with admin priviledges.
There's an issue on GitHub, seems you have to specify the full path (eg. 'C://foo/bar')
Probably because the script doesn't have permission to write in the folder where it lives:
C:\Users\Χρήστος\AppData\Local\Programs\Python\Python38\lib\site-packages\pdfkit\pdfkit.py
You gave it a relative path, so in runtime, it will try to create a file:
C:\Users\Χρήστος\AppData\Local\Programs\Python\Python38\lib\site-packages\pdfkit\out.pdf
Related
I was trying to use pdfplumber library in python (ver. 3.10.6) to convert some pdf pages to images but pdfplumber to_image() method throws the following error:
import pdfplumber
>>> myDOc = pdfplumber.open("CV.pdf")
>>> myImg = myDOc.pages[0].to_image(resolution=300)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\jjjku\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdfplumber\page.py", line 381, in to_image
return PageImage(self, **kwargs)
File "C:\Users\jjjku\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdfplumber\display.py", line 93, in __init__
self.original = get_page_image(
File "C:\Users\jjjku\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdfplumber\display.py", line 54, in get_page_image
with WandImage(
File "C:\Users\jjjku\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\wand\image.py", line 9306, in __init__
wand = library.NewMagickWand()
OSError: exception: access violation writing 0x0000000000000008
Initially I tried to use this method from PyCharm but thiserror occurred, after that I assumed that maybe something is wrong with PyCharm configuration, so I tried the same from cmd and the result is above (the same error as in PyCharm). I suspect that there is something wrong with my Image Wand or Ghostscript configuration. I have Windows 10 on my computer. I tried some ideas from the net but without results.
Does anyone have any idea what can be the cause of this error and how to make it work?
So, following the advice, I reported it as a bug on pdfplumber project GitHub and I got a response that this is related to some problems with Wand library dependencies. The issue was resolved after I installed the following package:
pip install -U wand
In PyCharm I added Wand package to Python Interpreter Packages (make sure to have version 0.6.10, for 0.6.9 the error still occurred from what I observed)
I am using this repository to deploy tesseract as a lambda layer: https://github.com/bweigel/aws-lambda-tesseract-layer
The deployment works well and other functions that pytesseract has like: image_to_string, image_to_data also works well without any hiccups.
But, when I try to use image_to_pdf_or_hocr like this:
pdf = pytesseract.image_to_pdf_or_hocr(f'/tmp/{file_name}/{page.number}.png', extension='pdf')
it does not work and throws error like:
Traceback (most recent call last):
File "/var/task/helpers/ocr_helper.py", line 36, in save_searchable_pdf
f'/tmp/{file_name}/{page.number}.png', extension='pdf')
File "/var/task/pytesseract/pytesseract.py", line 432, in image_to_pdf_or_hocr
return run_and_get_output(*args)
File "/var/task/pytesseract/pytesseract.py", line 289, in run_and_get_output
with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_6_hu78b0.pdf'
It says that the file tess_6_hu78b0.pdf does not exist. What does this mean? I have no file with tess_6_hu78b0 name to begin with.
The path that I am passing to image_to_pdf_or_hocr function is 100% correct and an image is present there. I have confirmed and the same thing works on my local.
I have tried:
I found somewhere that I needed to install libtesseract-dev too. Hence, I modified my dockerfile as:
FROM lambci/lambda:build-python3.6
RUN sudo apt install tesseract-ocr
RUN sudo apt install libtesseract-dev
but unfortunately this too did not work.
After 18 hours of hard work, I was finally able to figure it out.
It turns out that https://github.com/bweigel/aws-lambda-tesseract-layer is not bundled with all the necessary files for pytesseract.image_to_pdf_or_hocr() to run.
So what I did was, I build leptonica and tesseract from source and generated
configs folder
tessconfigs folder and
pdf.tiff file
These required files are available here:
https://github.com/prameshbajra/tessdata
Inside https://github.com/bweigel/aws-lambda-tesseract-layer, under ready-to-use folder there is a directory named amazonlinux-1, and inside it, there is a folder named tesseract/share/tessdata. All you need to do is paste in the above listed files under this directory.
Just download this repo and replace the tessdata folder.
Note: This tessdata is build with tesseract 4.1.1
I hope this helps future readers.
Happy coding.
Thank Benjamin Genz (#bweigel) for publishing this repo. You made our lives easier.
Adding this config argument fixed it for me, inspired by this solution :)
pytesseract.image_to_pdf_or_hocr("Image.png", extension="pdf", config = " -c tessedit_create_pdf=1")
Hi I am trying the python library pytesseract to extract text from image.
Please find the code:
from PIL import Image
from pytesseract import image_to_string
print image_to_string(Image.open(r'D:\new_folder\img.png'))
But the following error came:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 161, in image_to_string
config=config)
File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 94, in run_tesseract
stderr=subprocess.PIPE)
File "C:\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
I did not found a specific solution to this. Can anyone help me what to do. Anything more to be downloaded or from where i can download it etc..
Thanks in advance :)
I had the same trouble and quickly found the solution after reading this post:
OSError: [Errno 2] No such file or directory using pytesser
Just need to adapt it to Windows, replace the following code:
tesseract_cmd = 'tesseract'
with:
tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
(need double \\ to escape first \ in the string)
You're getting exception because subprocess isn't able to find the binaries (tesser executable).
The installation is a 3 step process:
1.Download/Install system level libs/binaries:
For various OS here's the help. For MacOS you can directly install it using brew.
Install Google Tesseract OCR (additional info how to install the
engine on Linux, Mac OSX and Windows). You must be able to invoke the
tesseract command as tesseract. If this isn’t the case, for example
because tesseract isn’t in your PATH, you will have to change the
“tesseract_cmd” variable at the top of tesseract.py. Under
Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users.
please install homebrew package tesseract.
For Windows:
An installer for the old version 3.02 is available for Windows from
our download page. This includes the English training data. If you
want to use another language, download the appropriate training data,
unpack it using 7-zip, and copy the .traineddata file into the
'tessdata' directory, probably C:\Program Files\Tesseract-OCR\tessdata.
To access tesseract-OCR from any location you may have to add the
directory where the tesseract-OCR binaries are located to the Path
variables, probably C:\Program Files\Tesseract-OCR.
Can download the .exe from here.
2.Install Python package
pip install pytesseract
3.Finally, you need to have tesseract binary in you PATH.
Or, you can set it at run-time:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = '<path-to-tesseract-bin>'
For Windows:
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
The above line will make it work temporarily, for permanent solution add the tesseract.exe to the PATH - such as PATH=%PATH%;"C:\Program Files (x86)\Tesseract-OCR".
Beside that make sure that TESSDATA_PREFIX Windows environment variable is set to the directory, containing tessdata directory. For example:
TESSDATA_PREFIX=C:\Program Files (x86)\Tesseract-OCR
i.e. tessdata location is: C:\Program Files (x86)\Tesseract-OCR\tessdata
Your example:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
print pytesseract.image_to_string(Image.open(r'D:\new_folder\img.png'))
You need Tesseract OCR engine ("Tesseract.exe") installed in your machine. If the path is not configured in your machine, provide complete path in pytesseract.py(tesseract.py).
README
Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as tesseract. If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable at the top of tesseract.py. Under Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users. please install homebrew package tesseract.
Another thread
I have also faced the same problem regarding pytesseract.
I would suggest you to work in linux environment, to solve such errors.
Do the following commands in linux:
pip install pytesseract
sudo apt-get update
sudo apt-get install pytesseract-ocr
Hope this will do the work..
I have been using WinPython 2.2.5 with Python 2.7 and it works nice. The problem that I have is when I want to install additional libraries to use from the https://pypi.python.org repository.
For example I tried to install pdfminer which is in following link: https://pypi.python.org/pypi/pdfminer/
I have read that I can use pip install which is in the following path on my computer:
C:\WinPython-32bit-2.7.6.3\python-2.7.6\Scripts
On that directory I have saved the tar.gz file of pdfminer and from the windows command prompt on the aforementioned path I have typed:
pip install pdfminer(version number).tar.gz
It seems that it works fine, because there are no error messages, but when I open the winpython and in the command shell I put:
pdf2txt
to see if it works I got the following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'pdf2txt' is not defined
What am I doing wrong?
According to the documentation, "PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py." So, instead of trying to run pdf2txt.py by importing it, you need to run it as it shows in the example in the documentation, like this:
$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
where output.html is the file that is created from the mined text, and samples/naac106-shinyama.pdf is the PDF you want to mine.
I tried installing pdfkit Python API in my windows 8 machine. I'm getting issues related to path.
Traceback (most recent call last):
File "C:\Python27\pdfcre", line 13, in <module>
pdfkit.from_url('http://google.com', 'out.pdf')
File "C:\Python27\lib\site-packages\pdfkit\api.py", line 22, in from_url
configuration=configuration)
File "C:\Python27\lib\site-packages\pdfkit\pdfkit.py", line 38, in __init__
self.configuration = (Configuration() if configuration is None
File "C:\Python27\lib\site-packages\pdfkit\configuration.py", line 27, in __init__
'https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf' % self.wkhtmltopdf)
IOError: No wkhtmltopdf executable found: ""
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
Is anybody installed Python PDFKIt in windows machine? How to resolve this error.
My sample code :
import pdfkit
import os
config = pdfkit.configuration(wkhtmltopdf='C:\\Python27\\wkhtmltopdf\bin\\wkhtmltopdf.exe')
pdfkit.from_url('http://google.com', 'out.pdf')
The following should work without needing to modify the windows environment variables:
import pdfkit
path_wkhtmltopdf = r'C:\Program Files (x86)\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
pdfkit.from_url("http://google.com", "out.pdf", configuration=config)
Assuming the path is correct of course (e.g. in my case it is r'C:\Program Files (x86)\wkhtmltopdf\bin\wkhtmltopdf.exe').
Please install wkhtmltopdf using,
sudo apt install -y wkhtmltopdf
for windows machine install it from below link, http://wkhtmltopdf.org/downloads.html
and you need to add wkhtmltopdf path into environment variables
IOError: 'No wkhtmltopdf executable found'
Make sure that you have wkhtmltopdf in your $PATH or set via custom configuration. where wkhtmltopdf in Windows or which wkhtmltopdf on Linux should return actual path to binary.
Adding this configuration line worked for me:
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_string(html, 'MyPDF.pdf', configuration=config)
From github
Seems you need to pass configuration=config as argument.
I am learning python today, and I met the same problem, lately I set the windows enviroment variables and everything is OK.
I add the install path of wkhtml to the path, for example:"D:\developAssistTools\wkhtmltopdf\bin;" is my install path of wkhtml, and I add it to the path, everything is OK.
import pdfkit
pdfkit.from_url("http://google.com", "out.pdf")
finally, I find a out.pdf.
import pdfkit
path_wkthmltopdf = b'C:\Program Files\wkhtmltopdf\\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
pdfkit.from_url("http://google.com", "rajul-url.pdf", configuration=config)
pdfkit.from_file("output.xml","rajul-pdf.pdf", configuration=config)
The Above Code block is working perfectly fine for me. Please note that file which needs to be converted is in the same directory where the pdf file is creating.
Ran into the same problem on a Mac. For some reason-- it worked after unistalling the pip installation and reinstall wkhtmltopdf using brew
pip uninstall wthtmltopdf
and use brew
brew install Caskroom/cask/wkhtmltopdf
You need set:
pdfkit.from_url('http://google.com', 'out.pdf',configuration=config)
Found the decode on a windows platform needed to be a binary string, try:
path_wkthmltopdf = b'C:\Program Files\wkhtmltopdf\\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
pdfkit.from_url(url=urlpath, output_path=pdffilepath,configuration=config)
def urltopdf(url,pdffile):
import pdfkit
'''
input
- url : target url
- pdffile : target pdf file name
'''
path_wkthmltopdf = 'D:\\Program Files (x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
#pdfkit.from_url(url=urlpath, output_path=pdffilepath,configuration=config)
pdfkit.from_url(url,pdffile,configuration=config)
urltopdf('http://www.google.com','pdf/google.pdf')
very good solution!
thanks everyone!
When I tried all of the above methods, I was till facing Permission Error as I don't have the admin rights to my workstation.
If that's the case for you too, then make sure when you install your wkhtmltopdf.exe. The destination folder for installation is in your python site-packages folder, or add the directory to sys.path.
Normally it gets installed in Program files folder.
I changed the installation directory and this works for me:
import pdfkit
pdfkit.from_url("http://google.com", "out.pdf")
[For Ubuntu/Debian]
first run: sudo apt-get update --fix-missing
then: sudo apt-get install -y wkhtmltopdf
hope it would solve your problem.
for windows Try to use the complete path C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe
No need to write wkhtmltopdf path into code. Just define an environment variable for that, and it works.
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
For me this code is working.