pdfkit- Warning: Blocked access to file

pdfkit- Warning: Blocked access to file - python

I am getting an error(Blocked access to the file) in HTML to pdf conversion using pdfkit library while using a local image in my HTML file.
How can I use local images in my HTML file?

I faced the same problem. I solved it by adding "enable-local-file-access" option to pdfkit.from_file().
options = {
"enable-local-file-access": None
}
pdfkit.from_file(html_file_name, pdf_file_name, options=options)

Pdfkit is a python wrapper for wkhtmltopdf. It seems to have inherited the default behaviour of wkhtmltopdf in recent versions, which now blocks local file access unless otherwise specified.
However, since pdfkit allows you to specify any of the original wkhtmltopdf options, you should be able to resolve this problem by passing the enable-local-file-access option.
Following the example on the pdfkit site, that would probably look something like this:
options = {
"enable-local-file-access": ""
}
pdfkit.from_string(html, output_path=False, options=options)

Related

python how to use tika with existing jar file without downloading again

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.
The problem is that the jar file size is around 60MB, which takes some time to download.
This is the code I'm using :
from tika import parser
def get_pdf_text(path):
parsed = parser.from_file(path):
return parsed['content']
The only workaround I found is this :
1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx
2 - Using tika.TikaClientOnly = True
3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')
But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.

To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.
TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.

if you don't want to add environment variable, you can change the directory that the tika looking for tika-server.jar file with code bellow.
from tika import tika
tika.TikaJarPath = r'TIKA_SERVER_PATH'
in that TIKA_SERVER_PATH the jar file name should be tika-server.jar(the name shouldn't include the version) and also the .md5 file must be there. if the .md5 file isn't the right version as tika-server.jar this method doesn't work and tika will delete your file and download the default version.

Here is what worked here :
os.environ['TIKA_SERVER_JAR'] = "<path_to_jar_and_md5>/tika-server.jar"
os.environ['TIKA_PATH'] = "<path_to_jar_and_md5_again>"
These are read at library import, so import the parser after, and reimport if you change them.

After trying almost everything, and debugging tika.py library code I found that you must set both of these variables for this hack to work.
TIKA_SERVER_JAR="/path_to_tika_server/tika-server.jar"
TIKA_SERVER_JAR="/path_to_tika_server"
You also need to provide a .md5 signature file because since Tika version 1.18 .md5 file is not provided (sha512 signature is provided instead, see https://archive.apache.org/dist/tika/). So you need to trick the library to accept your downloaded file.
Or someone could just patch python library :)

i am wondering how to get the .md5 file of tika-server.jar, since .md5 file is not provided and sha512 signature is provided instead

pdfkit changes href from relative to absolute paths on conversion

I'm using pdfkit to convert html files that have links with href attributes in them.
Inside of the html, href's are written with relative paths, e.g.:
PIC
When I convert this to pdf, the hrefs seem to be automatically rewritten to absolute paths (C:/Users/...).
Why does pdf change the href?

Wkhtmltopdf, which pdfkit relies on, converts relative links to absolute links by default.
This can be stopped by using the command line tool with a special flag:
wkhtmltopdf --keep-relative-links src destination
Or by telling pdfkit to apply this option:
def convert_to_pdf(path):
try:
# run the conversion and write the result to a file
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
options = {
'--keep-relative-links': ''
}
pdfkit.from_url(path+'.htm', path+'.pdf', configuration=config, options=options)
except Exception as why:
# report the error
sys.stderr.write('Pdf Conversion Error: {}\n'.format(why))
raise

Usually when you create a PDF out of an HTML file the PDF file will be opened on another location (for example on another computer after sending it via mail). So in order to reference correctly the full path is needed.
Of course this will only work if the other computer can access the path (so if the path is accessible from the other computer). With paths on C: this will only work from the localhost and not from other PCs.

Python Webbrowser Opening URLs with Chrome instead of IE

I've been attempting to create a function that iterates over inputs from a text file that contains URLs, using webbrowser package. It works fine when I create a empty list to which URLs are literally appended, as in:
import webbrowser
list = []
list.append(url1)
list.append(url2)
def webbrowsing(list)
for i in range(0, len(list)):
webbrowser.open(list[i])
where url1 and url2 are any valid URLs. And webbrowser.open() opens the URLs in Chrome and it is really good.
However, when I try and do the same thing with inputs from a text file of URLs, webbrowser opens the URLs from the file in Internet Explorer. I gave it a try using webbrowser.get(), explicitly directing it to use Chrome, but that didn't work.
I am not very sure why it does not open the URLs in Chrome, when almost everything seems the same as when the list is used as mentioned above. Chrome is set as my default web browser, and I rarely use the IE. I'd really appreciate any tips on that issue.

How do you define the 'webbrowser' object? I use something like this:
driver = webdriver.Chrome(driverPath) #driverPath contains the path to the 'chromedriver.exe' file
driver.get(url)

Python Selenium: Firefox set_preference to overwrite files on download?

I am using these Firefox preference setting for selenium in Python 2.7:
ff_profile = webdriver.FirefoxProfile(profile_dir)
ff_profile.set_preference("browser.download.folderList", 2)
ff_profile.set_preference("browser.download.manager.showWhenStarting", False)
ff_profile.set_preference("browser.download.dir", dl_dir)
ff_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
With Selenium, I want to recurringly download the same file, and overwrite it, thus keeping the same filename – without me having to confirm the download.
With the settings above, it will download without asking for location, but all downloads will creates duplicates with the filename filename (1).ext, filename (2).ext etc in MacOS.
I'm guessing there might not be a setting to allow overwriting from within Firefox, to prevent accidents(?).
(In that case, I suppose the solution would be to handle the overwriting on the disk with other Python modules; another topic).

This is something that is out of the Selenium's scope and is handled by the operating system.
Judging by the context of this and your previous question, you know (or can determine from the link text) the filename beforehand. If this is really the case, before hitting the "download" link, make sure you remove the existing file:
import os
filename = "All-tradable-ETFs-ETCs-and-ETNs.xlsx" # or extract it dynamically from the link
filepath = os.path.join(dl_dir, filename)
if os.path.exists(filepath):
os.remove(filepath)

Naming a file when downloading with Selenium Webdriver

I see that you can set where to download a file to through Webdriver, as follows:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/csv")
browser = webdriver.Firefox(firefox_profile=fp)
But, I was wondering if there is a similar way to give the file a name when it is downloaded? Preferably, probably not something that is associated with the profile, as I will be downloading ~6000 files through one browser instance, and do not want to have to reinitiate the driver for each download.

I would suggest a little bit strange way: do not download files with the use of Selenium if possible.
I mean get the file URL and use urllib library to download the file and save it to disk in a 'manual' way. The issue is that selenium doesn't have a tool to handle Windows dialogs, such as 'save as' dialog. I'm not sure, but I doubt that it can handle any OS dialogs at all, please correct me I'm wrong. :)
Here's a tiny example:
import urllib
urllib.urlretrieve( "http://www.yourhost.com/yourfile.ext", "your-file-name.ext")
The only job for us here is to make sure that we handle all the urllib Exceptions. Please see http://docs.python.org/2/library/urllib.html#urllib.urlretrieve for more info.

I do not know if there is a pure Selenium handler for this, but here is what I have done when I needed to do something with the downloaded file.
Set a loop that polls your download directory for the latest file that does not have a .part extension (this indicates a partial download and would occasionally trip things up if not accounted for. Put a timer on this to ensure that you don't go into an infinite loop in the case of timeout/other error that causes the download not to complete. I used the output of the ls -t <dirname> command in Linux (my old code uses commands, which is deprecated so I won't show it here :) ) and got the first file by using
# result = output of ls -t
result = result.split('\n')[1].split(' ')[-1]
If the while loop exits successfully, the topmost file in the directory will be your file, which you can then modify using os.rename (or anything else you like).
Probably not the answer you were looking for, but hopefully it points you in the right direction.

Solution with code as suggested by the selected answer. Rename the file after each one is downloaded.
import os
os.chdir(SAVE_TO_DIRECTORY)
files = filter(os.path.isfile, os.listdir(SAVE_TO_DIRECTORY))
files = [os.path.join(SAVE_TO_DIRECTORY, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
newest_file = files[-1]
os.rename(newest_file, docName + ".pdf")
This answer was posted as an edit to the question naming a file when downloading with Selenium Webdriver by the OP user1253952 under CC BY-SA 3.0.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pdfkit- Warning: Blocked access to file - python

I am getting an error(Blocked access to the file) in HTML to pdf conversion using pdfkit library while using a local image in my HTML file. How can I use local images in my HTML file?

I faced the same problem. I solved it by adding "enable-local-file-access" option to pdfkit.from_file(). options = { "enable-local-file-access": None } pdfkit.from_file(html_file_name, pdf_file_name, options=options)

Related

python how to use tika with existing jar file without downloading again

pdfkit changes href from relative to absolute paths on conversion

Python Webbrowser Opening URLs with Chrome instead of IE

Python Selenium: Firefox set_preference to overwrite files on download?

Naming a file when downloading with Selenium Webdriver

Categories

Resources