pdfkit changes href from relative to absolute paths on conversion

pdfkit changes href from relative to absolute paths on conversion - python

I'm using pdfkit to convert html files that have links with href attributes in them.
Inside of the html, href's are written with relative paths, e.g.:
PIC
When I convert this to pdf, the hrefs seem to be automatically rewritten to absolute paths (C:/Users/...).
Why does pdf change the href?

Wkhtmltopdf, which pdfkit relies on, converts relative links to absolute links by default.
This can be stopped by using the command line tool with a special flag:
wkhtmltopdf --keep-relative-links src destination
Or by telling pdfkit to apply this option:
def convert_to_pdf(path):
try:
# run the conversion and write the result to a file
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
options = {
'--keep-relative-links': ''
}
pdfkit.from_url(path+'.htm', path+'.pdf', configuration=config, options=options)
except Exception as why:
# report the error
sys.stderr.write('Pdf Conversion Error: {}\n'.format(why))
raise

Usually when you create a PDF out of an HTML file the PDF file will be opened on another location (for example on another computer after sending it via mail). So in order to reference correctly the full path is needed.
Of course this will only work if the other computer can access the path (so if the path is accessible from the other computer). With paths on C: this will only work from the localhost and not from other PCs.

Related

Python tarfile.extract func not extracting content of directory

I'm trying to extract a directory from tarfile using python. But some/ALL of its files inside that directory are missing after extraction. Only pathname got extracted (ie, I get folder home inside /tmp/myfolder but its empty)
Code is as follwing:
for tar in tarfiles:
mytar = tarfile.open(tar)
for file in mytar:
if file == "myfile":
mytar.extract('home', /tmp/myfolder)

Found a fix, by default extract only extracts path of variable, I can get content with
tar.extractall(members=members(tar))
Reference:
https://stackoverflow.com/a/43094365/20223973

pdfkit- Warning: Blocked access to file

I am getting an error(Blocked access to the file) in HTML to pdf conversion using pdfkit library while using a local image in my HTML file.
How can I use local images in my HTML file?

I faced the same problem. I solved it by adding "enable-local-file-access" option to pdfkit.from_file().
options = {
"enable-local-file-access": None
}
pdfkit.from_file(html_file_name, pdf_file_name, options=options)

Pdfkit is a python wrapper for wkhtmltopdf. It seems to have inherited the default behaviour of wkhtmltopdf in recent versions, which now blocks local file access unless otherwise specified.
However, since pdfkit allows you to specify any of the original wkhtmltopdf options, you should be able to resolve this problem by passing the enable-local-file-access option.
Following the example on the pdfkit site, that would probably look something like this:
options = {
"enable-local-file-access": ""
}
pdfkit.from_string(html, output_path=False, options=options)

Python Selenium: Firefox set_preference to overwrite files on download?

I am using these Firefox preference setting for selenium in Python 2.7:
ff_profile = webdriver.FirefoxProfile(profile_dir)
ff_profile.set_preference("browser.download.folderList", 2)
ff_profile.set_preference("browser.download.manager.showWhenStarting", False)
ff_profile.set_preference("browser.download.dir", dl_dir)
ff_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
With Selenium, I want to recurringly download the same file, and overwrite it, thus keeping the same filename – without me having to confirm the download.
With the settings above, it will download without asking for location, but all downloads will creates duplicates with the filename filename (1).ext, filename (2).ext etc in MacOS.
I'm guessing there might not be a setting to allow overwriting from within Firefox, to prevent accidents(?).
(In that case, I suppose the solution would be to handle the overwriting on the disk with other Python modules; another topic).

This is something that is out of the Selenium's scope and is handled by the operating system.
Judging by the context of this and your previous question, you know (or can determine from the link text) the filename beforehand. If this is really the case, before hitting the "download" link, make sure you remove the existing file:
import os
filename = "All-tradable-ETFs-ETCs-and-ETNs.xlsx" # or extract it dynamically from the link
filepath = os.path.join(dl_dir, filename)
if os.path.exists(filepath):
os.remove(filepath)

i can use this 'zipme' to download source code from gae , but can not i download another file around me

i follow this article : zipme
and i download my file successful , and i want to download another file that ex: the parent file
so i change this:
dirname=os.path.dirname
folder = dirname(__file__)
to
dirname=os.path.dirname
folder = dirname(dirname(__file__))
but the error is :
firefox can't find the file
why ？
thanks

You get the error, because something fails in the script and it won't return a valid ZIP file back in the response.
The most probable reason is because your zipme.py will be in the root of your application. So if you try to get the parent folder of your root folder (returned by dirname(__file__)) it will fail because there is no parent folder (or at least not accessible by your code).
As far as I can see there would be no reason to execute the code you want to execute, because the original dirname(__file__) should already ZIP all your application's files.

Resolving a relative path from py:match in a genshi template

<py:match path="foo">
<?python
import os
href = select('#href').render()
SOMEWHERE = ... # what file contained the foo tag?
path = os.path.abspath(os.path.join(os.path.dirname(SOMEWHERE), href)
f = file(path,'r')
# (do something interesting with f)
?>
</py:match>
...
<foo href="../path/relative/to/this/template/abcd.xyz"/>
What should go as "somewhere" above? I want that href attribute to be relative to the file with the foo tag in it, like href attributes on other tags.
Alternatively, what file contained the py:match block? This is less good because it may be in a different directory from the file with the foo tag.
Even less good: I could supply the path of the file I'm rendering as a context argument from outside Genshi, but that might be in a different directory from both of the above.

You need to make sure that the driver program (i.e., the Python program that parses the input file) runs in the directory of the file containing the foo tag. Otherwise, you need to pass down the relative path (i.e., how to get from the directory in which the reader runs to the directory of the file being read) as a context argument to your Python code and add it to the os.path.join command.
With this setup (and using Genshi 0.6 installed on MacOS X 10.6.3 via the Fink package genshi-py26) the command os.getcwd() returns the current working directory of the file containing the foo tag.
For such complicated path constructs I also strongly recommend to use path=os.path.normpath(path), since you may not want such things to leak in your resulting HTML code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pdfkit changes href from relative to absolute paths on conversion - python

I'm using pdfkit to convert html files that have links with href attributes in them. Inside of the html, href's are written with relative paths, e.g.: PIC When I convert this to pdf, the hrefs seem to be automatically rewritten to absolute paths (C:/Users/...). Why does pdf change the href?

Related

Python tarfile.extract func not extracting content of directory

pdfkit- Warning: Blocked access to file

Python Selenium: Firefox set_preference to overwrite files on download?

i can use this 'zipme' to download source code from gae , but can not i download another file around me

Resolving a relative path from py:match in a genshi template

Categories

Resources