Webpage text-only into PDF with Python PDFkit - python

I am trying to convert a large section of a website into a pdf. I can convert one page with pdfkit, but I do not want the images on the website, just the text. Is there a way to do this with pdfkit? I have been searching google for the last half hour looking for a solution, but can only find information about getting pictures, not excluding them.
Thank you for your help!

This information can be found in the documentation for "wkhtmltopdf".
This tool has a --no-images option.
The PyPI page for pdfkit explains how to set options when using the Python package.
So this is what you are looking for:
import pdfkit
options = {
'no-images':True
}
pdfkit.from_url('https://www.google.com/','out.pdf',options=options)

Related

ReportLab: edit a certain page after creating several pages

I want to edit a certain page while creating a PDF with ReportLab only. (There are some solutions with PyPDF2, but I want to use ReportLab only - if it is possible).
Description what I am doing / try to do:
I am creating a PDF-File which is giving the reader a good overview of certain data. I get the data from a server. I know the data structure, but from PDF to PDF it varies how much data I get. That's why some PDFs will be 20 pages long, some can be 50 pages+.
After getting all the data and creating a beautiful PDF (this work is done by now), I want to go back to page 2 of this PDF and create my own, very individual table of content.
But I can't find anywhere how to edit a certain page after creating several new pages.
What I've done so far for trying to solve my problem / search:
I read the documentation
I checked stackoverflow
I checked git-repos
Help would be really appreciated. In case that it is not possible to add a certain page after other pages got added with ReportLab, I think about using PyPDF2 then. I have little to no experience with PyPDF2 so far, so if you have some good links you can send me I'd very thankful.

Grabbing an article from a pdf file - Python

I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page.
Now I have the following data:
I was wondering if there is a way for me to get the specific article on the specific page using this data. I know now which filenames to check and which page.
Thanks a lot.
There is a library called tika. It can extract the text from a single page. You can split your pdf in such a way, that you have only the page in question still available. Then you can use:
parsed_page = parser.from_file('sample.pdf')
print(parsed_page['content'])
NOTE: This library requires Java to be installed on the system

Is it possible to automatically download the latest PDFs from a website?

For example, I want to download the latest WHO PDF on COVID-19. I'm really not sure how to do this.
If you type in 'who covid19 pdf' on Google, the pdf and link will come up.
I noticed that the links branch off from the main WHO domain name - maybe this can help?
Does anyone know how I can go about this?
From Python's standard library, use the urllib package. Specifically the retrieve function. A succient example is recreated below from this reference.
import urllib
testfile = urllib.URLopener()
testfile.retrieve("http://randomsite.com/file.pdf", "file.pdf")

Get a result page from a research in google -> to PDF

Here is what I'm trying to do: through a python script, I would like to get the first 5 pages of results of a Google search and save them as PDF files in a folder.
What do you suggest ?
(1) I start by parsing the HTML pages one by one and then find a tool to convert them into PDF ?
(2) I find a way to direclty do all the step in one through a mod which I don't know yet ?
Thank you very much in advance for your insights !
Use the standard Python library to download the file(s). Then you can use http://www.xhtml2pdf.com/ to convert the pages to PDF.
Note: Most web pages uses a lot of JavaScript to do all kinds of magic. So for many pages, only a full-blown web browser will get you nice/useful results. If you run into this problem, then there is no pure Python solution. Try phantomjs as explained here:
phantomjs rasterize.js 'http://en.wikipedia.org/w/index.php?title=Jakarta&printable=yes' jakarta.pdf
PS: I found these solutions by googling for python convert html to pdf You should try it once in a while.

Python Google Web Crawler

I am working on a project that needs to do a search on the internet (i.e. stack overflow). Retrieve all relevant results (URL, text, images paths) from the crawler from the search to an XML file. I am building it with python. Does anyone have any suggestion as to how i should approach this problem? I don't want to scan through the entire web, just top relevant results (stackoverflow, 10/08/2013, python as an example)
for stackoverflow you can use the api directly
for example:
https://api.stackexchange.com/2.1/questions?fromdate=1381190400&todate=1381276800&order=desc&sort=activity&tagged=python&site=stackoverflow
see https://api.stackexchange.com/docs/questions#fromdate=2013-10-08&todate=2013-10-09&order=desc&sort=activity&tagged=python&filter=default&site=stackoverflow
you can't making more 30 requests a second see http://api.stackexchange.com/docs/throttle
It sounds like you could use BeautifulSoup. And check out this thread, it sounds like it's what you need. Creating an XML document with BeautifulSoup: StackOverFlow
As for downloading and using BeautifulSoup, the site is here
It's pretty simple to use.
Hope this helps.

Categories

Resources