Grabbing an article from a pdf file - Python

Grabbing an article from a pdf file - Python - python

I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page.
Now I have the following data:
I was wondering if there is a way for me to get the specific article on the specific page using this data. I know now which filenames to check and which page.
Thanks a lot.

There is a library called tika. It can extract the text from a single page. You can split your pdf in such a way, that you have only the page in question still available. Then you can use:
parsed_page = parser.from_file('sample.pdf')
print(parsed_page['content'])
NOTE: This library requires Java to be installed on the system

Related

Webpage text-only into PDF with Python PDFkit

I am trying to convert a large section of a website into a pdf. I can convert one page with pdfkit, but I do not want the images on the website, just the text. Is there a way to do this with pdfkit? I have been searching google for the last half hour looking for a solution, but can only find information about getting pictures, not excluding them.
Thank you for your help!

This information can be found in the documentation for "wkhtmltopdf".
This tool has a --no-images option.
The PyPI page for pdfkit explains how to set options when using the Python package.
So this is what you are looking for:
import pdfkit
options = {
'no-images':True
}
pdfkit.from_url('https://www.google.com/','out.pdf',options=options)

Split pdf in more than one page with pikepdf in python

I need to split a pdf file in group of pages specified by the user. For example, I have a pdf with 20 pages, and I want to split it in groups of 5 pages. The output would be 4 pdfs of 5 pages each.
I read the pikepdf documentation and it can only split it in a single page, so I would have 20 single page pdfs.
pdf = Pdf.open('../tests/resources/fourpages.pdf')
for n, page in enumerate(pdf.pages):
dst = Pdf.new()
dst.pages.append(page)
dst.save(f'{n:02d}.pdf')
This is the code of the pikepdf documentation. I made it work, but as said before, the output is just single page pdfs. I tried changing it a bit with a nested while but it didn't work.
I think it's weird that it doesn't allow to split in more than one page. Maybe there is something obvious that I'm not seeing.
I thought about splitting it in single pages, and then merging it again by the desired amount of pages, but it doesn't seem very optimal.
For now I'm not allowed to use another library other than pikepdf. Is there a way to achieve this?
Thanks in advance.

ReportLab: edit a certain page after creating several pages

I want to edit a certain page while creating a PDF with ReportLab only. (There are some solutions with PyPDF2, but I want to use ReportLab only - if it is possible).
Description what I am doing / try to do:
I am creating a PDF-File which is giving the reader a good overview of certain data. I get the data from a server. I know the data structure, but from PDF to PDF it varies how much data I get. That's why some PDFs will be 20 pages long, some can be 50 pages+.
After getting all the data and creating a beautiful PDF (this work is done by now), I want to go back to page 2 of this PDF and create my own, very individual table of content.
But I can't find anywhere how to edit a certain page after creating several new pages.
What I've done so far for trying to solve my problem / search:
I read the documentation
I checked stackoverflow
I checked git-repos
Help would be really appreciated. In case that it is not possible to add a certain page after other pages got added with ReportLab, I think about using PyPDF2 then. I have little to no experience with PyPDF2 so far, so if you have some good links you can send me I'd very thankful.

Python text extraction does not work on some pdfs

I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf.
My code looks like this :
url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)
print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()
I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.
I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.
Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?

If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.
Looking at the two PDF files, I can't see anything wrong with the files themselves. But...
The first file contains fully embedded fonts
The second file contains subsetted fonts
This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).

Get a result page from a research in google -> to PDF

Here is what I'm trying to do: through a python script, I would like to get the first 5 pages of results of a Google search and save them as PDF files in a folder.
What do you suggest ?
(1) I start by parsing the HTML pages one by one and then find a tool to convert them into PDF ?
(2) I find a way to direclty do all the step in one through a mod which I don't know yet ?
Thank you very much in advance for your insights !

Use the standard Python library to download the file(s). Then you can use http://www.xhtml2pdf.com/ to convert the pages to PDF.
Note: Most web pages uses a lot of JavaScript to do all kinds of magic. So for many pages, only a full-blown web browser will get you nice/useful results. If you run into this problem, then there is no pure Python solution. Try phantomjs as explained here:
phantomjs rasterize.js 'http://en.wikipedia.org/w/index.php?title=Jakarta&printable=yes' jakarta.pdf
PS: I found these solutions by googling for python convert html to pdf You should try it once in a while.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grabbing an article from a pdf file - Python - python

Related

Webpage text-only into PDF with Python PDFkit

Split pdf in more than one page with pikepdf in python

ReportLab: edit a certain page after creating several pages

Python text extraction does not work on some pdfs

Get a result page from a research in google -> to PDF

Categories

Resources