Python - PyPDF2 misses large chunk of text. Any alternative on Windows? - python

I tried to parse a pdf file with the PyPDF2 but I only retrieve about 10% of the text. For the remaining 90%, pyPDF2 brings back only newlines... a bit frustrating.
Would you know any alternatives on Python running on Windows? I've heard of pdftotext but it seems that I can't install it because my computer does not run on Linux.
Any idea?
import PyPDF2
filename = 'Doc.pdf'
pdf_file = PyPDF2.PdfFileReader(open(filename, 'rb'))
print(pdf_file.getPage(0).extractText())

Try PyMuPDF. The following example simply prints out the text it finds. The library also allows you to get the position of the text if that would help you.
#!python3.6
import json
import fitz # http://pymupdf.readthedocs.io/en/latest/
pdf = fitz.open('2018-04-17-CP-Chiffre-d-affaires-T1-2018.pdf')
for page_index in range(pdf.pageCount):
text = json.loads(pdf.getPageText(page_index, output='json'))
for block in text['blocks']:
if 'lines' not in block:
# Skip blocks without text
continue
for line in block['lines']:
for span in line['spans']:
print(span['text'].encode('utf-8'))
pdf.close()

Related

PyPDF2 and PyPDF4 fails to extract text from the PDF

import PyPDF4 as p2
pdffile = open("XXXX.pdf","rb")
pdfread=p2.PdfFileReader(pdffile)
print(pdfread.getNumPages())
pageinfo=pdfread.getPage(0)
print(pageinfo.extractText())
While running the above the 4th line of code successfully returns the correct value i.e no. of pages in the PDF, however, the 6th line (PDF extraction) gives a one page long blank data. I've tried using PyPDF2 and PyPDF4 and ran the code in both Python terminal and sublimetext and in both cases the I received blank page instead of actual text.
Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

How can I remove every other page of a pdf with python?

I downloaded a pdf where every other page is blank and I'd like to remove the blank pages. I could do this manually in a pdf tool (Adobe Acrobat, Preview.app, PDFPen, etc.) but since it's several hundred pages I'd like to do something more automated. Is there a way to do this in python?
One way is to use pypdf, so in your terminal first do
pip install pypdf4
Then create a .py script file similar to this:
# pdf_strip_every_other_page.py
from PyPDF4 import PdfFileReader, PdfFileWriter
number_of_pages = 500
output_writer = PdfFileWriter()
with open("/path/to/original.pdf", "rb") as inputfile:
pdfOne = PdfFileReader(inputfile)
for i in list(range(0, number_of_pages)):
if i % 2 == 0:
page = pdfOne.getPage(i)
output_writer.addPage(page)
with open("/path/to/output.pdf", "wb") as outfile:
output_writer.write(outfile)
Note: you'll need to change the paths to what's appropriate for your scenario.
Obviously this script is rather crude and could be improved, but wanted to share it for anyone else wanting a quick way to deal with this scenario.

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

how to get a content from a pdf file and store it in a txt file

import pyPdf
f= open('jayabal_appt.pdf','rb')
pdfl = pyPdf.PdfFileReader(f)
content=""
for i in range(0,1):
content += pdfl.getPage(i).extractText() + "\n"
outpu = open('b.txt','wb')
outpu.write(content)
f.close()
outpu.close()
This is not getting the content from a pdf file and storing it in a txt file... What is the mistake in this code????
A simple example from the author suggest doing this (You don't seem to be doing 'file'):
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("jayabal_appt.pdf", "rb"))
Then you can do the following:
output.addPage(input1.getPage(0))
And sure, use a for loop for it, but the author doesn't suggest using extractText.
Just check out the website, the example is rather straight forward: https://pypi.org/project/pypdf/
However
pyPdf is no longer maintained, so I don't recommend using it. The author suggest to check out pyPdf2 instead.
A simple Google search also suggest that you should try pdftotext or pdfminer. There are plenty of examples out there.
Good luck.

Extracting titles from PDF files?

I want to write a script to rename downloaded papers with their titles automatically, I'm wondering if there is any library or tricks i can make use of? The PDFs are all generated by TeX and should have some 'formal' structures.
You could try to use pyPdf and this example.
for example:
from pyPdf import PdfFileWriter, PdfFileReader
def get_pdf_title(pdf_file_path):
with open(pdf_file_path) as f:
pdf_reader = PdfFileReader(f)
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('/home/user/Desktop/my.pdf')
Assuming all these papers are from arXiv, you could instead extract the arXiv id (I'd guess that searching for "arXiv:" in the PDF's text would consistently reveal the id as the first hit).
Once you have the arXiv reference number (and have done a pip install arxiv), you can get the title using
paper_ref = '1501.00730'
arxiv.query(id_list=[paper_ref])[0].title
I would probably start with perl (seeing as it's always the first thing I reach for). There are several modules for handling PDFs. If you have a consistent structure, you could use regex to snag the titles.
You can try using iText with Jython

Categories

Resources