PyPDF2 and PyPDF4 fails to extract text from the PDF - python

import PyPDF4 as p2
pdffile = open("XXXX.pdf","rb")
pdfread=p2.PdfFileReader(pdffile)
print(pdfread.getNumPages())
pageinfo=pdfread.getPage(0)
print(pageinfo.extractText())
While running the above the 4th line of code successfully returns the correct value i.e no. of pages in the PDF, however, the 6th line (PDF extraction) gives a one page long blank data. I've tried using PyPDF2 and PyPDF4 and ran the code in both Python terminal and sublimetext and in both cases the I received blank page instead of actual text.

Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

Related

How can I split a PDF file by its pages into PDF files of a smaller size with PyPDF2?

I wrote simple python code that gets PDF, goes over its pages using PyPDF2 and saves each page as new PDF file.
see page save function here:
from PyPDF2 import PdfReader, PdfWriter
def save_pdf_page(file_name, page_index):
reader = PdfReader(file_name)
writer = PdfWriter()
writer.add_page(reader.pages[page_index])
writer.remove_links()
with open(f"output_page{page_index}.pdf", "wb") as fh:
writer.write(fh)
Surprisingly each page is about the same size as the original PDF file.
using removeLinks (taken from here) didn't reduce page size
I found similar question here, saying it may be caused because PyPDF output files are uncompressed.
Is there a way using PyPDF or any other python lib to make each page relatively small as expected?
You are running into this issue: https://github.com/py-pdf/PyPDF2/issues/449
Essentially the are two problems:
Every page might need a resource which is shared, eg font information
PyPDF2 might not realize if some pages don't need it
Remove links might help. Additionally, you might want to follow the docs to reduce file size:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
writer = PdfWriter()
for page_num in [2, 3]:
page = reader.pages[page_num]
# This is CPU intensive! It ZIPs the contents of the page
page.compress_content_streams()
writer.add_page(page)
with open("seperate.pdf", "wb") as fh:
writer.remove_links()
writer.write(fh)

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

Split specific pages of PDF and save it with Python

I am trying to split 20 pages of pdf file (single) , into five respective pdf files , 1st pdf contains 1-3 pages , 2nd pdf file contains only 4th page, 3rd pdf contains 5 to 10 pages, 4th pdf contains 11-17 pages , and 5th pdf contains 18-20 page . I need the working code in python. The below mentioned code splits the entire pdf file into single pages, but I want the grouped pages..
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("input.pdf", "rb"))
for i in range(inputpdf.numPages):
j = i+1
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("page%s.pdf" % j, "wb") as outputStream:
output.write(outputStream)
For me it looks like task for pdfrw using this example from GitHub I written following example code:
from pdfrw import PdfReader, PdfWriter
pages = PdfReader('inputfile.pdf').pages
parts = [(3,6),(7,10)]
for part in parts:
outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf')
for pagenum in range(*part):
outdata.addpage(pages[pagenum-1])
outdata.write()
This one create two files: pages_3_6.pdf and pages_7_10.pdf each with 3 pages i.e. 3,4,5 and 7,8,9. Note pagenum-1 in code, that -1 is used due to fact that pdf pages numeration starts at 1 rather than 0. I also used so-called f-strings to get names of output files. In my opinion it is slick method but it is not available in Python2 and I am not sure if it is available in all Python3 versions (I tested my code in 3.6.7), so you might use old formatting method instead if you wish.
Remember to alter filenames and ranges accordingly to your needs.
if you have python 3, you can use tika according to the following answer here:
How to extract text from a PDF file?
How to extract specific pages (or split specific pages) from a PDF file and save those pages as a separate PDF using Python.
pip install PyPDF2 # to install module/package
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
CREDIT :
How to extract PDF pages and save as a separate PDF file using Python

Python - PyPDF2 misses large chunk of text. Any alternative on Windows?

I tried to parse a pdf file with the PyPDF2 but I only retrieve about 10% of the text. For the remaining 90%, pyPDF2 brings back only newlines... a bit frustrating.
Would you know any alternatives on Python running on Windows? I've heard of pdftotext but it seems that I can't install it because my computer does not run on Linux.
Any idea?
import PyPDF2
filename = 'Doc.pdf'
pdf_file = PyPDF2.PdfFileReader(open(filename, 'rb'))
print(pdf_file.getPage(0).extractText())
Try PyMuPDF. The following example simply prints out the text it finds. The library also allows you to get the position of the text if that would help you.
#!python3.6
import json
import fitz # http://pymupdf.readthedocs.io/en/latest/
pdf = fitz.open('2018-04-17-CP-Chiffre-d-affaires-T1-2018.pdf')
for page_index in range(pdf.pageCount):
text = json.loads(pdf.getPageText(page_index, output='json'))
for block in text['blocks']:
if 'lines' not in block:
# Skip blocks without text
continue
for line in block['lines']:
for span in line['spans']:
print(span['text'].encode('utf-8'))
pdf.close()

Copying file contents to clipboard and pasting into a plain text file automatically in python

All I am trying to accomplish with this little script that I wrote is to parse data from a PDF file.
However, I seem to have run into an issue with python, more specifically the PyPDF2 module not able to read the text from a pdf file. The data printed out is all fuzzy and basically not readable. However, when I open up the pdf file that I am trying to read I can simply click drag and ctrl+c to copy contents after which when I paste it into a plain txt document it works flawlessly. The data is readable when I go through this process of copying and pasting manually.
So what I'm trying to do is mimic that exact step, however automate it instead of having me go through all the pages within the pdf file performing the above steps.
Or if there are any suggestion as to what else I can do to achieve this, I would greatly appreciate it. I have tried converting the pdf file into a docx and plain text files however the contents of the file had their formats completely re arranged
import PyPDF2
pdfFileObj = open('sjsuclassdata.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(4)
print(pageObj.extractText())
EDIT
Essentially what Im trying to do now is to simply write a script that would perform the following actions.
1.) Read pdf file
2.) copy contents of whole page (ctrl+a)
3.) paste contents of whole page into plain text file (ctrl+v)
4.) read pdf till end of file
I would give slate a try:
import slate
output_prefix = 'foobar'
file_ext = 'txt'
with open('example.pdf') as f:
doc = slate.PDF(f)
for page_number, page in enumerate(doc):
open('%s_%s.%s' % (output_suffix, page_number, file_ext), 'w+').write(doc[page_number])

Categories

Resources