PDFPage does not exist in Python PDFMiner library - python

So i pip installed pdfminer3k for python 3.6. I was trying to follow some examples in opening and converting PDF files to text and they all require a PDFPage import. This does not exist for me. Is there any work around for this? I tried copying a PDFPage.py from online and saving to the directory where python searches pdfminer but I just got... "Import Error: cannot import name PDFObjectNotFound".
Thanks!

Ah. I guess the PDFPage is not meant for python 3.6. Following example from How to read pdf file using pdfminer3k? solved my issues!

import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
def extract_text_from_pdf(pdf_path):
'''
Iterator: extract the plain text from pdf-files with pdfminer3k
pdf_path: path to pdf-file to be extracted
return: iterator of string of extracted text (by page)
'''
# pdfminer.six-version can be found at:
# https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
with open(pdf_path, 'rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
for page in doc.get_pages(): # pdfminer.six: PDFPage.get_pages(fh, caching=True, check_extractable=True):
rsrcmgr = PDFResourceManager()
fake_file_handle = io.StringIO()
device = TextConverter(rsrcmgr, fake_file_handle, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
interpreter.process_page(page)
text = fake_file_handle.getvalue()
yield text
# close open handles
device.close()
fake_file_handle.close()
maxPages = 1
for i, t in enumerate(extract_text_from_pdf(fPath)):
if i<maxPages:
print(f"Page {i}:\n{t}")
else:
print(f"Page {i} skipped!")

Related

How to extract only specific text from PDF file using python

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.
Here is the sample input PDF file (File.pdf)
Link to the full PDF file File.pdf
We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.
Script i have used so far:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
But not getting the specific output value from the PDF file .
If you want to find the data in in your way (pdfminer), you can search for a pattern to extract the data like the following (new is the regex at the end, based on your given data):
from io import StringIO
import re
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('testfile.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
finding = re.search(r"INV-\d+\n\d+\n.+\n.+\n\$\d+\.\d+", output_string.getvalue())
invoice_no, order_no, _, due_date, total_due = finding.group(0).split("\n")
print(invoice_no, order_no, due_date, total_due)
If you want to store the data in excel, you may have to be more specific (or open a new question) or look on these pages:
Writing to an Excel spreadsheet
https://www.geeksforgeeks.org/writing-excel-sheet-using-python/
https://xlsxwriter.readthedocs.io/
PS: the other answer looks like a good solution, you only have to filter the data
EDIT:
Second solution. Here I use another package PyPDF2, because there you get the data in an other order (maybe this is possible with PDFMiner, too). If the text before the values are always the same, you can find the data like this:
import re
import PyPDF2
def parse_pdf() -> list:
with open("testfile.pdf", "rb") as file:
fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()
regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)")
regex_order_no = re.compile(r"Order Number(\d+)")
regex_invoice_date = re.compile(r"Invoice Date(\S+ \d{1,2}, \d{4})")
regex_due_date = re.compile(r"Due Date(\S+ \d{1,2}, \d{4})")
regex_total_due = re.compile(r"Total Due(\$\d+\.\d{1,2})")
invoice_no = re.search(regex_invoice_no, data).group(1)
order_no = re.search(regex_order_no, data).group(1)
invoice_date = re.search(regex_invoice_date, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
total_due = re.search(regex_total_due, data).group(1)
return [invoice_no, due_date, total_due]
if __name__ == '__main__':
print(parse_pdf())
Maybe you have to change the regexes, because they are only based on the given example. The regexes are only working if they find the regex, so you have to work with try: except per regex ;)
If this does not answer your question, you have to provide more information/example pdfs.
You can extract data using tabula and using that data you can create an excel file using python:
df = ("./Downloads/folder/myfile.pdf")
output = "./Downloads/folder/test.csv"
tabula.convert_into(df, output, output_format="csv", stream=True)
excel file creation:
https://www.geeksforgeeks.org/python-create-and-write-on-excel-file-using-xlsxwriter-module/

Can't make my script print output in the desired format

I'm trying to extract a certain portion of text from a pdf file. I've used PyPDF2 library to do that. However, when i excecute the script below I can see that the content I wish to grab is being printed in the console awkwardly.
I've written so far:
import io
import PyPDF2
import requests
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(0).extractText()
print(contents)
Output I'm having:
ACCESSHEALTHCTConnecticutAllPayersClaimsDatabaseDATASUBMISSIONGUIDE
December5,2013
Version1.2(withclarifications)
Output I wish to grab like:
ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)
This is the issue with pyPDF2, the reason is PyPDF doesn't read newline character. Alternately you can pdftotext
Simple and clean, you can loop over pages or get extract one page.
import io
import requests
import pdftotext
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
f = io.BytesIO(res.content)
pdf = pdftotext.PDF(f)
print(pdf[0])
# Iterate over all the pages
# for page in pdf:
# print(page)
I would suggest PDFMiner if installing other packages causes a dependency issue.
You can install it for python 3.7 by doing pip install pdfminer.six, I've already tested and its working on my python 3.7.
The code for getting page 0 is as follows
import io
import requests
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
fp = io.BytesIO(res.content)
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
page_no = 0
for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
if pageNumber == page_no:
interpreter.process_page(page)
data = retstr.getvalue()
print(data.strip())
Outputs
ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)
The good thing about PDFMiner is that it reads your pages directly and it focuses entirely on getting and analyzing text data.

Iterate through .PDFs and convert them to .txt using PDFMiner

I'm trying to merge two different things I've been able to accomplish independently. Unfortunately the PDFMiner docs are just not useful at all.
I have a folder that has hundred of PDFs, named: "[0-9].pdf", in it, in no particular order and I don't care to sort them. I just need a way to go through them and convert them to text.
Using this post: Extracting text from a PDF file using PDFMiner in python? - I was able to extract the text from one PDF successfully.
Some of this post: batch process text to csv using python - was useful in determining how to open a folder full of PDFs and work with them.
Now, I just don't know how I can combine them to one-by-one open a PDF, convert it to a text object, save that to a text file with the same original-filename.txt, and then move onto the next PDF in the directory.
Here's my code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob
directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))
resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)
password = ""
maxPages = 0
caching = True
pageNums=set()
for one_pdf in pdfFiles:
print("Processing file: " + str(one_pdf))
fp = file(one_pdf, 'rb')
for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = returnString.getvalue()
filenameString = str(one_pdf) + ".txt"
text_file = open(filenameString, "w")
text_file.write(text)
text_file.close()
fp.close()
device.close()
returnString.close()
I get no compilation errors, but my code doesn't do anything.
Thanks for your help!
Just answering my own question with the solution idea from #LaurentLAPORTE that worked.
Set directory to an absolute path using os like this: os.path.abspath("../Documents/003/"). And then it'll work.

How to get the same result as copy and pasting pdf to text using python?

When I copy and paste a pdf document into a text file using ctrl+a, ctrl+c, ctrl+v I get a result like
this:
but when I use pdfminer with the code below i get this:
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
*....*
def scrub(self):
text = self.convert(self.inFile)
with open(self.WBOutputFile, "w") as WBOut:
WBOut.write(text)
#code from Tim Arnold at https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167
def convert(self, fname):
pagenums = set()
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
*....*
The code takes several seconds longer than doing it manually but I want to automate this pdf to text process because I have a lot of documents. Is there a way to get similar result (in terms of speed and formatting) similarly to copy and paste? I am using chrome as my pdf viewer, sublime text as my text editor, and windows 8 as my OS.
I am using pdf from http:// www. supremecourt.gov/oral_arguments/argument_transcripts/14-8349_n648 .pdf
try setting the char_margin in the laparams to 50.
i.e.
laparams=LAParams()
laparams.char_margin = float(50)
converter = TextConverter(manager, output, laparams=laparams)
interpreter = PDFPageInterpreter(manager, converter)

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.
It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.
As it is, I'm just looking at source-code to see if I can figure it out.
Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.
Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x
Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.
This works in May 2020 using PDFminer six in Python3.
Installing the package
$ pip install pdfminer.six
Importing the package
from pdfminer.high_level import extract_text
Using a PDF saved on disk
text = extract_text('report.pdf')
Or alternatively:
with open('report.pdf','rb') as f:
text = extract_text(f)
Using PDF already in memory
If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:
import io
response = requests.get(url)
text = extract_text(io.BytesIO(response.content))
Performance and Reliability compared with PyPDF2
PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7
However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.
I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:
PDFminer.six: 2.88 sec
PyPDF2: 0.45 sec
pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.
Update (2022-08-04): According to Martin Thoma, PyPDF2 has improved a lot in the past 2 years, so do give it a try as well. Here's his benchmark
terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
text = retstr.getvalue()
retstr.close()
return text
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.
(All the examples assume your PDF file is called example.pdf)
Commandline
If you want to extract text just once you can use the commandline tool pdf2txt.py:
$ pdf2txt.py example.pdf
High-level api
If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.
from pdfminer.high_level import extract_text
# Extract text from a pdf.
text = extract_text('example.pdf')
# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')
Composable api
There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
Similar question and answer here. I'll try to keep them in sync.
this code is tested with pdfminer for python 3 (pdfminer-20191125)
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
def parsedocument(document):
# convert all horizontal text into a lines list (one entry per line)
# document is a file stream
lines = []
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBoxHorizontal):
lines.extend(element.get_text().splitlines())
return lines
I realize that this is an old question.
For anyone trying to use pdfminer, you should switch to pdfminer.six which is the currently maintained version.

Categories

Resources