Extract text from PDF

Extract text from PDF - python

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?
Thanks.

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

$ pdftotext -layout thingwithtablesinit.pdf
will produce a text file thingwithtablesinit.txt with the tables right.

I had a similar problem and ended up using XPDF from http://www.foolabs.com/xpdf/
One of the utils is PDFtoText, but I guess it all comes up to, how the PDF was produced.

As explained in other answers, extracting text from PDF is not a straight forward task. However there are certain Python libraries such as pdfminer (pdfminer3k for Python 3) that are reasonably efficient.
The code snippet below shows a Python class which can be instantiated to extract text from PDF. This will work in most of the cases.
(source - https://gist.github.com/vinovator/a46341c77273760aa2bb)
# Python 2.7.6
# PdfAdapter.py
""" Reusable library to extract text from pdf file
Uses pdfminer library; For Python 3.x use pdfminer3k module
Below links have useful information on components of the program
https://euske.github.io/pdfminer/programming.html
http://denis.papathanasiou.org/posts/2010.08.04.post.html
"""
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
# from pdfminer.pdfdevice import PDFDevice
# To raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
import logging
__doc__ = "eusable library to extract text from pdf file"
__name__ = "pdfAdapter"
""" Basic logging config
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())
class pdf_text_extractor:
""" Modules overview:
- PDFParser: fetches data from pdf file
- PDFDocument: stores data parsed by PDFParser
- PDFPageInterpreter: processes page contents from PDFDocument
- PDFDevice: translates processed information from PDFPageInterpreter
to whatever you need
- PDFResourceManager: Stores shared resources such as fonts or images
used by both PDFPageInterpreter and PDFDevice
- LAParams: A layout analyzer returns a LTPage object for each page in
the PDF document
- PDFPageAggregator: Extract the decive to page aggregator to get LT
object elements
"""
def __init__(self, pdf_file_path, password=""):
""" Class initialization block.
Pdf_file_path - Full path of pdf including name
password = If not passed, assumed as none
"""
self.pdf_file_path = pdf_file_path
self.password = password
def getText(self):
""" Algorithm:
1) Txr information from PDF file to PDF document object using parser
2) Open the PDF file
3) Parse the file using PDFParser object
4) Assign the parsed content to PDFDocument object
5) Now the information in this PDFDocumet object has to be processed.
For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager
6) Finally process the file page by page
"""
# Open and read the pdf file in binary mode
with open(self.pdf_file_path, "rb") as fp:
# Create parser object to parse the pdf content
parser = PDFParser(fp)
# Store the parsed content in PDFDocument object
document = PDFDocument(parser, self.password)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources
# such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Create a PDFDevice object which translates interpreted
# information into desired format
# Device to connect to resource manager to store shared resources
# device = PDFDevice(rsrcmgr)
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process content from PDFDocument
# Interpreter needs to be connected to resource manager for shared
# resources and device
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Initialize the text
extracted_text = ""
# Ok now that we have everything to process a pdf document,
# lets process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument
# object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
# Out of the many LT objects within layout, we are interested
# in LTTextBox and LTTextLine
for lt_obj in layout:
if (isinstance(lt_obj, LTTextBox) or
isinstance(lt_obj, LTTextLine)):
extracted_text += lt_obj.get_text()
return extracted_text.encode("utf-8")
Note - There are other libraries such as PyPDF2 which are good at transforming a PDF, such as merging PDF pages, splitting or cropping specific pages out of PDF etc.

Related

How to extract text boxes from a pdf and convert them to image

I'm trying to get cropped boxes from a pdf that has text in, this will be very usefull to gather training data for one of my models and that's why I need it. Here's a pdf sample:
https://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf ; for example I would like to get the first boxtext within as an image (jpg or whatever), like this:
What I tried so far is the following code, but I'm open to solve this in other ways so if you have another way, it's nice.
This code is a modified version from a solution (first answer) that I found here How to extract text and text coordinates from a PDF file? ; (only PART I of my code) ; part II is what I tried but didn't work so far, I also tried to read the image with pymupdf but didn't change anything at all (I won't post this attempt since the post is large enough).
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
# pdf path
pdf_path ="example - Git From Bottom Up.pdf"
# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# here is where i stored the data
boxes_data = []
page_sizes = []
def parse_obj(lt_objs, verbose = 0):
# loop over the object list
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
if verbose >0:
print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
boxes_data.append(data_dict)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs)
# loop over all pages in the document
for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
parse_obj(layout._objs)
mediabox = page.mediabox
mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
page_sizes.append(mediabox_data)
Part II of the code, getting the cropped box in image format.
# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()
#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)
As you see, coordinates of the box do not match with the image, maybe the problem is because that pdf2image is doing some trick with the image size or something like that but I specified the size of the image correctly so I don't know.
Any solutions / suggestions are more than welcome.
Thanks in adavance.

I've checked the coordinates of first two boxes from first part of your code and they more or less fit to the text on the page:
But are you aware that zero point in PDF placed in bottom-left corner? Maybe this is a cause of the problem.
Unfortunately I didn't managed to test the second part of the code. pdf2image gets me some error.
But I'm almost sure that PIL.Image has zero point in top-left corner not like PDF. You can convert pdf_Y to pil_Y with formula:
pil_Y = page_height - pdf_Y
Page height in your case is 792 pt. And you can get page height with script as well.
Coordinates
Update
Nevertheless after a couple hours that I spend to install all the modules (it was a hardest part!) I make your script to work to some extent.
Basically I was right: coordinates were inverted y => h - y because PIL and PDF have different positions of zero point.
And there was another thing. PIL makes images with resolution 200 dpi (probably it can be changed somewhere). PDF measures everything in points (1 pt = 1/72 dpi). So if you want to use PDF sizes in PIL, you need to change PDF sizes this way: x => x * 200 / 72.
Here is the fixed code:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
from pathlib import Path # it's just my favorite way to handle files
# pdf path
# pdf_path ="test.pdf"
pdf_path = Path.cwd()/"Git From Bottom Up.pdf"
# PART 1: GET LTBOXES COORDINATES IN THE IMAGE ----------------------
# Open a PDF file.
fp = open(pdf_path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# here is where i stored the data
boxes_data = []
page_sizes = []
def parse_obj(lt_objs, verbose = 0):
# loop over the object list
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
if verbose >0:
print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
data_dict = {
"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),
"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),
"text":obj.get_text()}
boxes_data.append(data_dict)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs)
# loop over all pages in the document
for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
parse_obj(layout._objs)
mediabox = page.mediabox
mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
page_sizes.append(mediabox_data)
# PART 2: NOW GET PAGE TO IMAGE -------------------------------------
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path)[0] # without 'size=...'
#show first page with the right size (at least the one that pdfminer says)
# firstpage_image.show()
firstpage_image.save("firstpage.png")
# the magic numbers
dpi = 200/72
vertical_shift = 5 # I don't know, but it's need to shift a bit
page_height = int(firstpage_size["height"] * dpi)
# loop through boxes (we'll process only first page for now)
for i, _ in enumerate(boxes_data):
#first box data
startX, startY, endX, endY, text = boxes_data[i].values()
# correction PDF --> PIL
startY = page_height - int(startY * dpi) - vertical_shift
endY = page_height - int(endY * dpi) - vertical_shift
startX = int(startX * dpi)
endX = int(endX * dpi)
startY, endY = endY, startY
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
# convert2pil_image.show()
png = "crop_" + str(i) + ".png"
convert2pil_image.save(png)
#print this does not match with the text, means there's an error
print(text)
The code almost all the same as yours. I just added the coordinates correction and save PNG files rather than show them.
Output:
Gi from the bottom up
Wed, Dec 9
by John Wiegley
In my pursuit to understand Git, it’s been helpful for me to understand it from the bottom
up — rather than look at it only in terms of its high-level commands. And since Git is so beauti-
fully simple when viewed this way, I thought others might be interested to read what I’ve found,
and perhaps avoid the pain I went through nding it.
I used Git version 1.5.4.5 for each of the examples found in this document.
1. License
2. Introduction
3. Repository: Directory content tracking
Introducing the blob
Blobs are stored in trees
How trees are made
e beauty of commits
A commit by any other name…
Branching and the power of rebase
4. e Index: Meet the middle man
Taking the index farther
5. To reset, or not to reset
Doing a mixed reset
Doing a so reset
Doing a hard reset
6. Last links in the chain: Stashing and the reog
7. Conclusion
8. Further reading
2
3
5
6
7
8
10
12
15
20
22
24
24
24
25
27
30
31
Of course the fixed code is more like a prototype. Not for sale. )

Loop script to extract multiple PDFs to text files using Python PDFMiner

Grateful for your help. I found this sample script to extract a PDF to a text file:
https://gist.github.com/vinovator/c78c2cb63d62fdd9fb67
This works, and it is probably the most accurate extraction I've found. I would like to edit it to loop through multiple PDFs and write them to multiple text files, all with the same name as the PDF they were created from. I'm struggling to do so and keep either only writing one text file, or overwriting the PDFs I'm trying to extract from. Anyone able just to help me with a loop that will loop through all PDFs in a single folder and extract them to individual text files of the same name as the PDF?
Thanks in advance for your help!
import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Import this to raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
base_path = "C://some_folder"
my_file = os.path.join(base_path + "/" + "test_pdf.pdf")
log_file = os.path.join(base_path + "/" + "pdf_log.txt")
password = ""
extracted_text = ""
# Open and read the pdf file in binary mode
fp = open(my_file, "rb")
# Create parser object to parse the pdf content
parser = PDFParser(fp)
# Store the parsed content in PDFDocument object
document = PDFDocument(parser, password)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Create a PDFDevice object which translates interpreted information into desired format
# Device needs to be connected to resource manager to store shared resources
# device = PDFDevice(rsrcmgr)
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process page content from PDFDocument
# Interpreter needs to be connected to resource manager for shared resources and device
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Ok now that we have everything to process a pdf document, lets process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
# Out of the many LT objects within layout, we are interested in LTTextBox and LTTextLine
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()
#close the pdf file
fp.close()
# print (extracted_text.encode("utf-8"))
with open(log_file, "wb") as my_log:
my_log.write(extracted_text.encode("utf-8"))
print("Done !!")

Assuming you have the following directory structure:
script.py
pdfs
├─a.pdf
├─b.pdf
└─c.pdf
txts
Where script.py is your Python script, pdfs is a folder containing your PDF documents, and txts is an empty folder where the extracted text files should go.
We can use pathlib.Path.glob to discover the paths of all PDF documents in a given directory. We iterate over the paths, and for each path we open the corresponding PDF document, parse it, extract the text and save the text in a text document (with the same name) in the txts folder.
def main():
from pathlib import Path
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
for path in Path("pdfs").glob("*.pdf"):
with path.open("rb") as file:
parser = PDFParser(file)
document = PDFDocument(parser, "")
if not document.is_extractable:
continue
manager = PDFResourceManager()
params = LAParams()
device = PDFPageAggregator(manager, laparams=params)
interpreter = PDFPageInterpreter(manager, device)
text = ""
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
for obj in device.get_result():
if isinstance(obj, LTTextBox) or isinstance(obj, LTTextLine):
text += obj.get_text()
with open("txts/{}.txt".format(path.stem), "w") as file:
file.write(text)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

The script author specifies the input and output files at the start with two parameters: my_file and log_file
You can convert the script to a function that takes these as inputs and performs the extraction, then loop this function multiple times.
# import statemates as in the original script
base_path = "C://some_folder"
# Define a pair of tuples with lists of your file names
my_files = ("pdf1.pdf","pdf2.pdf")
log_files = ("log1.txt","log2.txt")
# This is called a list comprehension, it takes each of the
# files listed above and generates the complete file path
my_files = [os.path.join(base_path,x) for x in my_files]
log_files = [os.path.join(base_path,x) for x in log_files]
# Function to extract the file
def extract(my_file,log_file):
# code to perform the file extraction as in the original script
# loop through the file names,
# as we have two list, use a range of indices instead of for name in my_files
for i in range(len(my_files)):
extract(my_files[i],log_files[i])
You should also check the documentation for os.path.join as your usage is not best practice (it may break when switching operating systems).

How to parse PDF text into sentences

I'm wondering how to parse PDF text into sentences, I've found a great variety of solutions here, but quite frankly I do not understand them or they do not solve the problem.
The text I'm trying to parse is IMF reports, and found that the best libary to use is pdfminer. The ultimate goal is to perform sentiment analysis on the reports.
link to text: https://www.imf.org/en/Publications/WEO/Issues/2019/03/28/world-economic-outlook-april-2019
The biggest problems I've encountered are the diverse layout and filtering them, such as frontpage, table of content, graphs etc. The second problem is special characters and characters that it can't read properly making them apostrophes.
Here is what I've got and what I'have tried:
Def Parse_PDF(file) is used to read the PDF
def text_to_sentence() is supposed to convert the text into a list of sentences, put doesn't.
The other two solutions I have found here, for the purpose of reading the PDF, but haven't found them to work properly on the text as explained above. What am I missing here, what can be done for this?
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import PyPDF2 as pdf
import nltk
def Parse_PDF(file):
filename = '/Users/andreas/Desktop/Python/Portfolio/Text_analasis/IMF_Publications/{}.pdf'.format(file)
Myfile = open(filename, mode='rb') #Opens PDF
pdfReader = pdf.PdfFileReader(Myfile) #Reads file
parsedpageCt = pdfReader.numPages #Get's number of pages
count = 0
text = ""
#The while loop will read each page
while count < parsedpageCt:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText() #Extracts the text
report = text.lower()
return text
def text_to_sentence():
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
data = Parse_PDF('text')
return '\n-----\n'.join(tokenizer.tokenize(data))
fp = open('/Users/andreas/Desktop/Python/Portfolio/Text_analasis/IMF_Publications/text.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
print('Processing next page...')
interpreter.process_page(page)
layout = device.get_result()
print(layout)
#for lobj in layout:
#if isinstance(lobj, LTTextBox):
#x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
#print('At %r is text: %s' % ((x, y), text))
fp = open('/Users/andreas/Desktop/Python/Portfolio/Text_analasis/IMF_Publications/text.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
def parse_obj(lt_objs):
# loop over the object list
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_')))
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs)
# loop over all pages in the document
for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
parse_obj(layout._objs)

how to execute a python script from within a python script [duplicate]

This question already has answers here:
Using a Python subprocess call to invoke a Python script
(10 answers)
Closed 4 years ago.
I need to call the pdfminer top level python script from my python code:
Here is the link to pdfminer documentation:
https://github.com/pdfminer/pdfminer.six
The readme file shows how to call it from terminal os prompt as follows:
pdf2txt.py samples/simple1.pdf
Here, the pdf2txt.py is installed in the global space by the pip command:
pip install pdfminer.six
I would like to call this from my python code, which is in the project root directory:
my_main.py (in the project root directory)
for pdf_file_name in input_file_list:
# somehow call pdf2txt.py with pdf_file_name as argument
# and write out the text file in the output_txt directory
How can I do that?

I think you need to import it in your code and follow the examples in the docs:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
I don't see any point of using shell given you are doing something usual.

I would suggest two ways to do this!
Use os
import os
os.system("pdf2txt.py samples/simple1.pdf")
use subprocess
import subprocess
subprocess.call("pdf2txt.py samples/simple1.pdf", shell=True)

PDFMiner - export pages as List of Strings

I'm looking to export text from pdf as a list of strings where the list is the whole document and strings are the pages of the PDF. I'm using PDFMiner for this task but it is very complicated and I'm on a tight deadline.
So far I've gotten the code to extract the full pdf as string but I need it in the form of list of strings.
my code is as follows
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
f = file('./PDF/' + file_name, 'rb')
data = []
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(pdf):
interpreter.process_page(page)
data = retstr.getvalue()
print data
help please.

The issue with your current script is StringIO.getvalue always returns a string, and this string contains all the data read so far. Moreover, with each page, you're overwriting the variable data where you're storing it.
One fix is to store the position of StringIO before it writes, and then reading from this position to the end of the string stream:
# A list for all each page's text
pages_text = []
for page in PDFPage.get_pages(pdf):
# Get (and store) the "cursor" position of stream before reading from PDF
# On the first page, this will be zero
read_position = retstr.tell()
# Read PDF page, write text into stream
interpreter.process_page(page)
# Move the "cursor" to the position stored
retstr.seek(read_position, 0)
# Read the text (from the "cursor" to the end)
page_text = retstr.read()
# Add this page's text to a convenient list
pages_text.append(page_text)
Think of StringIO as a text document. You need to manage the cursor position as text is added and store the newly-added text one page at a time. Here, we're storing text in a list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text from PDF - python

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

$ pdftotext -layout thingwithtablesinit.pdf will produce a text file thingwithtablesinit.txt with the tables right.

I had a similar problem and ended up using XPDF from http://www.foolabs.com/xpdf/ One of the utils is PDFtoText, but I guess it all comes up to, how the PDF was produced.

Related

How to extract text boxes from a pdf and convert them to image

Loop script to extract multiple PDFs to text files using Python PDFMiner

How to parse PDF text into sentences

how to execute a python script from within a python script [duplicate]

PDFMiner - export pages as List of Strings

Categories

Resources