This question already has answers here:
Using a Python subprocess call to invoke a Python script
(10 answers)
Closed 4 years ago.
I need to call the pdfminer top level python script from my python code:
Here is the link to pdfminer documentation:
https://github.com/pdfminer/pdfminer.six
The readme file shows how to call it from terminal os prompt as follows:
pdf2txt.py samples/simple1.pdf
Here, the pdf2txt.py is installed in the global space by the pip command:
pip install pdfminer.six
I would like to call this from my python code, which is in the project root directory:
my_main.py (in the project root directory)
for pdf_file_name in input_file_list:
# somehow call pdf2txt.py with pdf_file_name as argument
# and write out the text file in the output_txt directory
How can I do that?
I think you need to import it in your code and follow the examples in the docs:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
I don't see any point of using shell given you are doing something usual.
I would suggest two ways to do this!
Use os
import os
os.system("pdf2txt.py samples/simple1.pdf")
use subprocess
import subprocess
subprocess.call("pdf2txt.py samples/simple1.pdf", shell=True)
Related
Grateful for your help. I found this sample script to extract a PDF to a text file:
https://gist.github.com/vinovator/c78c2cb63d62fdd9fb67
This works, and it is probably the most accurate extraction I've found. I would like to edit it to loop through multiple PDFs and write them to multiple text files, all with the same name as the PDF they were created from. I'm struggling to do so and keep either only writing one text file, or overwriting the PDFs I'm trying to extract from. Anyone able just to help me with a loop that will loop through all PDFs in a single folder and extract them to individual text files of the same name as the PDF?
Thanks in advance for your help!
import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Import this to raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
base_path = "C://some_folder"
my_file = os.path.join(base_path + "/" + "test_pdf.pdf")
log_file = os.path.join(base_path + "/" + "pdf_log.txt")
password = ""
extracted_text = ""
# Open and read the pdf file in binary mode
fp = open(my_file, "rb")
# Create parser object to parse the pdf content
parser = PDFParser(fp)
# Store the parsed content in PDFDocument object
document = PDFDocument(parser, password)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Create a PDFDevice object which translates interpreted information into desired format
# Device needs to be connected to resource manager to store shared resources
# device = PDFDevice(rsrcmgr)
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process page content from PDFDocument
# Interpreter needs to be connected to resource manager for shared resources and device
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Ok now that we have everything to process a pdf document, lets process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
# Out of the many LT objects within layout, we are interested in LTTextBox and LTTextLine
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()
#close the pdf file
fp.close()
# print (extracted_text.encode("utf-8"))
with open(log_file, "wb") as my_log:
my_log.write(extracted_text.encode("utf-8"))
print("Done !!")
Assuming you have the following directory structure:
script.py
pdfs
├─a.pdf
├─b.pdf
└─c.pdf
txts
Where script.py is your Python script, pdfs is a folder containing your PDF documents, and txts is an empty folder where the extracted text files should go.
We can use pathlib.Path.glob to discover the paths of all PDF documents in a given directory. We iterate over the paths, and for each path we open the corresponding PDF document, parse it, extract the text and save the text in a text document (with the same name) in the txts folder.
def main():
from pathlib import Path
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
for path in Path("pdfs").glob("*.pdf"):
with path.open("rb") as file:
parser = PDFParser(file)
document = PDFDocument(parser, "")
if not document.is_extractable:
continue
manager = PDFResourceManager()
params = LAParams()
device = PDFPageAggregator(manager, laparams=params)
interpreter = PDFPageInterpreter(manager, device)
text = ""
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
for obj in device.get_result():
if isinstance(obj, LTTextBox) or isinstance(obj, LTTextLine):
text += obj.get_text()
with open("txts/{}.txt".format(path.stem), "w") as file:
file.write(text)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The script author specifies the input and output files at the start with two parameters: my_file and log_file
You can convert the script to a function that takes these as inputs and performs the extraction, then loop this function multiple times.
# import statemates as in the original script
base_path = "C://some_folder"
# Define a pair of tuples with lists of your file names
my_files = ("pdf1.pdf","pdf2.pdf")
log_files = ("log1.txt","log2.txt")
# This is called a list comprehension, it takes each of the
# files listed above and generates the complete file path
my_files = [os.path.join(base_path,x) for x in my_files]
log_files = [os.path.join(base_path,x) for x in log_files]
# Function to extract the file
def extract(my_file,log_file):
# code to perform the file extraction as in the original script
# loop through the file names,
# as we have two list, use a range of indices instead of for name in my_files
for i in range(len(my_files)):
extract(my_files[i],log_files[i])
You should also check the documentation for os.path.join as your usage is not best practice (it may break when switching operating systems).
Need to parse a PDF file in order to extract just the first initial lines of text, and have looked for different Python packages to do the job, but without any luck.
Having tried:
PDFminer, PDFminer.six and PDFminer3k, which appears to be overly complex for the simple job, and I was unable to find a simple working example
slate, got error in installation, though worked with fix from thread, but got error when trying; maybe using wrong PDFminer, but can't figure which to use
PyPDF2 and PyPDF3 but these gave garbage as described here
tika, that gave different terminal error messages and was very slow
pdftotext failed to install
pdf2text failed at "import pdf2text", and when changed to "pdftotext" failed to import with "ImportError: cannot import name 'Extractor'" even through pip list shows that "Extractor" is installed
Usually I find that installed Python packages work amazingly well, but parsing PDF to text appears to be a jungle, which the myriad of tools also indicates.
Any suggestion of how to do simple parsing of a PDF file to text in Python?
PyPDF2 example added
An example of PyPDF2 is:
import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj_0 = pdfReader.getPage(0)
print(pageObj_0.extractText())
Which returns garbage as:
$%$%&%&$'(' ˜!)"*+#
Based on pdfminer, I was able to extract the bare necessity from the pdf2txt.py script (provided with pdfminer) into a function:
import io
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
def pdf_to_text(path):
with open(path, 'rb') as fp:
rsrcmgr = PDFResourceManager()
outfp = io.StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, outfp, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
text = outfp.getvalue()
return text
#EquipDev your solution actually works quite nicely for me, though it is tab delimited rather than space. I would make one change to the last line:
return text.replace('\t', ' ') #replace tabs with spaces
This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 10 months ago.
I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.
What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?
I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.
from PyPDF2 import PdfReader
reader = PdfReader("January2019.pdf")
page = reader.pages[0]
print(page.extract_text())
This prints empty strings when it should be printing the contents of the page
edit: This question was asked for a very old PyPDF2 version. New versions of PyPDF2 have improved text extraction a lot
I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.
Hope this will be helpful to you.
import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
Using tika worked for me!
from tika import parser
rawText = parser.from_file('January2019.pdf')
rawList = rawText['content'].splitlines()
This made it really easy to extract separate each line in the bank statement into a list.
If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf and use it like this:
import fitz
def get_text(filepath: str) -> str:
with fitz.open(filepath) as doc:
text = ""
for page in doc:
text += page.getText().strip()
return text
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too.
it says :
While PyPDF2 has .extractText(), which can be used on its page objects
(not shown in this example), it does not work very well. Some PDFs
will return text and some will return an empty string. When you want
to extract text from a PDF, you should check out the PDFMiner project
instead. PDFMiner is much more robust and was specifically designed
for extracting text from PDFs.
You could instead install and use pdfminer using
pip install pdfminer
or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.
you can download the command line tools from here
and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here
PyPDF2 does not read whole pdf correctly. You must use this code.
import pdftotext
pdfFileObj = open("January2019.pdf", 'rb')
pdf = pdftotext.PDF(pdfFileObj)
# Iterate over all the pages
for page in pdf:
print(page)
Here is an alternative solution in Windows 10, Python 3.8
Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
import pdftables_api
import os
c = pdftables_api.Client('MY-API-KEY')
file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"
for file in os.listdir(file_path):
if file.endswith(".pdf"):
c.xlsx(os.path.join(file_path,file), file+'.xlsx')
Go to https://pdftables.com to get an API key.
CSV, format=csv
XML, format=xml
HTML, format=html
XLSX, format=xlsx-single, format=xlsx-multiple
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
I think this code will be exactly what you are looking for:
import requests, time, datetime, os, threading, sys, configparser
import glob
import pdfplumber
for filename in glob.glob("*.pdf"):
pdf = pdfplumber.open(filename)
OutputFile = filename.replace('.pdf','.txt')
fx2=open(OutputFile, "a+")
for i in range(0,10000,1):
try:
page = pdf.pages[i]
text = page.extract_text()
print(text)
fx2.write(text)
except Exception as e:
print(e)
fx2.close()
pdf.close()
Try this:
in terminal execute command: pip install PyPDF2
import PyPDF2
reader = PyPDF2.PdfReader("mypdf.pdf")
for page in reader.pages:
print(page.extract_text())
I'm trying to convert a lot of Visio files from .vsd to .html, but each file has a lot of pages, so I need to convert all pages to a single .html file.
Using the Python code below, I'm able to convert to PDF, but what I really need is HTML. I noticed I can use win32com.client.Dispatch("SaveAsWeb.VisSaveAsWeb"), but how to use it? Any ideas?
import sys
import win32com.client
from os.path import abspath
f = abspath(sys.argv[1])
visio = win32com.client.Dispatch("Visio.InvisibleApp")
doc = visio.Documents.Open(f)
doc.ExportAsFixedFormat(1, '{}.pdf'.format(f), 0, 0)
visio.Quit()
exit(0)
Visio cannot do that. You cannot "convert all pages into a single HTML file". You'll have a "root" file and a folder of "supporting" files.
VisSaveAsWeb is pretty well documented, no need to guess:
https://msdn.microsoft.com/en-us/vba/visio-vba/articles/vissaveasweb-object-visio-save-as-web
-- update
With python, it turned out to be not that trivial to deal with SaveAsWeb. It seems to default to a custom interface (non-dispatch). I don't think it's possible deal with this using win32com library, but with comtypes seems to work (comtypes library is building the client based on the type library, i.e. it also supports "custom" interfaces):
import sys
import comtypes
from comtypes import client
from os.path import abspath
f = abspath(sys.argv[1])
visio = comtypes.client.CreateObject("Visio.InvisibleApp")
doc = visio.Documents.Open(f)
comtypes.client.GetModule("{}\\SAVASWEB.DLL".format(visio.Path))
saveAsWeb = visio.SaveAsWebObject.QueryInterface(comtypes.gen.VisSAW.IVisSaveAsWeb)
webPageSettings = saveAsWeb.WebPageSettings.QueryInterface(comtypes.gen.VisSAW.IVisWebPageSettings)
webPageSettings.TargetPath = "{}.html".format(f)
webPageSettings.QuietMode = True
saveAsWeb.AttachToVisioDoc(doc)
saveAsWeb.CreatePages()
visio.Quit()
exit(0)
Other than that, you can try "command line" interface:
http://visualsignals.typepad.co.uk/vislog/2010/03/automating-visios-save-as-web-output.html
import sys
import win32com.client
from os.path import abspath
f = abspath(sys.argv[1])
visio = win32com.client.Dispatch("Visio.InvisibleApp")
doc = visio.Documents.Open(f)
visio.Addons("SaveAsWeb").Run("/quiet=True /target={}.htm".format(f))
visio.Quit()
exit(0)
Other than that you could give a try to my visio svg-export :)
I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?
Thanks.
PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
$ pdftotext -layout thingwithtablesinit.pdf
will produce a text file thingwithtablesinit.txt with the tables right.
I had a similar problem and ended up using XPDF from http://www.foolabs.com/xpdf/
One of the utils is PDFtoText, but I guess it all comes up to, how the PDF was produced.
As explained in other answers, extracting text from PDF is not a straight forward task. However there are certain Python libraries such as pdfminer (pdfminer3k for Python 3) that are reasonably efficient.
The code snippet below shows a Python class which can be instantiated to extract text from PDF. This will work in most of the cases.
(source - https://gist.github.com/vinovator/a46341c77273760aa2bb)
# Python 2.7.6
# PdfAdapter.py
""" Reusable library to extract text from pdf file
Uses pdfminer library; For Python 3.x use pdfminer3k module
Below links have useful information on components of the program
https://euske.github.io/pdfminer/programming.html
http://denis.papathanasiou.org/posts/2010.08.04.post.html
"""
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
# from pdfminer.pdfdevice import PDFDevice
# To raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
import logging
__doc__ = "eusable library to extract text from pdf file"
__name__ = "pdfAdapter"
""" Basic logging config
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())
class pdf_text_extractor:
""" Modules overview:
- PDFParser: fetches data from pdf file
- PDFDocument: stores data parsed by PDFParser
- PDFPageInterpreter: processes page contents from PDFDocument
- PDFDevice: translates processed information from PDFPageInterpreter
to whatever you need
- PDFResourceManager: Stores shared resources such as fonts or images
used by both PDFPageInterpreter and PDFDevice
- LAParams: A layout analyzer returns a LTPage object for each page in
the PDF document
- PDFPageAggregator: Extract the decive to page aggregator to get LT
object elements
"""
def __init__(self, pdf_file_path, password=""):
""" Class initialization block.
Pdf_file_path - Full path of pdf including name
password = If not passed, assumed as none
"""
self.pdf_file_path = pdf_file_path
self.password = password
def getText(self):
""" Algorithm:
1) Txr information from PDF file to PDF document object using parser
2) Open the PDF file
3) Parse the file using PDFParser object
4) Assign the parsed content to PDFDocument object
5) Now the information in this PDFDocumet object has to be processed.
For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager
6) Finally process the file page by page
"""
# Open and read the pdf file in binary mode
with open(self.pdf_file_path, "rb") as fp:
# Create parser object to parse the pdf content
parser = PDFParser(fp)
# Store the parsed content in PDFDocument object
document = PDFDocument(parser, self.password)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources
# such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Create a PDFDevice object which translates interpreted
# information into desired format
# Device to connect to resource manager to store shared resources
# device = PDFDevice(rsrcmgr)
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process content from PDFDocument
# Interpreter needs to be connected to resource manager for shared
# resources and device
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Initialize the text
extracted_text = ""
# Ok now that we have everything to process a pdf document,
# lets process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument
# object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
# Out of the many LT objects within layout, we are interested
# in LTTextBox and LTTextLine
for lt_obj in layout:
if (isinstance(lt_obj, LTTextBox) or
isinstance(lt_obj, LTTextLine)):
extracted_text += lt_obj.get_text()
return extracted_text.encode("utf-8")
Note - There are other libraries such as PyPDF2 which are good at transforming a PDF, such as merging PDF pages, splitting or cropping specific pages out of PDF etc.