I'm using PDFMiner6 with Python 3.5. It's far better than PyPDF2 (slower, but more accurate and doesn't spit out a bunch of letters that are not separated by spaces). I tried to parse this document:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2963791/
(You can download the PDF free from the NIH website).
I used this code (it's part of a larger spider, but the rest of the code is not relevant to this question):
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
class PDFMiner6(object):
def __init__(self):
pass
def PdfFileReader(self, fp):
text = []
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
output = retstr.getvalue()
text.append(output)
fp.close()
device.close()
retstr.close()
return text
It parses the first page perfectly, then stops. The rest of the document is not parsed.
I tested the same document using PyPDF2, it parses the entire document but outputs garbage without any spaces (hence I switched over to PDFMiner6). So I'm sure that it's not that the entire document is not being read but something wrong with the code that's parsing it. What is wrong?
EDIT: I went ahead and tested it on different PDF files with varying results - it parses some completely, whereas others, it stops at the first page. This is frustrating, as PDFMiner6 is a better parser compared to PyPDF2.
Could anybody help?
Make sure that the pdf is being opened by a pdf viewer and not a web browser. I had the same issue and this is how I fixed it.
It looks like pdfminer will view a pdf that is being opened by a web browser as one page. So you need to make sure that it is being opened by a pdf viewer so pdfminer recognizes that there are more than 1 page on the pdf.
Related
I am trying to translate PDFs files using translation API and output it as PDF by keeping the format same. My approach is to convert the PDF to word doc and to translate the file and then convert it back to PDF. But the problem, is there no efficient way to convert the PDF to word. I am trying to write my own program but the PDFs has lots of formats. So I guess it will take some effort to handle all the formats. So my question, is there any efficient way to translate there PDFs without losing the format or is there any efficient way to convert them to docx. I am using python as programing language.
Probably not.
PDFs aren't meant to be machine readable or editable, really; they describe formatted, laid-out, printable pages.
You can use pdfminer instead of API here an example:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
I'm trying to merge two different things I've been able to accomplish independently. Unfortunately the PDFMiner docs are just not useful at all.
I have a folder that has hundred of PDFs, named: "[0-9].pdf", in it, in no particular order and I don't care to sort them. I just need a way to go through them and convert them to text.
Using this post: Extracting text from a PDF file using PDFMiner in python? - I was able to extract the text from one PDF successfully.
Some of this post: batch process text to csv using python - was useful in determining how to open a folder full of PDFs and work with them.
Now, I just don't know how I can combine them to one-by-one open a PDF, convert it to a text object, save that to a text file with the same original-filename.txt, and then move onto the next PDF in the directory.
Here's my code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob
directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))
resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)
password = ""
maxPages = 0
caching = True
pageNums=set()
for one_pdf in pdfFiles:
print("Processing file: " + str(one_pdf))
fp = file(one_pdf, 'rb')
for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = returnString.getvalue()
filenameString = str(one_pdf) + ".txt"
text_file = open(filenameString, "w")
text_file.write(text)
text_file.close()
fp.close()
device.close()
returnString.close()
I get no compilation errors, but my code doesn't do anything.
Thanks for your help!
Just answering my own question with the solution idea from #LaurentLAPORTE that worked.
Set directory to an absolute path using os like this: os.path.abspath("../Documents/003/"). And then it'll work.
I am quite a newbie to python, scrapy and web scraping, so my question might seem naive. Apologies for that upfront.
I want to extract data from a pdf file, using scrapy. There are several questions on stackoverflow on this subject which I looked up and copied following code from one of the answers given. However, I am unable to see any output. Used print function in the code directly to see the output, tried writing the return value to an excel file, but that also is not showing any output. Am not getting any error either.
The code I am using is as follows:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 2
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
strval = retstr.getvalue()
print strval
fp.close()
device.close()
retstr.close()
return strval
Can anyone guide me where am I going wrong?
Thanks!
Tuhina
I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables.
This is what one of the tables looks like:
As you can see, some columns are marked with an 'x'. I'm trying to this table into a list of objects.
This is the code so far, I'm using pdfminer now.
# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os
def pdfToText(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ''
maxpages = 0
caching = True
pagenos = set()
records = []
i = 1
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
# process page
interpreter.process_page(page)
# only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
lines = retstr.getvalue().splitlines()
idx = containsSubString(lines, 'Tool')
lines = lines[idx+1:]
idx = containsSubString(lines, "1 The 'All'")
lines = lines[:idx]
for line in lines:
records.append(line)
i += 1
fp.close()
device.close()
retstr.close()
return records
def containsSubString(list, substring):
# find a substring in a list item
for i, s in enumerate(list):
if substring in s:
return i
return -1
# process pdf
fn = '../test1.pdf'
ft = 'test.txt'
text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
outFile.write(text[i])
outFile.close()
That produces a text file and it gets all of the text but, the x's don't have the spacing preserved. The output looks like this:
The x's are just single spaced in the text document
Right now, I'm just producing text output but my goal is to produce an html document with the data from the tables. I've been searching for OCR examples, and most of them seem confusing or incomplete. I'm open to using C# or any other language that might produce the results I'm looking for.
EDIT: There will be multiple pdfs like this that I need to get the table data from. The headers will be the same for all pdfs (s far as I know).
I figured it out, I was going in the wrong direction. What I did was create pngs of each table in the pdf and now I'm processing the images using opencv & python.
Give a try to Tabula and if it works use tabula-extractor library (written in ruby) to programatically extract the data.
I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.
It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.
As it is, I'm just looking at source-code to see if I can figure it out.
Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.
Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x
Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.
This works in May 2020 using PDFminer six in Python3.
Installing the package
$ pip install pdfminer.six
Importing the package
from pdfminer.high_level import extract_text
Using a PDF saved on disk
text = extract_text('report.pdf')
Or alternatively:
with open('report.pdf','rb') as f:
text = extract_text(f)
Using PDF already in memory
If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:
import io
response = requests.get(url)
text = extract_text(io.BytesIO(response.content))
Performance and Reliability compared with PyPDF2
PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7
However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.
I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:
PDFminer.six: 2.88 sec
PyPDF2: 0.45 sec
pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.
Update (2022-08-04): According to Martin Thoma, PyPDF2 has improved a lot in the past 2 years, so do give it a try as well. Here's his benchmark
terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
text = retstr.getvalue()
retstr.close()
return text
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.
(All the examples assume your PDF file is called example.pdf)
Commandline
If you want to extract text just once you can use the commandline tool pdf2txt.py:
$ pdf2txt.py example.pdf
High-level api
If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.
from pdfminer.high_level import extract_text
# Extract text from a pdf.
text = extract_text('example.pdf')
# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')
Composable api
There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
Similar question and answer here. I'll try to keep them in sync.
this code is tested with pdfminer for python 3 (pdfminer-20191125)
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
def parsedocument(document):
# convert all horizontal text into a lines list (one entry per line)
# document is a file stream
lines = []
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBoxHorizontal):
lines.extend(element.get_text().splitlines())
return lines
I realize that this is an old question.
For anyone trying to use pdfminer, you should switch to pdfminer.six which is the currently maintained version.