This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 10 months ago.
I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.
What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?
I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.
from PyPDF2 import PdfReader
reader = PdfReader("January2019.pdf")
page = reader.pages[0]
print(page.extract_text())
This prints empty strings when it should be printing the contents of the page
edit: This question was asked for a very old PyPDF2 version. New versions of PyPDF2 have improved text extraction a lot
I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.
Hope this will be helpful to you.
import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
Using tika worked for me!
from tika import parser
rawText = parser.from_file('January2019.pdf')
rawList = rawText['content'].splitlines()
This made it really easy to extract separate each line in the bank statement into a list.
If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf and use it like this:
import fitz
def get_text(filepath: str) -> str:
with fitz.open(filepath) as doc:
text = ""
for page in doc:
text += page.getText().strip()
return text
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too.
it says :
While PyPDF2 has .extractText(), which can be used on its page objects
(not shown in this example), it does not work very well. Some PDFs
will return text and some will return an empty string. When you want
to extract text from a PDF, you should check out the PDFMiner project
instead. PDFMiner is much more robust and was specifically designed
for extracting text from PDFs.
You could instead install and use pdfminer using
pip install pdfminer
or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.
you can download the command line tools from here
and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here
PyPDF2 does not read whole pdf correctly. You must use this code.
import pdftotext
pdfFileObj = open("January2019.pdf", 'rb')
pdf = pdftotext.PDF(pdfFileObj)
# Iterate over all the pages
for page in pdf:
print(page)
Here is an alternative solution in Windows 10, Python 3.8
Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
import pdftables_api
import os
c = pdftables_api.Client('MY-API-KEY')
file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"
for file in os.listdir(file_path):
if file.endswith(".pdf"):
c.xlsx(os.path.join(file_path,file), file+'.xlsx')
Go to https://pdftables.com to get an API key.
CSV, format=csv
XML, format=xml
HTML, format=html
XLSX, format=xlsx-single, format=xlsx-multiple
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
I think this code will be exactly what you are looking for:
import requests, time, datetime, os, threading, sys, configparser
import glob
import pdfplumber
for filename in glob.glob("*.pdf"):
pdf = pdfplumber.open(filename)
OutputFile = filename.replace('.pdf','.txt')
fx2=open(OutputFile, "a+")
for i in range(0,10000,1):
try:
page = pdf.pages[i]
text = page.extract_text()
print(text)
fx2.write(text)
except Exception as e:
print(e)
fx2.close()
pdf.close()
Try this:
in terminal execute command: pip install PyPDF2
import PyPDF2
reader = PyPDF2.PdfReader("mypdf.pdf")
for page in reader.pages:
print(page.extract_text())
Related
import PyPDF2
from PyDF2 import PdfFileReader, PdfFileWriter
file_path="sample.pdf"
pdf = PdfFileReader(file_path)
with open("sample.pdf", "w") as f:'
for page_num in range(pdf.numPages):
pageObj = pdf.getPage(page_num)
try:
txt = pageObj.extractText()
txt = DocumentInformation.author
except:
pass
else:
f.write(txt)
f.close()
Error Received:
ModuleNotFoundError: No module named 'PyPDF2'
Writing my first ever script where I want to scan in a PDF then extract the text and write it to a txt file. I was trying to use pyPDF2 but I'm not sure how to use it in a script like this.
EDIT: I had success importing the os & sys like so.
import os
import sys
There are multiple issues:
from PyDF2 import ...: A typo. You meant PyPDF2 instead of PyDF2
PdfFileWriter was imported, but never used (side-note: It's PdfReader and PdfWriter in the latest version of PyPDF2)
with open("sample.pdf", "w") as f:': A syntax error
Lacking indentation of the next lines
Side-note: Did you know that you can simply write for page in pdf.pages?
DocumentInformation.author is wrong. I guess you meant pdf.metadata.author
You overwrite the txt variable - I don't understand why you don't use it before you re-assign it.
Maybe this is what you want:
from PyPDF2 import PdfReader
def get_text(pdf_file_path: str) -> str:
text = ""
reader = PdfReader(pdf_file_path)
for page in reader.pages:
text += page.extract_text()
return text
text = get_text("example.pdf")
with open("example.txt", "w") as f:
f.write(text)
Installation issues
In case you have installation issues, maybe the docs on installing PyPDF2 can help you?
If you execute your script in the console as python your_script_name.py you might want to check the output of
python -c "import PyPDF2; print(PyPDF2.__version__)"
That should show your PyPDF2 version. If it doesn't, it the Python environment you're using doesn't have PyPDF2 installed. Please note that your system might have arbitrary many Python environments.
Need to parse a PDF file in order to extract just the first initial lines of text, and have looked for different Python packages to do the job, but without any luck.
Having tried:
PDFminer, PDFminer.six and PDFminer3k, which appears to be overly complex for the simple job, and I was unable to find a simple working example
slate, got error in installation, though worked with fix from thread, but got error when trying; maybe using wrong PDFminer, but can't figure which to use
PyPDF2 and PyPDF3 but these gave garbage as described here
tika, that gave different terminal error messages and was very slow
pdftotext failed to install
pdf2text failed at "import pdf2text", and when changed to "pdftotext" failed to import with "ImportError: cannot import name 'Extractor'" even through pip list shows that "Extractor" is installed
Usually I find that installed Python packages work amazingly well, but parsing PDF to text appears to be a jungle, which the myriad of tools also indicates.
Any suggestion of how to do simple parsing of a PDF file to text in Python?
PyPDF2 example added
An example of PyPDF2 is:
import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj_0 = pdfReader.getPage(0)
print(pageObj_0.extractText())
Which returns garbage as:
$%$%&%&$'(' ˜!)"*+#
Based on pdfminer, I was able to extract the bare necessity from the pdf2txt.py script (provided with pdfminer) into a function:
import io
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
def pdf_to_text(path):
with open(path, 'rb') as fp:
rsrcmgr = PDFResourceManager()
outfp = io.StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, outfp, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
text = outfp.getvalue()
return text
#EquipDev your solution actually works quite nicely for me, though it is tab delimited rather than space. I would make one change to the last line:
return text.replace('\t', ' ') #replace tabs with spaces
I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!
I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code
I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)
Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()
My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.
I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)
I'm trying to merge two different things I've been able to accomplish independently. Unfortunately the PDFMiner docs are just not useful at all.
I have a folder that has hundred of PDFs, named: "[0-9].pdf", in it, in no particular order and I don't care to sort them. I just need a way to go through them and convert them to text.
Using this post: Extracting text from a PDF file using PDFMiner in python? - I was able to extract the text from one PDF successfully.
Some of this post: batch process text to csv using python - was useful in determining how to open a folder full of PDFs and work with them.
Now, I just don't know how I can combine them to one-by-one open a PDF, convert it to a text object, save that to a text file with the same original-filename.txt, and then move onto the next PDF in the directory.
Here's my code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob
directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))
resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)
password = ""
maxPages = 0
caching = True
pageNums=set()
for one_pdf in pdfFiles:
print("Processing file: " + str(one_pdf))
fp = file(one_pdf, 'rb')
for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = returnString.getvalue()
filenameString = str(one_pdf) + ".txt"
text_file = open(filenameString, "w")
text_file.write(text)
text_file.close()
fp.close()
device.close()
returnString.close()
I get no compilation errors, but my code doesn't do anything.
Thanks for your help!
Just answering my own question with the solution idea from #LaurentLAPORTE that worked.
Set directory to an absolute path using os like this: os.path.abspath("../Documents/003/"). And then it'll work.
I want to add pdf watermark using pypdf lib, code below:
def add_wm(pdf_in, pdf_out):
wm_file = open("watermark.pdf", "rb")
pdf_wm = PdfFileReader(wm_file)
pdf_output = PdfFileWriter()
input_stream = open(pdf_in, "rb")
pdf_input = PdfFileReader(input_stream)
pageNum = pdf_input.getNumPages()
#print pageNum
for i in range(pageNum):
page = pdf_input.getPage(i)
page.mergePage(pdf_wm.getPage(0)) # !! here is fail if has chinese character
page.compressContentStreams()
pdf_output.addPage(page)
output_stream = open(pdf_out, "wb")
pdf_output.write(output_stream)
output_stream.close()
input_stream.close()
wm_file.close()
return True
The issue is if page = pdf_input.getPage(i) page has Chinese characters, page.mergePage will be raise exception and cause failure. How do I work around this?
The Python library pdfrw also supports watermarking. If it does not work for your particular PDF, please email it to me (address at github) and I will investigate -- I am the pdfrw author.
I had the same problem when i was watermarking with PyPdf2 1.25.1 .
pdfrw as Patrick suggested isn't working for my PDF (works for word documents exported as pdf but not for scanned documents i guess).
Updating to the newest version of PyPDF2 (for me this is 1.26.0) fixed this bug.
For more information see PyPDF2 issue #176