Extracting text from a PDF file in Python - python

I am trying to extract text from a pdf file I usually have to deal with at work, so that I can automize it.
When using PyPDF2, it works for my CV for instance, but not for my work-document. The problem is, that the text is then like that: "Helloworldthisisthetext". I then tried to use .join(" "), but this is not working.
I read that this is a known problem with PyPDF2 - it seems to depend on the way the pdf was built.
Does anyone know another approach how to extract text out of it which I then can use for further steps?
Thank you in advance

I can suggest you to try another tool - pdfreader. You can extract the both plain strings and "PDF markdown" (decoded text strings + operators). "PDF markdown" can be parsed as a regular text (with regular expressions for example).
Below you find the code sample for walking pages and extracting PDF content for further parsing.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
try:
while True:
viewer.render()
pdf_markdown = viewer.canvas.text_content
result = my_text_parser(pdf_markdown)
# The one below will probably be the same as PyPDF2 returns
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
...
def my_text_parser(text):
""" Code your parser here """
...
pdf_markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator.
For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects
You can parse it with regular expressions for example.

I had a similar requirement at work for which I used PyMuPDF. They also have a collection of recipes which cover typical scenarios of text extraction.

Related

How can I extract the outline of a PDF using Python?

I wish to read the outline of a .pdf format paper. The expected output is a list of section titles like ['abstract', 'Introduction', ...], The section titles can be identified by the following characteristics: 1) bold and larger font size, 2) all nouns starting with capital letters, and 3) appearing immediately after a line break \n.
The solutions I have tried with includes:
pypdf2 with reader.outline
reader = PyPDF2.PdfReader('path/to/my/pdf')
print(reader.outline)
pymupdf with doc.get_toc()
doc = fitz.open('path/to/my/pdf')
toc = doc.get_toc()
However both give me empty list.
I am currently using the re library to extract the section titles, but the results include additional information such as references and table contents.
import re
re.findall(r'(\[turnpage\]|\n)([A-Z][^.]*?)(\n[A-Z0-9][^\s]*?)', text)
For a clearer understanding of the results produced by the code, please refer to this link
If reader.outline by pypdf gives an empty result, there is no outline specified as metadata.
There can still be an outline specified as normal text. However, detecting / parsing that would require custom work on your side. You can use the text extraction as a basis:
https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

how to extract only main text with pdfplumber and ignore image text and tables?

trying to parse any non scanned pdf and extract only text, without tables and their comments or pictures and their comment. just the main text of a pdf, if such text exists. tried pdfplumber.
when trying this piece of code it extract all texts, include tables and their comments.
import pdfplumber
with pdfplumber.open("somePDFname.pdf") as pdf:
for pdf_page in pdf.pages:
single_page_text = pdf_page.extract_text()
print( single_page_text )
saw this solution - How to ignore table and its content while extracting text from pdf but if I understood correctly it was specific for a certain table, so did not work for me as I don't know the dim of the tables/images I'm scanning.
also read the issue in the pdfplumber (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0zejJ2P76AhUzuZUCHZ3oBZkQFnoECBAQAQ&url=https%3A%2F%2Fgithub.com%2Fjsvine%2Fpdfplumber%2Fissues%2F242&usg=AOvVaw3-4BI2LYY2dmH9ldel9_J9).
saw this solution also -https://stackoverflow.com/questions/66293939/how-i-can-extract-only-text-without-tables-inside-a-pdf-file-using-pdfplumber
but rather use pdfplumber for later parsing.
Is there a more general solution to the problem?
Hello you can use a filter after extracting text
clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
also, you can use specify the front Size in the filer,
import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.chars[0])
please check the above code for the get dataframe page-wise.

Extract Numbers from a certain location in PDF files

I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it is that pdfminer exports the numbers in various and unpredictable patters (I used this script). Sometimes it extracts the whole rows together and sometimes it extracts columns, so that's why I want to find a way to extract the numbers from various files in a consistent manner. Any suggestions would be much appreciated!
Try pdfreader. You can extract either text containing "pdf markdown" and than parse it with regular expressions for example:
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
viewer.next()
except PageDoesNotExist:
pass
data = my_total_deviation_parser(pdf_markdown)

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

pypdf not extracting tables from pdf

I am using pypdf to extract text from pdf files . The problem is that the tables in the pdf files are not extracted. I have also tried using the pdfminer but i am having the same issue .
The problem is that tables in PDFs are generally made up of absolutely positioned lines and characters, and it is non-trivial to convert this into a sensible table representation.
In Python, PDFMiner is probably your best bet. It gives you a tree structure of layout objects, but you will have to do the table interpreting yourself by looking at the positions of lines (LTLine) and text boxes (LTTextBox). There's a little bit of documentation here.
Alternatively, PDFX attempts this (and often succeeds), but you have to use it as a web service (not ideal, but fine for the occasional job). To do this from Python, you could do something like the following:
import urllib2
import xml.etree.ElementTree as ET
# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()
# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[#class="DoCO:TableBox"]'):
src = ET.tostring(tbox.find('content/table'))
info = ET.tostring(tbox.find('region[#class="TableInfo"]'))
caption = ET.tostring(tbox.find('caption'))

Categories

Resources