Unable to separate the passages, as no separation character is being displayed - python

I am trying to implement passage retrieval on PDF files. For easy navigation I want to include page number and in which passage result was belongs to (mostly passage number). like below:
query: "some query was asked"
results: "one result was displayed"
file_name: "name of file"
source: Page_no-2, passage_no:3
I have couple of pdf files, where we can separate the passage based on some recognizable pattrens. however, I am facing challenge with some pdf files, where no proper pattern was found.
when I open the pdf in chrome there are line gaps between the passages but when I trying read from fitz(pymupdf), no line gap is displayed and every line and every passage separated by just one "\n"
I tries pdfminer,pdftotext, and other libraries but no luck.
My code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf')
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text', sort=True)
text
Result:
screeshot of the page
-Full pdf

Related

How can I extract the outline of a PDF using Python?

I wish to read the outline of a .pdf format paper. The expected output is a list of section titles like ['abstract', 'Introduction', ...], The section titles can be identified by the following characteristics: 1) bold and larger font size, 2) all nouns starting with capital letters, and 3) appearing immediately after a line break \n.
The solutions I have tried with includes:
pypdf2 with reader.outline
reader = PyPDF2.PdfReader('path/to/my/pdf')
print(reader.outline)
pymupdf with doc.get_toc()
doc = fitz.open('path/to/my/pdf')
toc = doc.get_toc()
However both give me empty list.
I am currently using the re library to extract the section titles, but the results include additional information such as references and table contents.
import re
re.findall(r'(\[turnpage\]|\n)([A-Z][^.]*?)(\n[A-Z0-9][^\s]*?)', text)
For a clearer understanding of the results produced by the code, please refer to this link
If reader.outline by pypdf gives an empty result, there is no outline specified as metadata.
There can still be an outline specified as normal text. However, detecting / parsing that would require custom work on your side. You can use the text extraction as a basis:
https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

Search Keyword from multiple Excel colomn/row in multiples pdf files

I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

pdfminer doesn't extract data from filled-out pdf form

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:
Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
Click "Your request for a financial report is ready"
To extract the contents in blue, I copied code from this post:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below
For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.
Any idea how I can extract data directly from the pdf form? Thanks.
I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.
Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.
While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:
Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

Converting a PDF file consisting of tables into text document containings tables in Python

I have this pdf file that consists of general tables consisting of names,address,phone number,fax number. I want is :
1) read this file and get the content of each row and put it in data base.
i.e get the name from corresponding name column of the pdf file and store it in database. and so on with address, phone etc.
the main problem is whenever I am reading the pdf file and converting it into text file (As I dont't know any other way to use the data directly without converting it first to text file) the text output is completely messed up that is the format and spacing is not preserved. Please suggest a new way to do this or what can be done in the following code:
import pyPdf
def getPDFContent(path):
f=open("C:\\Doctor's Data\\delhi\\hospital_delhi1.txt","w")
content = ""
text=""
s=""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content = pdf.getPage(i).extractText() + "\n\n"
text+=content
tokens=content.split("Fax")
print len(tokens)
for t in tokens:
print t #general check
print s
f.close()
return text
getPDFContent("C:\\Doctor's Data\\delhi\\hospital_delhi1.pdf")
adding up my output is (Messed Up) :
S.NONAME OF THE HOSPITAL/CLINIC ADDRESS OF THE HOSPITAL/CLINIC PHONE NO. FAX NOLIST OF HOSPITALS AT DELHI59Walia Nursing HomeG.60, Laxmi Nagar, Shakarpur, DelhiDr.A.S.Dave - 2224858560Metro Heart InstituteSector A, Faridabad
:226358961Ayushman HospitalSector-XII, Dwarka, New Delhi42811114/15/16/18
: 28081723, 4553700163Mohan Eye Institute11-B, Ganga Ram Hospital Marg, New Delhi-6064Shroff Eye CentreKasturba Gandhi Marg, New DelhiReimbursement on CGHS rates without credit basis65Rockland HospitalB-33-34, Qutab Institutional Area, New Delhi66National Heart Institute49, Community Centre, East of Kailash
Have a look at some already existing python packages:
pdftables - (Github)
pdftable

Categories

Resources