Search Keyword from multiple Excel colomn/row in multiples pdf files

Search Keyword from multiple Excel colomn/row in multiples pdf files - python

I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!

Related

How can I extract unformated, table-like text from PDF's using python?

I have scenario where I have PDFs with a letterhead and table-like body of text. I have tried using pdfminer but I'm struggling to figure out how to approach my problem
An example of the format for one my PDFs
In specific, pdf miner reads the data starting from the letterhead up until the table header. It then reads the table header in a row like fashion from left to right. From there it's just beyond messy.
Here is python to convert pdf to text:
import pdfminer
import sys
from pdfminer.high_level import extract_text
text = extract_text('./quote2.pdf')
print((text))
f = open("results2.txt", "w")
f.write(text)
And here is a snippet of what the output looks like:
... letter head info
ITEM�#
DESCRIPTION
561347
55�PCs-792.00�LB
6061-T651�PLATE�AMS�4027
4�S/C�6"�SQUARE
CUTTING�PLATE�SAW�ALUM
PACKAGING�SKIDDING
SHIP�VIA�:�OUR�TRUCK
Quotation
DATE:
CUSTOMER NUMBER:
QUOTE NUMBER:
FOB:
4/1/2022
319486
957242
Destination
SHIP TO:
The idea was to use regex to extract relevant numbers. As you can see it read the first 2 records for columns ITEM and DESCRIPTION, but from there it starts back up from the letterhead, and it's even more messy below
Is there perhaps a way to seperate the letterhead from the rest of the body as a starting step? Very new to python, not sure how to get what I want, help much appreciated!

How do I generate multiple docx files using python's docxtpl package, thus preserve docx formatting?

I am working on a process to automate generation of offer letters for candidates. The candidate information is in Excel and contains standard information needed for offer letter generation such as candidate name, date of joining, location, job title, CTC etc.
Is there a way to generate multiple offer letters (output file name _.docx) while preserving the formatting of the docx template?
Using Stackoverflow's help, I was able to utilize python-docx package and generate multiple offer letters. Thus approach however strips all the formatting from the offer letter.
import os
from pandas import *
import datetime
from docxtpl import DocxTemplate
doc = DocxTemplate("\\template\\offer_letter_template.docx")
xls = ExcelFile("\\data\\candidate_data.xlsx")
df = xls.parse(xls.sheet_names[0])
print (df.to_json(orient='records'))
Output:
[{"offer_letter_date":"July 27, 2019","candidate_name":"John Wick","candidate_email":"john.wick#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":283000},{"offer_letter_date":"July 17, 2019","candidate_name":"Jane Doe","candidate_email":"jane.doe#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":290000}]
context = df.to_json(orient='records')
doc.render(context)
I am struggling with creating a loop around context so that candidate information is saved in respective file rather than one file itself. Can someone please help?
Jinja2 for word templating was really helpful but I could not replicate it with a loop.

It is possible to create multiple docx files, unfortunately nobody said in the docxtpl documentation that once you load the template, replacements are done in-place, thus preventing any further context replacements.
A workaround which you may like would be reopening the file at every iteration.
Something like:
context=df.to_json(orient='records')
for i in len(context):
doc = DocxTemplate("\\template\\offer_letter_template.docx")
template.render(context[i])
template.save("docs-folder\\%s%(context[i][candidate_name]))
^Might need some revision, but you get the point.

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).

I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

Appending pdf files based multilpe values in a dictionary key (or csv) results in too many pages

I am trying generate pdf files based on the county they fall in. If there is more than one pdf file per county then I need to append the files into a single file based on the county key. I can't seem to get the maps to append based on key. The final maps generated seem random and often have way too many files appended. I am pretty sure I am not grouping them correctly. I have read that multiple values in a key can result in showing up multiple times. Can someone please clue me in on how to access each value per key separately, one time only? Obviously I am not understanding something crucial.
My code:
import csv, os
import shutil
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
merged_file = PdfFileMerger()
counties = {'County4': ['C:\\maps\\map2.pdf', 'C:\\maps\\map3.pdf', 'C:\\maps\\map4.pdf'], 'County1': ['C:\\maps\\map1.pdf', 'C:\\maps\\map2.pdf'], 'County3': ['C:\\maps\\map3.pdf'], 'County2': ['C:\\maps\\map1.pdf', 'C:\\maps\\map3.pdf']}
for k, v in counties.items():
newPdfFile = ('C:\maps\JoinedMaps\k +'.pdf')
if len(v) > 1:
for filename in v:
merged_file.append(PdfFileReader(filename,'rb'))
merged_file.write(newPdfFile)
else:
for filename in v:
shutil.copyfile(filename, newPdfFile)
I get four maps outputted (which is correct) but the number of "pages" (appended files) in some of these files is wildly off. As far as I can tell there is no rhyme or reason as to how these pages are appended. County4 pdf has 3 pages (correct), County1 pdf has 8 pages instead of 2, County3 pdf has 1 page (correct) and County2 has 15 pages instead of 2.
EDIT:
It turns out pyPDF2 does not like iterating through and creating files using the concept of group-by. I imagine it has something to so with how it stores memory. The results are the creation of increasingly greater number of pages as you iterate through the key values. I spent days thinking it was my coding. Good to know it wasn't I guess but I am surprised this piece of information is not "out there on the internet" better.
My solution was to use arcpy, which doesn't help most users reading this, sorry to say.
For those looking at my solution, my csv file looked like this:
County1 C:\maps\map1.pdf
County1 C:\maps\map2.pdf
County2 C:\maps\map1.pdf
County2 C:\maps\map3.pdf
County3 C:\maps\map3.pdf
County4 C:\maps\map2.pdf
County4 C:\maps\map3.pdf
County4 C:\maps\map4.pdf
and my resulting pdf files looked like this:
County-County1 (2 pages - Map1 and Map2)
County-County2 (2 pages - Map1 and Map3)
County-County3 (1 page - Map3)
County-County2 (3 pages - Map2, Map3, and Map4)

My data started out as a csv file and the code below references this instead of the dictionaries (which were generated from the csv file) which I used in the above example, but you should be able to glean what I did based on code below. I basically scraped the dictionary idea and went with reading the csv file line by line and then appending using arcpy. pyPDF2 does NOT merge correctly when trying to output multiple files based on a key. Three days of my life I can't get back
import csv
import arcpy
from arcpy import env
import shutil, os, glob
# clear out files from destination directory
files = glob.glob(r'C:\maps\JoinedMaps\*')
for f in files:
os.remove(f)
# open csv file
f = open("C:\maps\Maps.csv", "r+")
ff = csv.reader(f)
# set variable to establish previous row of csv file (for comaprrison)
pre_line = ff.next()
# Iterate through csv file
for cur_line in ff:
# new file name and location based on value in column (county name)
newPdfFile = (r'C:\maps\JoinedMaps\County-' + cur_line[0] +'.pdf')
# establish pdf files to be appended
joinFile = pre_line[1]
appendFile = cur_line[1]
# If columns in both rows match
if pre_line[0] == cur_line[0]: # <-- compare first column
# If destnation file already exists, append file referenced in current row
if os.path.exists(newPdfFile):
tempPdfDoc = arcpy.mapping.PDFDocumentOpen(newPdfFile)
tempPdfDoc.appendPages(appendFile)
# Otherwise create destination and append files reference in both the previous and current row
else:
tempPdfDoc = arcpy.mapping.PDFDocumentCreate(newPdfFile)
tempPdfDoc.appendPages(joinFile)
tempPdfDoc.appendPages(appendFile)
# save and delete temp file
tempPdfDoc.saveAndClose()
del tempPdfDoc
else:
# if no match, do not merge, just copy
shutil.copyfile(appendFile,newPdfFile)
# reset variable
pre_line = cur_line

Get formula from Excel cell with python xlrd

I have to port an algorithm from an Excel sheet to python code but I have to reverse engineer the algorithm from the Excel file.
The Excel sheet is quite complicated, it contains many cells in which there are formulas that refer to other cells (that can also contains a formula or a constant).
My idea is to analyze with a python script the sheet building a sort of table of dependencies between cells, that is:
A1 depends on B4,C5,E7 formula: "=sqrt(B4)+C5*E7"
A2 depends on B5,C6 formula: "=sin(B5)*C6"
...
The xlrd python module allows to read an XLS workbook but at the moment I can access to the value of a cell, not the formula.
For example, with the following code I can get simply the value of a cell:
import xlrd
#open the .xls file
xlsname="test.xls"
book = xlrd.open_workbook(xlsname)
#build a dictionary of the names->sheets of the book
sd={}
for s in book.sheets():
sd[s.name]=s
#obtain Sheet "Foglio 1" from sheet names dictionary
sheet=sd["Foglio 1"]
#print value of the cell J141
print sheet.cell(142,9)
Anyway, It seems to have no way to get the formul from the Cell object returned by the .cell(...) method.
In documentation they say that it is possible to get a string version of the formula (in english because there is no information about function name translation stored in the Excel file). They speak about formulas (expressions) in the Name and Operand classes, anyway I cannot understand how to get the instances of these classes by the Cell class instance that must contains them.
Could you suggest a code snippet that gets the formula text from a cell?

[Dis]claimer: I'm the author/maintainer of xlrd.
The documentation references to formula text are about "name" formulas; read the section "Named references, constants, formulas, and macros" near the start of the docs. These formulas are associated sheet-wide or book-wide to a name; they are not associated with individual cells. Examples: PI maps to =22/7, SALES maps to =Mktng!$A$2:$Z$99. The name-formula decompiler was written to support inspection of the simpler and/or commonly found usages of defined names.
Formulas in general are of several kinds: cell, shared, and array (all associated with a cell, directly or indirectly), name, data validation, and conditional formatting.
Decompiling general formulas from bytecode to text is a "work-in-progress", slowly. Note that supposing it were available, you would then need to parse the text formula to extract the cell references. Parsing Excel formulas correctly is not an easy job; as with HTML, using regexes looks easy but doesn't work. It would be better to extract the references directly from the formula bytecode.
Also note that cell-based formulas can refer to names, and name formulas can refer both to cells and to other names. So it would be necessary to extract both cell and name references from both cell-based and name formulas. It may be useful to you to have info on shared formulas available; otherwise having parsed the following:
B2 =A2
B3 =A3+B2
B4 =A4+B3
B5 =A5+B4
...
B60 =A60+B59
you would need to deduce the similarity between the B3:B60 formulas yourself.
In any case, none of the above is likely to be available any time soon -- xlrd priorities lie elsewhere.

Update: I have gone and implemented a little library to do exactly what you describe: extracting the cells & dependencies from an Excel spreadsheet and converting them to python code. Code is on github, patches welcome :)
Just to add that you can always interact with excel using win32com (not very fast but it works). This does allow you to get the formula. A tutorial can be found here [cached copy] and details can be found in this chapter [cached copy].
Essentially you just do:
app.ActiveWorkbook.ActiveSheet.Cells(r,c).Formula
As for building a table of cell dependencies, a tricky thing is parsing the excel expressions. If I remember correctly the Trace code you mentioned does not always do this correctly. The best I have seen is the algorithm by E. W. Bachtal, of which a python implementation is available which works well.

So I know this is a very old post, but I found a decent way of getting the formulas from all the sheets in a workbook as well as having the newly created workbook retain all the formatting.
First step is to save a copy of your .xlsx file as .xls
-- Use the .xls as the filename in the code below
Using Python 2.7
from lxml import etree
from StringIO import StringIO
import xlsxwriter
import subprocess
from xlrd import open_workbook
from xlutils.copy import copy
from xlsxwriter.utility import xl_cell_to_rowcol
import os
file_name = '<YOUR-FILE-HERE>'
dir_path = os.path.dirname(os.path.realpath(file_name))
subprocess.call(["unzip",str(file_name+"x"),"-d","file_xml"])
xml_sheet_names = dict()
with open_workbook(file_name,formatting_info=True) as rb:
wb = copy(rb)
workbook_names_list = rb.sheet_names()
for i,name in enumerate(workbook_names_list):
xml_sheet_names[name] = "sheet"+str(i+1)
sheet_formulas = dict()
for i, k in enumerate(workbook_names_list):
xmlFile = os.path.join(dir_path,"file_xml/xl/worksheets/{}.xml".format(xml_sheet_names[k]))
with open(xmlFile) as f:
xml = f.read()
tree = etree.parse(StringIO(xml))
context = etree.iterparse(StringIO(xml))
sheet_formulas[k] = dict()
for _, elem in context:
if elem.tag.split("}")[1]=='f':
cell_key = elem.getparent().get(key="r")
cell_formula = elem.text
sheet_formulas[k][cell_key] = str("="+cell_formula)
sheet_formulas
Structure of Dictionary 'sheet_formulas'
{'Worksheet_Name': {'A1_cell_reference':'cell_formula'}}
Example results:
{u'CY16': {'A1': '=Data!B5',
'B1': '=Data!B1',
'B10': '=IFERROR(Data!B12,"")',
'B11': '=IFERROR(SUM(B9:B10),"")',

It seems that it is impossible now to do what you want with xlrd. You can have a look at this post for the detailed description of why it is so difficult to implement the functionality you need.
Note that the developping team does a great job for support at the python-excel google group.

I know this post is a little late but there's one suggestion that hasn't been covered here. Cut all the entries from the worksheet and paste using paste special (OpenOffice). This will convert the formulas to numbers so there's no need for additional programming and this is a reasonable solution for small workbooks.

Ye! With win32com it's works for me.
import win32com.client
Excel = win32com.client.Dispatch("Excel.Application")
# python -m pip install pywin32
file=r'path Excel file'
wb = Excel.Workbooks.Open(file)
sheet = wb.ActiveSheet
#Get value
val = sheet.Cells(1,1).value
# Get Formula
sheet.Cells(6,2).Formula

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.