Tabula does not recognize table - python

I have a simple python program that takes in a pdf (with a table) and saves the data into a csv file using tabula:
import tabula
if __name__ == '__main__':
path = input('Filename: ')
pathSegments = path.split('/')
folder = ''
i = 0
while i < len(pathSegments)-1:
folder += '/' + pathSegments[i]
i += 1
name = pathSegments[len(pathSegments)-1].split('.')[0]
dest = folder + '/' + name + '.csv'
print(dest)
tabula.convert_into(path, dest, pages = "all", output_format = "csv")
I tried multiple different pdfs, for example one with the following picture:
The result however, is always an empty csv file, tabula does not seem to recognize the tables

Tabula isn't perfect at picking up tables. I would look into adding a templates to give tabula more guidance. These templates could by dynamically generated depending on different features of the document. See the function tabula.read_pdf_with_template as documented here: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf_with_template.

Related

Mail Merge With Python

I have been trying to write a python script to mail merge labels. It would need to allow me to look into a folder, open an excel document, merge the document, and print it as a pdf. All the rows in each excel file are part of the same document and I'd like for them to be printed together. I've written up a script that opens a word template and pulls up the excel file to populate into the mail merge but when I print it:
The printed copy only shows me the merge fields not the information on the workbook
Only prints the first page, some of the files I use to make labels would make more than one page.
I've included the code that I have as well as pictures of what I'm currently getting and what I need the end Product to look like.
If anyone can help me on this, you would be a live saver.
What I need:
What I'm getting:
from os import listdir
import win32com.client as win32
import pathlib
import os
import pandas as pd
pd.options.mode.chained_assignment = None
working_directory = os.getcwd()
path = pathlib.Path().resolve()
inputPath = str(path) + '\Output'
outputPath = str(path) + '\OutputPDF'
inputs = listdir(inputPath)
wordApp = win32.Dispatch('Word.Application')
wordApp.Visible = True
sourceDoc = wordApp.Documents.Open(os.path.join(working_directory, 'labelTemplate.docx'))
mail_merge = sourceDoc.MailMerge
for x in inputs[1:]:
mail_merge.OpenDataSource(inputPath + '/'+ x)
print (x)
y = x.replace('.xlsx', '')
z = y.replace('output_','')
print (z)
mail_merge = wordApp.ActiveDocument
mail_merge.ExportAsFixedFormat(os.path.join(outputPath, z), exportformat:=17)`

Difficulty trying to export query results as a CSV, uploaded to SharePoint (PySpark)

I am trying to run a query, with the result saved as a CSV that is uploaded to a SharePoint folder. This is within Databricks via Pyspark.
My code below is close to doing this, but the final line is not functioning correctly - the file generated in SharePoint does not contain any data, though the dataframe does.
I'm new to Python and Databricks, if anyone can provide some guidance on how to correct that final line I'd really appreciate it!
from shareplum import Site
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = spark.sql(Query)
pandasdf = df.toPandas()
folder.upload_file(pandasdf.to_csv(FileName, encoding = 'utf-8'), FileName)
Sure my code is still garbage, but it does work. I needed to convert the dataframe into a variable containing CSV formatted data prior to uploading it to SharePoint; effectively I was trying to skip a step before. Last two lines were updated:
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = (spark.sql(QueryAllocation)).toPandas().to_csv(header=True, index=False, encoding='utf-8')
folder.upload_file(df, FileName)

When I save the docx file in python, data gets corrupted

I am able to edit & save txt without problem but when I save the docx file, data gets corrupted. Like the image here: image of error Any suggestions to save docx properly? Thanks.
def save(self,MainWindow):
if self.docURL == "" or self.docURL.split(".")[-1] == "txt":
text = self.textEdit_3.toPlainText()
file = open(self.docURL[:-4]+"_EDITED.txt","w+")
file.write(text)
file.close()
elif self.docURL == "" or self.docURL.split(".")[-1] == "docx":
text = self.textEdit_3.toPlainText()
file = open(self.docURL[:-4] + "_EDITED.docx", "w+")
file.write(text)
file.close()
.docx files are much more than text - they are actually a collection of XML files with a very specific format. To read/write them easily, you need the python-docx module. To adapt your code:
from docx import Document
...
elif self.docURL == "" or self.docURL.split(".")[-1] == "docx":
text = self.textEdit_3.toPlainText()
doc = Document()
paragraph = doc.add_paragraph(text)
doc.save(self.docURL[:-4] + "_EDITED.docx"
and you're all set. There is much more you can do regarding text formatting, inserting images and shapes, creating tables, etc., but this will get you going. I linked to the docs above.

Extract a table from a locally saved HTML file

I have a series of HTML files stored in a local folder ("destination folder"). These HTML files all contain a number of tables. What I'm looking to do is to locate the tables I'm interested in thanks to keywords, grab these tables in their entirety, paste them to a text file and save this file to the same local folder ("destination folder").
This is what I have for now:
from bs4 import BeautifulSoup
filename = open('filename.txt', 'r')
soup = BeautifulSoup(filename,"lxml")
data = []
for keyword in keywords.split(','):
u=1
txtfile = destinationFolder + ticker +'_'+ companyname[:10]+ '_'+item[1]+'_'+item[3]+'_'+keyword+u+'.txt'
mots = soup.find_all(string=re.compile(keyword))
for mot in mots:
for row in mot.find("table").find_all("tr"):
data = cell.get_text(strip=True) for cell in row.find_all("td")
data = data.get_string()
with open(txtfile,'wb') as t:
t.write(data)
txtfile.close()
u=u+1
except:
pass
filename.close()
Not sure what's happening in the background but I don't get my txt file in the end like I'm supposed to. The process doesn't fail. It runs its course till the end but the txt file is nowhere to be found in my local folder when it's done. I'm sure I'm looking in the correct folder. The same path is used elsewhere in my code and works fine.

How do I extract data from a doc/docx file using Python

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file.
Reading up on python-docx did not help, as it only seems to allow one to write into word documents, rather than read.
To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found.
Anybody have any ideas?
The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.
The advantage of this technique is that you don't need any extra python libraries installed.
import zipfile
import xml.etree.ElementTree
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'
with zipfile.ZipFile('<path to docx file>') as docx:
tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
for table in tree.iter(TABLE):
for row in table.iter(ROW):
for cell in row.iter(CELL):
print ''.join(node.text for node in cell.iter(TEXT))
See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.
In answer to a comment below,
Images are not as clear cut to extract. I have created an empty docx and inserted one image into it. I then open the docx file as a zip archive (using 7zip) and looked at the document.xml. All the image information is stored as attributes in the XML not the CDATA like the text is. So you need to find the tag you are interested in and pull out the information that you are looking for.
For example adding to the script above:
IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'
for image in tree.iter(IMAGE):
print image.attrib
outputs:
{'id': '1', 'name': 'Picture 1'}
I'm no expert at the openxml format but I hope this helps.
I do note that the zip file contains a directory called media which contains a file called image1.jpeg that contains a renamed copy of my embedded image. You can look around in the docx zipfile to investigate what is available.
To search in a document with python-docx
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
You also have a function to get the text of a document:
https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)
Using https://github.com/mikemaccana/python-docx
It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code.
If anyone needs additional details, please say so in the comments.
A more simple library with image extraction capability.
pip install docx2txt
Then use below code to read docx file.
import docx2txt
text = docx2txt.process("file.docx")
Extracting text from doc/docx file using python
import os
import docx2txt
from win32com import client as wc
def extract_text_from_docx(path):
temp = docx2txt.process(path)
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
final_text = ' '.join(text)
return final_text
def extract_text_from_doc(doc_path):
w = wc.Dispatch('Word.Application')
doc = w.Documents.Open(file_path)
doc.SaveAs(save_file_name, 16)
doc.Close()
w.Quit()
joinedPath = os.path.join(root_path,save_file_name)
text = extract_text_from_docx(joinedPath)
return text
def extract_text(file_path, extension):
text = ''
if extension == '.docx':
text = extract_text_from_docx(file_path)
else extension == '.doc':
text = extract_text_from_doc(file_path)
return text
file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)

Categories

Resources