I am trying to parse a word (.docx) for tables, then copy these tables over to excel using xlsxwriter.
This is my code:
from docx.api import Document
import xlsxwriter
document = Document('/Users/xxx/Documents/xxx/Clauses Sample - Copy v1 - for merge.docx')
tables = document.tables
wb = xlsxwriter.Workbook('C:/Users/xxx/Documents/xxx/test clause retrieval.xlsx')
Sheet1 = wb.add_worksheet("Compliance")
index_row = 0
print(len(tables))
for table in document.tables:
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
#print (data)
#big_data.append(data)
Sheet1.write(index_row,0, str(row_data))
index_row = index_row + 1
print(row_data)
wb.close()
This is my desired output:
However, here is my actual output:
I am aware that my current output produces a list of string instead.
Is there anyway that I can get my desired output using xlsxwriter? Any help is greatly appreciated
I would go using pandas package, instead of xlsxwriter, as follows:
from docx.api import Document
import pandas as pd
document = Document("D:/tmp/test.docx")
tables = document.tables
df = pd.DataFrame()
for table in document.tables:
for row in table.rows:
text = [cell.text for cell in row.cells]
df = df.append([text], ignore_index=True)
df.columns = ["Column1", "Column2"]
df.to_excel("D:/tmp/test.xlsx")
print df
Which outputs the following that is inserted in the excel:
>>>
Column1 Column2
0 Hello TEST
1 Est Ting
2 Gg ff
This is the portion of my code update that allowed me to get the output I want:
for row in block.rows:
for x, cell in enumerate(row.cells):
print(cell.text)
Sheet1.write(index_row, x, cell.text)
index_row += 1
Output:
Related
I am trying to do the following tasks:
Open a DOCX file using python-docx library
Count the # of tables in the DOCX: table_count = len(document.tables)
The function read_docx_table() extracts the table, creates a dataframe.
My objective here is as following:
Extract ALL tables from the DOCX
Find the table that is empty
Delete the empty table
Save the DOCX
My code is as follows:
import pandas as pd
from docx import Document
import numpy as np
document = Document('tmp.docx')
table_count = len(document.tables)
table_num= table_count
print(f"Number of tables in the Document is: {table_count}")
nheader=1
i=0
def read_docx_table(document, table_num=1, nheader=1):
table = document.tables[table_num-1]
data = [[cell.text for cell in row.cells] for row in table.rows]
df = pd.DataFrame(data)
if nheader ==1:
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
elif nheader == 2:
outside_col, inside_col = df.iloc[0], df.iloc[1]
h_index = pd.MultiIndex.from_tuples(list(zip(outside_col, inside_col)))
df = pd.DataFrame(data, columns=h_index).drop(df.index[[0,1]]).reset_index(drop=True)
elif nheader > 2:
print("More than two headers. Not Supported in Current version.")
df = pd.DataFrame()
return df
def Delete_table(table):
print(f" We are deleting table now. Table index is {table}")
print(f"Type of index before casting is {type(table)}")
index = int(table)
print(f"Type of index is {type(index)}")
try:
print("Deleting started...")
document.tables[index]._element.getparent().remove(document.tables[index]._element)
except Exception as e:
print(e)
while (i < table_count):
print(f"Dataframe table number is {i} ")
df = read_docx_table(document,i,nheader)
df = df.replace('', np.nan)
print(df)
if (df.dropna().empty):
print(f'Empty DataFrame. Table Index = {i}')
print('Deleting Empty table...')
#Delete_table(i)
try:
document.tables[i]._element.getparent().remove(document.tables[i]._element)
print("Table deleted...")
except Exception as e:
print(e)
else:
print("DF is not empty...")
print(df.size)
i+=1
document.save('OUT.docx')
My INPUT docx has 3 tables:
But, my CODE gives me the following Output:
It is keeping the empty table and deleting the non-empty table.
Is there something I am missing? I am doubting my logic to check the Table is empty using if (df.dropna().empty):
The df.dropna().empty logic drops any tables that have no non-header rows lacking a blank cell. Is that the intent? If so, then it seems okay to me.
Two points:
The docx library does not necessarily return the tables in the order they exist in the document.
When you delete a table with your code, you immediately skip the next one to be returned (which might not be the next one in the document due to the above) because you increment your counter after deleting. That results in not processing all the tables.
As I worked through the logic, I did some rearrangements to understand it. I think you might also have been getting some exceptions being emitted on indexing into the tables after deleting? I included my edits below - hopefully they help.
import pandas as pd
from docx import Document
import numpy as np
def read_docx_table(document, table_num, header_rows):
table = document.tables[table_num]
data = [[cell.text for cell in row.cells] for row in table.rows]
df = pd.DataFrame(data)
if header_rows == 1:
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
elif header_rows == 2:
outside_col, inside_col = df.iloc[0], df.iloc[1]
h_index = pd.MultiIndex.from_tuples(list(zip(outside_col, inside_col)))
df = pd.DataFrame(data, columns=h_index).drop(df.index[[0,1]]).reset_index(drop=True)
else: # header_rows not 1 or 2
print("More than two headers. Not Supported in Current version.")
df = pd.DataFrame()
return df
def table_is_empty(document, table_num, header_rows):
df = read_docx_table(document, table_num, header_rows)
df = df.replace('', np.nan)
return df.dropna().empty
def delete_table(document, table_num):
document.tables[table_num]._element.getparent().remove(document.tables[table_num]._element)
document = Document('tmp.docx')
header_rows = 1
table_count = len(document.tables)
print(f"Number of tables in the input document is: {table_count}")
table_num = 0
while table_num < table_count:
if table_is_empty(document, table_num, header_rows):
delete_table(document, table_num)
table_count = len(document.tables)
else:
table_num += 1
print(f"Number of tables in the output document is: {table_count}")
document.save('OUT.docx')
I have a .docx file (attached screenshot) with a table. I need to convert it into a .csv table. I am using python-docx for this with the code below.
My code is below. Everything works fine except the last column (G) which is a merged cell. My code ignores G1 and only reports column G2 (screenshot attached). How can I edit the code so that the .csv file has both G1 and G2 columns?
Thanks!
import glob
import os
import pandas as pd
from docx.api import Document
files = glob.glob('*.docx')
for name in glob.glob('*.docx'):
document = Document(name)
#document = Document('f.docx')
table = document.tables[0]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
#data
#print (data)
df = pd.DataFrame(data)
print(os.path.splitext(name)[0])
df.to_csv(os.path.splitext(name)[0]+'.csv')
Below is my code:
v_excel= []
for root, dirs, files in os.walk(paths):
for t in files:
if t.endswith('.xlsx'):
df = pd.read_excel(os.path.join(paths,t), header=None, index_col=False)
v_excel.append(df)
conc = pd.concat(v_excel, axis=1, ignore_index=True)
conc output:
#after appending two excel files i can successively concat the files and put it in
#seperate column
column1 column2
data1 data1
data2 data2
data3 data3
data3 data4
#column 1 is from excel file 1 and column2 from excel file 2
How to do this for docx as i did for excel ?
if t.endswith('.docx'):
#for c,z in enumerate(t):
v_doc.append(Document(t)) # <-----how to put this in df and concat according to
# docx file as i have done with excel ?
docx contains:
#docx contains dummy text's !!!
#docx1 contains:
data1
data2
data3
data4
#docx2 contains:
data5
data6
data7
data8
i want to save the content of docx files to columns of excel. docx 1 content to column 1 of excel and docx 2 to column 2 of same excel.
Hope i get some response. Thank you in advance.
Solution #1: Aggregating multiple .docx documents to single output docx document.
If want to copy the text and style from a collection of docx documents to a single output docx then can use python-docx module.
from docx import Document
import os
master = Document()
for f in os.listdir('.'):
if f.endswith('.docx'):
doc = Document(f)
for p in doc.paragraphs:
out_para = master.add_paragraph()
for run in p.runs:
output_run = out_para.add_run(run.text)
# copy style from old to new
output_run.bold = run.bold
output_run.italic = run.italic
output_run.underline = run.underline
output_run.font.color.rgb = run.font.color.rgb
output_run.style.name = run.style.name
master.save('out.docx')
Solution #2: Aggregating table content from multiple .docx documents to single output excel document.
In your comments, you want to create an excel sheet from a set of word documents with tables of text.
Here is Python code to copy cells in tables of Word documents to a target Excel document.
import pandas as pd
from docx import Document
import os
df = None
for f in os.listdir('data'):
if f.endswith('.docx'):
doc = Document(file)
for table in doc.tables:
for row in table.rows:
data = []
for cell in row.cells:
data.append(cell.text)
if df is None:
df = pd.DataFrame(columns=list(range(1, len(data)+1)))
df = df.append(pd.Series(data, index=df.columns),
ignore_index=True)
df.to_excel("output.xlsx")
Solution #3: Aggregating custom table content from multiple .docx documents to single output excel document with a 2-column table.
In your particular sample data, the table is structured with either 3 or 9 columns so need to concatenate the text of other columns to a single value if want to keep 2 columns in output.
df = None
for f in os.listdir('data'):
if f.endswith('.docx'):
doc = Document(file)
# iterate over all the tables
for table in doc.tables:
for row in table.rows:
cells = row.cells
if len(cells) > 1:
col1 = cells[0].text
# check if first column is not empty
if col1:
# concatenate text of cells to a single value
text = ''
for i in range(1, len(cells)):
if len(text) != 0:
text += ' '
text += cells[i].text
data = [cells[0].text, text]
if df is None:
df = pd.DataFrame(columns=['column1', 'column2'])
df = df.append(pd.Series(data, index=df.columns),
ignore_index=True)
# save output
df.to_excel("output.xlsx")
You can docxcompose to concat docx files in python. you can read more descriptions in docxcompose's pypi official page
i need to extract the same table out of multiple docx report documents.
In the list 'targets_in_dir' I have stored all the file names with paths in the format
'C:\directory\subdirectory\filename1.docx'
The code below perfectly grabs the table out of the document and correctly allocates the keys to the columns.
import pandas as pd
import docx
from docx.api import Document
document = Document(targets_in_dir[1])
table = document.tables[2]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = targets_in_dir[1]
print (targets_in_dir[1])
My question: For tracking purpose I want to add a column to the final df where in each line the filename where the row was pulled is added. I tried to do it with the line
df['report'] = targets_in_dir[1]
but strangely it only adds the data from 'data_1' instead of the filename and path!
report
data_1
C:\directory\subdirectory\filename1.docx
Cumarin
C:\directory\subdirectory\filename1.docx
Piperacin
Meanwhile I found a solution myself with the following line of code. i just add str
df['report'] = str(targets_in_dir[1])
I have a series of tables I would like to write to the same worksheet. The only other post similar to this is here. I also looked here but didn't see a solution.
I was hoping for a similar situation to SAS ODS Output codes that send proc freq results to an excel file. My thought was turning the table results into a new data frame and then stacking the output results to a worksheet.
pd.value_counts(df['name'])
df.groupby('name').aggregate({'Id': lambda x: x.unique()})
If I know the number of rows corresponding to the table, I should ideally know the appropriate range of cells to write to.
I am using:
import xlsxwriter
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()
tableone = pd.value_counts(df['name'])
tabletwo = df.groupby('name').aggregate({'Id': lambda x: x.unique()})
worksheet.write('B2:C15', tableone)
worksheet.write('D2:E15', tabletwo)
workbook.close()
EDIT: Include view of tableone
TableOne:
Name | Freq
A 5
B 1
C 6
D 11
import xlsxwriter
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()
tableone = pd.value_counts(df['name'])
tabletwo = df.groupby('name').aggregate({'Id': lambda x: x.unique()})
col = 1, row = 1 #This is cell b2
for value in tableone:
if col == 16:
row += 1
col = 1
worksheet.write(row,col, value)
col += 1
col = 3, row = 1 #This is cell d2
for value in tabletwo:
if col == 16:
row += 1
col = 1
worksheet.write(row,col,value)
col += 1