Python: Parsing Word document Table and Save CSV file

Python: Parsing Word document Table and Save CSV file - python

I would like to save some tables in word document to CSV file or Excel doesn't matter.
I tried to "readlines()" it doesn't work! I don't know know.
Tables in word document are like this..
Name Age Gender
Alex 12 F
Willy 14 M
.
.
.
However, I would like to save this table in the same row.. I mean that.. I would like to save in CSV or Excel File
Alex 12 F Willy 14 M ....
import win32com
word = win32com.client.Dispatch('Word.Application')
f=word.Documents.Open('C:/3.doc')

have a look to www.ironpython.com: it runs over .NET so it has all the libraries to access to the Microsoft world.
For your case, read this small tutorial about convert a .doc to a .txt file. It should be very useful for you:
http://www.ironpython.info/index.php/Converting_a_Word_document_to_Text

Related

Search Keyword from multiple Excel colomn/row in multiples pdf files

I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!

How can I extract unformated, table-like text from PDF's using python?

I have scenario where I have PDFs with a letterhead and table-like body of text. I have tried using pdfminer but I'm struggling to figure out how to approach my problem
An example of the format for one my PDFs
In specific, pdf miner reads the data starting from the letterhead up until the table header. It then reads the table header in a row like fashion from left to right. From there it's just beyond messy.
Here is python to convert pdf to text:
import pdfminer
import sys
from pdfminer.high_level import extract_text
text = extract_text('./quote2.pdf')
print((text))
f = open("results2.txt", "w")
f.write(text)
And here is a snippet of what the output looks like:
... letter head info
ITEM�#
DESCRIPTION
561347
55�PCs-792.00�LB
6061-T651�PLATE�AMS�4027
4�S/C�6"�SQUARE
CUTTING�PLATE�SAW�ALUM
PACKAGING�SKIDDING
SHIP�VIA�:�OUR�TRUCK
Quotation
DATE:
CUSTOMER NUMBER:
QUOTE NUMBER:
FOB:
4/1/2022
319486
957242
Destination
SHIP TO:
The idea was to use regex to extract relevant numbers. As you can see it read the first 2 records for columns ITEM and DESCRIPTION, but from there it starts back up from the letterhead, and it's even more messy below
Is there perhaps a way to seperate the letterhead from the rest of the body as a starting step? Very new to python, not sure how to get what I want, help much appreciated!

Translating big amount of csv file (Flickr8k_text dataset) to 'Nepali' Language in python

I've been working in a Image Captioning Project in 'Nepali Language'. For Dataset part I tried to translate all the English captions text to Nepali of the Flickr8k dataset. For this I'm using python translate tool as
dataset = pd.read_csv('/content/gdrive/My Drive/out.csv',delimiter = '\t')
dataset.drop('Unnamed: 0',axis = 1)
def trans(x):
translator= Translator(to_lang="ne")
return translator.translate(x)
dataset['caption'] = dataset['caption'].apply(trans)
print('done')
But it only translated 130 rows of captions to Nepali language and then all other texts are translated as
MYMEMORY WARNING: YOU USED ALL AVAILABLE FREE TRANSLATIONS FOR TODAY. NEXT AVAILABLE IN 23 HOURS 24 MINUTES 38 SECONDSVISIT TO TRANSLATE MORE
Is there's any way of translating all the texts at once??
I've tried googletrans too but it also fails due to frequent request on API
Note: the dataset contains 40458 rows with English sentences in caption column.
It will be great help if there's any way to translate all the text and Thanks in Advance :)

Okay, I figured it out by myself. Use google sheet and import your csv file
and make a column with the header named the target language's name and use the formula =googletranslate(cell_with_text, "source_language", "target_language")
example: =googletranslate(A2,"en","ne") now from the corner of the cell where mouse pointer appears as + like sign and drop all the way down and bingo you can translate all the text in a column at once.

table parsing , copy image , graph etc from one docx to new docx - Python

pydocx
I am wondering is there a way to save pic, tables etc from one docx to a new docx using python ? I am using python-docx to read a docx and doing some operations on the text by paragraph and copying it to a new docs but in this activity , any table or pic is getting missed . Seems the code itself does not read it . I want the pic, graphs , columns etc at their place . Is it possible ? Please help me on this.
import docx
doc = docx.Document('demo.docx')
doc1 = docx.Document()
l =len(doc.paragraphs)
for i in range(l):
d = doc.paragraphs[i].text
some_op = d.upper() #taking .upper as an example but doing something else here
doc1.add_paragraph(some_op.text)
doc1.save('Paragraphs.docx')
The newly created Paragraphs.docs is missing images, tables etc from the place.

you are using doc.paragraphs which just returns all the paragraphs in the document.
To access tables and images you need to use Table and inline_shape objects. You can find them here in the official documentation.
Tabels
Shapes and images

Excel showing empty cells when importing file created with csv module

I have a .csv with rows that are something like:
"Review",Clean Review
"The hotel _was, re3ally good",the hotel was really good
but when I was open with excel 2013 the cells with the double quotes are showing empty on the sheet, and I can only see the text in the formula bar. Can anyone tell me why this is happening?
excel sheet kinda looks like:
| |Clean Review|
| |the hotel wa|
I have opened the .csv in notepad and there doesn't seem to be hidden characters causing this behavior
I used Python's csv module to create the .csv
it especially happens if you drag and expand the first column

The problem is in your CSV file. There is a CR after the w in Review. And Review is actually on the 2nd line of A1 (seen if you increase the height of row 1).
Here is an analysis of the characters at the beginning of your actual CSV with the character and the ASCII code. Note the 10 after the w
" 34
R 82
e 101
v 118
i 105
e 101
w 119
10
" 34

Try
="Review",Clean Review
="The hotel _was, re3ally good",the hotel was really good
This forces the value to be text and not something else.
Haven't tried but theoretically should work.
Please check this and more solutions here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Parsing Word document Table and Save CSV file - python

have a look to www.ironpython.com: it runs over .NET so it has all the libraries to access to the Microsoft world. For your case, read this small tutorial about convert a .doc to a .txt file. It should be very useful for you: http://www.ironpython.info/index.php/Converting_a_Word_document_to_Text

Related

Search Keyword from multiple Excel colomn/row in multiples pdf files

How can I extract unformated, table-like text from PDF's using python?

Translating big amount of csv file (Flickr8k_text dataset) to 'Nepali' Language in python

table parsing , copy image , graph etc from one docx to new docx - Python

Excel showing empty cells when importing file created with csv module

Categories

Resources