Generating organized excel spreadsheet from a word document

Generating organized excel spreadsheet from a word document - python

I have Microsoft document which we want to transfer to excel. Every sentence needs to be separated and then pasted into the next appropriate cell in excel. These sentences also need to be analyzed as a heading, requirement, or informational.
I will recreate what the typical word format looks like
2.3.4 Lightening Transient Response
The device shall meet spec 24532. Voltage must resemble figure.
Figure 1.
which translates to
<numbering> <Heading>
<Requirements/information>
In excel that is almost exactly how I would the document to look except the second requirement sentence should be in row just below the previous requirement sentence.
2.3.4 | Lightening Transient Response | Heading
| The device shall meet spec 24532. | Requirement
|Voltage must resemble figure | Requirement
|figure 1 | Informational
I have attempted this project with python using openxl and docx modules. I have code that can go into word and get sentences and then code that can analyze the sentence.I'm retrieving runs from paragraphs. I am having problems because not all sentences are coming back due to how the word document is formatted. I am typically only getting the headings back. The heading numbers are not stored in runs. The requirements underneath the headings are stored in tables. I have written some code to get into the tables an extract the text from cells so that is one way to get the requirements however that snippet of code is giving problems(giving me the same sentence three times in a row).
I'm looking for other possible ways to do this. I'm thinking a format switch. XML has been mentioned and then also the pdf and pythons pdf module may be possible.
Any thoughts or advice would be greatly appreciated.
-Chris

XML is going to be harder, not easier. You're closer than you seem to think. I recommend attacking each problem separately until you crack it.
The sentence three times problem in the table is because of merged cells. The way python-docx works on tables, there is an underlying table layout of x rows and y columns. If two side-by-side cells are merged, you get the same results for both those cells. You can detect this be comparing the two cells for equality. Roughly like "if this_cell == last_cell skip this cell".
There's no way around the heading problem. Heading numbers only exist inside a running instance of Word; they are generated at display (or print) time. To get those you need to use the same rules to generate your own numbers. So you'd need to keep track of the number of headings you've passed through etc. and form your own dot-separated numbering.

Why are you using Python for this? Just use VBA, since you are working with Excel and Word.
Something like this should get you pretty close to where you want to be. It may need some tweaking...
Sub Demo()
Dim wdApp As Word.Application
Set wdApp = Word.Application
Dim wdDoc As Word.Document
Set wdDoc = wdApp.ActiveDocument
wdDoc.Range.Copy
ActiveSheet.Paste Destination:=ActiveSheet.Range("A1")
With ActiveSheet
.Paste Destination:=Range("A" & .Cells.SpecialCells(xlCellTypeLastCell).Row + 1)
End With
Set myRange = Range("A1:A100")
For i = 1 To myRange.Rows.Count
If InStr(myRange.Cells(i, "A").Value, "Voltage") > 0 Then
myRange.Cells(i, "A").Offset(1, 0).Select
ActiveCell.EntireRow.Insert
ActiveCell.Offset(-1, 0).Select
If InStr(myRange.Cells(i, "A").Value, "Voltage") > 0 Then
position1 = InStr(1, ActiveCell.Value, "Voltage")
myRange.Cells(i + 1, "A").Value = Mid(ActiveCell.Value, position1, 99)
ActiveCell.Value = Left(ActiveCell.Value, position1 - 2)
i = i + 2
End If
End If
Next i
End Sub
So, copy the text from your Word doc, which should be open and active, and you're good to go. There are other ways to do this too.

Related

Cannot identify string as part of set

I am trying to automatically detect the starting point for parsing some data into a list - specifically I want to start collecting only integers. I have a lot of data to go through and the files can be .txt or .csv usually starting with several rows of information on the data collection itself.
I am using a function to check if the data contains only integers / delimeters like so:
def check_str(input):
allow = set("0123456789\t eE-+.,;")
validation = set((input))
if validation.issubset(allow):
out = True
else:
out = False
return out
Now, this works for test strings like:
test = '435.76 568.21'
But it doesn't work on the actual data from the .txt file:
341.09 75.97
341.46 75.97
341.84 75.97
342.22 17.28
342.59 8.91
Output of check_string on actual data
I'm not really sure what the problem is, I have included tab and white space in my set so I cannot understand why it doesn't work.

The problem was that each row contains '\r\n' at the end:
i.e.
'1.301237305e+03,4.503491255e-06\r\n'
adding this to the set fixed the issue.
Thanks to Andrej Kesely for some insight.

python Win32 Excel is cell a range

I am writing a bit of Python code to automate the manipulation of Excel spreadsheets. The idea is to use spreadsheet templates to create daily reports. Saw this idea working several years ago using Perl. Anyway.
Here are the simple rules:
Sheets with the Workbook are process in the order they appear.
Within the sheets cells are process left to right, then top to bottom.
There are names defined which are single cell ranges, can contain static values or the results of queries. Cells can contain comments which contain SQL queries to run. ...
Here is the problem, as I process the cells I need to check if the cell has an attached comment and if the cell has a name. I am able to handle processing the attached cell comments. But I can not figure out how to determine if a cell is within a named range. In my case the single cell within the range.
I saw a posting the suggested this would work:
cellName = ws.ActiveCell.Name.Name
No luck.
Does anybody have any idea how to do this?
I am so close but no cigar.
Thanks for your attention to this matter.
KD

What you may consider doing is first building a list of all addresses of names in the worksheet, and checking the address of each cell against the list to see if it's named.
In VBA, you obtain the names collection (all the names in a workbook) this way:
Set ns = ActiveWorkbook.Names
You can determine if the names are pointed toward part of the current sheet, and a single cell, this way:
shname = ActiveSheet.Name
Dim SheetNamedCellAddresses(1 To wb.Names.Count) as String
i = 1
For Each n in ns:
If Split(n.Value, "!")(0) = "=" & shname And InStr(n.Value, ":") = 0 Then
' The name Value is something like "=Sheet1!A1"
' If there is no colon, it is a single cell, not a range of cells
SheetNamedCellAddresses(i) = Split(n,"=")(1) 'Add the address to your array, remove the "="
i = i + 1
End If
Next
So now you have a string array containing the addresses of all the named cells in your current sheet. Move that array into a python list and you are good to go.

OK so it errors out if the cell does NOT have a range name. If the cell has a range name the following bit of code returns the name: Great success!!
ws.Cells(r,c).Activate()
c = xlApp.ActiveCell
cellName = c.Name.Name
If there is no name associated with the cell, an exception is tossed.
So even in VBA you would have to wrap this bit of code in exception code. Sounds expensive to me to use exception processing for this call.

how can i change the size of a table in word using python (pywin32)

ms word table with python
I am working with python on word tables, i am generating tables, but all of them are
auto fit window..
is it possible to change it to auto fit contents?
i had tried something like this:
table = location.Tables.Add(location,len(df)+1,len(df.columns)
table.AutoFit(AutoFitBehavior.AutoFitToContents)
but it keeps to raise errors

You want to change you table creation to use this:
//''#Add two ones after your columns
table = location.Tables.Add(location,len(df)+1,len(df.columns),1,1)
Information about why you need those variables can be read here:
http://msdn.microsoft.com/en-us/library/office/ff845710(v=office.15).aspx
But basically, the default behavior is to disable Cell Autofitting and Use Table Autofit to Window. The first "1" enables Cell Autofitting. From the link I posted above, the DefaultTableBehavior can either be wdWord8TableBehavior (Autofit disabled --default) or wdWord9TableBehavior (Autofit enabled). The number comes from opening up Word's VBA editor and typing in the Immediate Window:
?Word.wdWord9TableBehavior
Next, from the link, we see another option called AutoFitBehavior. This is defined as:
Sets the AutoFit rules for how Word sizes tables. Can be one of the WdAutoFitBehavior constants.
So now we have another term to look up. In the VBA editor's Immediate window again type:
?Word.wdAutoFitBehavior.
After the last dot, the possible options should appear. These will be:
wdAutoFitContent
wdAutoFitFixed
wdAutoFitWindow
AutoFitContent looks to be the option we want, so let's finish up that previous line with:
?Word.wdAutoFitBehavior.wdAutoFitContent
The result will be a "1".
Now you may ask, why do we have to go through all this trouble finding the numerical representations of the values. From my experience, with using pywin32 with Excel, is that you can't get the Built-in values, from the string, most of the time. But putting in the numerical representation works just the same.
Also, One more reason for why your code may be failing is that the table object may not have a function "Autofit".
I'm using Word 2007, and Table has the function, AutoFitBehavior.
So Change:
table.AutoFit(AutoFitBehaviour.AutoFitToContent)
To:
table.AutoFitBehavior(1)
//''Which we know the 1 means wd.wdAutoFitBehavior.wdAutoFitContent
Hope I got it right, and this helps you out.

String count - identify key words and dismissing compound words

I have created a program for an assignment which reads a txt file and returns key words. My program returns the key words but there s one issue with one of the words 'data'. I should only get 6 results for this but I am getting 7. The reason, i assume, is there is a compound word present in the text 'data - analytics'. The program seems to be picking this up and counting it in the final result. Is there anything I could insert into the end of my code to dismiss this?
import string
text = open('news1.txt').read()+open ('news2.txt').read()
print 'data:', string.count(text, 'data')

It's hard to be sure without seeing your actual input files, but there's one obvious possibility:
news1.txt:
data data data dat
news2.txt
a data data data
There are only 6 instances of the word "data" in the files. But if you concatenate the files, you get this:
data data data data data data data
… and you will count 7 instead of 6.
And it's perfectly plausible that your teacher gave you files that look like that in order to catch exactly this kind of bug. Edge cases that don't come up very often in the wild, and that you didn't think to test for, are exactly the kinds of things that cost you months of frustration—trying to drag repro information out of users, debugging the program, etc. It's a good lesson to learn early on in your programming life.

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:
I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.
So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.
How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.
So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).
I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk
So, any suggestions?

If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:
diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py
--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
## -134,8 +134,10 ##
data_type = element.get('t', 'n')
if data_type == Cell.TYPE_STRING:
value = string_table.get(int(value))
-
- ws.cell(coordinate).value = value
+ ws.cell(coordinate).set_value_explicit(value=value,
+ data_type=Cell.TYPE_STRING)
+ else:
+ ws.cell(coordinate).value = value
# to avoid memory exhaustion, clear the item after use
element.clear()
The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.
As you can see from the code, the test whether it is a string was already there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.