Using python to autofill a word.docx from excel file

Using python to autofill a word.docx from excel file - python

I'm about halfway through Automate the Boring Stuff with Python textbook and video tutorials, however I have a big project at work where I need to autopopulate 60 Chemical Purchase Review documents that we can't seem to find. Rather than fill them out individually, I'd like to use what I've learned so far. I've had to jump ahead in chapters, but I can't seem to figure out how to get past the last line of code.
Basically, I have an excel spreadsheet with four columns of information I need to be input into certain areas on the word document form template.
I have "AAAA, BBBB..." in the word doc as a something to be found and replaced.
import openpyxl,os,docx,re
os.chdir(r'C:\Users\MYUSERNAME\OneDrive\Documents\Programming\ChemInv')
wb = openpyxl.load_workbook('cheminv.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
doc = docx.Document('ChemPurchaseForm_.docx')
fillObj = ('AAAA','BBBB','CCCC','DDDD')
for a in range(1,61):
for b in range(1,5):
fill = sheet.cell(row=a,column=b).value
for x in range(len(fillObj)):
inputRegex = re.compile(fillObj[x])
inputRegex.sub(fill,doc)
doc.save('ChemPurcaseForm_' + fill + '.docx')
I'm getting this error:
Traceback (most recent call last):
File "C:/Users/MYUSERNAME/OneDrive/Documents/Programming/ChemInv/autofill.py", line
15, in <module>
inputRegex.sub(fill,doc)
TypeError: expected string or bytes-like object
I'm assuming that either the "fill" variable or "doc" variable are not binary or string values?
Thank you in advance for help!

To debug this, you'll need to figure out which of the values are not binary or string values. A convenient way is to begin adding print statements for each value. For instance, you might try
print(fill)
print(doc)
print(type(fill))
print(type(doc))
I don't know exactly how the docx module works, but two hypotheses occur to me:
doc is not the appropriate type for the sub function; you'll have to cast the object to something different, or access it a different way if that's the case.
fill is None. That's easier to fix, it means you're not reading the Excel document properly.
Reading the docx documentation, I lean towards 1, since it doesn't look like it's a byte or string object, or a byte or string-compatible object, and so the sub method won't be able to properly operate on it; if that's correct, read the python-docx docs for more details that might help you figure out what you need to do. I'd explore what properties exist on your document, it seems there are some for directly accessing the text.
Good luck!

Related

Creating a dictionary from a csv file in Python by using own function

I got completely lost in figuring out this problem below. Here is the question:
country_population_data.csv
how the csv looks like
extract only the country name and its population from the csv. file (e.g., 'China', 14442161070)
create an empty dictionary named pop_dict. Then read the country_population_data.csv file, as a list of lines.
for each line of the records, extract a tuple of country name and population, then store it into the empty dictionary.
*requirement: create own function and use it
The answer should look like:
{'China': 14442161070, 'India': 13934090380, ...
My first approach was making a function to extract the required items from the csv file as a tuple, but somehow it did not work out and gave me this error.
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
#funtion to split items
with open(csv, 'r') as f:
def str_to_tuple(f)
str_splitted = tuple(f.split(","))
result_tuple = str_splitted.str()[1] + str_splitted.int()[-1]
return(result_tuple)
print(str_to_tuple(f))
And I also was not sure how to put extracted values in a new dictionary. Could anyone help me with this question? It has been just a couple of weeks for me to learn python so bear with my poor codes and explanation.
Any feedback & comments & tips are welcome to get used to this python world!

As this is a question that is part of a course, I will refrain from simply giving you the answer. Instead, I will try to give you some hints, which might help you find the answer by yourself.
My suggestion is that you start with trying to answer the following questions:
How should you call a function in Python? Is that how you do it in your script? Hint: probably not ;) If not, how would you fix this?
What is the type of f (i.e. print(type(f)) to find out)? Did you expect that to be the type of f? Hint: probably not ;) What do you expect the type of f to be? Perhaps we need to call .split(",") on a different variable, one that perhaps doesn't exist yet?

Reading Python CDispatch object only gives the first line

first question coming up...have just started using Python 3.6. I am creating an XML format document of tabulated data. The document object itself has a collection called CellValues. Using Dimensions (aka Unicom Intelligence) I can read this collection as a record set and loop round it with .movenext() etc.
However when I read it in Python with:
rs=tomdoc.tables["T0"].cellvalues()
for val in rs:
print(val)
I only see the first line. In contrast, when I connect to a SqL database, the returned object is a SQLrows type and prints the whole thing, but this one says it's CDispatch.
How can I get it to either loop round or show me the whole recordset?
Apologies for my ignorance and thanks in advance :)

Thanks to a colleague, I do now have a working process.
In fact the collection needs to be indexed this way:
rs = tomdoc.Tables("T0").CellValues
Then it can be read as one normally would read a SQL-type record set:
rs.MoveFirst()
while not rs.EOF:
rowStr = ""
for f in rs.Fields:
rowStr += f.value + "\t"
print(rowStr)
rs.MoveNext()
I'm not sure why the ["T0"] gave me the first line though - that threw me somewhat and made it look closer than it actually was (one of those jolly things one encounters when mixing objects) so I didn't investigate alternatives for that part of the script :(

Python, how to insert value in Powerpoint template?

I want to use an existing powerpoint presentation to generate a series of reports:
In my imagination the powerpoint slides will have content in such or similar form:
Date of report: {{report_date}}
Number of Sales: {{no_sales}}
...
Then my python app opens the powerpoint, fills in the values for this report and saves the report with a new name.
I googled, but could not find a solution for this.
There is python-pptx out there, but this is all about creating a new presentation and not inserting values in a template.
Can anybody advice?

Ultimately, barring some other library which has additional functionality, you need some sort of brute force approach to iterate the Slides collection and each Slide's respective Shapes collection in order to identify the matching shape (unless there is some other library which has additional "Find" functionality in PPT). Here is brute force using only win32com:
from win32com import client
find_date = r'{{report_date}}'
find_sales = r'{{no_sales}}'
report_date = '01/01/2016' # Modify as needed
no_sales = '604' # Modify as needed
path = 'c:/path/to/file.pptx'
outpath = 'c:/path/to/output.pptx'
ppt = client.Dispatch("PowerPoint.Application")
pres = ppt.Presentations.Open(path, WithWindow=False)
for sld in pres.Slides:
for shp in sld.Shapes:
with shp.TextFrame.TextRange as tr:
if find_date in tr.Text
tr.Replace(find_date, report_date)
elif find_sales in shp.TextFrame.Characters.Text
tr.Replace(find_sales, no_sales)
pres.SaveAs(outpath)
pres.Close()
ppt.Quit()
If these strings are inside other strings with mixed text formatting, it gets trickier to preserve existing formatting, but it should still be possible.
If the template file is still in design and subject to your control, I would consider giving the shape a unique identifier like a CustomXMLPart or you could assign something to the shapes' AlternativeText property. The latter is easier to work with because it doesn't require well-formed XML, and also because it's able to be seen & manipulated via the native UI, whereas the CustomXMLPart is only accessible programmatically, and even that is kind of counterintuitive. You'll still need to do shape-by-shape iteration, but you can avoid the string comparisons just by checking the relevant property value.

I tried this on a ".ppx" file I had hanging around.
A microsoft office power point ".pptx" file is in ".zip" format.
When I unzipped my file, I got an ".xml" file and three directories.
My ".pptx" file has 116 slides comprised of 3,477 files and 22 directories/subdirectories.
Normally, I would say it is not workable, but since you have only two short changes you probably could figure out what to change and zip the files to make a new ".ppx" file.
A warning: there are some xml blobs of binary data in one or more of the ".xml" files.

You can definitely do what you want with python-pptx, just perhaps not as straightforwardly as you imagine.
You can read the objects in a presentation, including the slides and the shapes on the slides. So if you wanted to change the text of the second shape on the second slide, you could do it like this:
slide = prs.slides[1]
shape = slide.shapes[1]
shape.text = 'foobar'
The only real question is how you find the shape you're interested in. If you can make non-visual changes to the presentation (template), you can determine the shape id or shape name and use that. Or you could fetch the text for each shape and use regular expressions to find your keyword/replacement bits.
It's not without its challenges, and python-pptx doesn't have features specifically designed for this role, but based on the parameters of your question, this is definitely a doable thing.

How to 'flatten' lines from text file if they meet certain criteria using Python?

To start I am a complete new comer to Python and programming anything other than web languages.
So, I have developed a script using Python as an interface between a piece of Software called Spendmap and an online app called Freeagent. This script works perfectly. It imports and parses the text file and pushes it through the API to the web app.
What I am struggling with is Spendmap exports multiple lines per order where as Freeagent wants One line per order. So I need to add the cost values from any orders spread across multiple lines and then 'flatten' the lines into One so it can be sent through the API. The 'key' field is the 'PO' field. So if the script sees any matching PO numbers, I want it to flatten them as per above.
This is a 'dummy' example of the text file produced by Spendmap:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
The above has been formatted for easier reading and normally is just one line after the next with no text formatting.
The 'key' or PO field is the first bold item and the second bold/italic item is the cost to be totalled. So if this example was to be passed through the script id expect the first row to be left alone, the Second and Third row costs to be added as they're both from the same PO number and the Fourth line to left alone.
Expected result:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,401.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
Any help with this would be greatly appreciated and if you need any further details just say.
Thanks in advance for looking!

I won't give you the solution. But you should:
Write and test a regular expression that breaks the line down into its parts, or use the CSV library.
Parse the numbers out so they're decimal numbers rather than strings
Collect the lines up by ID. Perhaps you could use a dict that maps IDs to lists of orders?
When all the input is finished, iterate over that dict and add up all orders stored in that list.
Make a string format function that outputs the line in the expected format.
Maybe feed the output back into the input to test that you get the same result. Second time round there should be no changes, if I understood the problem.
Good luck!

I would use a dictionary to compile the lines, using get(key,0.0) to sum values if they exist already, or start with zero if not:
InputData = """5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP COMMENT,002143"""
OutD = {}
ValueD = {}
for Line in InputData.split('\n'):
# commas in comments won't matter because we are joining after anyway
Fields = Line.split(',')
PO = Fields[3]
Value = float(Fields[5])
# set up the output string with a placeholder for .format()
OutD[PO] = ",".join(Fields[:5] + ["{0:.3f}"] + Fields[6:])
# add the value to the old value or to zero if it is not found
ValueD[PO] = ValueD.get(PO,0.0) + Value
# the output is unsorted by default, but you could sort or preserve original order
for POKey in ValueD:
print OutD[POKey].format(ValueD[POKey])
P.S. Yes, I know Capitals are for Classes, but this makes it easier to tell what variables I have defined...

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:
I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.
So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.
How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.
So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).
I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk
So, any suggestions?

If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:
diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py
--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
## -134,8 +134,10 ##
data_type = element.get('t', 'n')
if data_type == Cell.TYPE_STRING:
value = string_table.get(int(value))
-
- ws.cell(coordinate).value = value
+ ws.cell(coordinate).set_value_explicit(value=value,
+ data_type=Cell.TYPE_STRING)
+ else:
+ ws.cell(coordinate).value = value
# to avoid memory exhaustion, clear the item after use
element.clear()
The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.
As you can see from the code, the test whether it is a string was already there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.