first question coming up...have just started using Python 3.6. I am creating an XML format document of tabulated data. The document object itself has a collection called CellValues. Using Dimensions (aka Unicom Intelligence) I can read this collection as a record set and loop round it with .movenext() etc.
However when I read it in Python with:
rs=tomdoc.tables["T0"].cellvalues()
for val in rs:
print(val)
I only see the first line. In contrast, when I connect to a SqL database, the returned object is a SQLrows type and prints the whole thing, but this one says it's CDispatch.
How can I get it to either loop round or show me the whole recordset?
Apologies for my ignorance and thanks in advance :)
Thanks to a colleague, I do now have a working process.
In fact the collection needs to be indexed this way:
rs = tomdoc.Tables("T0").CellValues
Then it can be read as one normally would read a SQL-type record set:
rs.MoveFirst()
while not rs.EOF:
rowStr = ""
for f in rs.Fields:
rowStr += f.value + "\t"
print(rowStr)
rs.MoveNext()
I'm not sure why the ["T0"] gave me the first line though - that threw me somewhat and made it look closer than it actually was (one of those jolly things one encounters when mixing objects) so I didn't investigate alternatives for that part of the script :(
Related
I'm about halfway through Automate the Boring Stuff with Python textbook and video tutorials, however I have a big project at work where I need to autopopulate 60 Chemical Purchase Review documents that we can't seem to find. Rather than fill them out individually, I'd like to use what I've learned so far. I've had to jump ahead in chapters, but I can't seem to figure out how to get past the last line of code.
Basically, I have an excel spreadsheet with four columns of information I need to be input into certain areas on the word document form template.
I have "AAAA, BBBB..." in the word doc as a something to be found and replaced.
import openpyxl,os,docx,re
os.chdir(r'C:\Users\MYUSERNAME\OneDrive\Documents\Programming\ChemInv')
wb = openpyxl.load_workbook('cheminv.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
doc = docx.Document('ChemPurchaseForm_.docx')
fillObj = ('AAAA','BBBB','CCCC','DDDD')
for a in range(1,61):
for b in range(1,5):
fill = sheet.cell(row=a,column=b).value
for x in range(len(fillObj)):
inputRegex = re.compile(fillObj[x])
inputRegex.sub(fill,doc)
doc.save('ChemPurcaseForm_' + fill + '.docx')
I'm getting this error:
Traceback (most recent call last):
File "C:/Users/MYUSERNAME/OneDrive/Documents/Programming/ChemInv/autofill.py", line
15, in <module>
inputRegex.sub(fill,doc)
TypeError: expected string or bytes-like object
I'm assuming that either the "fill" variable or "doc" variable are not binary or string values?
Thank you in advance for help!
To debug this, you'll need to figure out which of the values are not binary or string values. A convenient way is to begin adding print statements for each value. For instance, you might try
print(fill)
print(doc)
print(type(fill))
print(type(doc))
I don't know exactly how the docx module works, but two hypotheses occur to me:
doc is not the appropriate type for the sub function; you'll have to cast the object to something different, or access it a different way if that's the case.
fill is None. That's easier to fix, it means you're not reading the Excel document properly.
Reading the docx documentation, I lean towards 1, since it doesn't look like it's a byte or string object, or a byte or string-compatible object, and so the sub method won't be able to properly operate on it; if that's correct, read the python-docx docs for more details that might help you figure out what you need to do. I'd explore what properties exist on your document, it seems there are some for directly accessing the text.
Good luck!
There's some weird mysterious behavior here.
EDIT This has gotten really long and tangled, and I've edited it like 10 times. The TL/DR is that in the course of processing some text, I've managed to write a function that:
works on individual strings of a list
throws a variety of errors when I try to apply it to the whole list with a list comprehension
throws similar errors when I try to apply it to the whole list with a loop
after throwing those errors, stops working on the individual strings until I re-run the function definition and feed it some sample data, then it starts working again, and finally
turns out to work when I apply it to the whole list with map().
There's an ipython notebook saved as html which displays the whole mess here: http://paul-gowder.com/wtf.html ---I've put a link at the top to jump past some irrelevant stuff. I've also made a[nother] gist that just has the problem code and some sample data, but since this problem seems to throw around a bunch of state somehow, I can't guarantee it'll be reproducible from it: https://gist.github.com/paultopia/402891d05dd8c05995d2
End TL/DR, begin mess
I'm doing some toy text-mining on that old enron dataset, and I have the following set of functions to clean up the emails preparatory to turning them into a document term matrix, after loading nltk stopwords and such. The following uses the email library in python 2.7
def parseEmail(document):
# strip unnecessary headers, header text, etc.
theMessage = email.message_from_string(document)
tofield = theMessage['to']
fromfield = theMessage['from']
subjectfield = theMessage['subject']
bodyfield = theMessage.get_payload()
wholeMsgList = [tofield, fromfield, subjectfield, bodyfield]
# get rid of any fields that don't exist in the email
cleanMsgList = [x for x in wholeMsgList if x is not None]
# now return a string with all that stuff run together
return ' '.join(cleanMsgList)
def lettersOnly(document):
return re.sub("[^a-zA-Z]", " ", document)
def wordBag(document):
return lettersOnly(parseEmail(document)).lower().split()
def cleanDoc(document):
dasbag = wordBag(document)
# get rid of "enron" for obvious reasons, also the .com
bagB = [word for word in dasbag if not word in ['enron','com']]
unstemmed =[word for word in bagB if not word in stopwords.words("english")]
return [stemmer.stem(word) for word in unstemmed]
print enronEmails[0][1]
print cleanDoc(enronEmails[0][1])
First (T-minus half an hour) running this on an email represented as a unicode string produced the expected result: print cleanDoc(enronEmails[0][1]) yielded a list of stemmed words. To be clear, the underlying data enronEmails is a list of [label, message] lists, where label is an integer 0 or 1, and message is a unicode string. (In python 2.7.)
Then at t-10, I added a couple lines of code (since deleted and lost, unfortunately...but see below), with some list comprehensions in them to just extract the messages from the enronEmails, run my cleanup function on them, and then join them back into strings for convenient conversion into document term matrix via sklearn. But the function started throwing errors. So I put my debugging hat on...
First I tried rerunning the original definition and test cell. But when I re-ran that cell, my email parsing function suddenly started throwing an error in the message_from_string method:
AttributeError: 'list' object has no attribute 'message_from_string'
So that was bizarre. This was exactly the same function, called on exactly the same data: cleanDoc(enronEmails[0][1]). The function was working, on the same data, and I haven't changed it.
So checked to make extra-sure I didn't mutate the data. enronEmails[0][1] was still a string. Not a list. I have no idea why traceback was of the opinion that I was passing a list to cleanDoc(). I wasn't.
But the plot thickens
So then I went to a make a gist to create a wholly reproducible example for the purpose of posting this SO question. I started with the working part. The gist: https://gist.github.com/paultopia/c8c3e066c39336e5f3c2.
To make sure it was working, first I stuck it in a normal .py file and ran it from command line. It worked.
Then I stuck it in a cell at the bottom of my ipython notebook with all the other stuff in it. That worked too.
Then I tried the parseEmail function on enronEmails[0][1]. That worked again. Then I went all the way back up to the original cell that was throwing an error not five minutes ago and re-ran it (including the import from sklearn, and including the original definition of all functions). And it freaking worked.
BUT THEN
I then went back in and tried again with the list comprehensions and such. And this time, I kept track more carefully of what was going on. Adding the following cells:
1.
def atLeastThreeString(cleandoc):
return ' '.join([w for w in cleandoc if len(w)>2])
print atLeastThreeString(cleanDoc(enronEmails[0][1]))
THIS works, and produces the expected output: a string with words over 2 letters. But then:
2.
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]
and all of a sudden it starts throwing a whole new error, same place in the traceback:
AttributeError: 'unicode' object has no attribute 'message_from_string'
which is extra funny, because I was passing it unicode strings a minute ago and it was doing just fine. And, just to thicken the plot, then going back and rerunning cleanDoc(enronEmails[0][1]) throws the same error
This is driving me insane. How is it possible that creating a new list, and then attempting to run function A on that list, not only throws an error on the new list, but ALSO causes function A to throw an error on data that it was previously working on? I know I'm not mutating the original list...
I've posted the entire notebook in html form here, if anyone wants to see full code and traceback: http://paul-gowder.com/wtf.html The relevant parts start about 2/3 of the way down, at the cells numbered 24-5, where it works, and then the cell numbered 26, where it blows up.
help??
Another edit: I've added some more debugging efforts to the bottom of the above-linked html notebook. As you can see, I've traced the problem down to the act of looping, whether done implicitly in list comprehension form or explicitly. My function works on an individual item in the list of just e-mails, but then fails on every single item when I try to loop over that list, except when I use map() to do it. ???? Has the world gone insane?
I believe the problem is these staements:
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]
In python 2, the dummy variable email leaks out into the namespace, and so you are overwriting the name of the email module, and you are then trying to call a method from that module on a python string. I don't have ntlk in python 2, so I cant test it, but I think this must be it.
My question is very similar to others' here but I haven't found the exact answer I'm looking for so I hope a veteran Python user will be able to further me along.
I'm learning scripting methods for my job but they won't send me to a training center to learn it so my Chief Technical Officer said that I should learn how to create log files from summarized Wireshark collection reports. I've had great luck in Bash but he wants me to become fluent in Python - without any help or background in scripting/programming this is a difficult task. I am attempting to essentially grep from the Wireshark report to a new file, giving a count and list of occurences of DNS traffic. The only thing is in order to be effective, it needs to be able to work using new data sets at every use, otherwise this is a meaningless exercise.
>> f1 = open('/home/user/file','r')
>> for line in f1
** if "DNS" in line:
**** print line
Two questions:
1) How would I put a count on each DNS occurance?
2) How would I pipe/print to a new txt file?
This might be a bit more advanced, however, regarding file processing I really like to do some generator pipelining!
# this is a generator (an iterable) which only outputs a
# line containing "DNS" if it was requested in an iteration
# furthermore, the way i use a generator here is called "list comprehension"
dns_lines = ( line for line in open('/home/user/file','r') if "DNS" in line )
# the with-statement uses python's magic-methods to take care of
# opening and closing the file
with open("output", 'w') as f:
# enumerate works on generators
# it enumerates each item that is iterated over
# a tuple is returned (count, line)
for count_line in enumerate(dns_lines):
f.write("%d - %s" % (count_line))
More on Generators and file processing here by David Beazley
I assumed, that you want to learn a bit more about how powerful python is. Hence, my long comments. :)
// Edit:
A bit more of an explanation regarding what is going to happen here:
The first line will just generate a generator object.
The file-reading will start running in the second for-loop.
As soon as this iteration is started the file will be read until a line containing "DNS" is found.
A tuple will be created (count, line) and handed over to this very iteration.
The tuple is written into a file using the format string!
The next iteration will take place and request the next line, which will start the file-reading again.
I hope this helps!
Generators prevent loading a whole list into the memory and they allow for a lot of neat tricks and pipelined processing. However, there is a lot more to it than you could mention here in this post!
You can simply initialize a new variable to count your items
counter = 0
if 'DNS' in line:
counter += 1
print counter
WRT saving your data, you can either do it in python or just print out the data and output it to a file
counter = 0
data = []
if 'DNS' in line:
counter += 1
data.append(line)
to_s = "\n".join(data)
f = open('out.txt', 'w')
f.write(to_s)
f.close()
I'm new to programming, and also to this site, so my apologies in advance for anything silly or "newbish" I may say or ask.
I'm currently trying to write a script in python that will take a list of items and write them into a csv file, among other things. Each item in the list is really a list of two strings, if that makes sense. In essence, the format is [[Google, http://google.com], [BBC, http://bbc.co.uk]], but with different values of course.
Within the CSV, I want this to show up as the first item of each list in the first column and the second item of each list in the second column.
This is the part of my code that I need help with:
with open('integration.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', dialect='excel')
writer.writerows(w for w in foundInstances)
For whatever reason, it seems that the delimiter is being ignored. When I open the file in Excel, each cell has one list. Using the old example, each cell would have "Google, http://google.com". I want Google in the first column and http://google.com in the second. So basically "Google" and "http://google.com", and then below that "BBC" and "http://bbc.co.uk". Is this possible?
Within my code, foundInstances is the list in which all the items are contained. As a whole, the script works fine, but I cannot seem to get this last step. I've done a lot of looking around within stackoverflow and the rest of the Internet, but I haven't found anything that has helped me with this last step.
Any advice is greatly appreciated. If you need more information, I'd be happy to provide you with it.
Thanks!
In your code on pastebin, the problem is here:
foundInstances.append(['http://' + str(num) + 'endofsite' + ', ' + desc])
Here, for each row in your data, you create one string that already has a comma in it. That is not what you need for the csv module. The CSV module makes comma-delimited strings out of your data. You need to give it the data as a simple list of items [col1, col2, col3]. What you are doing is ["col1, col2, col3"], which already has packed the data into a string. Try this:
foundInstances.append(['http://' + str(num) + 'endofsite', desc])
I just tested the code you posted with
foundInstances = [[1,2],[3,4]]
and it worked fine. It definitely produces the output csv in the format
1,2
3,4
So I assume that your foundInstances has the wrong format. If you construct the variable in a complex manner, you could try to add
import pdb; pdb.set_trace()
before the actual variable usage in the csv code. This lets you inspect the variable at runtime with the python debugger. See the Python Debugger Reference for usage details.
As a side note, according to the PEP-8 Style Guide, the name of the variable should be found_instances in Python.
I've spent a good part of today wrestling with this one -- I'm reading data from a serial-port server device (via socket module). Data is coming in OK, and I'm trying to do simple string processing on it (confirm correct data chunk size) prior to adding a timestamp and putting the complete chunks into a dictionary, with the timestamp as the key. Here is the code:
for i in range(0, (len(rawData)+1)):
if len(rawData[i]) == 57:
ss2000_data[str(time.time())] = (rawData[i].split(', '))
print ss2000_data
else: continue
The dictionary processing is going OK, in that I get a valid key:value pair -- once! The loop part is not working, so no matter how much serial data I receive, I'll only get a single key:value pair.
I've scanned questions here, also at the Python.org forum, and have also gone through the docs "Learning Python", "Python Pocket Ref" and the Python Tutorial at python.org, but I'm not getting anywhere. I'm a relative noob at Python, as well. I'd appreciate any suggestions or pointers to a potential source of information.
Thanks in advance, much appreciated
(I will assume that rawData contains some lines / datagrams from a serial connection.)
time.time() is not guaranteed to provide fractions of a second. You may be processing too quickly for time.time() to provide anything other than its initial value. Try prepending str(i) to the key you're using to store your split data, or using another key (possibly derived from i) that is guaranteed to change with each loop.
If you only get 1 entry printed, that means there's only 1 entry in rawData that has a length of 57, right?
Clean the code a bit, and add some debugging. Keeping it simple and close to what you have:
for block in rawData:
print 'Block,len=%d' % (len(block),)
if len(block) == 57:
ss2000_data[str(time.time())] = (block.split(', '))
print ss2000_data
If you're expecting more than 1 entry in rawData that has a length of 57, then are you sure "data is coming in OK"?