Parsing unicode string read from a cell in an xlrd.Book object

Parsing unicode string read from a cell in an xlrd.Book object - python

I am trying to parse some unicode text from an excel2007 cell read by using xlrd (actually xlsxrd).
For some reason xlrd attaches "text: " to the beginning of the unicode string and is making it difficult for me to type cast. I eventually want to reverse the order of the string since it is a name and will be put in alphabetical order with several others. Any help would be greatly appreciated, thanks.
here is a simple example of what I'm trying to do:
>>> import xlrd, xlsxrd
>>> book = xlsxrd.open_workbook('C:\\fileDir\\fileName.xlsx')
>>> book.sheet_names()
[u'Sheet1', u'Sheet2']
>>> sh = book.sheet_by_index(1)
>>> print sh
<xlrd.sheet.Sheet object at 0x(hexaddress)>
>>> name = sh.cell(0, 0)
>>> print name
text: u'First Last'
from here I would like to parse "name" exchanging 'First' with 'Last' or just separating the two for storage in two different vars but every attempt I have made to type cast the unicode gives an error. perhaps I am going about it the wrong way?
Thanks in advance!

I think you may need
name = sh.cell(0,0).value
to get the unicode object. Then, to split into two variables, you can obtain a list with the first and last name, using an empty space as separator:
split_name = name.split(' ')
print split_name
This gives [u'First', u'Last']. You can easily reverse the list:
split_name = split_name.reverse()
print split_name
giving [u'Last', u'First'].

Read aboput the Cell class in the xlrd documentation. Work through the tutorial that you can get via www.python-excel.org.

Related

Python Dictionary as Valid jSon?

in python I have:
dict = {}
dict['test'] = 'test'
when I print I get:
{'test':'test'}
How can I make it like this:
{"test":"test"}
Please Note, replace won't work as test may be test't...
I tried:
dict = {}
dict["test"] = "test"

You can use json.dumps()
For example, if you use print json.dumps(dict) you should get the desired output.
Additionally, as suggested in a different related question, you may construct your own version of a dict with special printing:
How to create a Python dictionary with double quotes as default quote format?

Python: Pass string into function with re

I am new to Python. I created the below function to search a text using regex in a file. The result is then written to an excel sheet.
But I get the error "NonType" object has no attribute group for (which mean match is not found).
b_list=re.split('\s+', str(b.group()))
However, when I use the function as normal code, I am able to find the text. So it means the passed values into the function didn't work.
How do I pass strings or variables correctly into the function? Thank you.
The complete code as below.
import re
import openpyxl
def eval_text(mh, search_text, excel_sht, excel_col):
b_regex=re.compile(r'(?<=mh ).+')
b=b_regex.search(search_text)
b_list=re.split('\s+', str(b.group()))
if abs(b)>1:
cell_b=excel_sht.cell(row=i, column=excel_col).value='OK'
else abs(b)<1:
cell_b=excel_sht.cell(row=i, column=excel_col).value='Not OK'
wb=openpyxl.load_workbook('test.xlsm', data_only=True, read_only=False, keep_vba=True)
sht=wb['test']
url=sht.cell(row=1, column=1).value
with open (url, 'r') as b:
diag_text_lines=b.readlines()
diag_text="".join(diag_text_lines)
eval_text('jame', diag_text, sht, 9)

Since the mh parameter is not used anywhere else in the function, I assume that you expected it to get automatically inserted in place of the mh in the regular expression r'(?<=mh ).+'. However, this does not happen! You have to use a format string, e.g. f'(?<={mh} ).+' (note that besides the {...} I replaced the "raw" r prefix, which you do not really need here, with f).
def eval_text(mh, search_text, excel_sht, excel_col):
b_regex=re.compile(f'(?<={mh} ).+')
b = b_regex.search(search_text)
...
For older versions of Python, use the format method instead. If there are more {...} used in the regex, this might not work, though. In the worst case, you can still concatenate the string yourself: r'(?<=' + mh + r' ).+' or use the old % format r'(?<=%s ).+' % mh.

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

I'm using Open Refine to do something that I KNOW Python can do. I'm using it to convert a csv into an XML metadata document. I can figure out most of it, but the one thing that trips me up, is this GREL line:
{{forEach(cells["subjectTopicsLocal"].value.split('; '), v, '<subject authority="local"><topic>'+v.escape("xml")+'</topic></subject>')}}
What this does, is beautiful for me. I've got a "subject" field in my Excel spreadsheet. My volunteers enter keywords, separated with a "; ". I don't know how many keywords they'll come up with, and sometimes there is only one. That GREL line creates a new <subject authority="local"><topic></topic></subject> for each term created, and of course slides it into the field.
I know there has to be a Python expression that can do this. Could someone recommend best practice for this? I'd appreciate it!

Basically you want to use 'split' in Python to convert the string from your subject field into a Python list, and then you can iterate over the list.
So assuming you've read the content of the 'subject' field from a line in your csv/excel document already and assigned it to a string variable 'subj' you could do something like:
subjList = subj.split(";")
for subject in subjList:
#do what you need to do to output 'subject' in an xml element here

This Python expression is the equivalent to your GREL expression:
['<subject authority="local"><topic>'+escape(v)+'</topic></subject>') for v in split(value,'; ')]
It will create an array of XML snippets containing your subjects. It assumes that you've created or imported an appropriate escape function, such as
from xml.sax.saxutils import escape

How to check if there is a comment or not

I've got an .xlsx file. Some cells in it have comments which content will be used thereafter. How to check, iterating through every cell, if it has a comment or not?
This code (in which I tried to iterate the third column and nothing else) returns an error:
import win32com.client, win32gui, re
xl = win32com.client.Dispatch("Excel.Application")
xl.Visible = 1
TempExchFilePath = win32gui.GetOpenFileNameW()[0]
wb = xl.Workbooks.Open(TempExchFilePath)
sh = wb.Sheets("Sheet1")
comments = None
for i in range (0,201,1):
if sh.Cells(2,i).Comment.Text() != None:
comment = sh.Cells(2,i).Comment.Text()
comments += comment
print(comments)
input()
I am very new to Python and sorry for my English.
Thanks! :3

Here is what I think is the best way, using the Python Excel modules, specifically xlrd
Suppose you have a workbook which has a cell A1 with a comment written by Joe Schmo which says "Hi!", here's how you'd get at that.
>>> from xlrd import *
>>> wb = open_workbook("test.xls")
>>> sheet = wb.sheet_by_index(0)
>>> notes = sheet.cell_note_map
>>> print notes
{(0, 0): <xlrd.sheet.Note object at 0x00000000033FE9E8>}
>>> notes[0,0].text
u'Schmo, Joe:\nHi!'
A Quick Explanation of What's Going On
So the xlrd module is a pretty handy thing, once you figure it out (full documentation here). The first two lines import the module and create a workbook object called wb. Next, we create a sheet object of the first sheet (index 0) and call that sheet (I'm feeling creative today). Then we create a dicitonary of note objects called notes with the cell_note_map attribute of our sheet object. This dictionary has the (row,col) index of the comment as the key, and then a note object as the value. We can then extract the text of that note using the text attribute of the note object.
For multiple notes, you can iterate through your dictionary to get at all the text as show below:
>>> comments = []
>>> for key in notes.keys():
... comments.append(notes[key].text)
...
>>> print comments
[u"Schmo, Joe:\nHere's another\n", u'Schmo, Joe:\nhi!']
Some Things to Note
This will only work with .xls files, not .xlsx, but you can save any .xlsx as an .xls so there's no problem
The author of the comment will always be listed first, but can be accessed separately by using the author attribute instead of text. There will also always be a \n inbetween the author and text.
Cells which do not have comments will not be mapped by cell_note_map. So a full sheet without any comments will yield an empty dictionary

I think defining comments as None and then trying to add Stuff (i guess a string) won't work.
Try comments = "" instead of comments = None
Other then that, it would deffinitly help to see the error.

I think this should work. However, you have
comments = None
and then
comments += comment
I don't think you can do None + anything. Most likely, you either want to do
comments = ''
comments += comment
or
comments = []
comments.append(comment)
Another thing you probably need to fix:
if sh.Cells(2,i).Comment.Text() != None:
The (2,i) syntax doesn't appear to work in python. Change to Cells[2][i]. Also, if Comment doesn't exist, then it will be None , and won't have a Text() function. i.e.:
if sh.Cells[2][i].Comment != None:
comment = sh.Cells[2][i].Comment.Text()

associative list python

i am parsing some html form with Beautiful soup. Basically i´ve around 60 input fields mostly radio buttons and checkboxes. So far this works with the following code:
from BeautifulSoup import BeautifulSoup
x = open('myfile.html','r').read()
out = open('outfile.csv','w')
soup = BeautifulSoup(x)
values = soup.findAll('input',checked="checked")
# echoes some output like ('name',1) and ('value',4)
for cell in values:
# the following line is my problem!
statement = cell.attrs[0][1] + ';' + cell.attrs[1][1] + ';\r'
out.write(statement)
out.close()
x.close()
As indicating in the code my problem ist where the attributes are selected, because the HTML template is ugly, mixing up the sequence of arguments that belong to a input field. I am interested in name="somenumber" value="someothernumber" . Unfortunately my attrs[1] approach does not work, since name and value do not occur in the same sequence in my html.
Is there any way to access the resulting BeautifulSoup list associatively?
Thx in advance for any suggestions!

My suggestion is to make values a dict. If soup.findAll returns a list of tuples as you seem to imply, then it's as simple as:
values = dict(soup.findAll('input',checked="checked"))
After that you can simply refer to the values by their attribute name, like what Peter said.
Of course, if soup.findAll doesn't return a list of tuples as you've implied, or if your problem is that the tuples themselves are being returned in some weird way (such that instead of ('name', 1) it would be (1, 'name')), then it could be a bit more complicated.
On the other hand, if soup.findAll returns one of a certain set of data types (dict or list of dicts, namedtuple or list of namedtuples), then you'll actually be better off because you won't have to do any conversion in the first place.
...Yeah, after checking the BeautifulSoup documentation, it seems that findAll returns an object that can be treated like a list of dicts, so you can just do as Peter says.
http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
Oh yeah, if you want to enumerate through the attributes, just do something like this:
for cell in values:
for attribute in cell:
out.write(attribute + ';' + str(cell[attribute]) + ';\r')

I'm fairly sure you can use the attribute name like a key for a hash:
print cell['name']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing unicode string read from a cell in an xlrd.Book object - python

Read aboput the Cell class in the xlrd documentation. Work through the tutorial that you can get via www.python-excel.org.

Related

Python Dictionary as Valid jSon?

Python: Pass string into function with re

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

How to check if there is a comment or not

associative list python

Categories

Resources