The Problem is that the excel cell contains a break(\n):
it looks like this (and don't want to change it, so that it doesn't contain a break, due to practical reasons):
Example-
123
My Code works for other cells, but doesn't work in this case where the cell has a break. I tried swapping "Example-123" to "Example-\n123" and "Example-\r123" but this didn't work.
How can I compare the two strings, ignoring the fact that the string contains a break?
if column[2].value == "Example-123":
example_dict.update(column_Test=column[2].column - 1)
test123 = str(row[example_dict["column_Test"]].value)
Related
I am working with the [UCI adult dataset][1]. I have added a row as a header to facilitate operation. I need to change the last column, which can take two values, '<=50k' and '>50k' and whose name is 'etiquette'. I have tried the following
num_datos.loc[num_datos.loc[:,"etiquette"]=="<=50K", "etiquette"]=1
num_datos.loc[num_datos.loc[:,"etiquette"]==">50K", "etiquette"]=0
and the following
num_datos['etiquette'].replace(['<=50K'], 1)
num_datos['etiquette'].replace(['>50K'], 0)
However, this seems to do nothing, since if I then execute
print(num_datos.etiquette[0])
I still get a value of <=50K. Is there a way for me to replace the values of the column in question?
Your second try, using df.replace(), should work when you remove the square brackets from your string. So instead use:
num_datos['etiquette'].replace('<=50K', 1)
num_datos['etiquette'].replace('>50K', 0)
The function currently interprets ['<=50K'] as a list with one element, and cannot find any values in your dataframe that are a list with that element. Instead, you want it to look for the string.
Hope this helps!
I am writing a bit of Python code to automate the manipulation of Excel spreadsheets. The idea is to use spreadsheet templates to create daily reports. Saw this idea working several years ago using Perl. Anyway.
Here are the simple rules:
Sheets with the Workbook are process in the order they appear.
Within the sheets cells are process left to right, then top to bottom.
There are names defined which are single cell ranges, can contain static values or the results of queries. Cells can contain comments which contain SQL queries to run. ...
Here is the problem, as I process the cells I need to check if the cell has an attached comment and if the cell has a name. I am able to handle processing the attached cell comments. But I can not figure out how to determine if a cell is within a named range. In my case the single cell within the range.
I saw a posting the suggested this would work:
cellName = ws.ActiveCell.Name.Name
No luck.
Does anybody have any idea how to do this?
I am so close but no cigar.
Thanks for your attention to this matter.
KD
What you may consider doing is first building a list of all addresses of names in the worksheet, and checking the address of each cell against the list to see if it's named.
In VBA, you obtain the names collection (all the names in a workbook) this way:
Set ns = ActiveWorkbook.Names
You can determine if the names are pointed toward part of the current sheet, and a single cell, this way:
shname = ActiveSheet.Name
Dim SheetNamedCellAddresses(1 To wb.Names.Count) as String
i = 1
For Each n in ns:
If Split(n.Value, "!")(0) = "=" & shname And InStr(n.Value, ":") = 0 Then
' The name Value is something like "=Sheet1!A1"
' If there is no colon, it is a single cell, not a range of cells
SheetNamedCellAddresses(i) = Split(n,"=")(1) 'Add the address to your array, remove the "="
i = i + 1
End If
Next
So now you have a string array containing the addresses of all the named cells in your current sheet. Move that array into a python list and you are good to go.
OK so it errors out if the cell does NOT have a range name. If the cell has a range name the following bit of code returns the name: Great success!!
ws.Cells(r,c).Activate()
c = xlApp.ActiveCell
cellName = c.Name.Name
If there is no name associated with the cell, an exception is tossed.
So even in VBA you would have to wrap this bit of code in exception code. Sounds expensive to me to use exception processing for this call.
I've already asked the root question but I thought I might see if I can get more help with this. I'm trying to work with XlDirectionDown in order to select the last filled cell in an Excel spreadsheet.
Ultimately, I'd like to use Python to select all filled cells in this sheet from A through AE. It will be copied into a text file and appended into SQL Server...so I don't want any blanks.
What I have so far:
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = 1;
excel.Workbooks.Open('G:/working.xlsx')
XlDirectionDown = 4
last = excel.Range("A:A").End(XlDirectionDown)
excel.Range("A1:A"+str(last)).Select()
First of all, the XlDirectionDown does not seem to work. The cursor in Excel remains on the first cell.
Secondly, I get an exception for the last line in this code (something to do with Range). Does anybody understand what's going on with this code? Also, is there ANY documentation on win32com or Pywin32 out there?? I can't find any how-to's! Thanks as always everyone.
I have used a specific cell rather than range of cells as starting point. Replace
last = excel.Range("A:A").End(XlDirectionDown)
with
last = excel.Range("A1:A1").End(XlDirectionDown)
However if there are any blank cells, this will stop just before it. You probably want to use UsedRange() instead. This will be the smallest range that contains all your cells, according to Excel: you may find (as I have) that resulting range is wider than AE (contains blank columns at end), and contains many entirely blank rows at the bottom. However, since you want to filter out blank cells anyways, those will be skipped during filtering.
As to the exception on last line of code, this is because End returns a Range object, and you can't convert a range to a string, or if you can then str(last) is a range so "A1:A"+str(last) will be an invalid range.
As to filtering out blank cells, I'm not sure what that means: when you copy the data to a text file, what will you put for blank cells? If you have "A blank C" will you put "A C"? The C will end up in wrong column of your database. Anyways just something that caught my attention.
There is no single place for documentation for win32com, although the Python on Windows book has a lot of info, and google gets you results quite useful, including SO hits. The one thing that keeps tripping me whenever I use Excel COM (this is not specific to python's win32com) is that everything in a workbook is a Range, you can't have an individual cells, even when some methods or properties might lead you to think you are getting a cell you're actually getting a range, it often requires a bit of extra thinking about how to go about getting to the desired cell.
I got started with win32com and Excel here.
In your code, what does excel.Range("A:A").End(XlDirectionDown) return? Test it. You might want to add .Select(), and then use excel.Selection.Address to get the last cell. Test it in interactive mode, it's easier to see what's going on there.
As an alternative, you can use a while loop to go through your cells. This code is looping the rows until an empty cell:
excel.Range("A1").Select()
while excel.ActiveCell.Value:
val = excel.ActiveCell.Value
print(val)
excel.ActiveCell.Offset(2,1).Select() # Move a row down
The last line is a bit funny; in VBA you should write Offset(1,0) to go one row down. However in Python you have to add one to both row and column. Maybe due to indexing?
I am using XLRD to attempt to read from and manipulate string text encapsulated within the cells of my excel document. I am posting my code, as well as the text that is returned when I choose to print a certain column.
import xlrd
data = xlrd.open_workbook('data.xls')
sheetname = data.sheet_names()
employees = data.sheet_by_index(0)
print employees.col(2)
>>>[text:u'employee_first', text:u'\u201cRichard\u201d', text:u'\u201cCatesby\u201d', text:u'\u201cBrian\u201d']
My intention is to create a dict or either reference the excel documents using strings in python. I would like to have a number of my functions in my program manipulate the data locally and then output at a later point (not within the scope of this question) to a second excel file.
How do I get rid of this extra information?
If you are only interested in the values of the cells, then you should do:
values = sheet.col_values(colx=2)
instead of:
cells = sheet.col(colx=2)
values = [c.value for c in cells]
because it's more concise and more efficient (Cell objects are constructed on the fly as/when requested).
employees.col(2) is a list of xlrd.sheet.Cell instances. To get all the values from the column (instead of the Cell objects), you can use the col_values method:
values = employees.col_values(2)
You could also do this (my original suggestion):
values = [c.value for c in employees.col(2)]
but that is much less efficient than using col_values.
\u201c and \u201d are unicode left and right double quotes, respectively. If you want to get rid of those, you can use, say, the lstrip and rstrip string methods. E.g. something like this:
values = [c.value.lstrip(u'\u201c').rstrip(u'\u201d') for c in employees.col(2)]
There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:
I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.
So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.
How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.
So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).
I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk
So, any suggestions?
If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:
diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py
--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
## -134,8 +134,10 ##
data_type = element.get('t', 'n')
if data_type == Cell.TYPE_STRING:
value = string_table.get(int(value))
-
- ws.cell(coordinate).value = value
+ ws.cell(coordinate).set_value_explicit(value=value,
+ data_type=Cell.TYPE_STRING)
+ else:
+ ws.cell(coordinate).value = value
# to avoid memory exhaustion, clear the item after use
element.clear()
The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.
As you can see from the code, the test whether it is a string was already there.