I thought this would have been a pretty simple task, but it is turning out to be much more complicated than I thought it would be. Trying to read a simple excel spreadsheet with a table of values, then will perform calculations on the values and output a new sheet.
First question is, what library do people recommend to use? Pandas? Openpyxl? Currently using openpyxl and struggling to get the value of an individual cell. Here's some code:
collectionOrder = np.empty( [numRows,2], dtype='object')
numCountries = 0
for i in burndownData.iter_rows():
elemnt = burndownData.cell(row=i,column=1)
print("elemnt=",elemnt.value )
if not( np.isnan(burndownData.cell(row=i,column=1).value)):
collectionOrder[ int(burndownData.cell(row=i,column=1).value) ][0] = burndownData.cell(row=i,column=1).value
collectionOrder[ int(burndownData.cell(row=i,column=1).value) ][1] = i
numCountries = numCountries + 1
But when I first try and use the cell reference, (burndownData.cell(row=i,column=1)), I get the following error:
Exception has occurred: TypeError '<' not supported between instances of 'tuple' and 'int'
File "C:\Users\cpeddie\Documents\projects\generateBurndownReport.py", line 59, in run elemnt =
burndownData.cell(row=i,column=1) File
"C:\Users\cpeddie\Documents\projects\generateBurndownReport.py", line 96, in <module>
run()
Everything I have seen on the web says this is the way to get the value of an individual cell. What am I doing wrong? Is there an easier way?
Thanks....
Unless you're doing something more complicated than gathering some cells, numpy or pandas is usually unnecessary overhead. openpyxl works well enough on its own.
You have two options for iterating through a worksheet but you're trying to mix them, hence the error.
One option is simply query every cell's value using the cell method of the worksheet object with the row and column keyword arguments. The row and column indexes are 1-based integers, like this:
burndownData.cell(row=1, column=1).value
The other option is iterating the sheet and indexing the row as a list:
for row in burndownData.iter_rows():
elemnt = row[0].value
This will get you column A of row 1, column A of row 2, and so on. (because it's an index of a Python list it's zero-based)
What you were doing above:
for i in burndownData.iter_rows():
elemnt = burndownData.cell(row=i,column=1)
generates an error because i is a tuple of openpyxl Cell objects, and the row argument to cell expects an integer.
Update: I should add there's a third way to reference cells using the spreadsheet column:row syntax, like this:
burndownData['B9'].value
but unless you're only going to select a few specific cells, translating back and forth between letters and numbers seems clumsy.
Related
I am trying to check a column of an Excel file for values in a given format and, if there is a match, append it to a list. Here is my code:
from openpyxl import load_workbook
import re
#Open file and read column with PBSID.
PBSID = []
wb = load_workbook(filename="FILE_PATH", data_only=True)
sheet = wb.active
for col in sheet["E"]:
if re.search("\d{3}[-\.\s]??\d{5}", str(col)):
PBSID.append(col.value)
print(PBSID)
Column E of the Excel file contains IDs like 431-00456 that I would like to append to the list named PBSID.
Expected result: PBSID list to be populated with ID in regex mask XXX-XXXXX.
Actual result: Output is an empty list ("[]").
Am I missing something? (I know there are more elegant ways of doing this but I am relatively new to Python and very open to critism).
Thanks!
Semantically, I think the for loop should be written as:
for row in sheet["E"]:
As I'm guessing that sheet["E"] is simply referring to the column 'E' already.
Without seeing exact data that's in a cell, I think what's happening here is that python is interpreting your call to str() as follows:
It's performing a maths operation (in my example) '256 - 23690' before giving you the string of the answer, which is '-23434', and then looking for your regular expression in '-23434' for which it won't find any match (hence no results). Make sure the string is interpreted as a raw string.
You also appear to be referring to the whole row object in 'str(col)', and then referring separately to the row value in 'PBSID.append(col.value)'. It's best to refer to the same object, whichever is more suitable in your case.
I am trying to write a new dataframe into the CSV file. But it is not being written into the file(empty file) neither giving any error.
I feel the issue is with this line. Since I am trying to write a single value to the column.
order['trade_type'] = trade_type
Any idea what's wrong here.
def write_transaction_to_file(trade_type):
order = pd.DataFrame()
order['trade_type'] = trade_type
order.to_csv("open_pos.csv", sep='\t', index=False)
write_transaction_to_file('SELL')
Your code creates an empty DataFrame, without even column names.
And now look at order['trade_type'] = trade_type.
If order contained some rows, among columns there were one named just
'trade_type' (string) and trade_type (variable) was a scalar,
then in all rows in order would receive this value (in this column).
But since order contains no rows, there is no place to write to.
Instead you can append a row, e.g.:
order = order.append({'trade_type': trade_type}, ignore_index=True)
The rest of code is OK, the output file name as ordinary string
is also OK.
Other solution: Just create a DataFrame with a single row and single
named column, filled with your variable:
order = pd.DataFrame([[trade_type]], columns=['trade_type'])
Then write it to CSV file as before.
This question already has answers here:
Insert column using openpyxl
(2 answers)
Closed 5 years ago.
i am not able to find any method in openpyxl that allows for inserting a new blank column with a known column letter or index from current cell properties. for instance, if i am searching through cells on rows and find a cell that has cell.column value of "E" and want to insert a new column before column E, i don't see any function or method for inserting a new column within openpyxl. is this possible at all or is there any known working method for doing so ?
UPDATE:
so i tried the test code as is from Insert column using openpyxl .
and it fails on following lines. i'm using python 2.7 and openpyxl 2.4.8
File "C:/Users/me/Desktop/my.py", line 30, in
column_letter, row = coordinate_from_string(coordinate)
File "C:\Python27\lib\site-packages\openpyxl\utils\cell.py", line 45, in coordinate_from_string
match = COORD_RE.match(coord_string.upper())
AttributeError: 'tuple' object has no attribute 'upper'
Looping could shuffle values to the right, but you already knew that. I assume you weren't happy with cycles spent on interpretive overhead to copy in that way. It might look like:
for src, dst in zip(ws['E:Y'], ws['F:Z']):
dst.value = src.value
There is support for pandas dataframes. Consider turning worksheet into a df, inserting a column, then spitting out a new .xlsx.
Using array slicing with a csv.writer may prove simpler.
None of this deals with the sticky wicket of adjusting formula references.
I read in a CSV file
times = pd.read_csv("times.csv",header=0)
times.columns.values
The column names are in a list
titles=('case','num_gen','year')
titles are much longer and complex but for simplicity sake, it is truncated here.
I want to call an index of a column of times using an index from titles.
My attempt is:
times.titles[2][0]
This is tho try to get the effect of:
times.year[0]
I need to do this because there are 75 columns that I need to call in a loop, therefore, I can not have each column name typed out as in the line above.
Any ideas on how to accomplish this?
I think you need to use .iloc let's look at the pandas doc on selection by position:
time.iloc[2,0] #will return the third row and first column, the indexes are zero-based.
I want to delete a record from a google spreadsheet using the gspread library.
Also, how to can I get the number of rows/records in google spreadsheet? gspread provides .row_count(), which returns the total number of rows, including those that are blank, but I only want to count rows which have data.
Since gspread version 0.5.0 (December 2016) you can remove a row with delete_row().
For example, to delete a row with index 42, you do:
worksheet.delete_row(42)
Can you specify exactly how you want to delete the rows/records? Are they in the middle of a sheet? Bottom? Top?
I had a situation where I wanted to wipe all data except the headers once I had processed it. To do this I just resized the worksheet twice.
#first row is data header to keep
worksheet.resize(rows=1)
worksheet.resize(rows=30)
This is a simple brute force solution for wiping a whole sheet without deleting the worksheet.
Count Rows with data
One way would be to download the data in a json object using get_all_records() then check the length of that object. That method returns all rows above the last non blank row. It will return rows that are blank if a row after it is not blank, but not trailing blanks.
worksheet.delete_row(42) is deprecated (December 2021). Now you can achieve the same results using
worksheet.delete_rows(42)
The new function has the added functionality of being able to delete several rows at the same time through
worksheet.delete_rows(42, 3)
where it will delete the next three rows, starting from row 42.
Beware that it starts counting rows from 1 (so not zero based numbering).
Reading the source code it seems there is no such method to directly remove rows - there are only methods there to add them or .resize() method to resize the worksheet.
When it comes to getting the rows number, there's a .row_count() method that should do the job for you.
adding to #AsAP_Sherb answere:
If you want to count how many rows there are, don't use get_all_records() - instead use worksheet.col_values(1), and count the length of that.
(instead of getting the entire table, you get only one column)
I think that would be more time efficient (and will definantly be memory efficient)