Getting all column values from google sheet using Gspread and Python - python

So i have a problem with the Gspread for python 3
when i do something like:
x = worksheet.cell(1,1).value
print(x)
Then i get the value of cell 1,1 which in my case is:
Nice
But when i do:
x = worksheet.col_values(1)
print(x)
Then i get all the results as in
'Nice', 'Cool','','','','','','','','','','','','','',''
And all the empty cells as well which i don't understand since i am asking just for values why i do i get all the '', empty brackets and why the other results are also in brackets ? I would expect something like:
Nice
Cool
When i call for the values of a column and those are the only values. Anyone know how to get such results ?
According to this https://github.com/burnash/gspread documentation it should work but it dose not.

You are getting all of the column data, contained in a list. It starts at row one and gives you all rows in that column to the bottom of the spreadsheet (1000 rows by default), including empty cells. The documentation tells you this:
col_values(col) Returns a list of all values in column col.
Empty cells in this list will be rendered as None.
This seems to have been changed to return empty strings instead, but the principle is the same.
To get just values, use a list comprehension:
x = [item for item in worksheet.col_values(1) if item]
Noting that the above will remove blank rows between items, which might cause misalignment if you try to work with multiple columns where row number is important. Since it's a list, individual items are accessed with:
for item in x:
print(item)

Looking again at the gspread-documentation, I was able to create a dataframe and then thereafter obtain the column-values:
gc = gspread.authorize(GoogleCredentials.get_application_default())
sht2 = gc.open_by_url('https://docs.google.com/spreadsheets/d/<id>')
worksheet = sht2.worksheet("Sheet-name")
dataframe = pd.DataFrame(worksheet.get_all_records())
dataframe.head(3)
Note: Don't forget to enable your gsheet's sharing-settings to "Anyone with a link", to be able to access the sheet from e.g. google colab.

You can also create a while loop and make something like this.
Let's say you want column E to G, you can start the loop from x=5 and end it on x=7. Just make sure that you transpose the dataframe at the end before printing it.
columns = []
x = 5
while x < 8:
data = sheet.col_values(x)[1:]
x += 1
columns.append(data)
df = pd.DataFrame(columns).T
print(df)

Related

iterrows() loop is only reading last value and only modifying first row

I have a dataframe test. My goal is to search in the column t1 for specific strings, and if it matches exactly a specific string, put that string in the next column over called t1_selected. Only thing is, I can't get iterrows() to go over the entire dataframe, and to report results in respective rows.
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[i, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
I am only trying to get ABCD_T1w_MPR_vNav_passive to be in the 4th row under the t1_selected, while all the other rows will have not found. The first entry in t1_selected is from the last row under t1 which I didn't include in the screenshot because the dataframe has over 200 rows.
I tried to initialize an empty list to append output of
import ast
x = ast.literal_eval(row['t1'])
to see if I can put x in there, but the same issue occurred.
Is there anything I am missing?
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[index, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
Where index is the row its written to. With i it was not changing

How to return the string of a header based on the max value of a cell in Openpyxl

Good morning guys! quick question for Openpyxl:
I am working with Python editing a xlsx document and generating various stats. Part of my script is to generate max values of a cell range :
temp_list=[]
temp_max=[]
for row in sheet.iter_rows(min_row=3, min_col=10, max_row=508, max_col=13):
print(row)
for cell in row:
temp_list.append(cell.value)
print(temp_list)
temp_max.append(max(temp_list))
temp_list=[]
I would also like to be able to print the string of the header of the column that contains the max value for the cell range desired. My data structure looks like this :
Any idea on how to do so?
Thanks!
This seems like a typical INDEX/MATCH Excel problem.
Have you tried retrieving the index for the max value in each temp_list?
You can use a function like numpy.argmax() to get the index of your max value within your "temp_list" array, then use this index to locate the header and append the string to a new list called, say, "max_headers" which contains all the header strings in order of appearance.
It would look something like this
for cell in row:
temp_list.append(cell.value)
i_max = np.argmax(temp_list)
max_headers.append(cell(row = 1, column = i_max).value)
And so on and so forth. Of course, for that to work, your temp_list should be a numpy array instead of a simple python list, and the max_headers list would have to be defined.
First, Thanks Bernardo for the hint. I found a decently working solution but still have a little issue. Perhaps someone can be of assistance.
Let me amend my initial statement : here is the code I am working with now :
temp_list=[]
headers_list=[]
for row in sheet.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32): #Index starts at 1 // Here we set the rows/columns containing the data to be analyzed
for cell in row:
temp_list.append(cell.value)
for cell in row:
if cell.value == max(temp_list):
print(str(cell.column))
print(cell.value)
print(sheet.cell(row=1, column=cell.column).value)
headers_list.append(sheet.cell(row=1,column=cell.column).value)
else:
print('keep going.')
temp_list = []
This formula works, but has a little issue : If, for instance, a row has the same value twice (ie : 25,9,25,8,9), this loop will print 2 headers instead of one. My question is :
how can I get this loop to take in account only the first match of a max value in a row?
You probably want something like this:
headers = [c for c in next(ws.iter_rows(min_col=27, max_col=32, min_row=1, max_row=1, values_only=True))]
for row in ws.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32, values_only=True):
mx = max(row)
idx = row.index(mx)
col = headers[idx]

Why is my data frame still empty?

I'm trying to make a data frame from the following code, but it is always empty and I'm not sure why. Any suggestions? Thanks!
step_size = 0.01
start = [100]
iter_list = list(range(10000))
for i in iter_list:
start.append(start[i] - step_size)
iter_list2 = list(range(len(start)))
variable_step = pd.DataFrame()
for i in iter_list2:
variable_step[i] = ((start[i]*step_size)/100)
It looks like you may have some sort of confusion about what a dataframe is. Your code doesn't seem to recognize that a DataFrame is a two-dimensional data structure, with rows and columns.
When you do variable_step[i] = ((start[i]*step_size)/100), you're creating a new column in variable_step with column label set to the current value of i, and initializing every element of that column to ((start[i]*step_size)/100), since ((start[i]*step_size)/100) is a scalar.
Creating a new column this way doesn't add more rows. It just adds more values to the existing rows - all 0 of them. Each new column you create has length 0, because you never create rows.
If you want me to tell you how to fix this, well, I can't, because I don't know what you were even trying to do.
You initialize empty data frame here:
variable_step = pd.DataFrame()
Assuming your intention is to put list into a data frame, you should:
variable_step = pd.DataFrame(start) # or any other list you need
Also, you address items by index in the data frame in the last loop to assign values while data frame is empty.
Use .append() instead

Trying to get the largest number in a column of a .csv file

This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])

Adding new column to dataframe of different length from list

I'm trying to add a new column to my dataframe for each time I run my funciton. This causes the error: 'ValueError: Length of values does not match length of index'. I assume this is because the list I add to df as a new column varies in length with every run of the function.
I have seen many threads suggest using concad, but this probably won't work for me as I can't seem to use concad and just overwrite my existing df - and I need one complete df at the end with a column from each run of my function.
My code functions like this:
df = DataFrame()
mylist = []
def myfunc(number):
mylist = []
for x in range(0,10):
if 'some condition':
mylist.append(x)
df['results%d' % number] = mylist
So for each function iteration I'm adding contents of 'mylist' as a new dataframe column. At second iteration this causes above mentioned error. I need some way of letting python ignore index/column length. From the threads suggesting using concad, I get that passing that giving the instruction 'axis=1' fixes the problem of different lengths - so solution might be parallel to that.
Alternative I could create a range of lists either before function definition or at beginning of it - one list for each 'number' parameter passed to the function, but this is a very primitive solution
I'm not exactly clear on what you're trying to do, but maybe you want something like this?
df = DataFrame()
def myfunc(number):
row_index = 0
for x in range(0,10):
if 'some condition':
df.loc[row_index, 'results%d' % number] = x
row_index += 1

Categories

Resources