I'm trying to add a new column to my dataframe for each time I run my funciton. This causes the error: 'ValueError: Length of values does not match length of index'. I assume this is because the list I add to df as a new column varies in length with every run of the function.
I have seen many threads suggest using concad, but this probably won't work for me as I can't seem to use concad and just overwrite my existing df - and I need one complete df at the end with a column from each run of my function.
My code functions like this:
df = DataFrame()
mylist = []
def myfunc(number):
mylist = []
for x in range(0,10):
if 'some condition':
mylist.append(x)
df['results%d' % number] = mylist
So for each function iteration I'm adding contents of 'mylist' as a new dataframe column. At second iteration this causes above mentioned error. I need some way of letting python ignore index/column length. From the threads suggesting using concad, I get that passing that giving the instruction 'axis=1' fixes the problem of different lengths - so solution might be parallel to that.
Alternative I could create a range of lists either before function definition or at beginning of it - one list for each 'number' parameter passed to the function, but this is a very primitive solution
I'm not exactly clear on what you're trying to do, but maybe you want something like this?
df = DataFrame()
def myfunc(number):
row_index = 0
for x in range(0,10):
if 'some condition':
df.loc[row_index, 'results%d' % number] = x
row_index += 1
Related
I'm trying to build the below dataframe
df = pd.DataFrame(columns=['Year','Revenue','Gross Profit','Operating Profit','Net Profit'])
rep_vals =['year','net_sales','gross_income','operating_income','profit_to_equity_holders']
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].x for x in rep_vals]
However I get error as per.. 'Report' object has no attribute 'x'
The below (brute force version) of the code works:
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].year,yearly_reports[i].net_sales ,
yearly_reports[i].gross_income, yearly_reports[i].operating_income,
yearly_reports[i].profit_to_equity_holders]
My issue is however I want to add a lot more columns and also I don't want to fetch every item from my yearly_reports into the dataframe, how can I iterate the values I want more effeciently please?
Instead of using .x, use [x].
yearly_reports[i][x]
Also, it is probably a bad idea / not necessary / slow to iterate over your dataframe like this. Have a look at join/merge which might be a lot faster.
I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended
DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.
def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.
I'm trying to make a data frame from the following code, but it is always empty and I'm not sure why. Any suggestions? Thanks!
step_size = 0.01
start = [100]
iter_list = list(range(10000))
for i in iter_list:
start.append(start[i] - step_size)
iter_list2 = list(range(len(start)))
variable_step = pd.DataFrame()
for i in iter_list2:
variable_step[i] = ((start[i]*step_size)/100)
It looks like you may have some sort of confusion about what a dataframe is. Your code doesn't seem to recognize that a DataFrame is a two-dimensional data structure, with rows and columns.
When you do variable_step[i] = ((start[i]*step_size)/100), you're creating a new column in variable_step with column label set to the current value of i, and initializing every element of that column to ((start[i]*step_size)/100), since ((start[i]*step_size)/100) is a scalar.
Creating a new column this way doesn't add more rows. It just adds more values to the existing rows - all 0 of them. Each new column you create has length 0, because you never create rows.
If you want me to tell you how to fix this, well, I can't, because I don't know what you were even trying to do.
You initialize empty data frame here:
variable_step = pd.DataFrame()
Assuming your intention is to put list into a data frame, you should:
variable_step = pd.DataFrame(start) # or any other list you need
Also, you address items by index in the data frame in the last loop to assign values while data frame is empty.
Use .append() instead
This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])
So i have a problem with the Gspread for python 3
when i do something like:
x = worksheet.cell(1,1).value
print(x)
Then i get the value of cell 1,1 which in my case is:
Nice
But when i do:
x = worksheet.col_values(1)
print(x)
Then i get all the results as in
'Nice', 'Cool','','','','','','','','','','','','','',''
And all the empty cells as well which i don't understand since i am asking just for values why i do i get all the '', empty brackets and why the other results are also in brackets ? I would expect something like:
Nice
Cool
When i call for the values of a column and those are the only values. Anyone know how to get such results ?
According to this https://github.com/burnash/gspread documentation it should work but it dose not.
You are getting all of the column data, contained in a list. It starts at row one and gives you all rows in that column to the bottom of the spreadsheet (1000 rows by default), including empty cells. The documentation tells you this:
col_values(col) Returns a list of all values in column col.
Empty cells in this list will be rendered as None.
This seems to have been changed to return empty strings instead, but the principle is the same.
To get just values, use a list comprehension:
x = [item for item in worksheet.col_values(1) if item]
Noting that the above will remove blank rows between items, which might cause misalignment if you try to work with multiple columns where row number is important. Since it's a list, individual items are accessed with:
for item in x:
print(item)
Looking again at the gspread-documentation, I was able to create a dataframe and then thereafter obtain the column-values:
gc = gspread.authorize(GoogleCredentials.get_application_default())
sht2 = gc.open_by_url('https://docs.google.com/spreadsheets/d/<id>')
worksheet = sht2.worksheet("Sheet-name")
dataframe = pd.DataFrame(worksheet.get_all_records())
dataframe.head(3)
Note: Don't forget to enable your gsheet's sharing-settings to "Anyone with a link", to be able to access the sheet from e.g. google colab.
You can also create a while loop and make something like this.
Let's say you want column E to G, you can start the loop from x=5 and end it on x=7. Just make sure that you transpose the dataframe at the end before printing it.
columns = []
x = 5
while x < 8:
data = sheet.col_values(x)[1:]
x += 1
columns.append(data)
df = pd.DataFrame(columns).T
print(df)