Please help me out in writing pandas custom functions, in
the confusion loop in returning specific row and col values as custom results,i want to return col means without using slicing no user defined functions like numpy(np.mean) and i need only parameter to pass is dataset 'df' to custom function.
In layman way i want to return column ['A','B'] means from function col_mean() by passing dataset "df" without using pandas slicing and predefined functions like mean/np.mean
Below is my dataset please give me code logic in getting col means.
df = pd.DataFrame({'A': [10,20,30], 'B': [20, 30, 10]})
def col_men(df):
means=[0 for i in range(df.shape[1])]
for k in range(df.shape[1]):
col_values=[row[k] for row in df]
means[k]=sum(col_values)/float(len(df))
return means
Instead of using range(df.shape[1]) use enumerate(df.columns), so you keep both name and position:
df = pd.DataFrame({'A': [10,20,30], 'B': [20, 30, 10]})
def col_men(df):
means=[0 for i in range(df.shape[1])]
for index, k in enumerate(df.columns):
col_values=[row for row in df[k]]
means[index]=sum(col_values)/len(df)
return means
col_men(df)
Related
I've to write a function (column_means), that calculates the mean of each column from Dataframe and give me a list of means at the end. I'm not allowed to use the mean function .mean(), so I'm implementing the general formula of the mean: sum(x_i)/Number of elements.
This is my code:
df = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6]})
def column_means(df):
means = []
for i,n in zip(df.columns, df.shape[0]):
means [n] = sum(df[i])/ df.shape[0]
return means
It doesn't work as intended. could you please help me and tell me, what are my mistakes?
Thank you in advance.
You are iterating over int in zip function, as df.shape[0] is returning single integer and not an iterable datatype.
So you can simply do as following:
def column_means(df):
means = []
for i in df.columns:
means.append(sum(df[i]) / df.shape[0])
return means
And if you want mean to be just an integer instead of float, you can just do sum(df[i]) // df.shape[0]
I hope this answers your question.
Do you want the mean of each column? You have to be careful if they don't have the exact same length:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6]})
def column_means(df):
means = []
for i,n in enumerate(df.columns):
means.append(sum(df[n])/len(df[n]))
return means
print(column_means(df))
You can also use the mean method of pd DataFrame
df.mean()
change the first df.shape[0] to df.indexand the assignment line.
def column_means(df):
means = []
for i,n in zip(df.columns, df.index):
means.append(sum(df[i])/ df.shape[0])
return means
I've been going crazy trying to figure this out. I'm trying to avoid using df.iterrows() to iterate through the rows of a dataframe, as it's quite time consuming and .loc() is better from what I've seen.
I know this works:
df = df.loc[df.number == 3, :]
And that'll basically set df to be each row where the "number" column is equal to 3.
But, I get an error when I try something like this:
df = df.loc[someFunction(df.number), :]
What I want is to get every row where someFunction() returns True whenever the "number" value of said row is set as the parameter.
For some reason, it's passing the entire column (the dataframe's entire "number" column, in this example), instead of the value of a row as it iterates through the row, like the previous example.
Again, I know I can just use a for loop and .iterrows(), but I'm working with around 280,000 rows and it just takes longer than I'd like. Also have tried using a lambda function among other things.
Apply is slow - if you can, try to just put the complex vectorization logic in the function by taking in series as arguments:
import pandas as pd
df = pd.DataFrame()
df['a'] = [7, 6, 5, 4, 3, 2]
df['b'] = [1, 2, 3, 4, 5, 6]
def my_func(series1, series2):
return (series2 > 3) | (series1 == series2)
df.loc[my_func(df.b, df.a), 'new_column_name'] = True
I think this is what you need:
import pandas as pd
df = pd.DataFrame({"number": [x for x in range(10)]})
def someFunction(row):
if row > 5:
return True
else:
return False
df = df.loc[df.number.apply(someFunction)]
print(df)
Output:
number
6 6
7 7
8 8
9 9
You can use an anonymous function with .loc
x refers to the dataframe you are indexing
df.loc[lambda x: x.number > 5, :]
Two options I can think of:
Create a new column using the pandas apply() method and a lambda function that returns either true or false depending on someFunction(). Then, use loc to filter on the new column you just created.
Use a for loop and df.itertuples() as it is way faster than iterrows. Make sure to look up the documentation as the syntax is slightly different for itertuples
Just use something like this will work
df = pd.DataFrame()
df['number'] = np.arange(10)
display(df[df['number']>5])
display(df[df['number']>2][df['number']<7])
I have a dictionary where key is a file name and values are dataframes that looks like:
col1 col2
A 10
B 20
A 20
A 10
B 10
I want to groupby based on 'col1' to sum values in 'col2' and store it to new dataframe 'df' whose output should look like:
The output should look like:
Index A B
file1 40 30
file2 50 35
My code:
df=pd.DataFrame(columns=['A','B'])
for key, value in data.items():
cnt=(value.groupby('Type')['Packets'].sum())
print(cnt)
df.append(cnt,ignore_index=True)
Another suggested way with group-by, transpose, and row stack into dataframe.
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[10, 20, 20, 10, 10]})
df_2 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[30, 10, 15, 5, 25]})
df_1_agg = df_1.groupby(['col1']).agg({'col2':'sum'}).T.values
df_2_agg = df_2.groupby(['col1']).agg({'col2':'sum'}).T.values
pd.DataFrame(np.row_stack((df_1_agg, df_2_agg)), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})
Edited: to generalize, you need to put it into the function and loop through. Also, need to format the index (file{i}) for general cases.
lst_df = [df_1, df_2]
df_all = []
for i in lst_df:
# iterate every data faame
df_agg = i.groupby(['col1']).agg({'col2':'sum'}).T.values
# append to the accumulator
df_all.append(df_agg)
pd.DataFrame(np.row_stack(df_all), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})
You should try to avoid appending in a loop. This is inefficient and not recommended.
Instead, you can concatenate your dataframes into one large dataframe, then use pivot_table:
# aggregate values in your dictionary, adding a "file" series
df_comb = pd.concat((v.assign(file=k) for k, v in data.items()), ignore_index=True)
# perform 'sum' aggregation, specifying index, columns & values
df = df_comb.pivot_table(index='file', columns='col1', values='col2', aggfunc='sum')
Explanation
v.assign(file=k) adds a series file to each dataframe with value set to the filename.
pd.concat concatenates all the dataframes in your dictionary.
pd.DataFrame.pivot_table is a Pandas method which allows you to create Excel-style pivot tables via specifying index, columns, values and aggfunc (aggregation function).
In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?
Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64
I need to create a duplicate for each row in a dataframe, apply some basic operations to the duplicate row and then combine these dupped rows along with the originals back into a dataframe.
I'm trying to use apply for it and the print shows that it's working correctly but when I return these 2 rows from the function and the dataframe is assembled I get an error message "cannot copy sequence with size 7 to array axis with dimension 2". It is as if it's trying to fit these 2 new rows back into the original 1 row slot. Any insight on how I can achieve it within apply (and not by iterating over every row in a loop)?
def f(x):
x_cpy=x.copy()
x_cpy['A']=x['B']
print(pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True))
#return pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True)
hld_pos.apply(f,axis=1)
The apply function of pandas operates along an axis. With axis=1, it operates along every row. To do something like what you're trying to do, think of how you would construct a new row from your existing row. Something like this should work:
import pandas as pd
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 4, 6]})
def f(row):
"""Return a new row with the items of the old row squared"""
pd.Series({'a': row['a'] ** 2, 'b': row['b'] ** 2})
new_df = my_df.apply(f, axis=1)
combined = concat([my_df, new_df], axis=0)