Finding .mean() of all columns in python using loop - python

I have the following dataframe:
Dataframe
Now i want to find the average of every column and create a new dataframe with the result.
My only solution has been:
#convert all rows to mean of values in column
df_find_mean['Germany'] = (df_find_mean["Germany"].mean())
df_find_mean['Turkey'] = (df_find_mean["Turkey"].mean())
df_find_mean['USA_NJ'] = (df_find_mean["USA_NJ"].mean())
df_find_mean['USA_TX'] = (df_find_mean["USA_TX"].mean())
df_find_mean['France'] = (df_find_mean["France"].mean())
df_find_mean['Sweden'] = (df_find_mean["Sweden"].mean())
df_find_mean['Italy'] = (df_find_mean["Italy"].mean())
df_find_mean['SouthAfrica'] = (df_find_mean["SouthAfrica"].mean())
df_find_mean['Taiwan'] = (df_find_mean["Taiwan"].mean())
df_find_mean['Hungary'] = (df_find_mean["Hungary"].mean())
df_find_mean['Portugal'] = (df_find_mean["Portugal"].mean())
df_find_mean['Croatia'] = (df_find_mean["Croatia"].mean())
df_find_mean['Albania'] = (df_find_mean["Albania"].mean())
df_find_mean['England'] = (df_find_mean["England"].mean())
df_find_mean['Switzerland'] = (df_find_mean["Switzerland"].mean())
df_find_mean['Denmark'] = (df_find_mean["Denmark"].mean())
#Remove all rows except first
df_find_mean = df_find_mean.loc[[0]]
#Verify data
display(df_find_mean)
Which works, but is not very elegant.
Is there some way to iterate over each column and construct a new dataframe as the average (.mean()) of that colume?
Expected output:
Dataframe with average of columns from previous dataframes

Use DataFrame.mean with convert Series to one row DataFrame by Series.to_frame and transpose:
df = df_find_mean.mean().to_frame().T
display(df)

Just use DataFrame.mean() to compute the mean of all your columns:
You can compute the mean of each column by df_find_mean.mean() and then integrate this into pd.DataFrame([df_find_mean.mean()])!
means = df_find_mean.mean()
df_mean = pd.DataFrame([means])
display(df_mean)

Related

How to loop through many columns

I have about 88 columns in a pandas dataframe. I'm trying to apply a formula that calculates a single value for each column. How do I switch out the name of each column and then build a new single-row dataframe from the equation?
Below is the equation (linear mixed model) which results in a single value for each column.
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum())/Area_sum) *
(gdf.groupby(['Benthic_Mo'])['W8_629044'].mean())).sum()
Below is a sample of the names of the columns
['OBJECTID', 'Benthic_Mo', 'SHAPE_Leng', 'SHAPE_Area', 'geometry', 'tmp', 'Species','W8_629044', 'W8_642938', 'W8_656877', 'W8_670861', 'W8_684891', 'W8_698965', 'W8_713086', 'W8_72726',...]
The columns with W8_## need to be switched out in the formula, but about 80 of them are there. The output I need is a new dataframe with a single row. I also would like to calculate the variance or Standard deviation from the data calculated with the formal.
thank you!
You can loop through the dataframe columns. I think the below code should work.
collist = list(orignal_dataframe.columns)
emptylist = []
emptydict = {}
for i in collist[7:]:
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum())/Area_sum) * (gdf.groupby(['Benthic_Mo'])[i].mean())).sum()
emptydict[i] = B1
emptylist.append(emptydict)
resdf = pd.DataFrame(emptylist)
to create new df with the results in each new col (one row), you can use similar as below:
W8_cols = [col for col in df.columns if 'W8_' in col]
df_out = pd.DataFrame()
for col in W8_cols:
B1 = (((gdf.groupby(['Benthic_Mo'])['SHAPE_Area'].sum()) / Area_sum) *
(gdf.groupby(['Benthic_Mo'])[col].mean())).sum()
t_data = [{col: B1}]
df_temp = pd.DataFrame(t_data)
data = [df_out, df_temp]
df_out = pd.concat(data, axis=1)

pandas groupby column with rolling mean, limited between datetimes, without iterating over each row

I have data in a dataframe as follows:
ROWS = 1000
df = pandas.DataFrame()
df['DaT'] = pandas.date_range('2000-1-1', periods=ROWS, freq='H')
df['cat'] = numpy.random.choice(['a','b','c'],size=ROWS)
df['val'] = numpy.random.randint(2,size=ROWS)
df['r10'] = df.groupby(['cat'])['val'].apply(lambda x: x.rolling(10).mean() )
I need to calculate a column that, is grouped by category 'cat', and is a rolling (10periods) mean of the value 'val' column, but the rolling mean for a given row cannot include values from the day it occurs on.
The desired result ('wanted') can be generated as follows:
df['wanted'] = numpy.nan
for idx, row in df.iterrows():
Rdate = row['DaT'].normalize()
Rcat = row['cat']
try: df.loc[idx,'wanted'] = df[(df['DaT'] < Rdate) & (df['cat'] == Rcat) ]['val'].rolling(10).mean().iloc[-1]
except: df.loc[idx,'wanted'] = numpy.nan
The above is an awful solution, but gets the result. It is very slow for 100000+rows that need to go through. Is there are more elegant solution?
I have tried using combinations of shift and even quantize to get a more efficient solution, but no success yet

Python: Assign value to DataFrame column elements based on comparison of two other columns

Given a DataFrame as below:
Desired DataFrame values for Quantile and Value columns
I have two objectives.
Create a quantile for each category (a, b, c, etc.) and assign it to the value in the Quantile column.
Compare each row in the Score column to the corresponding Quantile value for that group. If it’s above 90th percentile assign to column Value the number 3, above 60th, assign the number 2 and so forth.
So far I have been able to create (in an inefficient way) the following but I’m sure there must be a way to make this more efficient:
df = pd.read_excel("file.xlsx")
conditions2 = (df['scaled_score']>=df['quantiles2']) & (df['scaled_score']<df['quantiles1'])
conditions3 = (df['scaled_score']>=df['quantiles3']) & (df['scaled_score']<df['quantiles2'])
conditions4 = (df['scaled_score']>=df['quantiles3'])
df['quantiles1'] = df.groupby([‘Group']).scaled_score.quantile(0.9)
dfr1 = np.where(df['scaled_score']>=df['quantiles1'] ,0.5,0)
df['quantiles2'] = df.groupby([‘Group']).scaled_score.quantile(0.7)
dfr2 = np.where(conditions2 ,0.35,0)
df['quantiles3'] = df.groupby([‘Group']).scaled_score.quantile(0.5)
dfr3 = np.where(conditions3,0.25,0)
df['quantiles4'] = df.groupby([‘Group']).scaled_score.quantile(0.4)
dfr4 = np.where(conditions4,0.15,0)
dtest1=pd.DataFrame(dfr1)
dtest2=pd.DataFrame(dfr2)
dtest3=pd.DataFrame(dfr3)
dtest4=pd.DataFrame(dfr4)
dftest = pd.concat([dtest1,dtest2]).groupby(level=0).max()
dftest = pd.concat([dftest,dtest3]).groupby(level=0).max()
dftest = pd.concat([dftest,dtest4]).groupby(level=0).max()
df.drop(['quantile'],axis=1)
del df[‘quantile’, axis=1]
dftest.index=df.index
Panel2 = df.join(dftest, on=df.index)
df[‘Value'] = dftest
I will do qcut
s=df.groupby('Value').apply(lambda x : pd.qcut(x['scaled_score'],[0,0.5,0.7,0.9,1],labels=[0.15,0.25,0.35,0.5])).reset_index(level=0,drop=True)
df['New']=s

Pandas: Setting values in GroupBy doesn't affect original DataFrame

data = pd.read_csv("file.csv")
As = data.groupby('A')
for name, group in As:
current_column = group.iloc[:, i]
current_column.iloc[0] = np.NAN
The problem: 'data' stays the same after this loop, even though I'm trying to set values to np.NAN .
As #ohduran suggested:
data = pd.read_csv("file.csv")
As = data.groupby('A')
new_data = pd.DataFrame()
for name, group in As:
# edit grouped data
# eg group.loc[:,'column'] = np.nan
new_data = new_data.append(group)
.groupby() does not change the initial DataFrame. You might want to store what you do with groupby() on a different variable, and the accumulate it in a different DataFrame using that for loop?

Pandas: display groupby aggregate statistic with data

Data Snippet
I am trying to add a new column to my data frame that displays the average purchase amount per user. The data frame is called trainDf and the below line of code produces the average by user. I'm trying to learn how to add it as a column to display similar  to the above image. 
AveragePurchaseAmountUser = trainDf.groupby(by='User_ID')['Purchase_Amount'].mean()
Thank you in advance!
You can try:
trainDf['AveragePurchaseAmountUser'] = trainDf.groupby(['User_ID'])['Purchase_Amount'].mean()
I would use merge
avg_df = trainDf.groupby(by='User_ID')['Purchase_Amount'].mean().reset_index().rename(columns={'Purchase_Amount': 'Avg'})
trainDf = trainDf.merge(avg_df, on='User_ID')
This will return the DataFrame with the new column
def avg(df):
df['Average_Purchase_Amount'] = df['Purchase_Amount'].mean()
return df
newDf = trainDf.groupby(by='User_ID').apply(avg)
And if you want the column as a Series you can apply this function:
def avgSeries(df):
return pd.Series(data = df['Purchase_Amount'].mean(), index = df.index)
Then add the column to you DataFrame later
This is what transform is for
AveragePurchaseAmountUser = trainDf.groupby(by='User_ID')['Purchase_Amount'].transform() .mean()
I can't test atm, but you might need
...transform('mean')
Instead

Categories

Resources