How to divide column sum by row sum in pandas - python

So I have a data frame called df. It looks like this.
0
1
2
1
2
3
4
5
6
7
8
9
I want to sum up the columns and divide the sum of the columns by the sum of the rows.
So for example:
row 1, column 0: (1+4+7)/(1+2+3)
row 2, column 0: (1+4+7)/(4+5+6)
so on and so forth.
so that my final result is like this.
0
1
2
2
2.5
3
0.8
1
1.2
0.5
0.625
0.75
How do I do it in python using pandas and dataframe?

You can also do it this way:
import numpy as np
a = df.to_numpy()
b = np.divide.outer(a.sum(0),a.sum(1))
# divide is a ufunc(universal function) in numpy.
# All ufunc's support outer functionality
out = pd.DataFrame(b, index=df.index, columns=df.columns)
output:
0 1 2
0 2.0 2.500 3.00
1 0.8 1.000 1.20
2 0.5 0.625 0.75

You can use the underlying numpy array:
a = df.to_numpy()
out = pd.DataFrame(a.sum(0)/a.sum(1)[:,None],
index=df.index, columns=df.columns)
output:
0 1 2
0 2.0 2.500 3.00
1 0.8 1.000 1.20
2 0.5 0.625 0.75

Related

How to get average of all the columns and store into a new row

I have a dataframe like this:
A
B
C
D
user_id
1
1
0
0
1
2
2
1
0
2
3
2
3
1
3
4
3
2
0
4
I need to compute the average of all the columns and need the dataframe looks like this:
A
B
C
D
user_id
1
1
0
0
1
2
2
1
0
2
3
2
3
1
3
4
3
2
0
4
Average
2
1.5
0.25
2.5
I'm trying this but it gives me error
df = df.append({'user_id':'Average', df.mean}, ignore_index=True)
Also working:
df = pd.concat([df, df.mean().to_frame('Average').T])
which will create the following result.
A B C D
1 1.0 0.0 0.00 1.0
2 2.0 1.0 0.00 2.0
3 2.0 3.0 1.00 3.0
4 3.0 2.0 0.00 4.0
Average 2.0 1.5 0.25 2.5
Comment
If you really want to mix floats and integers, please use
pd.concat([df, df.mean().to_frame('Average', ).T.astype(object)])
This will result in
A B C D
1 1 0 0 1
2 2 1 0 2
3 2 3 1 3
4 3 2 0 4
Average 2.0 1.5 0.25 2.5
I want to quote from the official documenation to dtypes to show the disadvantage of this solution:
Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible (for performance and interoperability with other libraries and methods).
This is also the reason, why the default data type is float.
do you mean to do:
df.loc["Average", :] = df.mean()
this creates a new row called "Average" in your df where over all columns you store the mean over all columns
It will surely work :-
df = pd.concat([df, df.mean().to_frame('Average').T])
This also works:
pd.concat([df, pd.DataFrame(df.describe()).loc[['mean']] ])
A B C D
0 1.0 0.0 0.00 1.0
1 2.0 1.0 0.00 2.0
2 2.0 3.0 1.00 3.0
3 3.0 2.0 0.00 4.0
mean 2.0 1.5 0.25 2.5
You can use pandas.DataFrame.loc to add a line at the bottom :
df.loc['Average'] = df.iloc[:, :].mean()
>>> print(df)

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

I have a pandas dataframe as follows:
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
polyid value
0 1 0.56
1 1 0.59
2 1 0.62
3 1 0.83
4 2 0.85
5 2 0.01
6 2 0.79
7 3 0.37
8 3 0.99
9 3 0.48
10 3 0.55
11 3 0.06
I need to reclassify the 'value' column separately for each 'polyid'. For the reclassification, I have two dictionaries. One with the bins that contain the information on how I want to cut the 'values' for each 'polyid' separately:
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
And one with the ids with which I want to label the resulting bins:
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
I tried to get this answer to work for my use case. I could only come up with applying pd.cut on each 'polyid' subset and then pd.concat all subsets again back to one dataframe:
import pandas as pd
def reclass_df_dic(df, bins_dic, names_dic, bin_key_col, val_col, name_col):
df_lst = []
for key in df[bin_key_col].unique():
bins = bins_dic[key]
names = names_dic[key]
sub_df = df[df[bin_key_col] == key]
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
df_lst.append(sub_df)
return(pd.concat(df_lst))
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
df = reclass_df_dic(df, bins_dic, ids_dic, 'polyid', 'value', 'id')
This results in my desired output:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
However, the line:
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
raises the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
that I am unable to solve with using .loc. Also, I guess there generally is a more efficient way of doing this without having to loop over each category?
A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

Strange column names in dataframe after joining dataframe on summary of itself

When I summarize a dataframe and join it back on the original dataframe, then I'm having trouble working with the column names.
This is the original dataframe:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
Now I calculate some statistics and merge the summary back in:
group_summary = df.groupby('col1', as_index = False).agg({'col2': ['mean', 'count']})
df = pd.merge(df, group_summary, on = 'col1')
The dataframe has some strange column names now:
df
Out:
col1 col2 (col2, mean) (col2, count)
0 a 0 0.75 4
1 a 4 0.75 4
2 a -5 0.75 4
3 a 4 0.75 4
4 b 3 3.00 2
5 b 3 3.00 2
I know I can use the columns like df.iloc[:, 2], but I would also like to use them like df['(col2, mean)'], but this returns a KeyError.
Source: This grew out of this previous question.
It's because your GroupBy.agg operation results in a MultiIndex dataframe, and when merging a single-level header DataFrame with a MultiIndexed one, the multiIndex is converted into flat tuples.
Fix your groupby code as follows:
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
Merge now gives flat columns.
df.merge(group_summary, on='col1')
col1 col2 mean count
0 a 0 0.75 4
1 a 4 0.75 4
2 a -5 0.75 4
3 a 4 0.75 4
4 b 3 3.00 2
5 b 3 3.00 2
Better still, use transform to map the output to the input dimensions.
g = df.groupby('col1', as_index=False)['col2']
df.assign(mean=g.transform('mean'), count=g.transform('count'))
col1 col2 mean count
0 a 0 0.75 4
1 a 4 0.75 4
2 b 3 3.00 2
3 a -5 0.75 4
4 b 3 3.00 2
5 a 4 0.75 4
Pro-tip, you can use describe to compute some useful statistics in a single function call.
df.groupby('col1').describe()
col2
count mean std min 25% 50% 75% max
col1
a 4.0 0.75 4.272002 -5.0 -1.25 2.0 4.0 4.0
b 2.0 3.00 0.000000 3.0 3.00 3.0 3.0 3.0
Also see Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

Pandas add new columns in subloops back to the main dataframe

I have a dataframe looks like this:
ids value
1 0.1
1 0.2
1 0.14
2 0.22
....
I am trying to loop through each ids and calculate new columns for each id.
for id, row in df.groupby('ids'):
x = row.loc[0, 'value']
for i in range (len(row)):
row.loc[i, 'new_col_1'] = i * x
row.loc[i, 'new_col_2'] = i * x * 10
My goal is to add the 2 new columns for each id back to the original dataframe, so my df would look like this:
ids value new_col_1 new_col_2
1 0.1 0 0
1 0.2 0.2 2
1 0.14 0.28 2.8
2 0.22 0 0
....
cumcount
With a little Numpy broadcasting sprinkled in.
cumcount gets you your for i in range(len(df)) bit
df.groupby('ids').cumcount()
0 0
1 1
2 2
3 0
dtype: int64
c = df.groupby('ids').cumcount()
v = df.value
df.join(
pd.DataFrame(
(c.values * v.values)[:, None] * [1, 10],
df.index,
).rename(columns=lambda x: f"new_col_{x + 1}")
)
ids value new_col_1 new_col_2
0 1 0.10 0.00 0.0
1 1 0.20 0.20 2.0
2 1 0.14 0.28 2.8
3 2 0.22 0.00 0.0

How to use an input column as a major index in hierarchical indexing in pandas?

My csv file contains columns such as:
col1 col2
1 0.9
1 0.3
2 0.4
2 0.9
2 0.1
3 0.0
4 0.5
4 0.9
And I put this into a data frame, so naturally the df adds an index to all of the rows.
I want to keep the first column as my major index, and within each major index, have a minor index such as:
ID col1 col2
1 1 0.9
2 0.3
2 1 0.4
2 0.9
3 0.1
3 1 0.0
4 1 0.5
2 0.9
How do I do this?
My end goal is to be able to eliminate rows of a certain Major ID. For example, if the average of the rows in Major ID 4, is below 0.5, then I'll eliminate those rows.
I assume the best way is to use a major index, but if there's a better way, please let me know.
Firstly, you can create the column ID from your col1, and then drop col1.
Then you can use DataFrame.groupby , on ID column, and then use .cumcount() to get the result you want. Example -
df['ID'] = df['col1']
df = df.drop('col1',axis=1)
df['col1'] = (df.groupby('ID').cumcount() + 1)
Demo -
In [20]: df
Out[20]:
col1 col2
0 1 0.9
1 1 0.3
2 2 0.4
3 2 0.9
4 2 0.1
5 3 0.0
6 4 0.5
7 4 0.9
In [21]: df['ID'] = df['col1']
In [23]: df = df.drop('col1',axis=1)
In [24]: df['col1'] = (df.groupby('ID').cumcount() + 1)
In [25]: df
Out[25]:
col2 ID col1
0 0.9 1 1
1 0.3 1 2
2 0.4 2 1
3 0.9 2 2
4 0.1 2 3
5 0.0 3 1
6 0.5 4 1
7 0.9 4 2
After this, if you want id as the index, you can use .set_index() method passing 'ID' as the parameter.

Categories

Resources