Pandas creating new dataframe from several group by operations - python

I have a pandas dataframe
test = pd.DataFrame({'d':[1,1,1,2,2,3,3], 'id':[1,2,3,1,2,2,3], 'v1':[10, 20, 15, 35, 5, 10, 30], 'v2':[3, 4, 1, 6, 0, 2, 0], 'w1':[0.1, 0.3, 0.2, 0.1, 0.4, 0.3, 0.2], 'w2':[0.8, 0.1, 0.2, 0.3, 0.1, 0.1, 0.0]})
d id v1 v2 w1 w2
0 1 1 10 3 0.10 0.80
1 1 2 20 4 0.30 0.10
2 1 3 15 1 0.20 0.20
3 2 1 35 6 0.10 0.30
4 2 2 5 0 0.40 0.10
5 3 2 10 2 0.30 0.10
6 3 3 30 0 0.20 0.00
and I would like to get some weighted values by group like
test['w1v1'] = test['w1'] * test['v1']
test['w1v2'] = test['w1'] * test['v2']
test['w2v1'] = test['w2'] * test['v1']
test['w2v2'] = test['w2'] * test['v2']
How can I get the result nicely into a df. something that looks like
test.groupby('id').sum()['w1v1'] / test.groupby('id').sum()['w1']
id
1 22.50
2 11.00
3 22.50
but includes columns for each weighted value, so like
id w1v1 w1v2 w2v1 w2v2
1 22.50 ... ... ...
2 11.00 ... ... ...
3 22.50 ... ... ...
Any ideas how I can achieve this quick and easy?

Use:
cols = ['w1v1','w1v2','w2v1','w2v2']
test1 = (test[['w1', 'w2', 'w1', 'w2']] * test[['v1', 'v1', 'v2', 'v2']].values)
test1.columns = cols
print (test1)
w1v1 w1v2 w2v1 w2v2
0 1.0 8.0 0.3 2.4
1 6.0 2.0 1.2 0.4
2 3.0 3.0 0.2 0.2
3 3.5 10.5 0.6 1.8
4 2.0 0.5 0.0 0.0
5 3.0 1.0 0.6 0.2
6 6.0 0.0 0.0 0.0
df = test.join(test1).groupby('id').sum()
df1 = df[cols] / df[['w1', 'w2', 'w1', 'w2']].values
print (df1)
w1v1 w1v2 w2v1 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
Another more dynamic solution with MultiIndex DataFrames:
a = ['v1', 'v2']
b = ['w1', 'w2']
mux = pd.MultiIndex.from_product([a,b])
df1 = test.set_index('id').drop('d', axis=1)
v = df1.reindex(columns=mux, level=0)
w = df1.reindex(columns=mux, level=1)
print (v)
v1 v2
w1 w2 w1 w2
id
1 10 10 3 3
2 20 20 4 4
3 15 15 1 1
1 35 35 6 6
2 5 5 0 0
2 10 10 2 2
3 30 30 0 0
print (w)
v1 v2
w1 w2 w1 w2
id
1 0.1 0.8 0.1 0.8
2 0.3 0.1 0.3 0.1
3 0.2 0.2 0.2 0.2
1 0.1 0.3 0.1 0.3
2 0.4 0.1 0.4 0.1
2 0.3 0.1 0.3 0.1
3 0.2 0.0 0.2 0.0
df = w * v
print (df)
v1 v2
w1 w2 w1 w2
id
1 1.0 8.0 0.3 2.4
2 6.0 2.0 1.2 0.4
3 3.0 3.0 0.2 0.2
1 3.5 10.5 0.6 1.8
2 2.0 0.5 0.0 0.0
2 3.0 1.0 0.6 0.2
3 6.0 0.0 0.0 0.0
df1 = df.groupby('id').sum() / w.groupby('id').sum()
#flatten MultiIndex columns
df1.columns = ['{0[1]}{0[0]}'.format(x) for x in df1.columns]
print (df1)
w1v1 w2v1 w1v2 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000

If you can take multi index columns, you can use groupby + dot:
test.groupby('id').apply(
lambda g: g.filter(like='v').T.dot(g.filter(like='w')/g.filter(like='w').sum()).stack()
)
# v1 v2
# w1 w2 w1 w2
#id
#1 22.5 16.818182 4.5 3.818182
#2 11.0 11.666667 1.8 2.000000
#3 22.5 15.000000 0.5 1.000000

Related

Pandas dataframe add rows with NaN in one column based on values in other columns

For example, there are three columns in a dataframe, x, y, z. x and y have 3 different values with 0.5 intervals. They are coordinates so they map with each other and there will be 3*3=9 rows with some z values. But the actual dataframe has only let say 7 rows. How to add two missing rows with NaN value in z column? Below are the example input and output. Thank you!
Input:
DataFrame:
x y z
0 -0.5 0 5
1 -0.5 -0.5 10
2 0 -0.5 7
3 0 0.5 6
4 0 0 12
5 0.5 0 8
6 0.5 0.5 2
Output:
DataFrame:
x y z
0 -0.5 0 5
1 -0.5 -0.5 10
2 0 -0.5 7
3 0 0.5 6
4 0 0 12
5 0.5 0 8
6 0.5 0.5 2
7 -0.5 0.5 NaN // missing row
8 0.5 -0.5 NaN // missing row
One option is with complete from pyjanitor, to add the missing rows, based on a combination of x and y :
# pip install pyjanitor
import pandas as pd
import janitor
df.complete('x', 'y')
x y z
0 -0.5 0.0 5.0
1 -0.5 -0.5 10.0
2 -0.5 0.5 NaN
3 0.0 0.0 12.0
4 0.0 -0.5 7.0
5 0.0 0.5 6.0
6 0.5 0.0 8.0
7 0.5 -0.5 NaN
8 0.5 0.5 2.0
complete is just an efficient helper (wrapper around pandas functions); if your data does not have duplicates that can throw off pivot, then use it:
df.pivot('x', 'y', 'z').stack(dropna=False).rename('z').reset_index()
x y z
0 -0.5 -0.5 10.0
1 -0.5 0.0 5.0
2 -0.5 0.5 NaN
3 0.0 -0.5 7.0
4 0.0 0.0 12.0
5 0.0 0.5 6.0
6 0.5 -0.5 NaN
7 0.5 0.0 8.0
8 0.5 0.5 2.0
import numpy as np
new_row = {"x": -0.5, "y": 0.5, "z": np.nan}
df.append(new_row, ignore_index=True)
I don't know exactly how you calculate "x" and "y" values, but I guess your question is directed towards the NaN value in the "z" column.
You can find the product of [-0.5,0,0.5] with itertools.product
import itertools
lst = pd.concat([df['x'], df['y']]).unique().tolist()
p = list(itertools.product(lst, repeat=2))
print(p)
[(-0.5, -0.5), (-0.5, 0.0), (-0.5, 0.5), (0.0, -0.5), (0.0, 0.0), (0.0, 0.5), (0.5, -0.5), (0.5, 0.0), (0.5, 0.5)]
Then fill the missing index:
out = df.set_index(['x', 'y']).reindex(p).reset_index()
print(out)
x y z
0 -0.5 -0.5 10.0
1 -0.5 0.0 5.0
2 -0.5 0.5 NaN
3 0.0 -0.5 7.0
4 0.0 0.0 12.0
5 0.0 0.5 6.0
6 0.5 -0.5 NaN
7 0.5 0.0 8.0
8 0.5 0.5 2.0
You can make a "Default" xy DataFrame, and then outer merge to it~
import itertools
xy = pd.DataFrame(itertools.product([-0.5, 0, 0.5], repeat=2), columns=['x', 'y'])
df.merge(xy, 'outer')
Output:
x y z
0 -0.5 0.0 5.0
1 -0.5 -0.5 10.0
2 0.0 -0.5 7.0
3 0.0 0.5 6.0
4 0.0 0.0 12.0
5 0.5 0.0 8.0
6 0.5 0.5 2.0
7 -0.5 0.5 NaN
8 0.5 -0.5 NaN

Remove groups from a DataFrame that contain only a single unique value in one column

I am processing data with Pandas. 'A' is a unique ID column and column 'E' contains either 1 or 0. I want to keep only groups where the value of column E contains both 0 and 1. (I want to delete rows where columns A are 2 and 4 as those groups contain only 1 and 0s respectively, leaving only rows where columns A are 1, 3, 5).
What is the best way to do this?
A B C D E F
1 1 0 0 0 1 1163.7
2 1 0.8 0.8 2.2 0 0
3 1 0.2 0.2 4.4 0 0
4 1 0.8 0.4 0.4 0 0
5 1 0.5 0.7 3.8 0 0
6 2 1 1 8.9 1 116
7 2 1.5 1.5 1.7 1 116
8 2 2 2 8.7 1 116
9 3 3 3 5. 0 0
10 3 4.5 4.5 2.2 0 0
11 3 6.0 6.5 0.8 0 0
12 3 8 8 0.3 0 0
13 3 5.3 0 0 1 116
14 3 0 0 0 1 116
15 4 0.8 0.8 1.1 0 0
16 4 0.2 0.5 3.4 0 0
17 4 0.4 0.8 3.2 0 0
18 4 0.7 0.5 3.0 0 0
19 5 1 1 1.5 0 0
20 5 1.5 1.5 1.7 0 0
21 5 2 2 7.9 1 116
I want to get the following data.
A B C D E F
1 1 0 0 0 1 1163.7
2 1 0.8 0.8 2.2 0 0
3 1 0.2 0.2 4.4 0 0
4 1 0.8 0.4 0.4 0 0
5 1 0.5 0.7 3.8 0 0
6 3 3 3 2.2 0 0
7 3 4.5 4.5 2.2 0 0
8 3 6.0 6.5 0.8 0 0
9 3 8 8 0.3 0 0
10 3 5.3 0 0 1 116
11 3 0 0 0 1 116
12 5 1 1 1.5 0 0
13 5 1.5 1.5 1.7 0 0
14 5 2 2 7.9 1 116
Use Series.groupby on column E and transform using any to create a boolean mask:
m = (df['E'].eq(0).groupby(df['A']).transform('any') &
df['E'].eq(1).groupby(df['A']).transform('any'))
df1 = df[m]
Or another idea if column E consists only of zeros and ones,
m = df.groupby('A')['E'].nunique().eq(2)
df1 = df[df['A'].isin(m[m].index)]
Result:
print(df1)
A B C D E F
1 1 0.0 0.0 0.0 1 1163.7
2 1 0.8 0.8 2.2 0 0.0
3 1 0.2 0.2 4.4 0 0.0
4 1 0.8 0.4 0.4 0 0.0
5 1 0.5 0.7 3.8 0 0.0
9 3 3.0 3.0 5.0 0 0.0
10 3 4.5 4.5 2.2 0 0.0
11 3 6.0 6.5 0.8 0 0.0
12 3 8.0 8.0 0.3 0 0.0
13 3 5.3 0.0 0.0 1 116.0
14 3 0.0 0.0 0.0 1 116.0
19 5 1.0 1.0 1.5 0 0.0
20 5 1.5 1.5 1.7 0 0.0
21 5 2.0 2.0 7.9 1 116.0
you can use drop_duplicates on columns A and E and groupby.size to see where the group by A has 2 different elements as E is only 0 or 1. Then use the index where the size is equal to 2 like:
s = df[['A','E']].drop_duplicates().groupby('A').size()
df_ = df[df['A'].isin(s[s.eq(2)].index)].copy()
print(df_)
A B C D E F
1 1 0.0 0.0 0.0 1 1163.7
2 1 0.8 0.8 2.2 0 0.0
3 1 0.2 0.2 4.4 0 0.0
4 1 0.8 0.4 0.4 0 0.0
5 1 0.5 0.7 3.8 0 0.0
9 3 3.0 3.0 5.0 0 0.0
10 3 4.5 4.5 2.2 0 0.0
11 3 6.0 6.5 0.8 0 0.0
12 3 8.0 8.0 0.3 0 0.0
13 3 5.3 0.0 0.0 1 116.0
14 3 0.0 0.0 0.0 1 116.0
19 5 1.0 1.0 1.5 0 0.0
20 5 1.5 1.5 1.7 0 0.0
21 5 2.0 2.0 7.9 1 116.0

Create Max and Min column values from a single column value pandas

I have a dataframe like the one below and I need to create two columns out of the base column.
Input
Kg
0.5
0.5
1
1
1
2
2
5
5
5
Expected Output
Kg_From Kg_To
0 0.5
0 0.5
0.5 1
0.5 1
0.5 1
1 2
1 2
2 5
2 5
2 5
How can this be done in pandas ?
Assuming your kg column is sorted:
s = df["Kg"].unique()
df["Kg_from"] = df["Kg"].map({k:v for k,v in zip(s[1:], s)}).fillna(0)
print (df)
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0
#get unique values and counts of each value in the Kg column
val,counts = np.unique(df.Kg,return_counts=True)
#shift forward by 1 and replace the first value with 0
val = np.roll(val,1)
val[0] = 0
#repeat the count of each value with the counts generated earlier
df['Kg_from'] = np.repeat(val,counts)
df
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0
Use zip and dict for mapping new column created by DataFrame.insert with unique sorted values by np.unique with added first 0 value by np.insert:
df = df.rename(columns={'Kg':'Kg_To'})
a = np.unique(df["Kg_To"])
df.insert(0, 'Kg_from', df['Kg_To'].map(dict(zip(a, np.insert(a, 0, 0)))))
print (df)
Kg_from Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0
Code:
kgs = df.Kg.unique()
lower = [0] + list(kgs[:-1])
kg_dict = {k:v for v,k in zip(lower,kgs)}
# new dataframe
new_df = pd.DataFrame({
'Kg_From': df['Kg'].map(kg_dict),
'Kg_To': df['Kg']
})
# or if you want new columns:
df['Kg_from'] = df['Kg'].map(kg_dict)
Output:
Kg_From Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Greedy most diverse subset of pandas dataframe

This is my dataset:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
I want to get the top-n rows (subset) from that dataframe which maximum diverse.
To compute diversity, I used 1- jaccard.
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
By using dataframe operation, I can do a cartesian product to that dataframe using apply and compute the diversity values of each pair, and get the max value of diversity each pair by using df.idxmax(axis=1). But in here I have to compute all diversity values of each pair first which is not efficient.
0 1 2 3 4 5 6 7 8 9 10
0 0.0 1.0 0.8 0.5 0.5 0.8 0.5 1.0 0.8 0.8 0.8
1 0.0 0.0 1.0 0.8 1.0 0.8 1.0 0.8 0.8 0.8 0.8
2 0.0 0.0 0.0 1.0 0.5 1.0 0.5 0.8 0.8 1.0 1.0
3 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.5 0.8 0.5
4 0.0 0.0 0.0 0.0 0.0 0.8 0.8 1.0 0.5 1.0 0.8
df.idxmax(axis=1).sample(4)
5 6
2 3
0 1
8 9
dtype: int64
I want to implement this algorithm, but in some how, I did not understand the lines : 6 and 7.
How to compute argmax in here? and why in the line 10, it returns Sk but there is no initiation Sk value inside the looping?

pandas merge dataframe and pivot creating new columns

I've got two input dataframes
df1 (note, this DF could have more columns of data)
Sample Animal Time Sex
0 1 A one male
1 2 A two male
2 3 B one female
3 4 C one male
4 5 D one female
and df2
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
and I'd like to combine them so that I get the following:
one_a one_b one_c two_a two_b two_c Sex
Animal
A 0.2 0.4 0.3 0.5 0.7 0.2 male
B 0.4 0.1 0.9 NaN NaN NaN female
C 0.4 0.2 0.3 NaN NaN NaN male
D 0.6 0.2 0.4 NaN NaN NaN female
This is how I'm doing things:
df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)
This works just fine, but could potentially be slow for large datasets. I'm wondering if any panda-pros see a better (read faster, more efficient)? I'm new to pandas and can imagine there are some shortcuts here that I don't know about.
A few steps here. The key is that in order to create columns like one_a one_b .... two_c, we need add Time column to Sample index to build a multi-level index and then unstack to get the required form. Then, a groupby on Animal index is required to aggregate and reduce the number of NaNs. The rest are just some manipulations on format.
import pandas as pd
# your data
# ==============================
# set index
df1 = df1.set_index('Sample')
print(df1)
Animal Time Sex
Sample
1 A one male
2 A two male
3 B one female
4 C one male
5 D one female
print(df2)
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
# processing
# =============================
df = df1.join(df2)
df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()
print(df_temp)
a b c
Time one two one two one two
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]
print(df_temp)
one_a two_a one_b two_b one_c two_c
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)
print(result)
Sex one_a one_b one_c two_a two_b two_c
Animal
A male 0.2 0.4 0.3 0.5 0.7 0.2
B female 0.4 0.1 0.9 NaN NaN NaN
C male 0.4 0.2 0.3 NaN NaN NaN
D female 0.6 0.2 0.4 NaN NaN NaN

Categories

Resources