I want to do a groupby on column 1 then get the sum of values from column 2, conditional on the value in column 3, which are then divided by the total sum in column 2, still grouped by column 1.
An example is given below:
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
I want to create a new column: col4. For this column I group by col1 and then get the percentage of col2 values where col3 is 1 divided by the total grouped sum of col2. Such that I would end up with the following result. ( I put it in fractions to make it easier to follow the calculations.
col1 col2 col3 col4
0 1 3 1 3/5
1 2 4 1 4/11
2 1 2 0 3/5
3 2 7 0 4/11
I tried the following, but this does not work unfortunately:
df.col4 = df.groupby(['col1']).transform(lambda x: np.where(x.col3 == 1, x.col2, 0).sum()) / df.groupby(['col1']).col2.transform('sum')
Edit | Extended example
I extended the example as the solution provided by Wen only covered the above simple example.
d = {'col1': [1, 2, 1, 2, 1, 2], 'col2': [3, 4, 2, 7, 6, 8], 'col3': [1, 1, 0, 0, 1, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
4 1 6 1
5 2 8 0
Edit | Possible solution
I found a possible solution. I would like to do it in a cleaner way, but this is readable and pretty simple. Any alternatives to combine these two lines of code are still appreciated ofcourse.
df['col4'] = np.where(df.col3 == 1, df.col2, 0)
df['col4'] = df.groupby(['col1']).col4.transform('sum') / df.groupby(['col1']).col2.transform('sum')
You may need to correct your expected output , then using map after filter
df.col1.map(df.loc[df.col3==1,].set_index('col1').col2)/df.groupby(['col1']).col2.transform('sum')
Out[566]:
0 0.600000
1 0.363636
2 0.600000
3 0.363636
dtype: float64
simple :)
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
df['col4'] = 0.0
def con(data):
part_a = sum(data[data['col3'] == 1]['col2'])
part_b = sum(data['col2'])
data.col4 = part_a/part_b
return data
df.groupby('col1').apply(con)
Output
col1 col2 col3 col4
0 1 3 1 0.600000
1 2 4 1 0.363636
2 1 2 0 0.600000
3 2 7 0 0.363636
Related
Let's assume I have the following dataframe:
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
I want this dataframe sorted by col1 and col2 on the minimum value. The order of the indexes should be 2, 0, 1, 3.
I tried this with df.sort_values(by=['col2', 'col1']), but than it takes the minimum of col1 first and then of col2. Is there anyway to order by taking the minimum of two columns?
Using numpy.lexsort:
order = np.lexsort(np.sort(df[['col1', 'col2']])[:, ::-1].T)
out = df.iloc[order]
Output:
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
Note that you can easily handle any number of columns:
df.iloc[np.lexsort(np.sort(df[['col1', 'col2', 'col3']])[:, ::-1].T)]
col1 col2 col3 outcome
1 2 2 0 0
2 3 1 1 1
0 1 4 1 1
3 4 3 1 0
One way (not the most efficient):
idx = df[['col2', 'col1']].apply(lambda x: tuple(sorted(x)), axis=1).sort_values().index
Output:
>>> df.loc[idx]
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
>>> idx
Int64Index([2, 0, 1, 3], dtype='int64')
you can decorate-sort-undecorate where decoration is minimal and other (i.e., maximal) values per row:
cols = ["col1", "col2"]
(df.assign(_min=df[cols].min(axis=1), _other=df[cols].max(axis=1))
.sort_values(["_min", "_other"])
.drop(columns=["_min", "_other"]))
to get
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
I would compute min(col1, col2) as new column and then sort by it
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
df['colmin'] = df[['col1','col2']].min(axis=1) # compute min
df = df.sort_values(by='colmin').drop(columns='colmin') # sort then drop min
print(df)
gives output
col1 col2 col3 outcome
0 1 4 1 1
2 3 1 1 1
1 2 2 0 0
3 4 3 1 0
I have a df as follows:
Col1 Col2
0 [7306914, 7306915]
1 [7295911, 7295912]
2 [7324496]
3 [7294109, 7294110]
4 [7313713]
The second column is a list.
what I would like is to create a new column that contains the total number of elements in the list
Expected Output:
Col1 Col2 Col3
0 [7306914, 7306915] 2
1 [7295911, 7295912] 2
2 [7324496] 1
3 [7294109, 7294110] 2
4 [7313713] 1
Use Series.str.len. This is a vectorized method and is more efficient than apply function, which is essentially looping under-the-hood:
df = pd.DataFrame([{'Col1': 0, 'Col2': [7306914, 7306915]}, {'Col1': 1, 'Col2': [7295911, 7295912]}, {'Col1': 2, 'Col2': [7324496]}, {'Col1': 3, 'Col2': [7294109, 7294110]}, {'Col1': 4, 'Col2': [7313713]}])
df['Col3'] = df['Col2'].str.len()
[out]
print(df)
Col1 Col2 Col3
0 0 [7306914, 7306915] 2
1 1 [7295911, 7295912] 2
2 2 [7324496] 1
3 3 [7294109, 7294110] 2
4 4 [7313713] 1
Try this:
df_tmp = pd.DataFrame({'col1':[[1,2,3], [1,2]]}).reset_index()
In [360]:
df_tmp.head()
Out[360]:
index col1
0 0 [1, 2, 3]
1 1 [1, 2]
In [364]:
df_tmp['len'] = df_tmp.apply(lambda x: len(x['col1']), axis=1)
In [365]:
df_tmp
Out[365]:
index col1 len
0 0 [1, 2, 3] 3
1 1 [1, 2] 2
Apply should be most faster way for that.
Using the DataFrame.apply() or DataFrame.apply() like this:
df['Col3'] = df['Col2'].apply(len)
Hope it could help you.
I am calculating unique values, per row. However I want to exclude the value 0 and then calculate uniques
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0],}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 0
1 2 4 4
2 3 0 0
Expected output
col1 col2 col3 uniques
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
df.nunique(axis = 1), this includes all values
To do this you can simply replace zeroes with Nan values.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0]}
df = pd.DataFrame(data=d)
df['uniques'] = df.replace(0, np.NaN).nunique(axis=1)
Try this:
def func(x):
s = set(x)
s.discard(0)
return len(s)
df['uniq'] = df.apply(lambda x: func(x), axis=1)
A slightly more concise version without using replace:
df['unique'] = df[df!=0].nunique(axis=1)
df
Output:
col1 col2 col3 unique
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
If I have a dataframe df_i and I want to split it into sub-dataframes based on unique values of 'Cycle Number'
I use:
dfs = {k: df_i[df_i['Cycle Number'] == k] for k in df_i['Cycle Number'].unique()}
Assuming the 'Cycle Number' ranges from 1 to 50 and in each cycle, I have steps ranging from 1 to 15, how do I split each data frame into 15 further data frames?
I am presuming something of this type would work:
for i in range(1,51):
dsfs = {k: dfs[i][dfs[i]['Step Number'] == k] for k in dfs[i]['Step Number'].unique()}
But, this will return me 15 data frames only from the cycle number corresponding to 50, not the ones before.
If I want to access a sub-dataframe in the 20th Cycle with step number 10, is there a way of generating the subdata frame such that I can access it using something like dfs[20][10]?
A simple parallel:
Step Number Cycle Number Desired Access
1 1 dfs[1][1]
2 1 dfs[1][2]
3 1 dfs[1][3]
4 1 dfs[1][4]
5 1 dfs[1][5]
1 2 dfs[2][1]
2 2 dfs[2][2]
3 2 dfs[2][3]
4 2 dfs[2][4]
5 2 dfs[2][5]
1 3 dfs[3][1]
2 3 dfs[3][2]
3 3 dfs[3][3]
4 3 dfs[3][4]
5 3 dfs[3][5]
1 4 dfs[4][1]
2 4 dfs[4][2]
3 4 dfs[4][3]
4 4 dfs[4][4]
5 4 dfs[4][5]
You can use tuple keys instead and utilize groupby. Here's a minimal example:
df = pd.DataFrame([[0, 1, 2], [0, 1, 3], [1, 2, 4], [1, 2, 5], [1, 3, 6], [1, 3, 7]],
columns=['col1', 'col2', 'col3'])
dfs = dict(tuple(df.groupby(['col1', 'col2'])))
for k, v in dfs.items():
print(k)
print(v)
(0, 1)
col1 col2 col3
0 0 1 2
1 0 1 3
(1, 2)
col1 col2 col3
2 1 2 4
3 1 2 5
(1, 3)
col1 col2 col3
4 1 3 6
5 1 3 7
I have a pandas dataframe that has a column where the data is a list of statistics calculated from a groupby operation.
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
>>> df.groupby('a').apply(lambda row : calculate_stuff(row.b))
a
1 (0, 3, 9)
2 (0, 3, 10)
3 (0, 2, 2)
dtype: object
Basically, I have several statistics that depend on each other and have to be calculated for each groupby row. The function that does this returns a tuple of the statistics values. What I want is to create a new column for each index of the tuple so that it looks like this:
a col1 col2 col3
1 0 3 9
2 0 3 10
3 0 2 2
I don't think I can use df.groupby('a').agg because one of the calculations is required for the other calculations. Any suggestions?
edit: I realized my aggregate functions in my example were not aggregate functions so I changed them
Adding an extra a category item so the result is 4x3.
df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 4],
'b': [3, 4, 2, 3, 4, 3, 2, 1]})
new_cols = ['col1', 'col2', 'col3']
gb = df.groupby('a').apply(lambda group: calculate_stuff(group.b))
>>> pd.DataFrame(zip(*gb), columns=gb.index, index=new_cols).T
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2
4 0 1 1
You can try list comprehension:
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
group_df = df.groupby('a').apply(lambda row : calculate_stuff(row.b))
print pd.DataFrame([x for x in group_df],
columns=['col1','col2','col3'],
index=group_df.index)
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2