create new columns from a list of columns in pandas - python

I have a pandas dataframe that has a column where the data is a list of statistics calculated from a groupby operation.
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
>>> df.groupby('a').apply(lambda row : calculate_stuff(row.b))
a
1 (0, 3, 9)
2 (0, 3, 10)
3 (0, 2, 2)
dtype: object
Basically, I have several statistics that depend on each other and have to be calculated for each groupby row. The function that does this returns a tuple of the statistics values. What I want is to create a new column for each index of the tuple so that it looks like this:
a col1 col2 col3
1 0 3 9
2 0 3 10
3 0 2 2
I don't think I can use df.groupby('a').agg because one of the calculations is required for the other calculations. Any suggestions?
edit: I realized my aggregate functions in my example were not aggregate functions so I changed them

Adding an extra a category item so the result is 4x3.
df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 4],
'b': [3, 4, 2, 3, 4, 3, 2, 1]})
new_cols = ['col1', 'col2', 'col3']
gb = df.groupby('a').apply(lambda group: calculate_stuff(group.b))
>>> pd.DataFrame(zip(*gb), columns=gb.index, index=new_cols).T
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2
4 0 1 1

You can try list comprehension:
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
group_df = df.groupby('a').apply(lambda row : calculate_stuff(row.b))
print pd.DataFrame([x for x in group_df],
columns=['col1','col2','col3'],
index=group_df.index)
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2

Related

How to count the number of elements in a list and create a new column?

I have a df as follows:
Col1 Col2
0 [7306914, 7306915]
1 [7295911, 7295912]
2 [7324496]
3 [7294109, 7294110]
4 [7313713]
The second column is a list.
what I would like is to create a new column that contains the total number of elements in the list
Expected Output:
Col1 Col2 Col3
0 [7306914, 7306915] 2
1 [7295911, 7295912] 2
2 [7324496] 1
3 [7294109, 7294110] 2
4 [7313713] 1
Use Series.str.len. This is a vectorized method and is more efficient than apply function, which is essentially looping under-the-hood:
df = pd.DataFrame([{'Col1': 0, 'Col2': [7306914, 7306915]}, {'Col1': 1, 'Col2': [7295911, 7295912]}, {'Col1': 2, 'Col2': [7324496]}, {'Col1': 3, 'Col2': [7294109, 7294110]}, {'Col1': 4, 'Col2': [7313713]}])
df['Col3'] = df['Col2'].str.len()
[out]
print(df)
Col1 Col2 Col3
0 0 [7306914, 7306915] 2
1 1 [7295911, 7295912] 2
2 2 [7324496] 1
3 3 [7294109, 7294110] 2
4 4 [7313713] 1
Try this:
df_tmp = pd.DataFrame({'col1':[[1,2,3], [1,2]]}).reset_index()
In [360]:
df_tmp.head()
Out[360]:
index col1
0 0 [1, 2, 3]
1 1 [1, 2]
In [364]:
df_tmp['len'] = df_tmp.apply(lambda x: len(x['col1']), axis=1)
In [365]:
df_tmp
Out[365]:
index col1 len
0 0 [1, 2, 3] 3
1 1 [1, 2] 2
Apply should be most faster way for that.
Using the DataFrame.apply() or DataFrame.apply() like this:
df['Col3'] = df['Col2'].apply(len)
Hope it could help you.

nunique excluding some values in pandas

I am calculating unique values, per row. However I want to exclude the value 0 and then calculate uniques
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0],}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 0
1 2 4 4
2 3 0 0
Expected output
col1 col2 col3 uniques
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
df.nunique(axis = 1), this includes all values
To do this you can simply replace zeroes with Nan values.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0]}
df = pd.DataFrame(data=d)
df['uniques'] = df.replace(0, np.NaN).nunique(axis=1)
Try this:
def func(x):
s = set(x)
s.discard(0)
return len(s)
df['uniq'] = df.apply(lambda x: func(x), axis=1)
A slightly more concise version without using replace:
df['unique'] = df[df!=0].nunique(axis=1)
df
Output:
col1 col2 col3 unique
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1

Conditional ratio with group by in pandas

I want to do a groupby on column 1 then get the sum of values from column 2, conditional on the value in column 3, which are then divided by the total sum in column 2, still grouped by column 1.
An example is given below:
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
I want to create a new column: col4. For this column I group by col1 and then get the percentage of col2 values where col3 is 1 divided by the total grouped sum of col2. Such that I would end up with the following result. ( I put it in fractions to make it easier to follow the calculations.
col1 col2 col3 col4
0 1 3 1 3/5
1 2 4 1 4/11
2 1 2 0 3/5
3 2 7 0 4/11
I tried the following, but this does not work unfortunately:
df.col4 = df.groupby(['col1']).transform(lambda x: np.where(x.col3 == 1, x.col2, 0).sum()) / df.groupby(['col1']).col2.transform('sum')
Edit | Extended example
I extended the example as the solution provided by Wen only covered the above simple example.
d = {'col1': [1, 2, 1, 2, 1, 2], 'col2': [3, 4, 2, 7, 6, 8], 'col3': [1, 1, 0, 0, 1, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
4 1 6 1
5 2 8 0
Edit | Possible solution
I found a possible solution. I would like to do it in a cleaner way, but this is readable and pretty simple. Any alternatives to combine these two lines of code are still appreciated ofcourse.
df['col4'] = np.where(df.col3 == 1, df.col2, 0)
df['col4'] = df.groupby(['col1']).col4.transform('sum') / df.groupby(['col1']).col2.transform('sum')
You may need to correct your expected output , then using map after filter
df.col1.map(df.loc[df.col3==1,].set_index('col1').col2)/df.groupby(['col1']).col2.transform('sum')
Out[566]:
0 0.600000
1 0.363636
2 0.600000
3 0.363636
dtype: float64
simple :)
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
df['col4'] = 0.0
def con(data):
part_a = sum(data[data['col3'] == 1]['col2'])
part_b = sum(data['col2'])
data.col4 = part_a/part_b
return data
df.groupby('col1').apply(con)
Output
col1 col2 col3 col4
0 1 3 1 0.600000
1 2 4 1 0.363636
2 1 2 0 0.600000
3 2 7 0 0.363636

Subdataframes of Subdataframes

If I have a dataframe df_i and I want to split it into sub-dataframes based on unique values of 'Cycle Number'
I use:
dfs = {k: df_i[df_i['Cycle Number'] == k] for k in df_i['Cycle Number'].unique()}
Assuming the 'Cycle Number' ranges from 1 to 50 and in each cycle, I have steps ranging from 1 to 15, how do I split each data frame into 15 further data frames?
I am presuming something of this type would work:
for i in range(1,51):
dsfs = {k: dfs[i][dfs[i]['Step Number'] == k] for k in dfs[i]['Step Number'].unique()}
But, this will return me 15 data frames only from the cycle number corresponding to 50, not the ones before.
If I want to access a sub-dataframe in the 20th Cycle with step number 10, is there a way of generating the subdata frame such that I can access it using something like dfs[20][10]?
A simple parallel:
Step Number Cycle Number Desired Access
1 1 dfs[1][1]
2 1 dfs[1][2]
3 1 dfs[1][3]
4 1 dfs[1][4]
5 1 dfs[1][5]
1 2 dfs[2][1]
2 2 dfs[2][2]
3 2 dfs[2][3]
4 2 dfs[2][4]
5 2 dfs[2][5]
1 3 dfs[3][1]
2 3 dfs[3][2]
3 3 dfs[3][3]
4 3 dfs[3][4]
5 3 dfs[3][5]
1 4 dfs[4][1]
2 4 dfs[4][2]
3 4 dfs[4][3]
4 4 dfs[4][4]
5 4 dfs[4][5]
You can use tuple keys instead and utilize groupby. Here's a minimal example:
df = pd.DataFrame([[0, 1, 2], [0, 1, 3], [1, 2, 4], [1, 2, 5], [1, 3, 6], [1, 3, 7]],
columns=['col1', 'col2', 'col3'])
dfs = dict(tuple(df.groupby(['col1', 'col2'])))
for k, v in dfs.items():
print(k)
print(v)
(0, 1)
col1 col2 col3
0 0 1 2
1 0 1 3
(1, 2)
col1 col2 col3
2 1 2 4
3 1 2 5
(1, 3)
col1 col2 col3
4 1 3 6
5 1 3 7

Pandas merge duplicate DataFrame columns preserving column names

How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!
# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df
I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T

Categories

Resources