Can Pandas use a list for groupby? - python

import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns=['key', 'data1', 'data2'])
df
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
L = [0, 1, 0, 1, 2, 0]
print(df.groupby(L).sum())
The output is:
data1 data2
0 7 17
1 4 3
2 4 7
I need a clear explanation, please?! What are 0s, 1s and 2 in the L? Are they key column of the df? or are they index label of df? And how groupby grouped based on L?

the L is a list of integers in your example. As you groupby L you simply saying: Look at this list of integers and group my df based on those unique integers.
I think visualizing it will make sense (note that the df doesn't have column L - I just added it for visualization) :
groupby L means - take the unique values (in this case 0,1 and 2) and do sum for data1 and data2. So the result for data1 when L=0 would be for data1: 0+2+5=7 (etc)
and the end result would be:
df.groupby(L).sum()
​
data1 data2
0 7 17
1 4 3
2 4 7

You can use a list to group observations in your dataframe. For instance, say you have the heights of a few people:
import pandas as pd
df = pd.DataFrame({'names':['John', 'Mark', 'Fred', 'Julia', 'Mary'],
'height':[180, 180, 180, 160, 160]})
print(df)
names height
0 John 180
1 Mark 180
2 Fred 180
3 Julia 160
4 Mary 160
And elsewhere, you received their assigned groups:
sex = ['man', 'man', 'man', 'woman', 'woman']
You won't need to concatenate a new column to your dataframe just to group them. You can use the list to do the work:
df.groupby(sex).mean()
height
man 180
woman 160

You can see here how it's working:
In [6006]: df.groupby(L).agg(list)
Out[6006]:
key data1 data2
0 [A, C, C] [0, 2, 5] [5, 3, 9]
1 [B, A] [1, 3] [0, 3]
2 [B] [4] [7]
In [6002]: list(df.groupby(L))
Out[6002]:
[(0, key data1 data2
0 A 0 5
2 C 2 3
5 C 5 9),
(1, key data1 data2
1 B 1 0
3 A 3 3),
(2, key data1 data2
4 B 4 7)]
In L, it groups the The 0, key, which is ACC, index 0,2m5 the 1 key is BA, index 1,3, and the two key is B, index 4
This is due to the alignment of the L key:
df['L'] = L
key data1 data2 L
0 A 0 5 0
1 B 1 0 1
2 C 2 3 0
3 A 3 3 1
4 B 4 7 2
5 C 5 9 0
I hope this makes sense

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

Multiple insert columns if not exist pandas

I have the following df
list_columns = ['A', 'B', 'C']
list_data = [
[1, '2', 3],
[4, '4', 5],
[1, '2', 3],
[4, '4', 6]
]
df = pd.DataFrame(columns=list_columns, data=list_data)
I want to check if multiple columns exist, and if not to create them.
Example:
If B,C,D do not exist, create them(For the above df it will create only D column)
I know how to do this with one column:
if 'D' not in df:
df['D']=0
Is there a way to test if all my columns exist, and if not create the one that are missing? And not to make an if for each column
Here loop is not necessary - use DataFrame.reindex with Index.union:
cols = ['B','C','D']
df = df.reindex(df.columns.union(cols, sort=False), axis=1, fill_value=0)
print (df)
A B C D
0 1 2 3 0
1 4 4 5 0
2 1 2 3 0
3 4 4 6 0
Just to add, you can unpack the set diff between your columns and the list with an assign and ** unpacking.
import numpy as np
cols = ['B','C','D','E']
df.assign(**{col : 0 for col in np.setdiff1d(cols,df.columns.values)})
A B C D E
0 1 2 3 0 0
1 4 4 5 0 0
2 1 2 3 0 0
3 4 4 6 0 0

How Pandas is doing groupby for below scenario

I am facing problem while trying to understand the below code snippet of group by.I am trying to understand how is calculation is happening for df.groupby(L).sum().
This is a code snippet that i got from the urlenter link description here.
Thanks for any help.
Rows are grouping by values of list, because length of list is same like number of rows in DataFrame, it means:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
L = [0, 1, 0, 1, 2, 0]
print (df)
key data1 data2
0 A 0 5 <-0
1 B 1 0 <-1
2 C 2 3 <-0
3 A 3 3 <-1
4 B 4 7 <-2
5 C 5 9 <-0
So:
data1 for 0 is 0 + 2 + 5 = 7
data2 for 0 is 5 + 3 + 9 = 17
data1 for 1 is 1 + 3 = 4
data2 for 1 is 0 + 3 = 3
data1 for 2 is 4
data2 for 2 is 7
Output:
print(df.groupby(L).sum())
data1 data2
0 7 17
1 4 3
2 4 7
Key column is omitted, because Automatic exclusion of 'nuisance' columns.

Pandas Dataframe. Expand tuple values as columns with multiindex

I have a dataframe df:
A B
first second
bar one 0.0 0.0
two 0.0 0.0
foo one 0.0 0.0
two 0.0 0.0
I transform it to another one where values are tuples:
A B
first second
bar one (6, 1, 0) (0, 9, 3)
two (9, 3, 4) (6, 2, 1)
foo one (1, 9, 0) (4, 0, 0)
two (6, 1, 5) (8, 3, 5)
My question is how can I get it (expanded) to be like below where tuples values become columns with multiindex? Can I do it during transform or should I do it as an additional step after transform?
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Code for the above:
import numpy as np
import pandas as pd
np.random.seed(123)
def expand(s):
# complex logic of `result` has been replaced with `np.random`
result = [tuple(np.random.randint(10, size=3)) for i in s]
return result
index = pd.MultiIndex.from_product([['bar', 'foo'], ['one', 'two']], names=['first', 'second'])
df = pd.DataFrame(np.zeros((4, 2)), index=index, columns=['A', 'B'])
print(df)
expanded = df.groupby(['second']).transform(expand)
print(expanded)
Try this:
df_lst = []
for col in df.columns:
expanded_splt = expanded.apply(lambda x: pd.Series(x[col]),axis=1)
columns = pd.MultiIndex.from_product([[col], ['m', 'n', 'k']])
expanded_splt.columns = columns
df_lst.append(expanded_splt)
pd.concat(df_lst, axis=1)
Output:
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Finally I found time to find an answer that suits me.
expanded_data = expanded.agg(lambda x: np.concatenate(x), axis=1).to_numpy()
expanded_data = np.stack(expanded_data)
column_index = pd.MultiIndex.from_product([expanded.columns, ['m', 'n', 'k']])
exploded = pd.DataFrame(expanded_data, index=expanded.index, columns=column_index)
print(exploded)
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5

Axis error when dropping specific columns Pandas

I have identified specific columns I want to select as my predictors for my model based on some analysis. I have captured those column numbers and stored it in a list. I have roughly 80 columns and want to loop through and drop the columns not in this specific list. X_train is the column in which I want to do this. Here is my code:
cols_selected = [24, 4, 7, 50, 2, 60, 46, 53, 48, 61]
cols_drop = []
for x in range(len(X_train.columns)):
if x in cols_selected:
pass
else:
X_train.drop([x])
When running this, I am faced with the following error while highlighting the code: X_train.drop([x]):
KeyError: '[3] not found in axis'
I am sure it is something very simple that I am missing. I tried including the inplace=True or axis=1 statements along with this and all of them had the same error message (while the value inside the [] changed with those error codes).
Any help would be great!
Edit: Here is the addition to get this working:
cols_selected = [24, 4, 7, 50, 2, 60, 46, 53, 48, 61]
cols_drop = []
for x in range(len(X_train.columns)):
if x in cols_selected:
pass
else:
cols_drop.append(x)
X_train = X_train.drop(X_train.columns[[cols_drop]], axis=1)
According to the documentation of drop:
Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly index or column names
You can not drop columns by simply using the index of the column. You need the name of the columns. Also the axis parameter has to be set to 1 or columns Replace X_train.drop([x]) with X_train=X_train.drop(X_train.columns[x], axis='columns') to make your example work.
I am just assuming as per the question litle:
Example DataFrame:
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Dropping Specific columns B & C:
>>> df.drop(['B', 'C'], axis=1)
# df.drop(['B', 'C'], axis=1, inplace=True) <-- to make the change the df itself , use inplace=True
A D
0 0 3
1 4 7
2 8 11
If you are trying to drop them by column numbers (Dropping by index) then try like below:
>>> df.drop(df.columns[[1, 2]], axis=1)
A D
0 0 3
1 4 7
2 8 11
OR
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Also, in addition to #pygo pointing out that df.drop takes a keyword arg to designate the axis, try this:
X_train = X_train[[col for col in X_train.columns if col in cols_selected]]
Here is an example:
>>> import numpy as np
>>> import pandas as pd
>>> cols_selected = ['a', 'c', 'e']
>>> X_train = pd.DataFrame(np.random.randint(low=0, high=10, size=(20, 5)), columns=['a', 'b', 'c', 'd', 'e'])
>>> X_train
a b c d e
0 4 0 3 5 9
1 8 8 6 7 2
2 1 0 2 0 2
3 3 8 0 5 9
4 5 9 7 8 0
5 1 9 3 5 9 ...
>>> X_train = X_train[[col for col in X_train.columns if col in cols_selected]]
>>> X_train
a c e
0 4 3 9
1 8 6 2
2 1 2 2
3 3 0 9
4 5 7 0
5 1 3 9 ...

Categories

Resources