How to get equivalent of pandas melt using groupby + stack? - python

Recently I am learning groupby and stack and encountered one method of pandas called melt. I would like to know how to achieve the same result given by melt using groupby and stack.
Here is the MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
df1 = pd.melt(df, id_vars='A',value_vars=['B','C'],var_name='variable',value_name='value')
print(df1)
A variable value
0 1 B 1
1 1 B 1
2 1 B 2
3 2 B 2
4 2 B 1
5 1 C 10
6 1 C 20
7 1 C 30
8 2 C 40
9 2 C 50
How to get the same result using groupby and stack?
My attempt
df.groupby('A')[['B','C']].count().stack(0).reset_index()
I am not quite correct. And looking for the suggestions.

I guess you do not need groupby, just stack + sort_values:
result = df[['A', 'B', 'C']].set_index('A').stack().reset_index().sort_values(by='level_1')
result.columns = ['A', 'variable', 'value']
Output
A variable value
0 1 B 1
2 1 B 1
4 1 B 2
6 2 B 2
8 2 B 1
1 1 C 10
3 1 C 20
5 1 C 30
7 2 C 40
9 2 C 50

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

Pandas: set preceding values conditional on current value in column (by group)

I have a pandas data frame where values should be greater or equal to preceding values. In cases where the current value is lower than the preceding values, the preceding values must be set equal to the current value. This is best explained by example below:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[0, 1, 2, 3, 2, 0, 1, 2, 3, 1, 5, 0, 1, 0, 3, 2]}
df = pd.DataFrame(data)
df
group value
0 A 0
1 A 1
2 A 2
3 A 3
4 A 2
5 B 0
6 B 1
7 B 2
8 B 3
9 B 1
10 B 5
11 C 0
12 C 1
13 C 0
14 C 3
15 C 2
and the result I am looking for is:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
So here's my go!
(Special thanks to #jezrael for helping me simplify it considerably!)
I'm basing this on Expanding Windows, in reverse, to always get a suffix of the elements in each group (from the last element, expanding towards first).
this expanding window has the following logic:
For element in index i, you get a Series containing all elements in group with indices >=i, and I need to return a new single value for i in the result.
What is the value corresponding to this suffix? its minimum! because if the later elements are smaller, we need to take the smallest among them.
then we can assign the result of this operation to df['value'].
try this:
df['value'] = (df.iloc[::-1]
.groupby('group')['value']
.expanding()
.min()
.reset_index(level=0, drop=True)
.astype(int))
print (df)
Output:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
I didnt get your output but I believe you are looking for something like
df['fwd'] = df.value.shift(-1)
df['new'] = np.where(df['value'] > df['fwd'], df['fwd'], df['value'])

return new columns by subtracting two columns in a pandas list

It seems so basic, but I can't work out how to achieve the following...
Consider the scenario where I have the following data:
all_columns = ['A','B','C','D']
first_columns = ['A','B']
second_columns = ['C','D']
new_columns = ['E','F']
values = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
df = pd.DataFrame(data = values, columns = all_columns)
df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
How can I using this data subsequently subtract let's say column C - column A, then column D - column B and return two new columns E and F respectively to my df Pandas dataframe? I have multiple columns so writing the formula one by one is not an option.
I imagine it should be something like that, but python thinks that I am trying to subtract list names rather than the values in the actual lists...
df[new_columns] = df[second_columns] - df[first_columns]
Expected output:
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2
df['E'] = df['C'] - df['A']
df['F'] = df['D'] - df['B']
Or, alternatively (similar to #rafaelc's comment):
new_cols = ['E', 'F']
second_cols = ['C', 'D']
first_cols = ['A', 'B']
df[new_cols] = df[second_cols] - df[first_cols].values
As #rafaelc and #Ben.T mentioned .. below would be the good fit to go.
I'm Just placing this is in the answer section for the posterity use...
>>> df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Result:
>>> df[['E', 'F']] = df[['C', 'D']] - df[['A', 'B']].values
>>> df
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2

How can I add a column to a pandas DataFrame that uniquely identifies grouped data? [duplicate]

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Categories

Resources