How Pandas is doing groupby for below scenario - python

I am facing problem while trying to understand the below code snippet of group by.I am trying to understand how is calculation is happening for df.groupby(L).sum().
This is a code snippet that i got from the urlenter link description here.
Thanks for any help.

Rows are grouping by values of list, because length of list is same like number of rows in DataFrame, it means:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
L = [0, 1, 0, 1, 2, 0]
print (df)
key data1 data2
0 A 0 5 <-0
1 B 1 0 <-1
2 C 2 3 <-0
3 A 3 3 <-1
4 B 4 7 <-2
5 C 5 9 <-0
So:
data1 for 0 is 0 + 2 + 5 = 7
data2 for 0 is 5 + 3 + 9 = 17
data1 for 1 is 1 + 3 = 4
data2 for 1 is 0 + 3 = 3
data1 for 2 is 4
data2 for 2 is 7
Output:
print(df.groupby(L).sum())
data1 data2
0 7 17
1 4 3
2 4 7
Key column is omitted, because Automatic exclusion of 'nuisance' columns.

Related

Multiple insert columns if not exist pandas

I have the following df
list_columns = ['A', 'B', 'C']
list_data = [
[1, '2', 3],
[4, '4', 5],
[1, '2', 3],
[4, '4', 6]
]
df = pd.DataFrame(columns=list_columns, data=list_data)
I want to check if multiple columns exist, and if not to create them.
Example:
If B,C,D do not exist, create them(For the above df it will create only D column)
I know how to do this with one column:
if 'D' not in df:
df['D']=0
Is there a way to test if all my columns exist, and if not create the one that are missing? And not to make an if for each column
Here loop is not necessary - use DataFrame.reindex with Index.union:
cols = ['B','C','D']
df = df.reindex(df.columns.union(cols, sort=False), axis=1, fill_value=0)
print (df)
A B C D
0 1 2 3 0
1 4 4 5 0
2 1 2 3 0
3 4 4 6 0
Just to add, you can unpack the set diff between your columns and the list with an assign and ** unpacking.
import numpy as np
cols = ['B','C','D','E']
df.assign(**{col : 0 for col in np.setdiff1d(cols,df.columns.values)})
A B C D E
0 1 2 3 0 0
1 4 4 5 0 0
2 1 2 3 0 0
3 4 4 6 0 0

Pandas: set preceding values conditional on current value in column (by group)

I have a pandas data frame where values should be greater or equal to preceding values. In cases where the current value is lower than the preceding values, the preceding values must be set equal to the current value. This is best explained by example below:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[0, 1, 2, 3, 2, 0, 1, 2, 3, 1, 5, 0, 1, 0, 3, 2]}
df = pd.DataFrame(data)
df
group value
0 A 0
1 A 1
2 A 2
3 A 3
4 A 2
5 B 0
6 B 1
7 B 2
8 B 3
9 B 1
10 B 5
11 C 0
12 C 1
13 C 0
14 C 3
15 C 2
and the result I am looking for is:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
So here's my go!
(Special thanks to #jezrael for helping me simplify it considerably!)
I'm basing this on Expanding Windows, in reverse, to always get a suffix of the elements in each group (from the last element, expanding towards first).
this expanding window has the following logic:
For element in index i, you get a Series containing all elements in group with indices >=i, and I need to return a new single value for i in the result.
What is the value corresponding to this suffix? its minimum! because if the later elements are smaller, we need to take the smallest among them.
then we can assign the result of this operation to df['value'].
try this:
df['value'] = (df.iloc[::-1]
.groupby('group')['value']
.expanding()
.min()
.reset_index(level=0, drop=True)
.astype(int))
print (df)
Output:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
I didnt get your output but I believe you are looking for something like
df['fwd'] = df.value.shift(-1)
df['new'] = np.where(df['value'] > df['fwd'], df['fwd'], df['value'])

Can Pandas use a list for groupby?

import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns=['key', 'data1', 'data2'])
df
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
L = [0, 1, 0, 1, 2, 0]
print(df.groupby(L).sum())
The output is:
data1 data2
0 7 17
1 4 3
2 4 7
I need a clear explanation, please?! What are 0s, 1s and 2 in the L? Are they key column of the df? or are they index label of df? And how groupby grouped based on L?
the L is a list of integers in your example. As you groupby L you simply saying: Look at this list of integers and group my df based on those unique integers.
I think visualizing it will make sense (note that the df doesn't have column L - I just added it for visualization) :
groupby L means - take the unique values (in this case 0,1 and 2) and do sum for data1 and data2. So the result for data1 when L=0 would be for data1: 0+2+5=7 (etc)
and the end result would be:
df.groupby(L).sum()
​
data1 data2
0 7 17
1 4 3
2 4 7
You can use a list to group observations in your dataframe. For instance, say you have the heights of a few people:
import pandas as pd
df = pd.DataFrame({'names':['John', 'Mark', 'Fred', 'Julia', 'Mary'],
'height':[180, 180, 180, 160, 160]})
print(df)
names height
0 John 180
1 Mark 180
2 Fred 180
3 Julia 160
4 Mary 160
And elsewhere, you received their assigned groups:
sex = ['man', 'man', 'man', 'woman', 'woman']
You won't need to concatenate a new column to your dataframe just to group them. You can use the list to do the work:
df.groupby(sex).mean()
height
man 180
woman 160
You can see here how it's working:
In [6006]: df.groupby(L).agg(list)
Out[6006]:
key data1 data2
0 [A, C, C] [0, 2, 5] [5, 3, 9]
1 [B, A] [1, 3] [0, 3]
2 [B] [4] [7]
In [6002]: list(df.groupby(L))
Out[6002]:
[(0, key data1 data2
0 A 0 5
2 C 2 3
5 C 5 9),
(1, key data1 data2
1 B 1 0
3 A 3 3),
(2, key data1 data2
4 B 4 7)]
In L, it groups the The 0, key, which is ACC, index 0,2m5 the 1 key is BA, index 1,3, and the two key is B, index 4
This is due to the alignment of the L key:
df['L'] = L
key data1 data2 L
0 A 0 5 0
1 B 1 0 1
2 C 2 3 0
3 A 3 3 1
4 B 4 7 2
5 C 5 9 0
I hope this makes sense

How to get equivalent of pandas melt using groupby + stack?

Recently I am learning groupby and stack and encountered one method of pandas called melt. I would like to know how to achieve the same result given by melt using groupby and stack.
Here is the MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
df1 = pd.melt(df, id_vars='A',value_vars=['B','C'],var_name='variable',value_name='value')
print(df1)
A variable value
0 1 B 1
1 1 B 1
2 1 B 2
3 2 B 2
4 2 B 1
5 1 C 10
6 1 C 20
7 1 C 30
8 2 C 40
9 2 C 50
How to get the same result using groupby and stack?
My attempt
df.groupby('A')[['B','C']].count().stack(0).reset_index()
I am not quite correct. And looking for the suggestions.
I guess you do not need groupby, just stack + sort_values:
result = df[['A', 'B', 'C']].set_index('A').stack().reset_index().sort_values(by='level_1')
result.columns = ['A', 'variable', 'value']
Output
A variable value
0 1 B 1
2 1 B 1
4 1 B 2
6 2 B 2
8 2 B 1
1 1 C 10
3 1 C 20
5 1 C 30
7 2 C 40
9 2 C 50

Repeat data frame, with varying column value

I have the following data frame and need to repeat the values for a set of values. That is, given
test3 = pd.DataFrame(data={'x':[1, 2, 3, 4, pd.np.nan], 'y':['a', 'a', 'a', 'b', 'b']})
test3
x y
0 1 a
1 2 a
2 3 a
3 4 b
4 NaN b
I need to do something like this, but more performant:
test3['group'] = np.NaN
groups = ['a', 'b']
dfs = []
for group in groups:
temp = test3.copy()
temp['group'] = group
dfs.append(temp)
pd.concat(dfs)
That is, the expected output is:
x y group
0 1 a a
1 2 a a
2 3 a a
3 4 b a
4 NaN b a
0 1 a b
1 2 a b
2 3 a b
3 4 b b
4 NaN b b

Categories

Resources