A similar question might have been asked before, but I couldn't find the exact one fitting to my problem.
I want to group by a dataframe based on two columns.
For exmaple to make this
id product quantity
1 A 2
1 A 3
1 B 2
2 A 1
2 B 1
3 B 2
3 B 1
Into this:
id product quantity
1 A 5
1 B 2
2 A 1
2 B 1
3 B 3
Meaning that summation on "quantity" column for same "id" and same "product".
You need groupby with parameter as_index=False for return DataFrame and aggregating mean:
df = df.groupby(['id','product'], as_index=False)['quantity'].sum()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
Or add reset_index:
df = df.groupby(['id','product'])['quantity'].sum().reset_index()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use pivot_table with aggfunc='sum'
df.pivot_table('quantity', ['id', 'product'], aggfunc='sum').reset_index()
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use groupby and aggregate function
import pandas as pd
df = pd.DataFrame({
'id': [1,1,1,2,2,3,3],
'product': ['A','A','B','A','B','B','B'],
'quantity': [2,3,2,1,1,2,1]
})
print df
id product quantity
0 1 A 2
1 1 A 3
2 1 B 2
3 2 A 1
4 2 B 1
5 3 B 2
6 3 B 1
df = df.groupby(['id','product']).agg({'quantity':'sum'}).reset_index()
print df
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
Related
I have a data frame with repesenting the sales of an item:
import pandas as pd
data = {'id': [1,1,1,1,2,2], 'week': [1,2,2,3,1,3], 'quantity': [1,2,4,3,2,2]}
df_sales = pd.DataFrame(data)
🐍 >>> df_sales
id week quantity
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 2
4 2 3 2
I have another data frame that represents the available weeks:
data = {'week': [1,2,3]}
df_week = pd.DataFrame(data)
🐍 >>> df_week
week
0 1
1 2
2 3
I want to groupby the id and the week and compute the mean, which I do as follows:
df = df_sales.groupby(by=['id', 'week'], as_index=False).mean()
🐍 >>> df
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 3 2
However, I want to fill the missing week values (present in df_week) with 0, such that the output is:
🐍 >>> df
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
4 2 3 2
Is it possible to merge the groupby with the df_week data frame?
We can reindex after groupby
# group and aggregate
df = df_sales.groupby(['id', 'week']).mean()
# define new MultiIndex
idx = pd.MultiIndex.from_product([df.index.levels[0], df_week['week']])
# reindex with fill_value=0
df = df.reindex(idx, fill_value=0).reset_index()
print(df)
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
5 2 3 2
Since all unique id and week combinations are needed in the result, one way is to first prepare a combinations frame with pd.merge passed how="cross":
combs = pd.merge(df_sales.id.drop_duplicates(), df_week.week, how="cross")
or for versions below 1.2
combs = pd.merge(df_sales.id.drop_duplicates().to_frame().assign(key=1),
df_week.week.to_frame().assign(key=1), on="key").drop(columns="key")
which gives
>>> combs
id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
Now we can left merge this with df that has the means filling NaNs with 0:
result = combs.merge(df, how="left", on=["id", "week"]).fillna(0, downcast="infer")
where downcast is to go back to integers from float type because of NaN(s) that appeared in the intermediate step,
to get
>>> result
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
5 2 3 2
UPDATED THE SAMPLE DATASET
I have the following data:
location ID Value
A 1 1
A 1 1
A 1 1
A 1 1
A 1 2
A 1 2
A 1 2
A 1 2
A 1 3
A 1 4
A 2 1
A 2 2
A 3 1
A 3 2
B 4 1
B 4 2
B 5 1
B 5 1
B 5 2
B 5 2
B 6 1
B 6 1
B 6 1
B 6 1
B 6 1
B 6 2
B 6 2
B 6 2
B 7 1
I want to count unique Values (only if value is equals to 1 or 2) for each location and for each ID for the following output.
location ID_Count Value_Count
A 3 6
B 4 7
I tried using df.groupby(['location'])['ID','value'].nunique(), but I am getting only the unique count of values, like for I am getting value_count for A as 4 and for B as 2.
Try agg with slice on ID on True values.
For your updated sample, you just need to drop duplicates before processing. The rest is the same
df = df.drop_duplicates(['location', 'ID', 'Value'])
df_agg = (df.Value.isin([1,2]).groupby(df.location)
.agg(ID_count=lambda x: df.loc[x[x].index, 'ID'].nunique(),
Value_count='sum'))
Out[93]:
ID_count Value_count
location
A 3 6
B 4 7
IIUC, You can try series.isin with groupby.agg
out = (df.assign(Value_Count=df['Value'].isin([1,2])).groupby("location",as_index=False)
.agg({"ID":'nunique',"Value_Count":'sum'}))
print(out)
location ID Value_Count
0 A 3 6.0
1 B 4 7.0
Roughly same as anky, but then using Series.where and named aggregations so we can rename the columns while creating them in the groupby.
grp = df.assign(Value=df['Value'].where(df['Value'].isin([1, 2]))).groupby('location')
grp.agg(
ID_count=('ID', 'nunique'),
Value_count=('Value', 'count')
).reset_index()
location ID_count Value_count
0 A 3 6
1 B 4 7
Let's try a very similar approach to other answers. This time we filter first:
(df[df['Value'].isin([1,2])]
.groupby(['location'],as_index=False)
.agg({'ID':'nunique', 'Value':'size'})
)
Output:
location ID Value
0 A 3 6
1 B 4 7
I have a df1:
a b c
1 0 1 4
2 0 2 5
3 1 1 3
and a second df2:
a b c
1 0 1 5
2 0 2 5
3 1 1 4
These df's have the same goups in a and b. Within groupby of 'a' and 'b' I want df2 underneath df1:
a b c
1 0 1 4
2 0 1 5
3 0 2 5
4 0 2 5
5 1 1 3
6 1 1 4
How can I combine groupby() and concat() to get the desired output?
You can do concat then sort_values
df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)
I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?
use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )
I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
I have a dataframe:
>>> df
Category Score
0 A 1
1 A 2
2 A 3
3 B 5
4 B 9
I expect the output:
Sorting Score within Category.
>>> df
Category Score
2 A 3
1 A 2
0 A 1
4 B 9
3 B 5
Any ideas?
Use sort_values by mention order.
In [17]: df.sort_values(by=['Category', 'Score'], ascending=[True, False])
Out[17]:
Category Score
2 A 3
1 A 2
0 A 1
4 B 9
3 B 5