Merging a groupby object with another data frame - python

I have a data frame with repesenting the sales of an item:
import pandas as pd
data = {'id': [1,1,1,1,2,2], 'week': [1,2,2,3,1,3], 'quantity': [1,2,4,3,2,2]}
df_sales = pd.DataFrame(data)
🐍 >>> df_sales
id week quantity
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 2
4 2 3 2
I have another data frame that represents the available weeks:
data = {'week': [1,2,3]}
df_week = pd.DataFrame(data)
🐍 >>> df_week
week
0 1
1 2
2 3
I want to groupby the id and the week and compute the mean, which I do as follows:
df = df_sales.groupby(by=['id', 'week'], as_index=False).mean()
🐍 >>> df
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 3 2
However, I want to fill the missing week values (present in df_week) with 0, such that the output is:
🐍 >>> df
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
4 2 3 2
Is it possible to merge the groupby with the df_week data frame?

We can reindex after groupby
# group and aggregate
df = df_sales.groupby(['id', 'week']).mean()
# define new MultiIndex
idx = pd.MultiIndex.from_product([df.index.levels[0], df_week['week']])
# reindex with fill_value=0
df = df.reindex(idx, fill_value=0).reset_index()
print(df)
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
5 2 3 2

Since all unique id and week combinations are needed in the result, one way is to first prepare a combinations frame with pd.merge passed how="cross":
combs = pd.merge(df_sales.id.drop_duplicates(), df_week.week, how="cross")
or for versions below 1.2
combs = pd.merge(df_sales.id.drop_duplicates().to_frame().assign(key=1),
df_week.week.to_frame().assign(key=1), on="key").drop(columns="key")
which gives
>>> combs
id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
Now we can left merge this with df that has the means filling NaNs with 0:
result = combs.merge(df, how="left", on=["id", "week"]).fillna(0, downcast="infer")
where downcast is to go back to integers from float type because of NaN(s) that appeared in the intermediate step,
to get
>>> result
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
5 2 3 2

Related

How to create a new dichotomized columns from values in an existing column using pandas

I have a dataframe that looks like this:
ID type period
1 2 3
1 2 3
1 3 3
2 2 3
2 3 2
2 3 2
3 2 2
There are a total of X types and X periods. Not all types/periods will be used, but I need columns to be created for all X of each just so that the table doesn't break in the database when imported from pandas. (Assume X in this example is 3, but it's really 9, just shortened in this example.)
For each ID, I need a 0 to show if that type/period was present, and a 1 to show if it was not.
The desired dataframe looks like this:
ID type_1 type_2 type_3 period_1 period_2 period_3
1 0 1 1 0 0 1
2 0 1 1 0 1 1
3 0 1 0 0 1 0
Any advice towards the right direction would be greatly appreciated! Thank you!
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
ID type period
1 2 3
1 2 3
1 3 3
2 2 3
2 3 2
2 3 2
3 2 2"""), sep=' ')
>>> df
ID type period
0 1 2 3
1 1 2 3
2 1 3 3
3 2 2 3
4 2 3 2
5 2 3 2
6 3 2 2
We can use groupby on columns 'ID' and 'type' to extract their size, then unstack the result, fill NaNs with zeros and finally convert it to bool and int as you want 0 and 1 values :
>>> df.groupby(['ID','type']).size().unstack(fill_value=0).astype(bool).astype(int)
type 2 3
ID
1 1 1
2 1 1
3 1 0
And for the period column :
>>> df.groupby(['ID','period']).size().unstack(fill_value=0).astype(bool).astype(int)
period 2 3
ID
1 0 1
2 1 1
3 1 0

Padding columns of dataframe

I have 2 dataframes like this,
df1
0 1 2 3 4 5 category
0 1 2 3 4 5 6 foo
1 4 5 6 5 6 7 bar
2 7 8 9 5 6 7 foo1
and
df2
0 1 2 category
0 1 2 3 bar
1 4 5 6 foo
Shape of df1 is (3,7) and shape of df2 is (2,4).
I want to reshape df2 to (2,7) (as per first dataframe df1 columns) keeping the last column same.
df2
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
If you want to ensure that dataframe having less columns will pad the columns with zero according to the dataframe having more columns, then you can try DataFrame.align on axis=1 to align the columns of two dataframes keeping the rows unchanged:
df1, df2 = df1.align(df2, axis=1, fill_value=0)
print(df2)
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
You can use .shape[0] to get the # of rows from each dataframe. and .shape[1] to get the # of columns from each dataframe.
Use these logically with insert to only include the required rows and make the required columns 0:
s1, s2 = (df1.shape[1]), (df2.shape[1])
s = s1-s2
[df2.insert(s-1, s-1, 0) for s in range(s2,s1)]
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
Another method using iloc:
s1, s2 = (df1.shape[1] - 1), (df2.shape[1] - 1)
df3 = pd.concat([df2.iloc[:, :-1],
df1.iloc[:df2.shape[0]:, s2:s1],
df2.iloc[:, -1]], axis=1)
df3.iloc[:, s2:s1] = 0
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo

Group identical consecutive values in pandas DataFrame

I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?
use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )
I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1

how to create unique couple id for linked pairs in pandas

I have a dataframe linking people together. For example,
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2],[2,1],[3,4],[5,6],[4,3],[6,5]], columns=['m_id', 'f_id'])
>>> df
m_id f_id
0 1 2
1 2 1
2 3 4
3 5 6
4 4 3
5 6 5
My goal is to create a third column that creates a unique id for each pair of m_id and f_id. For instance, the following desired output.
>>> df
m_id f_id shared_id
0 1 2 0
1 2 1 0
2 3 4 1
3 5 6 2
4 4 3 1
5 6 5 2
UPDATE
This is not a duplicate of this question because I'm not trying to get the group ID back from a typical groupby. In my case, I have two columns and I want to assign a group ID based on if the two elements in a row are the same as the two elements in other rows, ignoring the order of the columns.
IIUC
pd.DataFrame(np.sort(df.values,1),index=df.index).groupby([0,1]).ngroup()
Out[94]:
0 0
1 0
2 1
3 2
4 1
5 2
dtype: int64
With numeric values, can use np.unique to get the groups, after sorting.
df['share_id'] = np.unique(np.sort(df.to_numpy(), axis=1), axis=0, return_inverse=True)[1]
m_id f_id share_id
0 1 2 0
1 2 1 0
2 3 4 1
3 5 6 2
4 4 3 1
5 6 5 2

How to groupby based on two columns in pandas?

A similar question might have been asked before, but I couldn't find the exact one fitting to my problem.
I want to group by a dataframe based on two columns.
For exmaple to make this
id product quantity
1 A 2
1 A 3
1 B 2
2 A 1
2 B 1
3 B 2
3 B 1
Into this:
id product quantity
1 A 5
1 B 2
2 A 1
2 B 1
3 B 3
Meaning that summation on "quantity" column for same "id" and same "product".
You need groupby with parameter as_index=False for return DataFrame and aggregating mean:
df = df.groupby(['id','product'], as_index=False)['quantity'].sum()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
Or add reset_index:
df = df.groupby(['id','product'])['quantity'].sum().reset_index()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use pivot_table with aggfunc='sum'
df.pivot_table('quantity', ['id', 'product'], aggfunc='sum').reset_index()
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use groupby and aggregate function
import pandas as pd
df = pd.DataFrame({
'id': [1,1,1,2,2,3,3],
'product': ['A','A','B','A','B','B','B'],
'quantity': [2,3,2,1,1,2,1]
})
print df
id product quantity
0 1 A 2
1 1 A 3
2 1 B 2
3 2 A 1
4 2 B 1
5 3 B 2
6 3 B 1
df = df.groupby(['id','product']).agg({'quantity':'sum'}).reset_index()
print df
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3

Categories

Resources