Group identical consecutive values in pandas DataFrame - python

I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?

use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )

I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

Merging a groupby object with another data frame

I have a data frame with repesenting the sales of an item:
import pandas as pd
data = {'id': [1,1,1,1,2,2], 'week': [1,2,2,3,1,3], 'quantity': [1,2,4,3,2,2]}
df_sales = pd.DataFrame(data)
🐍 >>> df_sales
id week quantity
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 2
4 2 3 2
I have another data frame that represents the available weeks:
data = {'week': [1,2,3]}
df_week = pd.DataFrame(data)
🐍 >>> df_week
week
0 1
1 2
2 3
I want to groupby the id and the week and compute the mean, which I do as follows:
df = df_sales.groupby(by=['id', 'week'], as_index=False).mean()
🐍 >>> df
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 3 2
However, I want to fill the missing week values (present in df_week) with 0, such that the output is:
🐍 >>> df
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
4 2 3 2
Is it possible to merge the groupby with the df_week data frame?
We can reindex after groupby
# group and aggregate
df = df_sales.groupby(['id', 'week']).mean()
# define new MultiIndex
idx = pd.MultiIndex.from_product([df.index.levels[0], df_week['week']])
# reindex with fill_value=0
df = df.reindex(idx, fill_value=0).reset_index()
print(df)
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
5 2 3 2
Since all unique id and week combinations are needed in the result, one way is to first prepare a combinations frame with pd.merge passed how="cross":
combs = pd.merge(df_sales.id.drop_duplicates(), df_week.week, how="cross")
or for versions below 1.2
combs = pd.merge(df_sales.id.drop_duplicates().to_frame().assign(key=1),
df_week.week.to_frame().assign(key=1), on="key").drop(columns="key")
which gives
>>> combs
id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
Now we can left merge this with df that has the means filling NaNs with 0:
result = combs.merge(df, how="left", on=["id", "week"]).fillna(0, downcast="infer")
where downcast is to go back to integers from float type because of NaN(s) that appeared in the intermediate step,
to get
>>> result
id week quantity
0 1 1 1
1 1 2 3
2 1 3 3
3 2 1 2
4 2 2 0
5 2 3 2

Padding columns of dataframe

I have 2 dataframes like this,
df1
0 1 2 3 4 5 category
0 1 2 3 4 5 6 foo
1 4 5 6 5 6 7 bar
2 7 8 9 5 6 7 foo1
and
df2
0 1 2 category
0 1 2 3 bar
1 4 5 6 foo
Shape of df1 is (3,7) and shape of df2 is (2,4).
I want to reshape df2 to (2,7) (as per first dataframe df1 columns) keeping the last column same.
df2
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
If you want to ensure that dataframe having less columns will pad the columns with zero according to the dataframe having more columns, then you can try DataFrame.align on axis=1 to align the columns of two dataframes keeping the rows unchanged:
df1, df2 = df1.align(df2, axis=1, fill_value=0)
print(df2)
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
You can use .shape[0] to get the # of rows from each dataframe. and .shape[1] to get the # of columns from each dataframe.
Use these logically with insert to only include the required rows and make the required columns 0:
s1, s2 = (df1.shape[1]), (df2.shape[1])
s = s1-s2
[df2.insert(s-1, s-1, 0) for s in range(s2,s1)]
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
Another method using iloc:
s1, s2 = (df1.shape[1] - 1), (df2.shape[1] - 1)
df3 = pd.concat([df2.iloc[:, :-1],
df1.iloc[:df2.shape[0]:, s2:s1],
df2.iloc[:, -1]], axis=1)
df3.iloc[:, s2:s1] = 0
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo

Concatenate groups of multiple dataframes

I have a df1:
a b c
1 0 1 4
2 0 2 5
3 1 1 3
and a second df2:
a b c
1 0 1 5
2 0 2 5
3 1 1 4
These df's have the same goups in a and b. Within groupby of 'a' and 'b' I want df2 underneath df1:
a b c
1 0 1 4
2 0 1 5
3 0 2 5
4 0 2 5
5 1 1 3
6 1 1 4
How can I combine groupby() and concat() to get the desired output?
You can do concat then sort_values
df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

Categories

Resources