Pandas: Create count cumsum column and reset if condition - python

Overview
When creating a conditional count_cumsum column in Pandas I have created a temporary count column then deleted this after the desired column was created.
Code
df = pd.DataFrame({"Level":[1,2,3,4,5,6,7,8],
"Price":[2,3,4,5,6,7,1,10]})
df["Count"] = np.where((df.Price > df.Level),1,np.NaN)
df['count_cumsum'] = df.Count.groupby(df.Count.isna().cumsum()).cumsum()
del df["Count"]
Level Price count_cumsum
0 1 2 1.0
1 2 3 2.0
2 3 4 3.0
3 4 5 4.0
4 5 6 5.0
5 6 7 6.0
6 7 1 NaN
7 8 10 1.0
Question
How can I use a zero instead of NaN for the df["Count"] column to keep count_cumsum as an int column and is there a simpler way to yield this output.
Desired output
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1

To use zero instead of NaN, you can replace codes on np.nan with 0 and replace isna() by eq(0) in your code. This should be simple and you should be able to do it yourself based on the hint here. I will go straightly to the way to simplify the coding below:
You can simplify the processing logics as follows:
# Replace the `np.where` on the boolean condition and setting 0 or 1 according to condition by using `astype(int)` instead
m = (df.Price > df.Level).astype(int)
#Use the series m for grouping and cumsum
df['count_cumsum'] = m.groupby(m.eq(0).cumsum()).cumsum()
In this way, you can simplify the code by:
without defining temporary column df["Count"] on df and delete it afterwards
simplify the code np.where((df.Price > df.Level),1,0) to simply converting the boolean condition (df.Price > df.Level) to integer (will give 0 and 1 for False and True respectively).
Result:
print(df)
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1

You can avoid the NaN's altogether with the clean and readable solution below:
df = pd.DataFrame({"Level":[1,2,3,4,5,6,7,8],
"Price":[2,3,4,5,6,7,1,10]})
df["Count"] = np.where((df.Price > df.Level),1,0)
df['count_cumsum'] = df.Count.groupby((df.Count == 0).cumsum()).cumsum()
del df["Count"]
Level Price count_cumsum
0 1 2 1
1 2 3 2
2 3 4 3
3 4 5 4
4 5 6 5
5 6 7 6
6 7 1 0
7 8 10 1
This leaves everything as an int type as well, which seems to be what you're after.

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

How to drop duplicates in pandas but keep more than the first

Let's say I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is
>> df
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 2
9 2
EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.
You can create unique value for each consecutive group, then use groupby and head:
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
# result:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 3
9 3
Use boolean indexing with groupby.cumcount:
N = 3
df[df.groupby('a').cumcount().lt(N)]
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
8 3
9 3
For the last N:
df[df.groupby('a').cumcount(ascending=False).lt(N)]
apply on consecutive repetitions
df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1 # this is #3 of the local group
8 3
9 3
advantages of boolean indexing
You can use it for many other operations, such as setting values or masking:
group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)
df.where(m)
a
0 1.0
1 2.0
2 2.0
3 2.0
4 NaN
5 1.0
6 1.0
7 1.0
8 3.0
9 3.0
df.loc[~m] = -1
a
0 1
1 2
2 2
3 2
4 -1
5 1
6 1
7 1
8 3
9 3

Pandas calculate column based on other row

I need to calculate a column based on other row. Basically I want my new_column to be the sum of "base_column" for all row with same id.
I currently do the following (but is not really efficient) what is the most efficient way to achieve that ?
def calculate(x):
filtered_df = df[["id"] == dataset.at[x.name, "id"]] # in fact my filter is more complex basically same id and date in the last 4 weeks
df.at[x.name, "new_column"] = filtered_df["base_column"].sum()
df.apply(calculate)
You can do a below
df['new_column']= df.groupby('id')['base_column'].transform('sum')
input
id base_column
0 1 2
1 1 4
2 2 5
3 3 6
4 5 7
5 7 4
6 7 5
7 7 3
output
id base_column new_column
0 1 2 6
1 1 4 6
2 2 5 5
3 3 6 6
4 5 7 7
5 7 4 12
6 7 5 12
7 7 3 12
Another way to do this is to use groupby and merge
import pandas as pd
df = pd.DataFrame({'id':[1,1,2],'base_column':[2,4,5]})
# compute sum by id
sum_base =df.groupby("id").agg({"base_column": 'sum'}).reset_index().rename(columns={'base_column':'new_column'})
# join the result to df
df = pd.merge(df,sum_base,how='left',on='id')
# id base_column new_column
#0 1 2 6
#1 1 4 6
#2 2 5 5

Shuffling one Column of a DataFrame By Group Efficiently

I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:
group some_value label
0 1 8 1
1 1 7 0
2 1 6 2
3 1 5 2
4 2 1 0
5 2 2 0
6 2 3 1
7 2 4 2
8 3 2 1
9 3 4 1
10 3 2 1
11 3 4 2
I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:
group some_value label
0 1 8 1
1 1 7 2
2 1 6 2
3 1 5 0
4 2 1 1
5 2 2 0
6 2 3 0
7 2 4 2
8 3 2 1
9 3 4 2
10 3 2 1
11 3 4 1
I used np.random.permutation but found it was very slow.
df["label"] = df.groupby("group")["label"].transform(np.random.permutation
It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?
We can using sample Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values
What about providing a custom transform function?
def sample(x):
return x.sample(n=x.shape[0])
df.groupby("group")["label"].transform(sample)
This SO explanation of printing out what is passed into the custom function via the transform function is helpful.

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Categories

Resources