Pandas - changing rows where less than n subsequent values are equal - python

I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...

Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

Related

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?
Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0

How to do the classification and count of DataFrame columns? [duplicate]

This question already has answers here:
GroupBy Pandas Count Consecutive Zero's
(2 answers)
Closed 1 year ago.
I want to count consecutive 0s, if there are 0s, count the consecutive numbers, and assign the numbers to the count column, and if they encounter 1, recount.
I also tried several methods, but none of them achieved my results.
An example of my Dataframe is as follows:
import numpy as np
import pandas as pd
np.random.seed(2021)
a = np.random.randint(0, 2, 20)
df = pd.DataFrame(a, columns=['No.'])
print(df)
No.
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
12 1
13 1
14 0
15 0
16 0
17 0
18 0
19 0
The result I need:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 3
6 0 3
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 6
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6
I tried the following methods, but none of them achieved my results. What should I do?
groups = df['No.'].ne(0).cumsum()
df['count'] = df['No.'].eq(0).groupby(groups).count()
df['count'] = df['No.'].eq(0).groupby(groups).agg(len)
df['count'] = df['No.'].groupby(groups).agg(len)
df['count'] = df['No.'].groupby(groups).count()
For your groups variable, calculate diff first, so you assign an id to each consecutive sequence that contains the same value. And to get the equal sized count Series that can be assigned to original data frame, use transform instead of agg:
df['count'] = 0
groups = df['No.'].diff().ne(0).cumsum()
df.loc[df['No.'] == 0, 'count'] = df['No.'].groupby(groups).transform('size')
df
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 3
6 0 3
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 6
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6

Multiple condition count across two dataframes

I've tried several solutions from similar problems, but so far, no luck. I know it's probably simple.
I have two pandas dataframes. One contains temperatures and months, df1. The other contains months and a possible range of temperatures, df2. I would like to count how many times a temperature for a particular month occurs based on df2.
df1:
Month Temp
1 10
1 10
1 20
2 5
2 10
2 15
df2:
Month Temp
1 0
1 5
1 10
1 15
1 20
1 25
2 0
2 5
2 10
2 15
2 20
2 25
desired output with a new columns, Count, in df2:
Month Temp Count
1 0 0
1 5 0
1 10 2
1 15 0
1 20 1
1 25 0
2 0 0
2 5 1
2 10 1
2 15 1
2 20 0
2 25 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*3 + [2]*3,
'Temp': [10,10,20,5,10,15]})
df2 = pd.DataFrame({'Month': [1]*6 + [2]*6,
'Temp': [0,5,10,15,20,25]*2})
df2['Count'] =
An approach using value_counts and reindex:
new_index = pd.MultiIndex.from_frame(df2)
new_df = (
df1.value_counts(["Month", "Temp"])
.reindex(new_index, fill_value=0)
.rename("Count")
.reset_index()
)
Month Temp Count
0 1 0 0
1 1 5 0
2 1 10 2
3 1 15 0
4 1 20 1
5 1 25 0
6 2 0 0
7 2 5 1
8 2 10 1
9 2 15 1
10 2 20 0
11 2 25 0
Try this:
(df2.join(
df.groupby(['Month','Temp']).size().rename('count'),
on=['Month','Temp'])
.fillna(0))
Another solution:
x = (
df1.assign(Count=1)
.merge(df2, on=["Month", "Temp"], how="outer")
.fillna(0)
.groupby(["Month", "Temp"], as_index=False)
.sum()
.astype(int)
)
print(x)
Prints:
Month Temp Count
0 1 0 0
1 1 5 0
2 1 10 2
3 1 15 0
4 1 20 1
5 1 25 0
6 2 0 0
7 2 5 1
8 2 10 1
9 2 15 1
10 2 20 0
11 2 25 0
try:
res = (df2.set_index(['Month', 'Temp'])
.join(df1.value_counts().to_frame(name='count'))
.reset_index().fillna(0).astype(int))
OR
di = df1.value_counts().to_dict()
df2['count'] = df2.apply(lambda x: 0 if tuple(x) not in di.keys() else di[tuple(x)], axis=1)
Month Temp count
0 1 0 0
1 1 5 0
2 1 10 2
3 1 15 0
4 1 20 1
5 1 25 0
6 2 0 0
7 2 5 1
8 2 10 1
9 2 15 1
10 2 20 0
11 2 25 0

How do I create a Sequence in Pyspark that resets when rows change from 0 to 1 and and increments when all are 1's

I have a pyspark dataframe like this and need the SEQ output as shown:
R_ID ORDER SC_ITEM seq
A 1 0
A 3 1 1
A 4 1 2
A 5 1 3
A 6 1 4
A 7 1 5
A 8 1 6
A 9 1 7
A 10 0 0
A 11 1 1
A 12 0 0
A 13 1
A 14 0
A 15 1 1
A 16 1 2
A 17 1 3
A 18 1 4
A 19 1 5
A 20 1 6
A 21 0 0
A 22 0 0
B 1 0 0
B 2 1 1
C 1 1 1
C 2 1 2
Not sure if the data is showing properly. So pic attached :enter image description here
I did something like this :
RN = Window().orderBy(lit('A'))
.when(((F.col("R_ID")==(lag(F.col("R_ID"),1).over(RN))) & (F.col("SC_ITEM")== 1)), (F.col("SC_ITEM") + (lag(F.col("SEQ"),1).over(RN))))\
Not sure if I can do lead or lag over the SEQ. Please help how to do this

Leave blocks of 1 of size >= k in Pandas data frame

I need to leave block >= k of '1'. All other block of '1' should be transformed to zero. For example, k=2:
df=
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0
where the column a is the original sequence, and the column b is the desired.
z = df.a.eq(0)
g = z.cumsum().mask(z, -1)
k = 2
df['b'] = df.a.groupby(g).transform('size').ge(k).mask(z, 0)
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0

Categories

Resources