Streaks of True or False in pandas Series - python

I'm trying to work out how to show streaks of True or False in a pandas Series.
Data:
p = pd.Series([True,False,True,True,True,True,False,False,True])
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
dtype: bool
I tried p.diff() but not sure how to count the False values this generates to show my desired output which is as follows:.
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0

You can use cumcount of consecutives groups created by compare if p is not equal with shifted p and cumsum:
print (p.ne(p.shift()))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
dtype: bool
print (p.ne(p.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
dtype: int32
print (p.groupby(p.ne(p.shift()).cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Thank you MaxU for another solution:
print (p.groupby(p.diff().cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64

Another alternative solution is create the cumulative sum of p Series and subtract the most recent cumulative sum where p is 0. Then invert p and do same. Last multiple Series together:
c = p.cumsum()
a = c.sub(c.mask(p).ffill(), fill_value=0).sub(1).abs()
c = (~p).cumsum()
d = c.sub(c.mask(~(p)).ffill(), fill_value=0).sub(1).abs()
print (a)
0 0.0
1 1.0
2 0.0
3 1.0
4 2.0
5 3.0
6 1.0
7 1.0
8 0.0
dtype: float64
print (d)
0 1.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 1.0
8 1.0
dtype: float64
print (a.mul(d).astype(int))
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int32

Related

How do I assign elements to the column of a pandas dataframe based on the properties of groups derived from that dataframe?

Suppose I import pandas and numpy as follows:
import pandas as pd
import numpy as np
and construct the following dataframe:
df = pd.DataFrame({'Alpha'
['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : np.NaN})
...which gives me this:
Alpha Beta
0 A NaN
1 A NaN
2 A NaN
3 B NaN
4 B NaN
5 B NaN
6 B NaN
7 C NaN
8 C NaN
9 C NaN
10 C NaN
11 C NaN
How do I use pandas to get the following dataframe?
df_u = pd.DataFrame({'Alpha':['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : [1,2,3,1,2,2,3,1,2,2,2,3]})
i.e. this:
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Generally speaking what I'm trying to achieve can be described by the following logic:
Suppose we group df by Alpha.
For every group, for every row in the group...
if the index of the row equals the minimum index of rows in the group, then assign 1 to Beta for that row,
else if the index of the row equals the maximum index of the rows in the group, then assign 3 to Beta for that row,
else assign 2 to Beta for that row.
Let's use duplicated:
df.loc[~df.duplicated('Alpha', keep='last'), 'Beta'] = 3
df.loc[~df.duplicated('Alpha', keep='first'), 'Beta'] = 1
df['Beta'] = df['Beta'].fillna(2)
print(df)
Output:
Alpha Beta
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 2.0
5 B 2.0
6 B 3.0
7 C 1.0
8 C 2.0
9 C 2.0
10 C 2.0
11 C 3.0
method 1
Use np.select:
mask1=df['Alpha'].ne(df['Alpha'].shift())
mask3=df['Alpha'].ne(df['Alpha'].shift(-1))
mask2=~(mask1|mask3)
cond=[mask1,mask2,mask3]
values=[1,2,3]
df['Beta']=np.select(cond,values)
print(df)
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Detail of cond list:
print(mask1)
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
Name: Alpha, dtype: bool
print(mask2)
0 False
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 True
11 False
Name: Alpha, dtype: bool
print(mask3)
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
Name: Alpha, dtype: bool
method 2
Use groupby:
def assign_value(x):
return pd.Series([1]+[2]*(len(x)-2)+[3])
new_df=df.groupby('Alpha').apply(assign_value).rename('Beta').reset_index('Alpha')
print(new_df)
Alpha Beta
0 A 1
1 A 2
2 A 3
0 B 1
1 B 2
2 B 2
3 B 3
0 C 1
1 C 2
2 C 2
3 C 2
4 C 3
assuming that "Alpha" column is sorted you can do it like this
df["Beta"] = 2
df.loc[~(df["Alpha"] == df["Alpha"].shift()), "Beta"] = 1
df.loc[~(df["Alpha"] == df["Alpha"].shift(-1)), "Beta"] = 3
df

How to count consecutive repetitions in a pandas series

Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1
Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64
The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN

Iterating over pandas rows

Having a df:
cell;value
0;8
1;2
2;1
3;6
4;4
5;6
6;7
And i'm trying to define a function that will check the cell values of the row after the observed one. If the value of the cell after the observed one (i+1) is bigger than then observed (i), than the values in a new column maxValue is equal to 0, if smaller - 1.
The final df should look like:
cell;value;maxValue
0;8;1
1;2;1
2;1;0
3;6;1
4;4;0
5;6;0
6;7;0
My solution that does not work yet is:
def MaxFind(df, a, col='value'):
if df.iloc[a+1][col] > df.iloc[a][col]:
return 0
df['maxValue'] = df.apply(lambda row: MaxFind(df, row.value), axis=1)
I believe you need shift with comparing by gt, inverting mask and cast to integers:
df['maxValue'] = (~df['value'].shift().gt(df['value'])).astype(int)
#another solution
#df['maxValue'] = df['value'].shift().le(df['value']).astype(int)
print (df)
cell value maxValue
0 0 8 1
1 1 2 0
2 2 1 0
3 3 6 1
4 4 4 0
5 5 6 1
6 6 7 1
Details:
df['shifted'] = df['value'].shift()
df['mask'] = (df['value'].shift().gt(df['value']))
df['inverted_mask'] = (~df['value'].shift().gt(df['value']))
df['maxValue'] = (~df['value'].shift().gt(df['value'])).astype(int)
print (df)
cell value shifted mask inverted_mask maxValue
0 0 8 NaN False True 1
1 1 2 8.0 True False 0
2 2 1 2.0 True False 0
3 3 6 1.0 False True 1
4 4 4 6.0 True False 0
5 5 6 4.0 False True 1
6 6 7 6.0 False True 1
EDIT:
df['maxValue'] = df['value'].shift(1).le(df['value'].shift(-1)).astype(int)
print (df)
cell value maxValue
0 0 8 0
1 1 2 0
2 2 1 1
3 3 6 1
4 4 4 1
5 5 6 1
6 6 7 0
df['shift_1'] = df['value'].shift(1)
df['shift_-1'] = df['value'].shift(-1)
df['mask'] = df['value'].shift(1).le(df['value'].shift(-1))
df['maxValue'] = df['value'].shift(1).le(df['value'].shift(-1)).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 NaN 2.0 False 0
1 1 2 8.0 1.0 False 0
2 2 1 2.0 6.0 True 1
3 3 6 1.0 4.0 True 1
4 4 4 6.0 6.0 True 1
5 5 6 4.0 7.0 True 1
6 6 7 6.0 NaN False 0
If shift values, get for first or last ones missing values. If necessary, is possible repalce them by first no NaN or last non NaNs with forward or back filling:
df['shift_1'] = df['value'].shift(2)
df['shift_-1'] = df['value'].shift(-2)
df['mask'] = df['value'].shift(2).le(df['value'].shift(-2))
df['maxValue'] = df['value'].shift(2).le(df['value'].shift(-2)).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 NaN 1.0 False 0
1 1 2 NaN 6.0 False 0
2 2 1 8.0 4.0 False 0
3 3 6 2.0 6.0 True 1
4 4 4 1.0 7.0 True 1
5 5 6 6.0 NaN False 0
6 6 7 4.0 NaN False 0
df['shift_1'] = df['value'].shift(2).bfill()
df['shift_-1'] = df['value'].shift(-2).ffill()
df['mask'] = df['value'].shift(2).bfill().le(df['value'].shift(-2).ffill())
df['maxValue'] = df['value'].shift(2).bfill().le(df['value'].shift(-2).ffill()).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 8.0 1.0 False 0
1 1 2 8.0 6.0 False 0
2 2 1 8.0 4.0 False 0
3 3 6 2.0 6.0 True 1
4 4 4 1.0 7.0 True 1
5 5 6 6.0 7.0 True 1
6 6 7 4.0 7.0 True 1

No results are returned in dataframe

I am trying to take the average every fifth and every sixth row of var A in a dataframe, and put the result in a new column as var B. But it NaN shows. It may be resulted by I did not return values correctly?
Here is the sample data:
PID A
1 0
1 3
1 2
1 6
1 0
1 2
2 3
2 3
2 1
2 4
2 0
2 4
Expected results:
PID A B
1 0 1
1 3 1
1 2 1
1 6 1
1 0 1
1 2 1
2 3 2
2 3 2
2 1 2
2 4 2
2 0 2
2 4 2
My codes:
lst1 = df.iloc[5::6, :]
lst2 = df.iloc[4::6, :]
df['B'] = (lst1['A'] + lst2['A'])/2
print(df['B'])
The script can be run without error, but the var B is empty and show NaN.
Thanks for your help!
There is problem data not aligned, because different indexes, so get NaNs.
print(lst1)
PID A
5 1 2
11 2 4
print(lst2)
PID A
4 1 0
10 2 0
print (lst1['A'] + lst2['A'])
4 NaN
5 NaN
10 NaN
11 NaN
Name: A, dtype: float64
Solution is use values for add Series to numpy array:
print (lst1['A'] + (lst2['A'].values))
5 2
11 4
Name: A, dtype: int64
Or you can sum 2 numpy arrays:
print (lst1['A'].values + (lst2['A'].values))
[2 4]
It seems you need:
df['B'] = (lst1['A'] + lst2['A'].values).div(2)
df['B'] = df['B'].bfill()
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
But if need mean of 5. and 6. value per group by PID use groupby with transform:
df['B'] = df.groupby('PID').transform(lambda x: x.iloc[[4, 5]].mean())
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 1
Straightforward way taking the mean of the 5th and 6th positions within each group defined by 'PID'.
df.assign(B=df.groupby('PID').transform(lambda x: x.values[[4, 5]].mean()))
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 2
Fun way using join assuming there are actually exactly 6 rows for each 'PID'.
df.join(df.set_index('PID').A.pipe(lambda d: (d.iloc[4::6] + d.iloc[5::6]) / 2).rename('B'), on='PID')
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0

How to "iron out" a column of numbers with duplicates in it

If one has the following column:
df = pd.DataFrame({"numbers":[1,2,3,4,4,5,1,2,2,3,4,5,6,7,7,8,1,1,2,2,3,4,5,6,6,7]})
How can one "iron" it out so that the duplicates become part of the series of numbers:
numbers new_numbers
1 1
2 2
3 3
4 4
4 5
5 6
1 1
2 2
2 3
3 4
4 5
5 6
6 7
7 8
7 9
8 10
1 1
1 2
2 3
2 4
3 5
4 6
5 7
6 8
6 9
7 10
(I put spaces into the df for clarification)
It seems you need cumcount by Series created with diff and compare with lt (<) for finding starts of each group. Groups are made by cumsum:
#for better testing helper df1
df1 = pd.DataFrame(index=df.index)
df1['dif'] = df.numbers.diff()
df1['compare'] = df.numbers.diff().lt(0)
df1['groups'] = df.numbers.diff().lt(0).cumsum()
print (df1)
dif compare groups
0 NaN False 0
1 1.0 False 0
2 1.0 False 0
3 1.0 False 0
4 0.0 False 0
5 1.0 False 0
6 -4.0 True 1
7 1.0 False 1
8 0.0 False 1
9 1.0 False 1
10 1.0 False 1
11 1.0 False 1
12 1.0 False 1
13 1.0 False 1
14 0.0 False 1
15 1.0 False 1
16 -7.0 True 2
17 0.0 False 2
18 1.0 False 2
19 0.0 False 2
20 1.0 False 2
21 1.0 False 2
22 1.0 False 2
23 1.0 False 2
24 0.0 False 2
25 1.0 False 2
df['new_numbers'] = df.groupby(df.numbers.diff().lt(0).cumsum()).cumcount() + 1
print (df)
numbers new_numbers
0 1 1
1 2 2
2 3 3
3 4 4
4 4 5
5 5 6
6 1 1
7 2 2
8 2 3
9 3 4
10 4 5
11 5 6
12 6 7
13 7 8
14 7 9
15 8 10
16 1 1
17 1 2
18 2 3
19 2 4
20 3 5
21 4 6
22 5 7
23 6 8
24 6 9
25 7 10

Categories

Resources