Iterating over pandas rows - python

Having a df:
cell;value
0;8
1;2
2;1
3;6
4;4
5;6
6;7
And i'm trying to define a function that will check the cell values of the row after the observed one. If the value of the cell after the observed one (i+1) is bigger than then observed (i), than the values in a new column maxValue is equal to 0, if smaller - 1.
The final df should look like:
cell;value;maxValue
0;8;1
1;2;1
2;1;0
3;6;1
4;4;0
5;6;0
6;7;0
My solution that does not work yet is:
def MaxFind(df, a, col='value'):
if df.iloc[a+1][col] > df.iloc[a][col]:
return 0
df['maxValue'] = df.apply(lambda row: MaxFind(df, row.value), axis=1)

I believe you need shift with comparing by gt, inverting mask and cast to integers:
df['maxValue'] = (~df['value'].shift().gt(df['value'])).astype(int)
#another solution
#df['maxValue'] = df['value'].shift().le(df['value']).astype(int)
print (df)
cell value maxValue
0 0 8 1
1 1 2 0
2 2 1 0
3 3 6 1
4 4 4 0
5 5 6 1
6 6 7 1
Details:
df['shifted'] = df['value'].shift()
df['mask'] = (df['value'].shift().gt(df['value']))
df['inverted_mask'] = (~df['value'].shift().gt(df['value']))
df['maxValue'] = (~df['value'].shift().gt(df['value'])).astype(int)
print (df)
cell value shifted mask inverted_mask maxValue
0 0 8 NaN False True 1
1 1 2 8.0 True False 0
2 2 1 2.0 True False 0
3 3 6 1.0 False True 1
4 4 4 6.0 True False 0
5 5 6 4.0 False True 1
6 6 7 6.0 False True 1
EDIT:
df['maxValue'] = df['value'].shift(1).le(df['value'].shift(-1)).astype(int)
print (df)
cell value maxValue
0 0 8 0
1 1 2 0
2 2 1 1
3 3 6 1
4 4 4 1
5 5 6 1
6 6 7 0
df['shift_1'] = df['value'].shift(1)
df['shift_-1'] = df['value'].shift(-1)
df['mask'] = df['value'].shift(1).le(df['value'].shift(-1))
df['maxValue'] = df['value'].shift(1).le(df['value'].shift(-1)).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 NaN 2.0 False 0
1 1 2 8.0 1.0 False 0
2 2 1 2.0 6.0 True 1
3 3 6 1.0 4.0 True 1
4 4 4 6.0 6.0 True 1
5 5 6 4.0 7.0 True 1
6 6 7 6.0 NaN False 0
If shift values, get for first or last ones missing values. If necessary, is possible repalce them by first no NaN or last non NaNs with forward or back filling:
df['shift_1'] = df['value'].shift(2)
df['shift_-1'] = df['value'].shift(-2)
df['mask'] = df['value'].shift(2).le(df['value'].shift(-2))
df['maxValue'] = df['value'].shift(2).le(df['value'].shift(-2)).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 NaN 1.0 False 0
1 1 2 NaN 6.0 False 0
2 2 1 8.0 4.0 False 0
3 3 6 2.0 6.0 True 1
4 4 4 1.0 7.0 True 1
5 5 6 6.0 NaN False 0
6 6 7 4.0 NaN False 0
df['shift_1'] = df['value'].shift(2).bfill()
df['shift_-1'] = df['value'].shift(-2).ffill()
df['mask'] = df['value'].shift(2).bfill().le(df['value'].shift(-2).ffill())
df['maxValue'] = df['value'].shift(2).bfill().le(df['value'].shift(-2).ffill()).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 8.0 1.0 False 0
1 1 2 8.0 6.0 False 0
2 2 1 8.0 4.0 False 0
3 3 6 2.0 6.0 True 1
4 4 4 1.0 7.0 True 1
5 5 6 6.0 7.0 True 1
6 6 7 4.0 7.0 True 1

Related

How to insert list of values into null values of a column in python?

I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0

How to create a increment var from a first value of a dataframe group?

I have a datframe as :
data=[[0,1,5],
[0,1,6],
[0,0,8],
[0,0,10],
[0,1,12],
[0,0,14],
[0,1,16],
[0,1,18],
[1,0,2],
[1,1,0],
[1,0,1],
[1,0,2]]
df = pd.DataFrame(data,columns=['KEY','COND','VAL'])
For RES1, I want to create a counter variable RES where COND ==1. The value of RES for the
first KEY of the group remains same as the VAL (Can I use cumcount() in some way).
For RES2, then I just want to fill the missing values as
the previous value. (df.fillna(method='ffill')), I am thinking..
KEY COND VAL RES1 RES2
0 0 1 5 5 5
1 0 1 6 6 6
2 0 0 8 6
3 0 0 10 6
4 0 1 12 7 7
5 0 0 14 7
6 0 1 16 8 8
7 0 1 18 9 9
8 1 0 2 2 2
9 1 1 0 3 3
10 1 0 1 3
11 1 0 2 3
Aim is to look fir a vectorized solution that's most optimal over million rows.
IIUC
con=(df.COND==1)|(df.index.isin(df.drop_duplicates('KEY').index))
df['res1']=df.groupby('KEY').VAL.transform('first')+
df.groupby('KEY').COND.cumsum()[con]-
df.groupby('KEY').COND.transform('first')
df['res2']=df.res1.ffill()
df
Out[148]:
KEY COND VAL res1 res2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
You want:
s = (df[df.KEY.duplicated()] # Ignore first row in each KEY group
.groupby('KEY').COND.cumsum() # Counter within KEY
.add(df.groupby('KEY').VAL.transform('first')) # Add first value per KEY
.where(df.COND.eq(1)) # Set only where COND == 1
.add(df.loc[~df.KEY.duplicated(), 'VAL'], fill_value=0) # Set 1st row by KEY
)
df['RES1'] = s
df['RES2'] = df['RES1'].ffill()
KEY COND VAL RES1 RES2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0

How do I assign elements to the column of a pandas dataframe based on the properties of groups derived from that dataframe?

Suppose I import pandas and numpy as follows:
import pandas as pd
import numpy as np
and construct the following dataframe:
df = pd.DataFrame({'Alpha'
['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : np.NaN})
...which gives me this:
Alpha Beta
0 A NaN
1 A NaN
2 A NaN
3 B NaN
4 B NaN
5 B NaN
6 B NaN
7 C NaN
8 C NaN
9 C NaN
10 C NaN
11 C NaN
How do I use pandas to get the following dataframe?
df_u = pd.DataFrame({'Alpha':['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : [1,2,3,1,2,2,3,1,2,2,2,3]})
i.e. this:
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Generally speaking what I'm trying to achieve can be described by the following logic:
Suppose we group df by Alpha.
For every group, for every row in the group...
if the index of the row equals the minimum index of rows in the group, then assign 1 to Beta for that row,
else if the index of the row equals the maximum index of the rows in the group, then assign 3 to Beta for that row,
else assign 2 to Beta for that row.
Let's use duplicated:
df.loc[~df.duplicated('Alpha', keep='last'), 'Beta'] = 3
df.loc[~df.duplicated('Alpha', keep='first'), 'Beta'] = 1
df['Beta'] = df['Beta'].fillna(2)
print(df)
Output:
Alpha Beta
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 2.0
5 B 2.0
6 B 3.0
7 C 1.0
8 C 2.0
9 C 2.0
10 C 2.0
11 C 3.0
method 1
Use np.select:
mask1=df['Alpha'].ne(df['Alpha'].shift())
mask3=df['Alpha'].ne(df['Alpha'].shift(-1))
mask2=~(mask1|mask3)
cond=[mask1,mask2,mask3]
values=[1,2,3]
df['Beta']=np.select(cond,values)
print(df)
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Detail of cond list:
print(mask1)
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
Name: Alpha, dtype: bool
print(mask2)
0 False
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 True
11 False
Name: Alpha, dtype: bool
print(mask3)
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
Name: Alpha, dtype: bool
method 2
Use groupby:
def assign_value(x):
return pd.Series([1]+[2]*(len(x)-2)+[3])
new_df=df.groupby('Alpha').apply(assign_value).rename('Beta').reset_index('Alpha')
print(new_df)
Alpha Beta
0 A 1
1 A 2
2 A 3
0 B 1
1 B 2
2 B 2
3 B 3
0 C 1
1 C 2
2 C 2
3 C 2
4 C 3
assuming that "Alpha" column is sorted you can do it like this
df["Beta"] = 2
df.loc[~(df["Alpha"] == df["Alpha"].shift()), "Beta"] = 1
df.loc[~(df["Alpha"] == df["Alpha"].shift(-1)), "Beta"] = 3
df

No results are returned in dataframe

I am trying to take the average every fifth and every sixth row of var A in a dataframe, and put the result in a new column as var B. But it NaN shows. It may be resulted by I did not return values correctly?
Here is the sample data:
PID A
1 0
1 3
1 2
1 6
1 0
1 2
2 3
2 3
2 1
2 4
2 0
2 4
Expected results:
PID A B
1 0 1
1 3 1
1 2 1
1 6 1
1 0 1
1 2 1
2 3 2
2 3 2
2 1 2
2 4 2
2 0 2
2 4 2
My codes:
lst1 = df.iloc[5::6, :]
lst2 = df.iloc[4::6, :]
df['B'] = (lst1['A'] + lst2['A'])/2
print(df['B'])
The script can be run without error, but the var B is empty and show NaN.
Thanks for your help!
There is problem data not aligned, because different indexes, so get NaNs.
print(lst1)
PID A
5 1 2
11 2 4
print(lst2)
PID A
4 1 0
10 2 0
print (lst1['A'] + lst2['A'])
4 NaN
5 NaN
10 NaN
11 NaN
Name: A, dtype: float64
Solution is use values for add Series to numpy array:
print (lst1['A'] + (lst2['A'].values))
5 2
11 4
Name: A, dtype: int64
Or you can sum 2 numpy arrays:
print (lst1['A'].values + (lst2['A'].values))
[2 4]
It seems you need:
df['B'] = (lst1['A'] + lst2['A'].values).div(2)
df['B'] = df['B'].bfill()
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
But if need mean of 5. and 6. value per group by PID use groupby with transform:
df['B'] = df.groupby('PID').transform(lambda x: x.iloc[[4, 5]].mean())
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 1
Straightforward way taking the mean of the 5th and 6th positions within each group defined by 'PID'.
df.assign(B=df.groupby('PID').transform(lambda x: x.values[[4, 5]].mean()))
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 2
Fun way using join assuming there are actually exactly 6 rows for each 'PID'.
df.join(df.set_index('PID').A.pipe(lambda d: (d.iloc[4::6] + d.iloc[5::6]) / 2).rename('B'), on='PID')
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0

Streaks of True or False in pandas Series

I'm trying to work out how to show streaks of True or False in a pandas Series.
Data:
p = pd.Series([True,False,True,True,True,True,False,False,True])
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
dtype: bool
I tried p.diff() but not sure how to count the False values this generates to show my desired output which is as follows:.
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
You can use cumcount of consecutives groups created by compare if p is not equal with shifted p and cumsum:
print (p.ne(p.shift()))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
dtype: bool
print (p.ne(p.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
dtype: int32
print (p.groupby(p.ne(p.shift()).cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Thank you MaxU for another solution:
print (p.groupby(p.diff().cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Another alternative solution is create the cumulative sum of p Series and subtract the most recent cumulative sum where p is 0. Then invert p and do same. Last multiple Series together:
c = p.cumsum()
a = c.sub(c.mask(p).ffill(), fill_value=0).sub(1).abs()
c = (~p).cumsum()
d = c.sub(c.mask(~(p)).ffill(), fill_value=0).sub(1).abs()
print (a)
0 0.0
1 1.0
2 0.0
3 1.0
4 2.0
5 3.0
6 1.0
7 1.0
8 0.0
dtype: float64
print (d)
0 1.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 1.0
8 1.0
dtype: float64
print (a.mul(d).astype(int))
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int32

Categories

Resources