pandas series add with previous row on condition

pandas series add with previous row on condition - python

I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)

The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3

You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned

Related

Compare two rows on a loop for on Pandas

I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0.
d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] }
df = pd.DataFrame(data=d)
df
I have tried this with the np.where and it is working:
df[B] = np.where(df[A] > df[B], 0, df[B])
df[C] = np.where(df[B] > df[C], 0, df[C])
However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for.
Thanks

Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:
df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
A B C
0 6 0 0
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:
for a, b in zip(df.columns, df.columns[1:]):
df[b] = np.where(df[a] > df[b], 0, df[b])
print (df)
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen.
You can achieve a correct vectorial replacement using a shifted mask:
m1 = df.diff(axis=1).lt(0) # check if < than previous
m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice
df2 = df.mask(m1&m2, 0)
output:
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

Doing group by on one hot encoded columns PANDAS

i have the following dataframe called df. I want for each sector column (sector_) to basically do a group by and get the unique ids for each sector. The sector is denoted as 1 for each row if the id is apart of that sector. How can i do this group by if the columns are one hot encoded?
id winner sector_food sector_learning sector_parenting sector_consumer
1 1 1 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
2 0 1 0 0 0
3 1 0 0 0 1
expected output
sector unique_id
sector_food 2
sector_learning 0
sector_parenting 0
sector_consumer 1

You can do something like this:
out = df.drop(["id", "winner"], 1).multiply(df.id, 0).nunique().subtract(1)
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1
#dtype: int64
To get your exact expected output you can add:
out = out.rename_axis("sector").to_frame("unique_id")
# unique_id
#sector
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1

Try this:
df.drop('winner',axis=1).groupby(level=0).sum().gt(0).sum().to_frame('unique_id')

Given
import pandas as pd
ids = [1, 1, 2, 2, 3]
winner = [1, 0, 1, 0, 1]
sector_food = [1, 1, 0, 1, 0]
sector_learning = [0, 0, 0, 0, 0]
sector_parenting = [0, 0, 0, 0, 0]
sector_consumer = [0, 0, 0, 0, 1]
df = pd.DataFrame({
'id': ids,
'winner': winner,
'sector_food': sector_food,
'sector_learning': sector_learning,
'sector_parenting': sector_parenting,
'sector_consumer': sector_consumer
})
print(df)
output
id winner sector_food sector_learning sector_parenting sector_consumer
0 1 1 1 0 0 0
1 1 0 1 0 0 0
2 2 1 0 0 0 0
3 2 0 1 0 0 0
4 3 1 0 0 0 1
You can do
_df = (df
# drop unused cols
.drop('winner', axis=1)
# melt with 'id' as index
.melt(id_vars='id')
# drop all duplicates
.drop_duplicates(['id', 'variable', 'value'])
# sum unique values
.groupby('variable').value.sum()
)
print(_df)
output
variable
sector_consumer 1
sector_food 2
sector_learning 0
sector_parenting 0
Name: value, dtype: int64

How to remove rows from DF as result of a groupby query?

I have this Pandas dataframe:
df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'day': [1, 1, 1, 1, 1, 1, 2, 2, 2],
'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'clicks': [100, 200, 50, 0, 0, 0, 10, 0, 20]})
# site day hour clicks
# 0 a 1 1 100
# 1 a 1 2 200
# 2 a 1 3 50
# 3 b 1 1 0
# 4 b 1 2 0
# 5 b 1 3 0
# 6 a 2 1 10
# 7 a 2 2 0
# 8 a 2 3 20
And I want to remove all rows for a site/day, where there were 0 clicks. So in the example above, I would want to remove the rows with site='b' and day =1.
I can basically group them and show where the sum is 0 for a day/site:
print(df.groupby(['site', 'day'])['clicks'].sum() == 0)
But how would now be straight-forward way to remove the rows from original dataframe where that condition applies?
Solution I am having so far is that I iterate over group and save all tuples of site/day in a list, and then separately remove all rows that have that combinations of site/day. That works but, I am sure there must be a more functional and elegant way to achieve that result?

Option 1
Using groupby, transform and boolean indexing:
df[df.groupby(['site', 'day'])['clicks'].transform('sum') != 0]
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20
Option 2
Using groupby and filter:
df.groupby(['site', 'day']).filter(lambda x: x['clicks'].sum() != 0)
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20

Creating a string from pandas column and row data

I am interested in generating a string that is composed of pandas row and column data. Given the following pandas data frame I am interested only in generating a string from columns with positive values
index A B C
1 0 1 2
2 0 0 3
3 0 0 0
4 1 0 0
I would like to create a new column that appends a string that lists which columns in a row were positive. Then I would drop all of the rows that the data came from:
index Positives
1 B-1, C-2
2 C-3
4 A-1

Here is one way using pd.DataFrame.apply + pd.Series.apply:
df = pd.DataFrame([[1, 0, 1, 2], [2, 0, 0, 3], [3, 0, 0, 0], [4, 1, 0, 0]],
columns=['index', 'A', 'B', 'C'])
def formatter(x):
x = x[x > 0]
return (x.index[1:].astype(str) + '-' + x[1:].astype(str))
df['Positives'] = df.apply(formatter, axis=1).apply(', '.join)
print(df)
index A B C Positives
0 1 0 1 2 B-1, C-2
1 2 0 0 3 C-3
2 3 0 0 0
3 4 1 0 0 A-1
If you need to filter out zero-length strings, you can use the fact that empty strings evaluate to False with bool:
res = df[df['Positives'].astype(bool)]
print(res)
index A B C Positives
0 1 0 1 2 B-1, C-2
1 2 0 0 3 C-3
3 4 1 0 0 A-1

I'd replace the zeros with np.NaN to remove things you don't care about and stack. Then form the strings you want and groupby.apply(list)
import numpy as np
df = df.set_index('index') # if 'index' is not your index.
stacked = df.replace(0, np.NaN).stack().reset_index()
stacked['Positives'] = stacked['level_1'] + '-' + stacked[0].astype(int).astype('str')
stacked = stacked.groupby('index').Positives.apply(list).reset_index()
stacked is now:
index Positives
0 1 [B-1, C-2]
1 2 [C-3]
2 4 [A-1]
Or if you just want one string and not a list, change the last line:
stacked.groupby('index').Positives.apply(lambda x: ', '.join(list(x))).reset_index()
# index Positives
#0 1 B-1, C-2
#1 2 C-3
#2 4 A-1

pandas: Use if-else to populate new column

I have a DataFrame like this:
col1 col2
1 0
0 1
0 0
0 0
3 3
2 0
0 4
I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like
df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0)
How would I do this in python / pandas?

You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0):
df['col3'] = (df.col2 > 0).astype('int')
(To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.)
This produces col3 as:
col2 col3
0 0 0
1 1 1
2 0 0
3 0 0
4 3 1
5 0 0
6 4 1
Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example:
>>> np.where(df['col2'] > 0, 4, -1)
array([-1, 4, -1, -1, 4, -1, 4])

I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros.
df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0],
'col2': [0, 1, 0, 0, 3, 0, 4]})
df1['col3'] = df1.col2.gt(0) * 1
>>> df1
Out[70]:
col1 col2 col3
0 1 0 0
1 0 1 1
2 0 0 0
3 0 0 0
4 3 3 1
5 2 0 0
6 0 4 1
You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example.
df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas series add with previous row on condition - python

Related

Compare two rows on a loop for on Pandas

Doing group by on one hot encoded columns PANDAS

How to remove rows from DF as result of a groupby query?

Creating a string from pandas column and row data

pandas: Use if-else to populate new column

Categories

Resources