pandas series add with previous row on condition - python
I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)
The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3
You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned
Related
Compare two rows on a loop for on Pandas
I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0. d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] } df = pd.DataFrame(data=d) df I have tried this with the np.where and it is working: df[B] = np.where(df[A] > df[B], 0, df[B]) df[C] = np.where(df[B] > df[C], 0, df[C]) However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for. Thanks
Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask: df1 = df.mask(df.diff(axis=1).lt(0), 0) print (df1) A B C 0 6 0 0 1 8 9 14 2 10 12 0 3 1 0 4 4 3 0 9 If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...: for a, b in zip(df.columns, df.columns[1:]): df[b] = np.where(df[a] > df[b], 0, df[b]) print (df) A B C 0 6 0 3 1 8 9 14 2 10 12 0 3 1 0 4 4 3 0 9
To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen. You can achieve a correct vectorial replacement using a shifted mask: m1 = df.diff(axis=1).lt(0) # check if < than previous m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice df2 = df.mask(m1&m2, 0) output: A B C 0 6 0 3 1 8 9 14 2 10 12 0 3 1 0 4 4 3 0 9
Doing group by on one hot encoded columns PANDAS
i have the following dataframe called df. I want for each sector column (sector_) to basically do a group by and get the unique ids for each sector. The sector is denoted as 1 for each row if the id is apart of that sector. How can i do this group by if the columns are one hot encoded? id winner sector_food sector_learning sector_parenting sector_consumer 1 1 1 0 0 0 1 0 1 0 0 0 2 1 0 0 0 0 2 0 1 0 0 0 3 1 0 0 0 1 expected output sector unique_id sector_food 2 sector_learning 0 sector_parenting 0 sector_consumer 1
You can do something like this: out = df.drop(["id", "winner"], 1).multiply(df.id, 0).nunique().subtract(1) #sector_food 2 #sector_learning 0 #sector_parenting 0 #sector_consumer 1 #dtype: int64 To get your exact expected output you can add: out = out.rename_axis("sector").to_frame("unique_id") # unique_id #sector #sector_food 2 #sector_learning 0 #sector_parenting 0 #sector_consumer 1
Try this: df.drop('winner',axis=1).groupby(level=0).sum().gt(0).sum().to_frame('unique_id')
Given import pandas as pd ids = [1, 1, 2, 2, 3] winner = [1, 0, 1, 0, 1] sector_food = [1, 1, 0, 1, 0] sector_learning = [0, 0, 0, 0, 0] sector_parenting = [0, 0, 0, 0, 0] sector_consumer = [0, 0, 0, 0, 1] df = pd.DataFrame({ 'id': ids, 'winner': winner, 'sector_food': sector_food, 'sector_learning': sector_learning, 'sector_parenting': sector_parenting, 'sector_consumer': sector_consumer }) print(df) output id winner sector_food sector_learning sector_parenting sector_consumer 0 1 1 1 0 0 0 1 1 0 1 0 0 0 2 2 1 0 0 0 0 3 2 0 1 0 0 0 4 3 1 0 0 0 1 You can do _df = (df # drop unused cols .drop('winner', axis=1) # melt with 'id' as index .melt(id_vars='id') # drop all duplicates .drop_duplicates(['id', 'variable', 'value']) # sum unique values .groupby('variable').value.sum() ) print(_df) output variable sector_consumer 1 sector_food 2 sector_learning 0 sector_parenting 0 Name: value, dtype: int64
How to remove rows from DF as result of a groupby query?
I have this Pandas dataframe: df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'day': [1, 1, 1, 1, 1, 1, 2, 2, 2], 'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'clicks': [100, 200, 50, 0, 0, 0, 10, 0, 20]}) # site day hour clicks # 0 a 1 1 100 # 1 a 1 2 200 # 2 a 1 3 50 # 3 b 1 1 0 # 4 b 1 2 0 # 5 b 1 3 0 # 6 a 2 1 10 # 7 a 2 2 0 # 8 a 2 3 20 And I want to remove all rows for a site/day, where there were 0 clicks. So in the example above, I would want to remove the rows with site='b' and day =1. I can basically group them and show where the sum is 0 for a day/site: print(df.groupby(['site', 'day'])['clicks'].sum() == 0) But how would now be straight-forward way to remove the rows from original dataframe where that condition applies? Solution I am having so far is that I iterate over group and save all tuples of site/day in a list, and then separately remove all rows that have that combinations of site/day. That works but, I am sure there must be a more functional and elegant way to achieve that result?
Option 1 Using groupby, transform and boolean indexing: df[df.groupby(['site', 'day'])['clicks'].transform('sum') != 0] Output: site day hour clicks 0 a 1 1 100 1 a 1 2 200 2 a 1 3 50 6 a 2 1 10 7 a 2 2 0 8 a 2 3 20 Option 2 Using groupby and filter: df.groupby(['site', 'day']).filter(lambda x: x['clicks'].sum() != 0) Output: site day hour clicks 0 a 1 1 100 1 a 1 2 200 2 a 1 3 50 6 a 2 1 10 7 a 2 2 0 8 a 2 3 20
Creating a string from pandas column and row data
I am interested in generating a string that is composed of pandas row and column data. Given the following pandas data frame I am interested only in generating a string from columns with positive values index A B C 1 0 1 2 2 0 0 3 3 0 0 0 4 1 0 0 I would like to create a new column that appends a string that lists which columns in a row were positive. Then I would drop all of the rows that the data came from: index Positives 1 B-1, C-2 2 C-3 4 A-1
Here is one way using pd.DataFrame.apply + pd.Series.apply: df = pd.DataFrame([[1, 0, 1, 2], [2, 0, 0, 3], [3, 0, 0, 0], [4, 1, 0, 0]], columns=['index', 'A', 'B', 'C']) def formatter(x): x = x[x > 0] return (x.index[1:].astype(str) + '-' + x[1:].astype(str)) df['Positives'] = df.apply(formatter, axis=1).apply(', '.join) print(df) index A B C Positives 0 1 0 1 2 B-1, C-2 1 2 0 0 3 C-3 2 3 0 0 0 3 4 1 0 0 A-1 If you need to filter out zero-length strings, you can use the fact that empty strings evaluate to False with bool: res = df[df['Positives'].astype(bool)] print(res) index A B C Positives 0 1 0 1 2 B-1, C-2 1 2 0 0 3 C-3 3 4 1 0 0 A-1
I'd replace the zeros with np.NaN to remove things you don't care about and stack. Then form the strings you want and groupby.apply(list) import numpy as np df = df.set_index('index') # if 'index' is not your index. stacked = df.replace(0, np.NaN).stack().reset_index() stacked['Positives'] = stacked['level_1'] + '-' + stacked[0].astype(int).astype('str') stacked = stacked.groupby('index').Positives.apply(list).reset_index() stacked is now: index Positives 0 1 [B-1, C-2] 1 2 [C-3] 2 4 [A-1] Or if you just want one string and not a list, change the last line: stacked.groupby('index').Positives.apply(lambda x: ', '.join(list(x))).reset_index() # index Positives #0 1 B-1, C-2 #1 2 C-3 #2 4 A-1
pandas: Use if-else to populate new column
I have a DataFrame like this: col1 col2 1 0 0 1 0 0 0 0 3 3 2 0 0 4 I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0) How would I do this in python / pandas?
You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0): df['col3'] = (df.col2 > 0).astype('int') (To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.) This produces col3 as: col2 col3 0 0 0 1 1 1 2 0 0 3 0 0 4 3 1 5 0 0 6 4 1 Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example: >>> np.where(df['col2'] > 0, 4, -1) array([-1, 4, -1, -1, 4, -1, 4])
I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros. df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0], 'col2': [0, 1, 0, 0, 3, 0, 4]}) df1['col3'] = df1.col2.gt(0) * 1 >>> df1 Out[70]: col1 col2 col3 0 1 0 0 1 0 1 1 2 0 0 0 3 0 0 0 4 3 3 1 5 2 0 0 6 0 4 1 You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example. df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)