Python Pandas: Create Column That Acts As A Conditional Running Variable - python

I'm trying to create a new dataframe column that acts as a running variable that resets to zero or "passes" under certain conditions. Below is a simplified example of what I'm looking to accomplish. Let's say I'm trying to quit drinking coffee and I'm tracking the number of days in a row i've gone without drinking any. On days where I forgot to make note of whether I drank coffee, I put "forgot", and my tally does not get influenced.
Below is how i'm currently accomplishing this, though I suspect there's a much more efficient way of going about it.
Thanks in advance!
import pandas as pd
Day = [1,2,3,4,5,6,7,8,9,10,11]
DrankCoffee = ['no','no','forgot','yes','no','no','no','no','no','yes','no']
df = pd.DataFrame(list(zip(Day,DrankCoffee)), columns=['Day','DrankCoffee'])
df['Streak'] = 0
s = 0
for (index,row) in df.iterrows():
if row['DrankCoffee'] == 'no':
s += 1
if row['DrankCoffee'] == 'yes':
s = 0
else:
pass
df.loc[index,'Streak'] = s

you can use groupby.transform
for each streak, what you're looking for is something like this:
def my_func(group):
return (group == 'no').cumsum()
you can divide the different streak with simple comparison and cumsum
streak = (df['DrankCoffee'] == 'yes').cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
then apply the transform
df['Streak'] = df.groupby(streak)['DrankCoffee'].transform(my_func)

You need firstly map you DrankCoffee to [0,1](Base on my understanding yes and forgot should be 0 and no is 1), then we just do groupby cumsum to create the group key , when there is yes we start a new round for count those evens
df.DrankCoffee.replace({'no':1,'forgot':0,'yes':0}).groupby((df.DrankCoffee=='yes').cumsum()).cumsum()
Out[111]:
0 1
1 2
2 2
3 0
4 1
5 2
6 3
7 4
8 5
9 0
10 1
Name: DrankCoffee, dtype: int64

Use:
df['Streak'] = df.assign(streak=df['DrankCoffee'].eq('no'))\
.groupby(df['DrankCoffee'].eq('yes').cumsum())['streak'].cumsum().astype(int)
Output:
Day DrankCoffee Streak
0 1 no 1
1 2 no 2
2 3 forgot 2
3 4 yes 0
4 5 no 1
5 6 no 2
6 7 no 3
7 8 no 4
8 9 no 5
9 10 yes 0
10 11 no 1
First, create streak increment when 'no' then True.
Next, create streak when 'yes' start a new streak using cumsum().
Lastly, use cumsum to count streak increment in streaks with
cumsum().

Related

Cumulative Sum that resets based on specific condition

Let's say I have the following data:
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","First","First","First","Second","Second","Second","Second"],
'Payments':[1,2,3,4,9,3,1,6]})
I want to create a cumulative sum for payments, but it has to reset when flag turns from first to second. Any help?
The output that I'm looking for is the following:
Not sure if this is you want since you didn't provide an output but try this
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","Second","First","Second","First","Second","Second","First"],
'Payments':[1,2,3,4,9,3,1,6]})
# make groups using consecutive Flags
groups = df.Flag.shift().ne(df.Flag).cumsum()
# groupby the groups and cumulatively sum payments
df['cumsum'] = df.groupby(groups).Payments.cumsum()
df
You can use df['Flag'].ne(df['Flag'].shift()).cumsum() to generate a grouper that will group by changes in the Flag column. Then, group by that, and cumsum:
df['cumsum'] = df['Payments'].groupby(df['Flag'].ne(df['Flag'].shift()).cumsum()).cumsum()
Output:
>>> df
Days Flag Payments cumsum
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19
What is wrong with
df['Cumulative Payments'] = df.groupby('Flag')['Payments'].cumsum()
Days Flag Payments Cumulative Payments
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19

How to apply function to all rows in data frame?

I am confused about how to apply a function to a data frame. Generally with creating user-defined-functions, I am familiar with ultimately having a "return" value to produce. Except for this case, I need the "return" value to show up in every cell of a data frame column, and I can't figure this out. The function is based on "if" and "if else" conditional statements, and I am unsure how to apply this to my data frame. Maybe I am perhaps missing a parentheses or bracket somewhere, but I am not entirely sure. I will explain here below.
I have the following dataframe:
Day No_employee? No_machinery? Production_potential
---------------------------------------------------------------------------
0 Day 1 1 0 5
1 Day 2 1 1 4
2 Day 3 0 1 3
3 Day 4 1 0 8
4 Day 5 0 0 6
5 Day 6 0 1 3
6 Day 7 0 0 5
7 Day 8 1 1 2
...
Now I want to take my dataframe and append a new column called Production_lost, based on the following logic:
In a factory, to manufacture products, you need both 1) an employee present, and 2) functioning machinery. If you cannot produce any product, then that potential product becomes lost product.
For each day (thinking about a factory), if No_employee? is true ( = 1), then no products can be made, regardless of No_machinery? and Production_lost = Production_potential. If No_machinery? is true ( = 1), then no products can be made, regardless of No_employee?, and Production_lost = Production_potential. Only if No_employee? and No_machinery? both = 0, will Production_lost = 0. If you have both an employee present and functioning machinery, there will be no production loss.
So I have the following code:
df['Production_loss'] = df['No_employee?'].apply(lambda x: df['Production_potential'] if x == 1.0 else df['Production_potential'] * df['No_machinery?'])
which produces the following error message:
ValueError: Wrong number of items passed 70, placement implies 1
I understand this means that there are too many arguments being applied to a single column (I think), but I am not sure how to address this, or how I might have reached this problem. Is there a simple fix to this?
The dataframe I am trying to produce would look like this:
Day No_employee? No_machinery? Production_potential Production_lost
-----------------------------------------------------------------------------------------------
0 Day 1 1 0 5 5
1 Day 2 1 1 4 4
2 Day 3 0 1 3 3
3 Day 4 1 0 8 8
4 Day 5 0 0 6 0
5 Day 6 0 1 3 3
6 Day 7 0 0 5 0
7 Day 8 1 1 2 2
...
numpy.where
df['Production_lost'] = np.where(((df['No_employee?'] == 1) | (df['No_machinery?'] == 1)),
df['Production_potential'], 0)
Day No_employee? No_machinery? Production_potential Production_lost
0 Day 1 1 0 5 5
1 Day 2 1 1 4 4
2 Day 3 0 1 3 3
3 Day 4 1 0 8 8
4 Day 5 0 0 6 0
5 Day 6 0 1 3 3
6 Day 7 0 0 5 0
7 Day 8 1 1 2 2
No need to use apply, use pd.Series.where instead:
df['Production_loss'] = df['Production_potential'].where(df['No_employee?'].eq(1), df['Production_potential'] * df['No_machinery?'])
You can also use multiplication:
df['Production_loss'] = ~(df['No_employee?'] * df['No_machinery?']) * df['Production_potential']

"Drop random rows" from pandas dataframe

In a pandas dataframe, how can I drop a random subset of rows that obey a condition?
In other words, if I have a Pandas dataframe with a Label column, I'd like to drop 50% (or some other percentage) of rows where Label == 1, but keep all of the rest:
Label A -> Label A
0 1 0 1
0 2 0 2
0 3 0 3
1 10 1 11
1 11 1 12
1 12
1 13
I'd love to know the simplest and most pythonic/panda-ish way of doing this!
Edit: This question provides part of an answer, but it only talks about dropping rows by index, disregarding the row values. I'd still like to know how to drop only from rows that are labeled a certain way.
Use the frac argument
df.sample(frac=.5)
If you define the amount you want to drop in a variable n
n = .5
df.sample(frac=1 - n)
To include the condition, use drop
df.drop(df.query('Label == 1').sample(frac=.5).index)
Label A
0 0 1
1 0 2
2 0 3
4 1 11
6 1 13
Using drop with sample
df.drop(df[df.Label.eq(1)].sample(2).index)
Label A
0 0 1
1 0 2
2 0 3
3 1 10
5 1 12

Python random sampling in multiple indices

I have a data frame according to below:
id_1 id_2 value
1 0 1
1 1 2
1 2 3
2 0 4
2 1 1
3 0 5
3 1 1
4 0 5
4 1 1
4 2 6
4 3 7
11 0 8
11 1 14
13 0 10
13 1 9
I would like to take out a random sample of size n, without replacement, from this table based on id_1. This row needs to be unique with respect to the id_1 column and can only occur once.
End result something like:
id_1 id_2 value
1 1 2
2 0 4
4 3 7
13 0 10
I have tried to do a group by and use the indices to take out a row through random.sample but it dosent go all the way.
Can someone give me a pointer on how to make this work? Code for DF below!
As always, thanks for time and input!
/swepab
df = pd.DataFrame({'id_1' : [1,1,1,2,2,3,3,4,4,4,4,11,11,13,13],
'id_2' : [0,1,2,0,1,0,1,0,1,2,3,0,1,0,1],
'value_col' : [1,2,3,4,1,5,1,5,1,6,7,8,14,10,9]})
You can do this using vectorized functions (not loops) using
import numpy as np
uniqued = df.id_1.reindex(np.random.permutation(df.index)).drop_duplicates()
df.ix[np.random.choice(uniqued.index, 1, replace=False)]
uniqued is created by a random shuffle + choice of a unique element by id_1. Then, a random sample (without replacement) is generated on it.
This samples one random per id:
for id in sorted(set(df["id_1"])):
print(df[df["id_1"] == id].sample(1))
PS:
translated above solution using pythons list comprehension, returning a list of of indices:
idx = [df[df["id_1"] == val].sample(1).index[0] for val in sorted(set(df["id_1"]))]

computing sum of pandas dataframes

I have two dataframes that I want to add bin-wise. That is, given
dfc1 = pd.DataFrame(list(zip(range(10),np.zeros(10))), columns=['bin', 'count'])
dfc2 = pd.DataFrame(list(zip(range(0,10,2), np.ones(5))), columns=['bin', 'count'])
which gives me this
dfc1:
bin count
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
dfc2:
bin count
0 0 1
1 2 1
2 4 1
3 6 1
4 8 1
I want to generate this:
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0
where I've added the count columns where the bin columns matched.
In fact, it turns out that I only ever add 1 (that is, count in dfc2 is always 1). So an alternate version of the question is "given an array of bin values (dfc2.bin), how can I add one to each of their corresponding count values in dfc1?"
My only solution thus far feels grossly inefficient (and slightly unreadable in the end), doing an outer joint between the two bin columns, thus creating a third dataframe on which I do a computation and then project out the unneeded column.
Suggestions?
First set bin to be index in both dataframes, then you can use add, fillvalue is needed to point that zero shall be used if bin is missing in dataframe:
dfc1 = dfc1.set_index('bin')
dfc2 = dfc2.set_index('bin')
result = pd.DataFrame.add(dfc1, dfc2, fill_value=0)
Pandas automatically sums up rows with equal index.
By the way, if you need to perform such operation frequently, I strongly recommend using numpy.bincount, which allows even repeating the bin index inside one dataframe
Since the dfc1 index is the same as your "bin" value, you could simply do the following:
dfc1.iloc[dfc2.bin].cnt += 1
Notice that I renamed your "count" column to "cnt" since count is a pandas builtin, which can cause confusion and errors!
As an alternative of #Alleo's answer, you can use method combineAdd to simply add 2 dataframes together and set_index at the same time, provided that their indexes will be matched by bin:
dfc1.set_index('bin').combineAdd(dfc2.set_index('bin')).reset_index()
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0

Categories

Resources