How to apply function to all rows in data frame? - python

I am confused about how to apply a function to a data frame. Generally with creating user-defined-functions, I am familiar with ultimately having a "return" value to produce. Except for this case, I need the "return" value to show up in every cell of a data frame column, and I can't figure this out. The function is based on "if" and "if else" conditional statements, and I am unsure how to apply this to my data frame. Maybe I am perhaps missing a parentheses or bracket somewhere, but I am not entirely sure. I will explain here below.
I have the following dataframe:
Day No_employee? No_machinery? Production_potential
---------------------------------------------------------------------------
0 Day 1 1 0 5
1 Day 2 1 1 4
2 Day 3 0 1 3
3 Day 4 1 0 8
4 Day 5 0 0 6
5 Day 6 0 1 3
6 Day 7 0 0 5
7 Day 8 1 1 2
...
Now I want to take my dataframe and append a new column called Production_lost, based on the following logic:
In a factory, to manufacture products, you need both 1) an employee present, and 2) functioning machinery. If you cannot produce any product, then that potential product becomes lost product.
For each day (thinking about a factory), if No_employee? is true ( = 1), then no products can be made, regardless of No_machinery? and Production_lost = Production_potential. If No_machinery? is true ( = 1), then no products can be made, regardless of No_employee?, and Production_lost = Production_potential. Only if No_employee? and No_machinery? both = 0, will Production_lost = 0. If you have both an employee present and functioning machinery, there will be no production loss.
So I have the following code:
df['Production_loss'] = df['No_employee?'].apply(lambda x: df['Production_potential'] if x == 1.0 else df['Production_potential'] * df['No_machinery?'])
which produces the following error message:
ValueError: Wrong number of items passed 70, placement implies 1
I understand this means that there are too many arguments being applied to a single column (I think), but I am not sure how to address this, or how I might have reached this problem. Is there a simple fix to this?
The dataframe I am trying to produce would look like this:
Day No_employee? No_machinery? Production_potential Production_lost
-----------------------------------------------------------------------------------------------
0 Day 1 1 0 5 5
1 Day 2 1 1 4 4
2 Day 3 0 1 3 3
3 Day 4 1 0 8 8
4 Day 5 0 0 6 0
5 Day 6 0 1 3 3
6 Day 7 0 0 5 0
7 Day 8 1 1 2 2
...

numpy.where
df['Production_lost'] = np.where(((df['No_employee?'] == 1) | (df['No_machinery?'] == 1)),
df['Production_potential'], 0)
Day No_employee? No_machinery? Production_potential Production_lost
0 Day 1 1 0 5 5
1 Day 2 1 1 4 4
2 Day 3 0 1 3 3
3 Day 4 1 0 8 8
4 Day 5 0 0 6 0
5 Day 6 0 1 3 3
6 Day 7 0 0 5 0
7 Day 8 1 1 2 2

No need to use apply, use pd.Series.where instead:
df['Production_loss'] = df['Production_potential'].where(df['No_employee?'].eq(1), df['Production_potential'] * df['No_machinery?'])
You can also use multiplication:
df['Production_loss'] = ~(df['No_employee?'] * df['No_machinery?']) * df['Production_potential']

Related

How to find the last line and the diff of each line

I am trying to handle the following dataframe
df = pd.DataFrame({'ID':[1,1,2,2,3,3,3,4,4,4,4],
'sum':[1,2,1,2,1,2,3,1,2,3,4,]})
Now I want to find the difference from the last row by each ID.
Specifically, I tried this code.
df['diff'] = df.groupby('ID')['sum'].diff(-1)
df
However, this would require a difference from one line behind.
Is there any way to determine the difference between each of the last rows with groupbuy?
Thank you for your help.
You can use transform('last') to get the last value per group:
df['diff'] = df['sum'].sub(df.groupby('ID')['sum'].transform('last'))
or using groupby.apply:
df['diff'] = df.groupby('ID')['sum'].apply(lambda x: x-x.iloc[-1])
output:
ID sum diff
0 1 1 -1
1 1 2 0
2 2 1 -1
3 2 2 0
4 3 1 -2
5 3 2 -1
6 3 3 0
7 4 1 -3
8 4 2 -2
9 4 3 -1
10 4 4 0

Cumulative Sum that resets based on specific condition

Let's say I have the following data:
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","First","First","First","Second","Second","Second","Second"],
'Payments':[1,2,3,4,9,3,1,6]})
I want to create a cumulative sum for payments, but it has to reset when flag turns from first to second. Any help?
The output that I'm looking for is the following:
Not sure if this is you want since you didn't provide an output but try this
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","Second","First","Second","First","Second","Second","First"],
'Payments':[1,2,3,4,9,3,1,6]})
# make groups using consecutive Flags
groups = df.Flag.shift().ne(df.Flag).cumsum()
# groupby the groups and cumulatively sum payments
df['cumsum'] = df.groupby(groups).Payments.cumsum()
df
You can use df['Flag'].ne(df['Flag'].shift()).cumsum() to generate a grouper that will group by changes in the Flag column. Then, group by that, and cumsum:
df['cumsum'] = df['Payments'].groupby(df['Flag'].ne(df['Flag'].shift()).cumsum()).cumsum()
Output:
>>> df
Days Flag Payments cumsum
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19
What is wrong with
df['Cumulative Payments'] = df.groupby('Flag')['Payments'].cumsum()
Days Flag Payments Cumulative Payments
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19

Create flag if customer ordered in the next month / If total of a column for the next weeks is more than 1

I have a dataframe like so:
CUSTOMER WEEK NO ORDERS
0 Ann 4 1
1 Ann 6 3
2 John 1 1
3 John 7 2
I`d like to add a Flag column that indicates if the customer made an order in the next month / within the next 4 weeks. My ideal output would be something like this:
CUSTOMER WEEK NO ORDERS FLAG
0 Ann 4 1 1
1 Ann 6 3 0
2 John 1 1 0
3 John 7 2 0
I looked at some examples on this site and derived this code. It seems to work on a few accounts but when I apply to the whole dataframe, everything is flagged as 1. I'm not sure why, I even added the Customer condition but again, when applied to the whole dataframe it doesn`t work:
df['Flag'] = df.apply(lambda x: 1 if df.loc[(df.Week_No >= x.Week_No) &
(df.Week_No <= x.Week_No+4) & (x.Customer==df.Customer), 'total_orders'].sum()>=1
else 0, axis=1)
Group the dataframe by CUSTOMER, find the differences between WEEK NOs, and then shift the results back to line up how we want. If these differences are <=4, the mask will be True.
df['FLAG'] = 0
mask = df.groupby('CUSTOMER')['WEEK NO'].diff().shift(-1).le(4)
df['FLAG'].mask(mask, 1, inplace=True)
Output:
CUSTOMER WEEK NO ORDERS FLAG
0 Ann 4 1 1
1 Ann 6 3 0
2 John 1 1 0
3 John 7 2 0

check if value is changed between group of transactions

given this df:
df = pd.DataFrame({"A":[1,0,0,0,0,1,0,1,0,0,1],'B':['enters','A','B','C','D','exit','walk','enters','Q','Q','exit'],"Value":[4,4,4,4,5,6,6,6,6,6,6]})
A B Value
0 1 enters 4
1 0 A 4
2 0 B 4
3 0 C 4
4 0 D 5
5 1 exit 6
6 0 walk 6
7 1 enters 6
8 0 Q 6
9 0 Q 6
10 1 exit 6
There are 2 'transactions' here. When someone enters and leaves. So tx#1 is between 0 and 5 and tx#2 between 7 and 10.
My goal is to show if the value was changed? So in tx1 the value has changed from 4 to 6 and in tx#2 no change. Expected result:
index tx value_before value_after
0 1 4 6
7 2 6 6
I tried to fill the 0 between each tx with 1 and then group but I get all A column as 1. I'm not sure how to define the group by if each tx stands on its own.
Assign a new transaction number on each new 'enter' and pivot:
df['tx'] = np.where(df.B.eq('enters'),1,0).cumsum()
df[df.B.isin(['enters','exit'])].pivot('tx','B','Value')
Result:
B enters exit
tx
1 4 6
2 6 6
Not exactly what you want but it has all the info
df[df['B'].isin(['enters', 'exit'])].drop(['A'], axis=1).reset_index()
index B Value
0 0 enters 4
1 5 exit 6
2 7 enters 6
3 10 exit 6
You can get what you need with cumsum(), and pivot_table():
df['tx'] = np.where(df['B']=='enters',1,0).cumsum()
res = pd.pivot_table(df[df['B'].isin(['enters','exit'])],
index=['tx'],columns=['B'],values='Value'
).reset_index().rename(columns ={'enters':'value_before','exit':'value_after'})
Which prints:
res
tx value_before value_after
0 1 4 6
1 2 6 6
If you always have a sequence "enters - exit" you can create a new dataframe and assign certain values to each column:
result = pd.DataFrame({'tx': [x + 1 for x in range(len(new_df['value_before']))],
'value_before': df['Value'].loc[df['B'] == 'enters'],
'value_after': list(df['Value'].loc[df['B'] == 'exit'])})
Output:
tx value_before value_after
0 1 4 6
7 2 6 6
You can add 'reset_index(drop=True)' at the end if you don't want to see an index from the original dataframe.
I added 'list' for 'value_after' to get right concatenation.

Python Pandas: Create Column That Acts As A Conditional Running Variable

I'm trying to create a new dataframe column that acts as a running variable that resets to zero or "passes" under certain conditions. Below is a simplified example of what I'm looking to accomplish. Let's say I'm trying to quit drinking coffee and I'm tracking the number of days in a row i've gone without drinking any. On days where I forgot to make note of whether I drank coffee, I put "forgot", and my tally does not get influenced.
Below is how i'm currently accomplishing this, though I suspect there's a much more efficient way of going about it.
Thanks in advance!
import pandas as pd
Day = [1,2,3,4,5,6,7,8,9,10,11]
DrankCoffee = ['no','no','forgot','yes','no','no','no','no','no','yes','no']
df = pd.DataFrame(list(zip(Day,DrankCoffee)), columns=['Day','DrankCoffee'])
df['Streak'] = 0
s = 0
for (index,row) in df.iterrows():
if row['DrankCoffee'] == 'no':
s += 1
if row['DrankCoffee'] == 'yes':
s = 0
else:
pass
df.loc[index,'Streak'] = s
you can use groupby.transform
for each streak, what you're looking for is something like this:
def my_func(group):
return (group == 'no').cumsum()
you can divide the different streak with simple comparison and cumsum
streak = (df['DrankCoffee'] == 'yes').cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
then apply the transform
df['Streak'] = df.groupby(streak)['DrankCoffee'].transform(my_func)
You need firstly map you DrankCoffee to [0,1](Base on my understanding yes and forgot should be 0 and no is 1), then we just do groupby cumsum to create the group key , when there is yes we start a new round for count those evens
df.DrankCoffee.replace({'no':1,'forgot':0,'yes':0}).groupby((df.DrankCoffee=='yes').cumsum()).cumsum()
Out[111]:
0 1
1 2
2 2
3 0
4 1
5 2
6 3
7 4
8 5
9 0
10 1
Name: DrankCoffee, dtype: int64
Use:
df['Streak'] = df.assign(streak=df['DrankCoffee'].eq('no'))\
.groupby(df['DrankCoffee'].eq('yes').cumsum())['streak'].cumsum().astype(int)
Output:
Day DrankCoffee Streak
0 1 no 1
1 2 no 2
2 3 forgot 2
3 4 yes 0
4 5 no 1
5 6 no 2
6 7 no 3
7 8 no 4
8 9 no 5
9 10 yes 0
10 11 no 1
First, create streak increment when 'no' then True.
Next, create streak when 'yes' start a new streak using cumsum().
Lastly, use cumsum to count streak increment in streaks with
cumsum().

Categories

Resources