I have the following data frame, per each date, per hour, I want create a new column "result"such that if the value in column "B" is >=0 then use the value in column A; otherwise use the maximum between 0 and the previous row value in column B
Date Hour A B result
1/1/2018 1 5 95 5
1/1/2018 1 16 79 16
1/1/2018 1 85 -6 79
1/1/2018 1 12 -18 0
1/1/2018 2 17 43 17
1/1/2018 2 17 26 17
1/1/2018 2 16 10 16
1/1/2018 2 142 -132 10
1/1/2018 2 10 -142 0
I tried grouping by date and hour and then applying a lambda function using shift but I got an error:
df['result'] = df.groupby(['Date','Hour']).apply(lambda x: x['A'] if x['B'] >= 0 else np.maximum(0, x['B'].shift(1)), axis = 1)
Use np.where. The groupby is only necessary when shifting "B", so you can vectorise this operation without using apply.
df['result'] = np.where(
df.B >= 0,
df.A,
df.groupby(['Date', 'Hour'])['B'].shift().clip(lower=0))
df
Date Hour A B result
0 1/1/2018 1 5 95 5.0
1 1/1/2018 1 16 79 16.0
2 1/1/2018 1 85 -6 79.0
3 1/1/2018 1 12 -18 0.0
4 1/1/2018 2 17 43 17.0
5 1/1/2018 2 17 26 17.0
6 1/1/2018 2 16 10 16.0
7 1/1/2018 2 142 -132 10.0
8 1/1/2018 2 10 -142 0.0
Related
I have a pandas df, like this:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value
0 10 2022-01-01 100 0
1 10 2022-01-02 150 1
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200 3
5 10 2022-01-06 0
6 10 2022-01-07 150 2
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100 5
12 23 2022-02-01 490 0
13 23 2022-02-02 0
14 23 2022-02-03 350 2
15 23 2022-02-04 333 1
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211 4
20 23 2022-02-09 100 1
Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value
0 10 100 0.0
1 10 150 1.0
2 10 0 NaN
3 10 0 NaN
4 10 200 3.0
5 10 0 NaN
6 10 150 2.0
7 10 0 NaN
8 10 0 NaN
9 10 0 NaN
10 10 0 NaN
11 10 100 5.0
12 23 490 0.0
13 23 0 NaN
14 23 350 2.0
15 23 333 1.0
16 23 0 NaN
17 23 0 NaN
18 23 0 NaN
19 23 211 4.0
20 23 100 1.0
First, we'll have to group by 'ID'.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 4
8 4
9 4
10 4
11 5
12 6
13 6
14 7
15 8
16 8
17 8
18 8
19 9
20 10
We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.
I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb · 1
1 47 1 Feb · 1
2 119 6 Feb · 1
8 101 7 hrs · 1
9 536 11 min · 1
10 53 2 hrs · 1
11 20 11 Feb · 3
3 15 1 hrs · 2
4 33 7 Feb · 1
5 153 4 Feb · 3
6 34 3 min · 2
7 26 3 Feb · 3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT
I have a dataframe and has a column name called savings. In that savings, it has a positive and negative value. I want to check if the savings is negative then assign 1 in the new column (need to create a new column name called flag_negative). If the savings is positive assign 0 in the new column. I have a missing value in the savings column which I don't want to do anything. Leave as it is.
I would like to apply loop or any other easy method.
my dataframe name is df
I want to get as follow
Number of rows: 9000
savings flag_negative
100 0
-76 1
1200 0
-
-
-200 1
500 0
I tried with loop and created new column as flag_ negatvie. But I am getting NONE for all the rows
Below is my code
for i in sum['savings']:
if i>0:
sum['flag_negative'] = print(0)
elif i == " ":
sum['flag_negative'] = print(" ")
else:
sum['flag_negative'] = print(1)
if your dataframe is like this:
savings
0 -4
1 -41
2 174
3 -103
4 -194
5 -160
6 126
7 100
8 -125
9 -71
10 -159
11 -100
12 -30
13 -50
14 83
15 124
16 -123
17 -70
18 -71
19 -29
then you can easily filter on positive/negative and assign to a new column like this:
df.loc[df.savings < 0, 'flag_negative'] = 1
df.loc[df.savings >= 0, 'flag_negative'] = 0
resulting in:
savings flag_negative
0 -4 1.0
1 -41 1.0
2 174 0.0
3 -103 1.0
4 -194 1.0
5 -160 1.0
6 126 0.0
7 100 0.0
8 -125 1.0
9 -71 1.0
10 -159 1.0
11 -100 1.0
12 -30 1.0
13 -50 1.0
14 83 0.0
15 124 0.0
16 -123 1.0
17 -70 1.0
18 -71 1.0
19 -29 1.0
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1
You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).
I have this dataframe
df1_9
date store_nbr item_nbr units station_nbr tavg preciptotal
8 2012-01-01 1 9 29 1 42 0.05
119 2012-01-02 1 9 60 1 41 0.01
...
452 2012-01-05 1 9 16 1 32 0.00
563 2012-01-06 1 9 12 1 36 T
I want to replace the 'T' in the preciptotal column with the value .01.
df1_9.ix[df1_9.preciptotal == 'T', 'preciptotal'] = 0.01
I wrote this code, but for some reason it is not working. I have been staring at this for a while, any help would be appreciated.