I have the following dataframe (df):
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN IMP_CLR_TIME_BIN
0 -1447310116 23:59:00 00:11:00 47 0
1 1673545041 00:00:00 00:01:00 0 0
2 -743717696 23:59:00 00:00:00 47 0
3 58641876 04:01:00 09:02:00 8 18
I want to duplicate the rows for which IMP_START_TIME_BIN is less than IMP_CLR_TIME_BIN as many times as the absolute difference of IMP_START_TIME_BIN and IMP_CLR_TIME_BIN and then append (at the end of the data frame) or preferable append below that row while incrementing the value of IMP_START_TIME_BIN.
For example, for row 3, the difference is 10 and thus I should append 10 rows in the data frame incrementing the value in the IMP_START_TIME_BIN from 8(excluding) to 18(including).
The result should look like this:
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN IMP_CLR_TIME_BIN
0 -1447310116 23:59:00 00:11:00 47 0
1 1673545041 00:00:00 00:01:00 0 0
2 -743717696 23:59:00 00:00:00 47 0
3 58641876 04:01:00 09:02:00 8 18
4 58641876 04:01:00 09:02:00 9 18
... ... ... ... ... ...
13 58641876 04:01:00 09:02:00 18 18
For this I tried to do the following but it didn't work :
for i in range(len(df)):
if df.ix[i,3] < df.ix[i,4]:
for j in range(df.ix[i,3]+1, df.ix[i,4]+1):
df = df.append((df.set_value(i,'IMP_START_TIME_BIN',j))*abs(df.ix[i,3] - df.ix[i,4]))
How can I do it ?
You can use this solution, only necessary index values has to be unique:
#first filter only values for repeating
l = df['IMP_CLR_TIME_BIN'] - df['IMP_START_TIME_BIN']
l = l[l > 0]
print (l)
3 10
dtype: int64
#repeat rows by repeating index values
df1 = df.loc[np.repeat(l.index.values,l.values)].copy()
#add counter to column IMP_START_TIME_BIN
#better explanation http://stackoverflow.com/a/43518733/2901002
a = pd.Series(df1.index == df1.index.to_series().shift())
b = a.cumsum()
a = b.sub(b.mask(a).ffill().fillna(0).astype(int)).add(1)
df1['IMP_START_TIME_BIN'] = df1['IMP_START_TIME_BIN'] + a.values
#append to original df, if necessary sort
df = df.append(df1, ignore_index=True).sort_values('SERV_OR_IOR_ID')
print (df)
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN \
0 -1447310116 23:59:00 00:11:00 47
1 1673545041 00:00:00 00:01:00 0
2 -743717696 23:59:00 00:00:00 47
3 58641876 04:01:00 09:02:00 8
4 58641876 04:01:00 09:02:00 9
5 58641876 04:01:00 09:02:00 10
6 58641876 04:01:00 09:02:00 11
7 58641876 04:01:00 09:02:00 12
8 58641876 04:01:00 09:02:00 13
9 58641876 04:01:00 09:02:00 14
10 58641876 04:01:00 09:02:00 15
11 58641876 04:01:00 09:02:00 16
12 58641876 04:01:00 09:02:00 17
13 58641876 04:01:00 09:02:00 18
IMP_CLR_TIME_BIN
0 0
1 0
2 0
3 18
4 18
5 18
6 18
7 18
8 18
9 18
10 18
11 18
12 18
13 18
Related
I am having issues finding a solution for the cummulative sum for mtd and ytd
I need help to get this result
Use groupby.cumsum combined with periods using to_period:
# ensure datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# group by year
df['ytd'] = df.groupby(s.dt.to_period('Y'))['count'].cumsum()
# group by month
df['mtd'] = df.groupby(s.dt.to_period('M'))['count'].cumsum()
Example (with dummy data):
date count ytd mtd
0 2022-08-26 6 6 6
1 2022-08-27 1 7 7
2 2022-08-28 4 11 11
3 2022-08-29 4 15 15
4 2022-08-30 8 23 23
5 2022-08-31 4 27 27
6 2022-09-01 6 33 6
7 2022-09-02 3 36 9
8 2022-09-03 5 41 14
9 2022-09-04 8 49 22
10 2022-09-05 7 56 29
11 2022-09-06 9 65 38
12 2022-09-07 9 74 47
I have a pandas df, like this:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value
0 10 2022-01-01 100 0
1 10 2022-01-02 150 1
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200 3
5 10 2022-01-06 0
6 10 2022-01-07 150 2
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100 5
12 23 2022-02-01 490 0
13 23 2022-02-02 0
14 23 2022-02-03 350 2
15 23 2022-02-04 333 1
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211 4
20 23 2022-02-09 100 1
Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value
0 10 100 0.0
1 10 150 1.0
2 10 0 NaN
3 10 0 NaN
4 10 200 3.0
5 10 0 NaN
6 10 150 2.0
7 10 0 NaN
8 10 0 NaN
9 10 0 NaN
10 10 0 NaN
11 10 100 5.0
12 23 490 0.0
13 23 0 NaN
14 23 350 2.0
15 23 333 1.0
16 23 0 NaN
17 23 0 NaN
18 23 0 NaN
19 23 211 4.0
20 23 100 1.0
First, we'll have to group by 'ID'.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 4
8 4
9 4
10 4
11 5
12 6
13 6
14 7
15 8
16 8
17 8
18 8
19 9
20 10
We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.
I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb · 1
1 47 1 Feb · 1
2 119 6 Feb · 1
8 101 7 hrs · 1
9 536 11 min · 1
10 53 2 hrs · 1
11 20 11 Feb · 3
3 15 1 hrs · 2
4 33 7 Feb · 1
5 153 4 Feb · 3
6 34 3 min · 2
7 26 3 Feb · 3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT
I want to plot this data frame:
Date TradeSize
0 2013-04-17 0.780000
1 2013-04-20 0.034000
2 2013-04-23 21.972500
3 2013-05-28 0.021000
4 2013-06-16 11.000000
5 2013-06-19 0.013000
6 2013-07-01 9.021000
7 2013-07-13 0.150000
8 2013-09-01 6.000000
9 2013-09-04 0.008000
10 2013-09-16 0.082000
11 2013-09-17 0.010000
12 2013-09-21 0.161000
13 2013-09-22 1.000000
14 2013-09-23 1.000000
15 2013-09-24 1.119000
16 2013-09-28 1.000000
17 2013-12-17 3.000000
18 2013-12-18 1.500000
19 2014-01-11 1.170000
20 2014-01-14 0.000100
21 2014-01-25 4.000000
22 2014-01-26 0.060000
23 2014-01-28 2.029900
24 2014-02-22 0.089900
25 2014-03-02 8.000000
26 2014-03-18 0.008000
27 2014-03-31 0.000100
28 2014-04-05 0.052000
29 2014-04-19 0.122000
30 2014-04-20 0.027000
31 2014-04-21 0.000100
32 2014-04-22 0.001100
33 2014-04-27 0.100000
34 2014-04-29 0.039000
35 2014-05-05 3.521000
36 2014-05-07 0.000105
37 2014-05-11 0.000100
38 2014-05-14 0.000100
39 2014-06-15 0.000800
40 2014-06-21 0.000500
41 2014-06-24 0.000600
42 2014-06-28 0.000400
43 2014-07-14 0.135000
44 2014-07-15 0.002300
45 2014-07-21 300.000000
46 2014-07-22 10.000000
47 2014-08-09 2.000000
48 2014-08-23 19.000000
49 2014-09-13 2.000000
But there is a restriction i should apply on data, it is for prettify the plot,
If next row TradeSize Value is not in the range of +-10% of today's TradeSize Value should be replaced by average of today's TradeSize value and next row TradeSize value; to clarify see this example:
Date TradeSize
1 2013-04-20 0.034000
2 2013-04-23 21.972500
the value of index 2 is greater than +10% of value of index 1 so the value of index 2 should be replaced by value of average of this two index and so on.
if the value is also -10% it should do the same!
If i understand right, 'tomorrow' means next row?
calculate +-10% value first:
min_v = (df['TradeSize'] * 0.9).shift() #shift to next row
max_v = (df['TradeSize'] * 1.1).shift()
df = df.assign(min_v=min_v, max_v=max_v)
get average then:
df = df.assign(avg=(df['TradeSize']+df['TradeSize'].shift())/2.)
make a copied result columns (for plot):
df = df.assign(res=df['TradeSize'].copy())
find +-10% and replace it with average result:
not_in_range_bool = (df['TradeSize'] < df['min_v']) | (df['TradeSize'] > df['max_v'])
not_in_range_bool[0] = False #first row can not be calculate, set it to False
df.loc[not_in_range_bool, 'res'] = df.loc[not_in_range_bool, 'avg']
now you can use df['res'] for prettify the plot
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1
You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).