correcting data and replace the new data in pandas

correcting data and replace the new data in pandas - python

I want to plot this data frame:
Date TradeSize
0 2013-04-17 0.780000
1 2013-04-20 0.034000
2 2013-04-23 21.972500
3 2013-05-28 0.021000
4 2013-06-16 11.000000
5 2013-06-19 0.013000
6 2013-07-01 9.021000
7 2013-07-13 0.150000
8 2013-09-01 6.000000
9 2013-09-04 0.008000
10 2013-09-16 0.082000
11 2013-09-17 0.010000
12 2013-09-21 0.161000
13 2013-09-22 1.000000
14 2013-09-23 1.000000
15 2013-09-24 1.119000
16 2013-09-28 1.000000
17 2013-12-17 3.000000
18 2013-12-18 1.500000
19 2014-01-11 1.170000
20 2014-01-14 0.000100
21 2014-01-25 4.000000
22 2014-01-26 0.060000
23 2014-01-28 2.029900
24 2014-02-22 0.089900
25 2014-03-02 8.000000
26 2014-03-18 0.008000
27 2014-03-31 0.000100
28 2014-04-05 0.052000
29 2014-04-19 0.122000
30 2014-04-20 0.027000
31 2014-04-21 0.000100
32 2014-04-22 0.001100
33 2014-04-27 0.100000
34 2014-04-29 0.039000
35 2014-05-05 3.521000
36 2014-05-07 0.000105
37 2014-05-11 0.000100
38 2014-05-14 0.000100
39 2014-06-15 0.000800
40 2014-06-21 0.000500
41 2014-06-24 0.000600
42 2014-06-28 0.000400
43 2014-07-14 0.135000
44 2014-07-15 0.002300
45 2014-07-21 300.000000
46 2014-07-22 10.000000
47 2014-08-09 2.000000
48 2014-08-23 19.000000
49 2014-09-13 2.000000
But there is a restriction i should apply on data, it is for prettify the plot,
If next row TradeSize Value is not in the range of +-10% of today's TradeSize Value should be replaced by average of today's TradeSize value and next row TradeSize value; to clarify see this example:
Date TradeSize
1 2013-04-20 0.034000
2 2013-04-23 21.972500
the value of index 2 is greater than +10% of value of index 1 so the value of index 2 should be replaced by value of average of this two index and so on.
if the value is also -10% it should do the same!

If i understand right, 'tomorrow' means next row?
calculate +-10% value first:
min_v = (df['TradeSize'] * 0.9).shift() #shift to next row
max_v = (df['TradeSize'] * 1.1).shift()
df = df.assign(min_v=min_v, max_v=max_v)
get average then:
df = df.assign(avg=(df['TradeSize']+df['TradeSize'].shift())/2.)
make a copied result columns (for plot):
df = df.assign(res=df['TradeSize'].copy())
find +-10% and replace it with average result:
not_in_range_bool = (df['TradeSize'] < df['min_v']) | (df['TradeSize'] > df['max_v'])
not_in_range_bool[0] = False #first row can not be calculate, set it to False
df.loc[not_in_range_bool, 'res'] = df.loc[not_in_range_bool, 'avg']
now you can use df['res'] for prettify the plot

Related

Python Pandas - Time Series Find Index of Previous Row

I am analyzing time series data of one stock to seek the highest price for further analysis, here is the sample dataframe df:
date close high_3days
2021-05-01 20 20
2021-05-02 23 23
2021-05-03 26 26
2021-05-04 24 26
2021-05-05 20 26
2021-05-06 26 26
2021-05-07 22 26
2021-05-08 30 30
2021-05-09 20 30
2021-05-10 20 30
I want to add a new column to find the number of days from previous 3 days high. My logic is seeking the index of the row of previous high, and then subtract it from the index of current row.
Here is the desire output:
date close high_3days days_previous_high
2021-05-01 20 20 0
2021-05-02 23 23 0
2021-05-03 26 26 0
2021-05-04 24 26 1
2021-05-05 20 26 2
2021-05-06 22 26 3
2021-05-07 20 26 4
2021-05-08 30 30 0
2021-05-09 20 30 1
2021-05-10 20 30 2
Could you help to figure the way out~? Thanks guys!

Try creating a boolean index with expanding max, then enumerate each group with groupby cumcount:
df['days_previous_high'] = df.groupby(
df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
df:
date close high_3days days_previous_high
0 2021-05-01 20 20 0
1 2021-05-02 23 23 0
2 2021-05-03 26 26 0
3 2021-05-04 24 26 1
4 2021-05-05 20 26 2
5 2021-05-06 22 26 3
6 2021-05-07 20 26 4
7 2021-05-08 30 30 0
8 2021-05-09 20 30 1
9 2021-05-10 20 30 2
Explaination:
expanding max is used to determine the current maximum value at each row.
df['high_3days'].expanding().max()
diff can be used to see where the current value exceeds the max.
df['high_3days'].expanding().max().diff()
groups can be created by taking the cumsum of where the diff is greater than 0:
df['high_3days'].expanding().max().diff().gt(0).cumsum()
expanding_max expanding_max_diff expanding_max_gt_0 expanding_max_gt_0_cs
20.0 NaN False 0
23.0 3.0 True 1
26.0 3.0 True 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
30.0 4.0 True 3
30.0 0.0 False 3
30.0 0.0 False 3
Now that rows are grouped, groupby cumcount can be used to enumerate each group:
df.groupby(df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64

Create new Pandas columns using the value from previous row

I need to create two new Pandas columns using the logic and value from the previous row.
I have the following data:
Day Vol Price Income Outgoing
1 499 75
2 3233 90
3 1812 70
4 2407 97
5 3474 82
6 1057 53
7 2031 68
8 304 78
9 1339 62
10 2847 57
11 3767 93
12 1096 83
13 3899 88
14 4090 63
15 3249 52
16 1478 52
17 4926 75
18 1209 52
19 1982 90
20 4499 93
My challenge is to come up with a logic where both the Income and Outgoing columns (which are currently empty), should have the values of (Vol * Price).
But, the Income column should carry this value when, the previous day's "Price" value is lower than present. The Outgoing column should carry this value when, the previous day's "Price" value is higher than present. The rest of the Income and Outgoing columns, should just have NaN's. If the Price is unchanged, then that day's value is to be dropped.
But the entire logic should start with (n + 1) day. The first row should be skipped and the logic should apply from row 2 onwards.
I have tried using shift in my code example such as:
if sample_data['Price'].shift(1) < sample_data['Price'].shift(2)):
sample_data['Income'] = sample_data['Vol'] * sample_data['Price']
else:
sample_data['Outgoing'] = sample_data['Vol'] * sample_data['Price']
But it isn't working.
I feel there would be a simpler and comprehensive tactic to go about this, could someone please help ?
Update (The final output should look like this):
For day 16, the data is deleted because we have two similar prices for day 15 and 16.

I'd calculate the product and the mask separately, and then update the cols:
In [11]: vol_price = df["Vol"] * df["Price"]
In [12]: incoming = df["Price"].diff() < 0
In [13]: df.loc[incoming, "Income"] = vol_price
In [14]: df.loc[~incoming, "Outgoing"] = vol_price
In [15]: df
Out[15]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 NaN 290970.0
2 3 1812 70 126840.0 NaN
3 4 2407 97 NaN 233479.0
4 5 3474 82 284868.0 NaN
5 6 1057 53 56021.0 NaN
6 7 2031 68 NaN 138108.0
7 8 304 78 NaN 23712.0
8 9 1339 62 83018.0 NaN
9 10 2847 57 162279.0 NaN
10 11 3767 93 NaN 350331.0
11 12 1096 83 90968.0 NaN
12 13 3899 88 NaN 343112.0
13 14 4090 63 257670.0 NaN
14 15 3249 52 168948.0 NaN
15 16 1478 52 NaN 76856.0
16 17 4926 75 NaN 369450.0
17 18 1209 52 62868.0 NaN
18 19 1982 90 NaN 178380.0
19 20 4499 93 NaN 418407.0
or is it this way around:
In [21]: incoming = df["Price"].diff() > 0
In [22]: df.loc[incoming, "Income"] = vol_price
In [23]: df.loc[~incoming, "Outgoing"] = vol_price
In [24]: df
Out[24]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 290970.0 NaN
2 3 1812 70 NaN 126840.0
3 4 2407 97 233479.0 NaN
4 5 3474 82 NaN 284868.0
5 6 1057 53 NaN 56021.0
6 7 2031 68 138108.0 NaN
7 8 304 78 23712.0 NaN
8 9 1339 62 NaN 83018.0
9 10 2847 57 NaN 162279.0
10 11 3767 93 350331.0 NaN
11 12 1096 83 NaN 90968.0
12 13 3899 88 343112.0 NaN
13 14 4090 63 NaN 257670.0
14 15 3249 52 NaN 168948.0
15 16 1478 52 NaN 76856.0
16 17 4926 75 369450.0 NaN
17 18 1209 52 NaN 62868.0
18 19 1982 90 178380.0 NaN
19 20 4499 93 418407.0 NaN

Identify Updated Value in Time Series Data Python Pandas

I don't do a lot of time series work and I know the way I'm thinking about this solution is sub-optimal. Wanted to get input as to the most efficient way to approach this issue.
I have several days of values with multiple values per day identified by a time stamp.
Data looks like this:
Index Period Value Timestamp
0 1 73 2017-08-10 16:44:23
1 1 73 2017-08-09 16:30:12
2 1 73 2017-08-08 16:40:31
3 2 50 2017-08-10 16:44:23
4 2 45 2017-08-09 16:30:12
5 2 45 2017-08-08 16:40:31
6 3 13 2017-08-10 16:44:23
7 3 13 2017-08-09 16:30:12
8 3 13 2017-08-08 16:40:31
The example shows one data element for three different periods captured three days in a row. The idea is determining if the value for any of the measured period (Period 1, 2, or 3) changes.
As you can see in the example, on the third day (2017-08-10) The value for Period 2 was updated. I want to detect that changed value.
The only way I can figure out how do compare is to loop through which I think is both inelegant, inefficient and definitely not Pythonic.
Anyone have insight into a means of approach without looping/iteration?
Thanks in advance.
EDIT
Expected Output would be a df as follows if there is a value change in the most recent timestamped data:
Index Period Value Timestamp
0 1 73 2017-08-10 16:44:23
3 2 50 2017-08-10 16:44:23
6 3 13 2017-08-10 16:44:23

First, you can identify rows with a change like this:
df['diff'] = df.groupby('Period')['Value'].diff(-1).fillna(0)
Period Value Timestamp diff
0 1 73 2017-08-10 16:44:23 0.0
1 1 73 2017-08-09 16:30:12 0.0
2 1 73 2017-08-08 16:40:31 0.0
3 2 50 2017-08-10 16:44:23 5.0
4 2 45 2017-08-09 16:30:12 0.0
5 2 45 2017-08-08 16:40:31 0.0
6 3 13 2017-08-10 16:44:23 0.0
7 3 13 2017-08-09 16:30:12 0.0
8 3 13 2017-08-08 16:40:31 0.0
Then select rows to display (all rows with the same timestamp as a row with a change):
lst = df[ df['diff'] != 0. ]['Timestamp'].tolist()
df[ df['Timestamp'].isin(lst) ]
Period Value Timestamp diff
0 1 73 2017-08-10 16:44:23 0.0
3 2 50 2017-08-10 16:44:23 5.0
6 3 13 2017-08-10 16:44:23 0.0

How to append and set value in one command using Python?

I have the following dataframe (df):
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN IMP_CLR_TIME_BIN
0 -1447310116 23:59:00 00:11:00 47 0
1 1673545041 00:00:00 00:01:00 0 0
2 -743717696 23:59:00 00:00:00 47 0
3 58641876 04:01:00 09:02:00 8 18
I want to duplicate the rows for which IMP_START_TIME_BIN is less than IMP_CLR_TIME_BIN as many times as the absolute difference of IMP_START_TIME_BIN and IMP_CLR_TIME_BIN and then append (at the end of the data frame) or preferable append below that row while incrementing the value of IMP_START_TIME_BIN.
For example, for row 3, the difference is 10 and thus I should append 10 rows in the data frame incrementing the value in the IMP_START_TIME_BIN from 8(excluding) to 18(including).
The result should look like this:
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN IMP_CLR_TIME_BIN
0 -1447310116 23:59:00 00:11:00 47 0
1 1673545041 00:00:00 00:01:00 0 0
2 -743717696 23:59:00 00:00:00 47 0
3 58641876 04:01:00 09:02:00 8 18
4 58641876 04:01:00 09:02:00 9 18
... ... ... ... ... ...
13 58641876 04:01:00 09:02:00 18 18
For this I tried to do the following but it didn't work :
for i in range(len(df)):
if df.ix[i,3] < df.ix[i,4]:
for j in range(df.ix[i,3]+1, df.ix[i,4]+1):
df = df.append((df.set_value(i,'IMP_START_TIME_BIN',j))*abs(df.ix[i,3] - df.ix[i,4]))
How can I do it ?

You can use this solution, only necessary index values has to be unique:
#first filter only values for repeating
l = df['IMP_CLR_TIME_BIN'] - df['IMP_START_TIME_BIN']
l = l[l > 0]
print (l)
3 10
dtype: int64
#repeat rows by repeating index values
df1 = df.loc[np.repeat(l.index.values,l.values)].copy()
#add counter to column IMP_START_TIME_BIN
#better explanation http://stackoverflow.com/a/43518733/2901002
a = pd.Series(df1.index == df1.index.to_series().shift())
b = a.cumsum()
a = b.sub(b.mask(a).ffill().fillna(0).astype(int)).add(1)
df1['IMP_START_TIME_BIN'] = df1['IMP_START_TIME_BIN'] + a.values
#append to original df, if necessary sort
df = df.append(df1, ignore_index=True).sort_values('SERV_OR_IOR_ID')
print (df)
SERV_OR_IOR_ID IMP_START_TIME IMP_CLR_TIME IMP_START_TIME_BIN \
0 -1447310116 23:59:00 00:11:00 47
1 1673545041 00:00:00 00:01:00 0
2 -743717696 23:59:00 00:00:00 47
3 58641876 04:01:00 09:02:00 8
4 58641876 04:01:00 09:02:00 9
5 58641876 04:01:00 09:02:00 10
6 58641876 04:01:00 09:02:00 11
7 58641876 04:01:00 09:02:00 12
8 58641876 04:01:00 09:02:00 13
9 58641876 04:01:00 09:02:00 14
10 58641876 04:01:00 09:02:00 15
11 58641876 04:01:00 09:02:00 16
12 58641876 04:01:00 09:02:00 17
13 58641876 04:01:00 09:02:00 18
IMP_CLR_TIME_BIN
0 0
1 0
2 0
3 18
4 18
5 18
6 18
7 18
8 18
9 18
10 18
11 18
12 18
13 18

Python: How to split hourly values into 15 minute buckets?

happy new year to all!
I guess this question might be an easy one, but i can't figure it out.
How can i turn hourly data into 15 minute buckets quickly in python (see table below). Basically the left column should be converted into the right one.Just duplicate the hourly value for times and dump it into a new column.
Thanks for the support!
Cheers!
Hourly 15mins
1 28.90 1 28.90
2 28.88 1 28.90
3 28.68 1 28.90
4 28.67 1 28.90
5 28.52 2 28.88
6 28.79 2 28.88
7 31.33 2 28.88
8 32.60 2 28.88
9 42.00 3 28.68
10 44.00 3 28.68
11 44.00 3 28.68
12 44.00 3 28.68
13 39.94 4 28.67
14 39.90 4 28.67
15 38.09 4 28.67
16 39.94 4 28.67
17 44.94 5 28.52
18 66.01 5 28.52
19 49.45 5 28.52
20 48.37 5 28.52
21 38.02 6 28.79
22 34.55 6 28.79
23 33.33 6 28.79
24 32.05 6 28.79
7 31.33
7 31.33
7 31.33
7 31.33

You could also do this through constructing a new DataFrame and using numpy methods.
import numpy as np
pd.DataFrame(np.column_stack((np.arange(df.shape[0]).repeat(4, axis=0),
np.array(df).repeat(4, axis=0))),
columns=['hours', '15_minutes'])
which returns
hours 15_minutes
0 0 28.90
1 0 28.90
2 0 28.90
3 0 28.90
4 1 28.88
5 1 28.88
...
91 22 33.33
92 23 32.05
93 23 32.05
94 23 32.05
95 23 32.05
column_stack appends arrays by columns (index=0). np.arange(df.shape[0]).repeat(4, axis=0) gets the hour IDs by repeating 0 through 23 four times, and the values for each 15 minutes is constructed in a similar manner. pd.DataFrame produces the DataFrames and column names are added as well.

Create datetime-like index for your DataFrame, then you can use resample.
series.resample('15T')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

correcting data and replace the new data in pandas - python

Related

Python Pandas - Time Series Find Index of Previous Row

Create new Pandas columns using the value from previous row

Identify Updated Value in Time Series Data Python Pandas

How to append and set value in one command using Python?

Python: How to split hourly values into 15 minute buckets?

Categories

Resources