happy new year to all!
I guess this question might be an easy one, but i can't figure it out.
How can i turn hourly data into 15 minute buckets quickly in python (see table below). Basically the left column should be converted into the right one.Just duplicate the hourly value for times and dump it into a new column.
Thanks for the support!
Cheers!
Hourly 15mins
1 28.90 1 28.90
2 28.88 1 28.90
3 28.68 1 28.90
4 28.67 1 28.90
5 28.52 2 28.88
6 28.79 2 28.88
7 31.33 2 28.88
8 32.60 2 28.88
9 42.00 3 28.68
10 44.00 3 28.68
11 44.00 3 28.68
12 44.00 3 28.68
13 39.94 4 28.67
14 39.90 4 28.67
15 38.09 4 28.67
16 39.94 4 28.67
17 44.94 5 28.52
18 66.01 5 28.52
19 49.45 5 28.52
20 48.37 5 28.52
21 38.02 6 28.79
22 34.55 6 28.79
23 33.33 6 28.79
24 32.05 6 28.79
7 31.33
7 31.33
7 31.33
7 31.33
You could also do this through constructing a new DataFrame and using numpy methods.
import numpy as np
pd.DataFrame(np.column_stack((np.arange(df.shape[0]).repeat(4, axis=0),
np.array(df).repeat(4, axis=0))),
columns=['hours', '15_minutes'])
which returns
hours 15_minutes
0 0 28.90
1 0 28.90
2 0 28.90
3 0 28.90
4 1 28.88
5 1 28.88
...
91 22 33.33
92 23 32.05
93 23 32.05
94 23 32.05
95 23 32.05
column_stack appends arrays by columns (index=0). np.arange(df.shape[0]).repeat(4, axis=0) gets the hour IDs by repeating 0 through 23 four times, and the values for each 15 minutes is constructed in a similar manner. pd.DataFrame produces the DataFrames and column names are added as well.
Create datetime-like index for your DataFrame, then you can use resample.
series.resample('15T')
Related
I am having issues finding a solution for the cummulative sum for mtd and ytd
I need help to get this result
Use groupby.cumsum combined with periods using to_period:
# ensure datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# group by year
df['ytd'] = df.groupby(s.dt.to_period('Y'))['count'].cumsum()
# group by month
df['mtd'] = df.groupby(s.dt.to_period('M'))['count'].cumsum()
Example (with dummy data):
date count ytd mtd
0 2022-08-26 6 6 6
1 2022-08-27 1 7 7
2 2022-08-28 4 11 11
3 2022-08-29 4 15 15
4 2022-08-30 8 23 23
5 2022-08-31 4 27 27
6 2022-09-01 6 33 6
7 2022-09-02 3 36 9
8 2022-09-03 5 41 14
9 2022-09-04 8 49 22
10 2022-09-05 7 56 29
11 2022-09-06 9 65 38
12 2022-09-07 9 74 47
I am analyzing time series data of one stock to seek the highest price for further analysis, here is the sample dataframe df:
date close high_3days
2021-05-01 20 20
2021-05-02 23 23
2021-05-03 26 26
2021-05-04 24 26
2021-05-05 20 26
2021-05-06 26 26
2021-05-07 22 26
2021-05-08 30 30
2021-05-09 20 30
2021-05-10 20 30
I want to add a new column to find the number of days from previous 3 days high. My logic is seeking the index of the row of previous high, and then subtract it from the index of current row.
Here is the desire output:
date close high_3days days_previous_high
2021-05-01 20 20 0
2021-05-02 23 23 0
2021-05-03 26 26 0
2021-05-04 24 26 1
2021-05-05 20 26 2
2021-05-06 22 26 3
2021-05-07 20 26 4
2021-05-08 30 30 0
2021-05-09 20 30 1
2021-05-10 20 30 2
Could you help to figure the way out~? Thanks guys!
Try creating a boolean index with expanding max, then enumerate each group with groupby cumcount:
df['days_previous_high'] = df.groupby(
df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
df:
date close high_3days days_previous_high
0 2021-05-01 20 20 0
1 2021-05-02 23 23 0
2 2021-05-03 26 26 0
3 2021-05-04 24 26 1
4 2021-05-05 20 26 2
5 2021-05-06 22 26 3
6 2021-05-07 20 26 4
7 2021-05-08 30 30 0
8 2021-05-09 20 30 1
9 2021-05-10 20 30 2
Explaination:
expanding max is used to determine the current maximum value at each row.
df['high_3days'].expanding().max()
diff can be used to see where the current value exceeds the max.
df['high_3days'].expanding().max().diff()
groups can be created by taking the cumsum of where the diff is greater than 0:
df['high_3days'].expanding().max().diff().gt(0).cumsum()
expanding_max expanding_max_diff expanding_max_gt_0 expanding_max_gt_0_cs
20.0 NaN False 0
23.0 3.0 True 1
26.0 3.0 True 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
30.0 4.0 True 3
30.0 0.0 False 3
30.0 0.0 False 3
Now that rows are grouped, groupby cumcount can be used to enumerate each group:
df.groupby(df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb · 1
1 47 1 Feb · 1
2 119 6 Feb · 1
8 101 7 hrs · 1
9 536 11 min · 1
10 53 2 hrs · 1
11 20 11 Feb · 3
3 15 1 hrs · 2
4 33 7 Feb · 1
5 153 4 Feb · 3
6 34 3 min · 2
7 26 3 Feb · 3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT
I have the data flowing into a csv file daily which shows the no. of pieces being manufactured. I want to clearly show the daily % increase in pieces being produced
I have tried transpose(), unstack() but have not been able to solve this.
Here is what the data looks like:
I want to clearly show the daily % increase in pieces being produced. The output should be something like this:
How should I get this done?
You would need s.pct_change() and series.shift():
df.insert(1,'Day2',df.Day.shift(-1))
df['Percent_change']=(df.Peice_Produced.pct_change()*100).shift(-1).fillna(0).round(2)
print(df)
Day Day2 Peice_Produced Percent_change
0 1/1/17 1/2/17 10 -50.00
1 1/2/17 1/3/17 5 200.00
2 1/3/17 1/4/17 15 -60.00
3 1/4/17 1/5/17 6 250.00
4 1/5/17 1/6/17 21 -66.67
5 1/6/17 1/7/17 7 300.00
6 1/7/17 1/8/17 28 -71.43
7 1/8/17 1/9/17 8 350.00
8 1/9/17 1/10/17 36 -75.00
9 1/10/17 1/11/17 9 400.00
10 1/11/17 NaN 45 0.00
I admit I do not fully understand what your intent is. Nevertheless, I may have a solution as i understand it ..
Use diff() function to find the discrete difference
Your Simulated DataFarme:
>>> df
Day Peice_Produced
0 1/1/17 10
1 1/2/17 5
2 1/3/17 15
3 1/4/17 6
4 1/5/17 21
5 1/6/17 7
6 1/7/17 28
7 1/8/17 8
8 1/9/17 36
9 1/10/17 9
10 1/11/17 45
Solution: One way around of doing..
>>> df['Day_over_day%'] = df.Peice_Produced.diff(periods=1).fillna(0).astype(str) + '%'
>>> df
Day Peice_Produced Day_over_day%
0 1/1/17 10 0.0%
1 1/2/17 5 -5.0%
2 1/3/17 15 10.0%
3 1/4/17 6 -9.0%
4 1/5/17 21 15.0%
5 1/6/17 7 -14.0%
6 1/7/17 28 21.0%
7 1/8/17 8 -20.0%
8 1/9/17 36 28.0%
9 1/10/17 9 -27.0%
10 1/11/17 45 36.0%
You can just add a calculated column. I'm assuming you are storing this data in a pandas DataFrame called df. You can do this simply with:
df['change'] = (df['Pieces Produced'] / df['Pieces Produced'].shift(1))-1
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1
You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).