I am analyzing time series data of one stock to seek the highest price for further analysis, here is the sample dataframe df:
date close high_3days
2021-05-01 20 20
2021-05-02 23 23
2021-05-03 26 26
2021-05-04 24 26
2021-05-05 20 26
2021-05-06 26 26
2021-05-07 22 26
2021-05-08 30 30
2021-05-09 20 30
2021-05-10 20 30
I want to add a new column to find the number of days from previous 3 days high. My logic is seeking the index of the row of previous high, and then subtract it from the index of current row.
Here is the desire output:
date close high_3days days_previous_high
2021-05-01 20 20 0
2021-05-02 23 23 0
2021-05-03 26 26 0
2021-05-04 24 26 1
2021-05-05 20 26 2
2021-05-06 22 26 3
2021-05-07 20 26 4
2021-05-08 30 30 0
2021-05-09 20 30 1
2021-05-10 20 30 2
Could you help to figure the way out~? Thanks guys!
Try creating a boolean index with expanding max, then enumerate each group with groupby cumcount:
df['days_previous_high'] = df.groupby(
df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
df:
date close high_3days days_previous_high
0 2021-05-01 20 20 0
1 2021-05-02 23 23 0
2 2021-05-03 26 26 0
3 2021-05-04 24 26 1
4 2021-05-05 20 26 2
5 2021-05-06 22 26 3
6 2021-05-07 20 26 4
7 2021-05-08 30 30 0
8 2021-05-09 20 30 1
9 2021-05-10 20 30 2
Explaination:
expanding max is used to determine the current maximum value at each row.
df['high_3days'].expanding().max()
diff can be used to see where the current value exceeds the max.
df['high_3days'].expanding().max().diff()
groups can be created by taking the cumsum of where the diff is greater than 0:
df['high_3days'].expanding().max().diff().gt(0).cumsum()
expanding_max expanding_max_diff expanding_max_gt_0 expanding_max_gt_0_cs
20.0 NaN False 0
23.0 3.0 True 1
26.0 3.0 True 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
30.0 4.0 True 3
30.0 0.0 False 3
30.0 0.0 False 3
Now that rows are grouped, groupby cumcount can be used to enumerate each group:
df.groupby(df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
Related
I have a pandas df, like this:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value
0 10 2022-01-01 100 0
1 10 2022-01-02 150 1
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200 3
5 10 2022-01-06 0
6 10 2022-01-07 150 2
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100 5
12 23 2022-02-01 490 0
13 23 2022-02-02 0
14 23 2022-02-03 350 2
15 23 2022-02-04 333 1
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211 4
20 23 2022-02-09 100 1
Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value
0 10 100 0.0
1 10 150 1.0
2 10 0 NaN
3 10 0 NaN
4 10 200 3.0
5 10 0 NaN
6 10 150 2.0
7 10 0 NaN
8 10 0 NaN
9 10 0 NaN
10 10 0 NaN
11 10 100 5.0
12 23 490 0.0
13 23 0 NaN
14 23 350 2.0
15 23 333 1.0
16 23 0 NaN
17 23 0 NaN
18 23 0 NaN
19 23 211 4.0
20 23 100 1.0
First, we'll have to group by 'ID'.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 4
8 4
9 4
10 4
11 5
12 6
13 6
14 7
15 8
16 8
17 8
18 8
19 9
20 10
We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.
So I have a data frame that is something like this
Resource 2020-06-01 2020-06-02 2020-06-03
Name1 8 7 8
Name2 7 9 9
Name3 10 10 10
Imagine that the header is literal all the days of the month. And that there are way more names than just three.
I need to reduce the columns to five. Considering the first column to be the days between 2020-06-01 till 2020-06-05. Then from Saturday till Friday of the same week. Or the last day of the month if it is before Friday. So for June would be these weeks:
week 1: 2020-06-01 to 2020-06-05
week 2: 2020-06-06 to 2020-06-12
week 3: 2020-06-13 to 2020-06-19
week 4: 2020-06-20 to 2020-06-26
week 5: 2020-06-27 to 2020-06-30
I have no problem defining these weeks. The problem is grouping the columns based on them.
I couldn't come up with anything.
Does someone have any ideas about this?
I have to use these code to generate your dataframe.
dates = pd.date_range(start='2020-06-01', end='2020-06-30')
df = pd.DataFrame({
'Name1': np.random.randint(1, 10, size=len(dates)),
'Name2': np.random.randint(1, 10, size=len(dates)),
'Name3': np.random.randint(1, 10, size=len(dates)),
})
df = df.set_index(dates).transpose().reset_index().rename(columns={'index': 'Resource'})
Then, the solution starts from here.
# Set the first column as index
df = df.set_index(df['Resource'])
# Remove the unused column
df = df.drop(columns=['Resource'])
# Transpose the dataframe
df = df.transpose()
# Output:
Resource Name1 Name2 Name3
2020-06-01 00:00:00 3 2 7
2020-06-02 00:00:00 5 6 8
2020-06-03 00:00:00 2 3 6
...
# Bring "Resource" from index to column
df = df.reset_index()
df = df.rename(columns={'index': 'Resource'})
# Add a column "week of year"
df['week_no'] = df['Resource'].dt.weekofyear
# You can simply group by the week no column
df.groupby('week_no').sum().reset_index()
# Output:
Resource week_no Name1 Name2 Name3
0 23 38 42 41
1 24 37 30 43
2 25 38 29 23
3 26 29 40 42
4 27 2 8 3
I don't know what you want to do for the next. If you want your original form, just transpose() it back.
EDIT: OP claimed the week should start from Saturday end up with Friday
# 0: Monday
# 1: Tuesday
# 2: Wednesday
# 3: Thursday
# 4: Friday
# 5: Saturday
# 6: Sunday
df['weekday'] = df['Resource'].dt.weekday.apply(lambda day: 0 if day <= 4 else 1)
df['customised_weekno'] = df['week_no'] + df['weekday']
Output:
Resource Resource Name1 Name2 Name3 week_no weekday customised_weekno
0 2020-06-01 4 7 7 23 0 23
1 2020-06-02 8 6 7 23 0 23
2 2020-06-03 5 9 5 23 0 23
3 2020-06-04 7 6 5 23 0 23
4 2020-06-05 6 3 7 23 0 23
5 2020-06-06 3 7 6 23 1 24
6 2020-06-07 5 4 4 23 1 24
7 2020-06-08 8 1 5 24 0 24
8 2020-06-09 2 7 9 24 0 24
9 2020-06-10 4 2 7 24 0 24
10 2020-06-11 6 4 4 24 0 24
11 2020-06-12 9 5 7 24 0 24
12 2020-06-13 2 4 6 24 1 25
13 2020-06-14 6 7 5 24 1 25
14 2020-06-15 8 7 7 25 0 25
15 2020-06-16 4 3 3 25 0 25
16 2020-06-17 6 4 5 25 0 25
17 2020-06-18 6 8 2 25 0 25
18 2020-06-19 3 1 2 25 0 25
So, you can use customised_weekno for grouping.
I have a csv of daily maximum temperatures. I am trying to assign a "rank" for my data. I first sorted my daily maximum temperature from lowest to highest. I then created a new column called rank.
#Sort data smallest to largest
ValidFullData_Sorted=ValidFullData.sort_values(by="TMAX")
#count total obs
n=ValidFullData_Sorted.shape[0]
#add a numbered column 1-> n to use in return calculation for rank
ValidFullData_Sorted.insert(0,'rank',range(1,1+n))
How can I make the rank the same for values of daily maximum temperature that are the same? (i.e. every time the daily maximum temperature reaches 95° the rank for each of those instances should be the same)
Here is some sample data:(its daily temperature data so its thousands of lines long)
Date TMAX TMIN
1/1/00 22 11
1/2/00 26 12
1/3/00 29 14
1/4/00 42 7
1/5/00 42 21
And I want to add a TMAXrank column that would look like this:
Date TMAX TMIN TMAXRank
1/1/00 22 11 4
1/2/00 26 12 3
1/3/00 29 14 2
1/4/00 42 7 1
1/5/00 42 21 1
ValidFullData['TMAXRank'] = ValidFullData[ValidFullData['TMAX'] < 95]['TMAX'].rank(ascending=False, method='dense')
Output:
Unnamed: 0 TMAX TMIN TMAXRank
17 17 88 14 1.0
16 16 76 12 2.0
15 15 72 11 3.0
14 14 64 21 4.0
8 8 62 7 5.0
7 7 58 14 6.0
13 13 58 7 6.0
18 18 55 7 7.0
3 3 42 7 8.0
4 4 42 21 8.0
6 6 41 12 9.0
12 12 37 14 10.0
5 5 36 11 11.0
2 2 29 14 12.0
1 1 26 12 13.0
0 0 22 11 14.0
9 9 98 21 NaN
10 10 112 11 NaN
11 11 98 12 NaN
19 19 95 21 NaN
happy new year to all!
I guess this question might be an easy one, but i can't figure it out.
How can i turn hourly data into 15 minute buckets quickly in python (see table below). Basically the left column should be converted into the right one.Just duplicate the hourly value for times and dump it into a new column.
Thanks for the support!
Cheers!
Hourly 15mins
1 28.90 1 28.90
2 28.88 1 28.90
3 28.68 1 28.90
4 28.67 1 28.90
5 28.52 2 28.88
6 28.79 2 28.88
7 31.33 2 28.88
8 32.60 2 28.88
9 42.00 3 28.68
10 44.00 3 28.68
11 44.00 3 28.68
12 44.00 3 28.68
13 39.94 4 28.67
14 39.90 4 28.67
15 38.09 4 28.67
16 39.94 4 28.67
17 44.94 5 28.52
18 66.01 5 28.52
19 49.45 5 28.52
20 48.37 5 28.52
21 38.02 6 28.79
22 34.55 6 28.79
23 33.33 6 28.79
24 32.05 6 28.79
7 31.33
7 31.33
7 31.33
7 31.33
You could also do this through constructing a new DataFrame and using numpy methods.
import numpy as np
pd.DataFrame(np.column_stack((np.arange(df.shape[0]).repeat(4, axis=0),
np.array(df).repeat(4, axis=0))),
columns=['hours', '15_minutes'])
which returns
hours 15_minutes
0 0 28.90
1 0 28.90
2 0 28.90
3 0 28.90
4 1 28.88
5 1 28.88
...
91 22 33.33
92 23 32.05
93 23 32.05
94 23 32.05
95 23 32.05
column_stack appends arrays by columns (index=0). np.arange(df.shape[0]).repeat(4, axis=0) gets the hour IDs by repeating 0 through 23 four times, and the values for each 15 minutes is constructed in a similar manner. pd.DataFrame produces the DataFrames and column names are added as well.
Create datetime-like index for your DataFrame, then you can use resample.
series.resample('15T')
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1
You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).