Grouping data in DF but keeping all columns in Python - python

I have a df that includes high and low stock prices by day in 2 minute increments. I am trying to find the high and low for each day. I am able to do so by using the code below but the output only gives me the date and price data. I need to have the time column available as well. I've tried about 100 different ways but cannot get it to work.
high = df.groupby('Date')['High'].max()
low = df.groupby('Date')['Low'].min()
Below are my columns and dtypes.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 High 4277 non-null float64
1 Low 4277 non-null float64
2 Date 4277 non-null object
3 Time 4277 non-null object
Any suggestions?

transform with boolean indexing:
# sample data
np.random.seed(10)
df = pd.DataFrame([pd.date_range('2020-01-01', '2020-01-03', freq='H'),
np.random.randint(1,10000, 49), np.random.randint(1,10,49)]).T
df.columns = ['date', 'high', 'low']
df['time'] = df['date'].dt.time
df['date'] = df['date'].dt.date
# transform max and min then assign to a variable
mx = df.groupby('date')['high'].transform(max)
mn = df.groupby('date')['low'].transform(min)
# boolean indexing
high = df[df['high'] == mx]
low = df[df['low'] == mn]
# high
date high low time
4 2020-01-01 9373 9 04:00:00
42 2020-01-02 9647 2 18:00:00
48 2020-01-03 45 5 00:00:00
# low
date high low time
14 2020-01-01 2103 1 14:00:00
15 2020-01-01 3417 1 15:00:00
23 2020-01-01 654 1 23:00:00
27 2020-01-02 2701 1 03:00:00
30 2020-01-02 284 1 06:00:00
36 2020-01-02 6160 1 12:00:00
38 2020-01-02 631 1 14:00:00
40 2020-01-02 3417 1 16:00:00
44 2020-01-02 6860 1 20:00:00
45 2020-01-02 8989 1 21:00:00
47 2020-01-02 2811 1 23:00:00
48 2020-01-03 45 5 00:00:00

Do you wan this:
# should use datetime type:
df['Date'] = pd.to_datetime(df['Date'])
df.groupby(df.Date.dt.normalize()).agg({'High': 'max', 'Low': 'min'})

After you apply groupby and min or max function, you can select the columns using loc or iloc:
df.groupby('Date').max().loc[:,['High','Time']]

Related

Calculate average temperature/humidity between 2 dates pandas data frames

I have the following data frames:
df3
Harvest_date
Starting_date
2022-10-06
2022-08-06
2022-02-22
2021-12-22
df (I have all temp and humid starting from 2021-01-01 till the present)
date
temp
humid
2022-10-06 00:30:00
2
30
2022-10-06 00:01:00
1
30
2022-10-06 00:01:30
0
30
2022-10-06 00:02:00
0
30
2022-10-06 00:02:30
-2
30
I would like to calculate the avg temperature and humidity between the starting_date and harvest_date. I tried this:
import pandas as pd
df = pd.read_csv (r'C:\climate.csv')
df3 = pd.read_csv (r'C:\Flower_weight_Seson.csv')
df['date'] = pd.to_datetime(df.date)
df3['Harvest_date'] = pd.to_datetime(df3.Harvest_date)
df3['Starting_date'] = pd.to_datetime(df3.Starting_date)
df.style.format({"date": lambda t: t.strftime("%Y-%m-%d")})
df3.style.format({"Harvest_date": lambda t: t.strftime("%Y-%m-%d")})
df3.style.format({"Starting_date": lambda t: t.strftime("%Y-%m-%d")})
for harvest_date,starting_date in zip(df3['Harvest_date'],df3['Starting_date']):
df3["Season avg temp"]= df[df["date"].between(starting_date,harvest_date)]["temp"].mean()
df3["Season avg humid"]= df[df["date"].between(starting_date,harvest_date)]["humid"].mean()
I get the same value for all dates. Can someone point out what I did wrong, please?
Use DataFrame.loc with match indices by means of another DataFrame:
#changed data for match with df3
print (df)
date temp humid
0 2022-10-06 00:30:00 2 30
1 2022-09-06 00:01:00 1 33
2 2022-09-06 00:01:30 0 23
3 2022-10-06 00:02:00 0 30
4 2022-01-06 00:02:30 -2 25
for i,harvest_date,starting_date in zip(df3.index,df3['Harvest_date'],df3['Starting_date']):
mask = df["date"].between(starting_date,harvest_date)
avg = df.loc[mask, ["temp",'humid']].mean()
df3.loc[i, ["Season avg temp",'Season avg humid']] = avg.to_numpy()
print (df3)
Harvest_date Starting_date Season avg temp Season avg humid
0 2022-10-06 2022-08-06 0.5 28.0
1 2022-02-22 2021-12-220 -2.0 25.0
EDIT: For add new condition for match by room columns use:
for i,harvest_date,starting_date, room in zip(df3.index,
df3['Harvest_date'],
df3['Starting_date'], df3['Room']):
mask = df["date"].between(starting_date,harvest_date) & df['Room'].eq(room)
avg = df.loc[mask, ["temp",'humid']].mean()
df3.loc[i, ["Season avg temp",'Season avg humid']] = avg.to_numpy()
print (df3)

How to aggregate a function on a specific column for each day

I have a CSV file that has minute data in it.
The end goal is to find the standard deviations of all of the lows ('Low' column) of each day, using all of the data of each day.
The issue is that the csv file has some holes in it in that it does not have exactly 390 minutes(number on minutes in a trading day). The code looks like this:
import pandas as pd
import datetime as dt
df = pd.read_csv('/Volumes/Seagate Portable/S&P 500 List/AAPL.txt')
df.columns = ['Extra', 'Dates', 'Open', 'High', 'Low', 'Close', 'Volume']
df.drop(['Extra', 'Open', 'High', 'Volume'], axis=1, inplace=True)
df.Dates = pd.to_datetime(df.Dates)
df.set_index(df.Dates, inplace=True)
df = df.between_time('9:30', '16:00')
print(df.Low[::390])
The output is as follows:
Dates
2020-01-02 09:30:00 73.8475
2020-01-02 16:00:00 75.0875
2020-01-03 15:59:00 74.3375
2020-01-06 15:58:00 74.9125
2020-01-07 15:57:00 74.5028
...
2020-12-14 09:41:00 122.8800
2020-12-15 09:40:00 125.9900
2020-12-16 09:39:00 126.5600
2020-12-17 09:38:00 129.1500
2020-12-18 09:37:00 127.9900
Name: Low, Length: 245, dtype: float64
As you can see in the output, even if one 9:30 is missing I can no longer index out by 390. So my solution to this would be to get as much data as possible even if dates are missing in the sense that, when the datetime code goes from say 15:59, or 16:00 to 9:31 or 9:32. In essence when it changes back down for 16 to 9:30? I don't know if there are any other solutions to this? Any ideas? And if this is the solution what would be the best way to code it?
Use .groupby() with pandas.Grouper() on 'date', with freq='D' for day, and then aggregate .std() on 'low'.
The 'date' column must be a datetime dtype. Use pd.to_datetime() to convert the 'Dates' column, if needed.
If desired, use df = df.set_index('date').between_time('9:30', '16:00').reset_index() to select only times within a specific range. This would be done before the .groupby().
The 'date' column needs to be the index, to use .between_time().
import requests
import pandas as pd
# sample stock data
periods = '3600'
resp = requests.get('https://api.cryptowat.ch/markets/poloniex/ethusdt/ohlc', params={'periods': periods})
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=['date', 'open', 'high', 'low', 'close', 'volume', 'amount'])
# convert to a datetime format
df['date'] = pd.to_datetime(df['date'], unit='s')
# display(df.head())
date open high low close volume amount
0 2020-11-22 02:00:00 550.544464 554.812114 536.523241 542.000000 2865.381737 1.567462e+06
1 2020-11-22 03:00:00 541.485933 551.621355 540.992000 548.500000 1061.275481 5.796859e+05
2 2020-11-22 04:00:00 548.722267 549.751680 545.153196 549.441709 310.874748 1.703272e+05
3 2020-11-22 05:00:00 549.157866 549.499632 544.135302 546.913493 259.077448 1.416777e+05
4 2020-11-22 06:00:00 547.600000 548.000000 541.668524 544.241871 363.433373 1.979504e+05
# groupby day, using pd.Grouper and then get std of low
std = df.groupby(pd.Grouper(key='date', freq='D'))['low'].std().reset_index(name='low std')
# display(std)
date low std
0 2020-11-22 14.751495
1 2020-11-23 14.964803
2 2020-11-24 6.542568
3 2020-11-25 9.523858
4 2020-11-26 24.041421
5 2020-11-27 8.272477
6 2020-11-28 12.340238
7 2020-11-29 8.444779
8 2020-11-30 10.290333
9 2020-12-01 13.605846
10 2020-12-02 6.201248
11 2020-12-03 9.403853
12 2020-12-04 12.667251
13 2020-12-05 10.180626
14 2020-12-06 4.481538
15 2020-12-07 3.881311
16 2020-12-08 10.518746
17 2020-12-09 12.077622
18 2020-12-10 6.161330
19 2020-12-11 5.035066
20 2020-12-12 6.297173
21 2020-12-13 9.739574
22 2020-12-14 3.505540
23 2020-12-15 3.304968
24 2020-12-16 16.753780
25 2020-12-17 10.963064
26 2020-12-18 5.574997
27 2020-12-19 4.976494
28 2020-12-20 7.243917
29 2020-12-21 16.844777
30 2020-12-22 10.348576
31 2020-12-23 15.769288
32 2020-12-24 10.329158
33 2020-12-25 5.980148
34 2020-12-26 8.530006
35 2020-12-27 21.136509
36 2020-12-28 16.115898
37 2020-12-29 10.587339
38 2020-12-30 7.634897
39 2020-12-31 7.278866
40 2021-01-01 6.617027
41 2021-01-02 19.708119

replace values greater than 0 in a range of time in pandas dataframe

I have a large csv file in which I want to replace values with zero in a particular range of time. For example in between 20:00:00 to 05:00:00 I want to replace all the values greater than zero with 0. How do I do it?
dff = pd.read_csv('108e.csv', header=None) # reading the data set
data = df.copy()
df = pd.DataFrame(data)
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
for i in df.set_index('timeStamp').between_time('20:00:00' , '05:00:00')['luminosity']:
if( i > 0):
df[['luminosity']] = df[["luminosity"]].replace({i:0})
You can use the function select from numpy.
import numpy as np
df['luminosity'] = np.select((df['timeStamp']>='20:00:00') & (df['timeStamp']<='05:00:00') & (df['luminosity']>=0), 0, df['luminosity'])
Here are other examples to use it and here are the official docs.
Assume that your DataFrame contains:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 11
2 2020-01-02 22:00:00 12
3 2020-01-03 02:00:00 13
4 2020-01-03 05:00:00 14
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 17
8 2020-01-03 22:10:00 18
9 2020-01-04 02:10:00 19
10 2020-01-04 05:00:00 20
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
To only retrieve rows in the time range of interest you could run:
df.set_index('timeStamp').between_time('20:00' , '05:00')
But if you attempted to modify these data, e.g.
df = df.set_index('timeStamp')
df.between_time('20:00' , '05:00')['luminosity'] = 0
you would get SettingWithCopyWarning. The reason is that this function
returns a view of the original data.
To circumvent this limitation, you can use indexer_between_time,
on the index of a DataFrame, which returns a Numpy array - locations
of rows meeting your time range criterion.
To update the underlying data, with setting index only to get row positions,
you can run:
df.iloc[df.set_index('timeStamp').index\
.indexer_between_time('20:00', '05:00'), 1] = 0
Note that to keep the code short, I passed the int location of the column
of interest.
Access by iloc should be quite fast.
When you print the df again, the result is:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 0
2 2020-01-02 22:00:00 0
3 2020-01-03 02:00:00 0
4 2020-01-03 05:00:00 0
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 0
8 2020-01-03 22:10:00 0
9 2020-01-04 02:10:00 0
10 2020-01-04 05:00:00 0
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22

Python/Pandas: extract intervals from a large dataframe

I have two pandas DataFrames:
20 million rows of continues time series data with DateTime Index (df) IMG
20 thousand rows with two timestamps (df_seq) IMG
I want to use the second Dataframe to extract all sequences out of the first (all rows of the first between the two timestamps for each row of 2. ), then each sequence needs to be transposed into 990 columns and then all sequences have to be combined in a new DataFrame.
So the new DataFrame has one row with 990 columns for each sequence IMG (case row get added later).
Right now my code looks like this:
sequences = pd.DataFrame()
for row in df_seq.itertuples(index=True, name='Pandas'):
sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())
sequences = sequences.reset_index(drop=True)
This code works, but is terribly slow --> 20-25 min execution time
Is there a way to rewrite this in vectorised operations? Or any other way to improve the performance of this code?
Here's a way to do it. The large dataframe is 'df', and the intervals one is called 'intervals':
inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"
intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"),
("2020-01-01 00:01:20","2020-01-01 00:02:03")],
columns=["start_time", "end_time"])
intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals
t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")
t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]
The result is:
timestamp 0 start_time end_time
12 2020-01-01 00:00:12 12 2020-01-01 00:00:12 2020-01-01 00:00:18
13 2020-01-01 00:00:13 13 2020-01-01 00:00:12 2020-01-01 00:00:18
14 2020-01-01 00:00:14 14 2020-01-01 00:00:12 2020-01-01 00:00:18
15 2020-01-01 00:00:15 15 2020-01-01 00:00:12 2020-01-01 00:00:18
16 2020-01-01 00:00:16 16 2020-01-01 00:00:12 2020-01-01 00:00:18
.. ... ... ... ...
119 2020-01-01 00:01:59 119 2020-01-01 00:01:20 2020-01-01 00:02:03
120 2020-01-01 00:02:00 120 2020-01-01 00:01:20 2020-01-01 00:02:03
121 2020-01-01 00:02:01 121 2020-01-01 00:01:20 2020-01-01 00:02:03
122 2020-01-01 00:02:02 122 2020-01-01 00:01:20 2020-01-01 00:02:03
123 2020-01-01 00:02:03 123 2020-01-01 00:01:20 2020-01-01 00:02:03
After the steps from the answer above I added a groupby and a unstack and the result is exactly the df i need:
Execution time is ~30 seconds!
The full code looks now like this:
sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]
sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
sequences = sequences.reset_index(drop=True)

How can I get different statistics for a rolling datetime range up top a current value in a pandas dataframe?

I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

Categories

Resources