I have transaction data that I hope to resample in a fashion similar to OHLC stock market prices.
The goal is to display a meaningful price summary for each day
However, my challenge is that the transactional data is sparse.
There are days without transactions
The opening price of the day might happen in the middle of the day and does not roll over from the previous day automatically
I can do the naive OHLC with resample(), as seen in the example below. Because the data is sparse, the normal straightforward resample gives unideal results. To get a meaningful price sample, the following conditions should be met too:
The opening price is always set to the closing price of the previous day
If any day does not have transactions, all OHLC values are the closing price of the previous day ("price does not move")
I can do this in pure Python, as it is not that difficult, but it is not very computationally efficient for high volumes of data. Thus, my question is, would Pandas offer any clever way of doing resample or aggregate satisfying the conditions above, but without needing to loop values in Python manually?
The example code is below:
import pandas as pd
# Transactions do not have regular intervals and may miss days
data = {
"timestamp": [
pd.Timestamp("2020-01-01 01:00"),
pd.Timestamp("2020-01-01 05:00"),
pd.Timestamp("2020-01-02 03:00"),
pd.Timestamp("2020-01-04 04:00"),
pd.Timestamp("2020-01-05 00:00"),
],
"transaction": [
100.00,
102.00,
103.00,
102.80,
99.88
]
}
df = pd.DataFrame.from_dict(data, orient="columns")
df.set_index("timestamp", inplace=True)
print(df)
transaction
timestamp
2020-01-01 01:00:00 100.00
2020-01-01 05:00:00 102.00
2020-01-02 03:00:00 103.00
2020-01-04 04:00:00 102.80
2020-01-05 00:00:00 99.88
# https://stackoverflow.com/a/36223274/315168
naive_resample = df["transaction"].resample("1D") .agg({'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last'})
print(naive_resample)
In this result, you can see that:
open/close do not match over daily boundaries
if a day does not have transactions price is marked as NaN
open high low close
timestamp
2020-01-01 100.00 102.00 100.00 102.00
2020-01-02 103.00 103.00 103.00 103.00
2020-01-03 NaN NaN NaN NaN
2020-01-04 102.80 102.80 102.80 102.80
2020-01-05 99.88 99.88 99.88 99.88
You can use following logic:
Shift "close" to next row as "prev_close", to use it for next row processing.
If "open" == NaN, fill in "prev_close" for OHLC.
Match "open" with "prev_close" for all.
# Fill missing days "close" with last known value.
naive_resample["close"] = naive_resample["close"].fillna(method="ffill")
# Shift "close" to next row, to use it for next row processing.
naive_resample["prev_close"] = naive_resample["close"].shift(1)
# First transaction has no "prev_close". Adjust it to prevent NaN spill over.
naive_resample.iloc[0, naive_resample.columns.get_loc("prev_close")] = naive_resample.iloc[0, naive_resample.columns.get_loc("open")]
def adjust_ohlc(row):
# Process missing day
if math.isnan(row["open"]):
return pd.Series([row["prev_close"]] * 4)
else:
# Adjust "open" with "prev_close"
return pd.Series([row["prev_close"], row["high"], row["low"], row["close"]])
naive_resample[["open", "high", "low", "close"]] = naive_resample.apply(adjust_ohlc, axis=1)
naive_resample = naive_resample.drop("prev_close", axis=1)
Output:
open high low close
timestamp
2020-01-01 100.0 102.00 100.00 102.00
2020-01-02 102.0 102.00 102.00 102.00
2020-01-03 102.0 103.00 103.00 103.00
2020-01-04 103.0 102.80 102.80 102.80
2020-01-05 102.8 99.88 99.88 99.88
Related
Unnamed: 0 open high low close volume Date time Profit Loss proforloss
0 0 252.70 254.25 252.35 252.60 319790 2022-01-03 09:15:00 0.000000 0.039573 -0.039573
1 1 252.60 253.65 251.75 252.80 220927 2022-01-03 09:30:00 0.079177 0.000000 0.079177
2 2 252.95 254.90 252.30 252.85 526445 2022-01-03 09:45:00 0.000000 0.039534 -0.039534
3 3 252.85 253.15 252.40 252.55 280414 2022-01-03 10:00:00 0.000000 0.118647 -0.118647
4 4 252.55 253.10 252.25 252.80 112875 2022-01-03 10:15:00 0.098990 0.000000 0.098990
this is the data given to me i have got the profit and loss for 15 min time frame but how can i get profit of a day starting time is 09:30:00 and closing time is 15:00:00.
how can i also get the max profit and min profit of the day . when capital is 100,000.
thanks for your help;
You could filter for rows during the relevant trading times first, then use groupby() to get the max / min profit of the day. Finally, multiply with the desired volume.
df['time'] = pd.to_timedelta(df['time'])
df = df[
(pd.Timedelta(hours=9, minutes=30) <= df["time"])
& (df["time"] <= pd.Timedelta(hours=15))
]
df = df.groupby("Date", sort=False)[["Profit", "Loss"]].agg(["min", "max"])
df = df * 100_000
Output:
Profit Loss
min max min max
Date
2022-01-03 0.0 9899.0 0.0 11864.7
I have a pandas dataframe that has datetime in multiple columns and looks similar to below but with hundreds of columns, almost pushing 1k.
datetime, battery, datetime, temperature, datetime, pressure
2020-01-01 01:01:01, 13.8, 2020-01-01 01:01:02, 97, 2020-01-01 01:01:03, 10
2020-01-01 01:01:04, 13.8, 2020-01-01 01:01:05, 97, 2020-01-01 01:01:06, 11
What I have done is imported it and then converted every datetime column using pd.to_datetime. This reduces the memory usage by more than half (2.4GB to 1.0GB), but I'm wondering if this is still inefficient and maybe a better way.
Would I benefit from converting this down to 3 columns where I have datetime, data name, data measurement? If so what is the best method of doing this? I've tried this but end up with a lot of empty spaces.
Would there be another way to handle this data that I'm just not presenting?
or what I'm doing makes sense and is efficient enough?
I eventually want to plot some of this data by selecting specific data names.
I ran a small experiment with the above data and converting the data to date / type / value columns reduces the overall memory consumption:
print(df)
datetime battery datetime.1 temperature datetime.2 pressure
0 2020-01-01 01:01:01 13.8 2020-01-01 01:01:02 97 2020-01-01 01:01:03 10
1 2020-01-01 01:01:04 13.8 2020-01-01 01:01:05 97 2020-01-01 01:01:06 11
print(df.memory_usage().sum())
==> 224
After converting the dataframe:
dfs = []
for i in range(0, 6, 2):
d = df.iloc[:, i:i+2]
d["type"] = d.columns[1]
d.columns = ["datetime", "value", "type"]
dfs.append(d)
new_df = pd.concat(dfs)
print(new_df)
==>
datetime value type
0 2020-01-01 01:01:01 13.8 battery
1 2020-01-01 01:01:04 13.8 battery
0 2020-01-01 01:01:02 97.0 temperature
1 2020-01-01 01:01:05 97.0 temperature
0 2020-01-01 01:01:03 10.0 pressure
1 2020-01-01 01:01:06 11.0 pressure
print(new_df.memory_usage().sum())
==> 192
I have a dataframe that contains NaN values and I want to fill the missing data using information of the same month.
the dataframe looks this:
data = {'x':[208.999,-894.0,-171.0,108.999,-162.0,-29.0,-143.999,-133.0,-900.0],
'e':[0.105,0.209,0.934,0.150,0.158,'',0.333,0.089,0.189],
}
df = pd.DataFrame(data)
df = pd.DataFrame(data, index =['2020-01-01', '2020-02-01',
'2020-03-01', '2020-01-01',
'2020-02-01','2020-03-01',
'2020-01-01','2020-02-01',
'2020-03-01'])
df.index = pd.to_datetime(df.index)
df['e'] =df['e'].apply(pd.to_numeric, errors='coerce')
Now im using df=df.fillna(df['e'].mean()) to fill the nan value but it takes all the column data, is and it gives me 0.27 is there a way to use only the data of the same month?, the result should be 0.56
Try grouping in index.month and get mean (transformed) then fillna
df.index = pd.to_datetime(df.index)
out = df.fillna({'e':df.groupby(df.index.month)['e'].transform('mean')})
print(out)
x e
2020-01-01 208.999 0.1050
2020-02-01 -894.000 0.2090
2020-03-01 -171.000 0.9340
2020-01-01 108.999 0.1500
2020-02-01 -162.000 0.1580
2020-03-01 -29.000 0.5615
2020-01-01 -143.999 0.3330
2020-02-01 -133.000 0.0890
2020-03-01 -900.000 0.1890
Maybe you could use interpolate() instead of fillna(), but you have to sort the index first, ie.:
df.e.sort_index().interpolate()
Output:
2020-01-01 0.1050
2020-01-01 0.1500
2020-01-01 0.3330
2020-02-01 0.2090
2020-02-01 0.1580
2020-02-01 0.0890
2020-03-01 0.9340
2020-03-01 0.5615
2020-03-01 0.1890
Name: e, dtype: float64
By default linear interpolation is used, so in case of a single occurrence of NaN you get the mean value and the missing one was replaced by 0.5615 like you expected. However if the NaN was the first sample of the month after sorting the result would be the mean of the last month's last value and this month's next value, but it works in cases where there are NaNs for the whole month and nothing to average, so depending how strict you are on the same month requirement or how are your missing values spread across the whole dataframe you can accept this solution or not.
Having a terrible time finding information on this. I am tracking several completion times every single day to measure them against goal completion time.
I am reading the completion date and time into a pandas dataframe and using df.map to map a dictionary of completion times to create a "goal time" column in a dataframe.
Sample Data:
Date Process
1/2/2020 10:20:00 AM Test 1
1/2/2020 10:25:00 AM Test 2
1/3/2020 10:15:00 AM Test 1
1/3/2020 10:00:00 AM Test 2
Using df.map() to create a column with the goal time:
goalmap={
'Test 1':dt.datetime.strptime('10:15', '%H:%M'),
'Test 2':dt.datetime.strptime('10:30', '%H:%M')}
df['Goal Time']=df['Process'].map(goalmap)
I am then trying to create a new column of "Delta" that calculates the time difference between the two in minutes. Most of the issues I am running into relate to the data types. I got it to calculate an time difference by converting column one (Date) using pd.to_datetime but because my 'Goal Time' column does not store a date, it calculates a delta that is massive (back to 1900). I've also tried parsing the time out of the Date Time column to no avail.
Any best way to calculate the difference between time stamps only?
I recommend timedelta over datetime:
goalmap={
'Test 1': pd.to_timedelta('10:15:00'),
'Test 2': pd.to_timedelta('10:30:00') }
df['Goal Time']=df['Process'].map(goalmap)
df['Goal_Timestamp'] = df['Date'].dt.normalize() + df['Goal Time']
df['Meet_Goal'] = df['Date'] <= df['Goal_Timestamp']
Output:
Date Process Goal Time Goal_Timestamp Meet_Goal
0 2020-01-02 10:20:00 Test 1 10:15:00 2020-01-02 10:15:00 False
1 2020-01-02 10:25:00 Test 2 10:30:00 2020-01-02 10:30:00 True
2 2020-01-03 10:15:00 Test 1 10:15:00 2020-01-03 10:15:00 True
3 2020-01-03 10:00:00 Test 2 10:30:00 2020-01-03 10:30:00 True
My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.