Optimize the unique id with in certain period - python

I have below dataframe called "df" and calculating the sum by unique id called "Id".
Can anyone help me in optimizing the code i have tried.
import pandas as pd
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Id':['100','200','300','100','100', '100','200'],
'Amount':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])

You can try to use groupby, after this each adjust within sub-groupby not to the whole df
s = {}
for x , y in df.groupby(['Id','NCC']):
for i in y.index:
start_date = y['Date'][i] - timedelta(seconds=300)
end_date = y['Date'][i]
mask = (y['Date'] >= start_date) & (y['Date'] < end_date)
count = y.loc[mask]
count = count.loc[(y['Sys'] == 1)]
if len(count) == 0:
s.update({i : 0})
else:
s.update({i : count['Amount'].sum()})
df['New']=pd.Series(s)

If the original data frame has 2 million rows, it would probably be faster to convert the 'Date' column to an index and sort it. Then you can sub select each 5-minute interval:
df = df.set_index('Date').sort_index()
df['Sum_Amt'] = 0
for end in df.index:
start = end - pd.Timedelta('5min')
current_window = df[start : end] # data frame with 5-minute look-back
sum_amt = <calc logic applied to `current_window` goes here>
df.at[end, 'Sum_Amt'] = sum_amt
print(current_window)
print()
I'm not following the logic for calculating Sum_Amt, so I left that out.

Related

Python - calculating difference between price extracting time

I need to create a new column and the value should be:
the current fair_price - fair_price 15 minutes ago(or the closest row)
I need to filter who is the row 15 minutes before then calculate the diff.
import numpy as np
import pandas as pd
from datetime import timedelta
df = pd.DataFrame(pd.read_csv('./data.csv'))
def calculate_15min(row):
end_date = pd.to_datetime(row['date']) - timedelta(minutes=15)
mask = (pd.to_datetime(df['date']) <= end_date).head(1)
price_before = df.loc[mask]
return price_before['fair_price']
def calc_new_val(row):
return 'show date 15 minutes before, maybe it will be null, nope'
df['15_min_ago'] = df.apply(lambda row: calculate_15min(row), axis=1)
myFields = ['pkey_id', 'date', '15_min_ago', 'fair_price']
print(df[myFields].head(5))
df[myFields].head(5).to_csv('output.csv', index=False)
I did it using nodejs but python is not my beach, maybe you have a fast solution...
pkey_id,date,fair_price,15_min_ago
465620,2021-05-17 12:28:30,45080.23,fair_price_15_min_before
465625,2021-05-17 12:28:35,45060.17,fair_price_15_min_before
465629,2021-05-17 12:28:40,45052.74,fair_price_15_min_before
465633,2021-05-17 12:28:45,45043.89,fair_price_15_min_before
465636,2021-05-17 12:28:50,45040.93,fair_price_15_min_before
465640,2021-05-17 12:28:56,45049.95,fair_price_15_min_before
465643,2021-05-17 12:29:00,45045.38,fair_price_15_min_before
465646,2021-05-17 12:29:05,45039.87,fair_price_15_min_before
465650,2021-05-17 12:29:10,45045.55,fair_price_15_min_before
465652,2021-05-17 12:29:15,45042.53,fair_price_15_min_before
465653,2021-05-17 12:29:20,45039.34,fair_price_15_min_before
466377,2021-05-17 12:42:50,45142.74,fair_price_15_min_before
466380,2021-05-17 12:42:55,45143.24,fair_price_15_min_before
466393,2021-05-17 12:43:00,45130.98,fair_price_15_min_before
466398,2021-05-17 12:43:05,45128.13,fair_price_15_min_before
466400,2021-05-17 12:43:10,45140.9,fair_price_15_min_before
466401,2021-05-17 12:43:15,45136.38,fair_price_15_min_before
466404,2021-05-17 12:43:20,45118.54,fair_price_15_min_before
466405,2021-05-17 12:43:25,45120.69,fair_price_15_min_before
466407,2021-05-17 12:43:30,45121.37,fair_price_15_min_before
466413,2021-05-17 12:43:36,45133.71,fair_price_15_min_before
466415,2021-05-17 12:43:40,45137.74,fair_price_15_min_before
466419,2021-05-17 12:43:45,45127.96,fair_price_15_min_before
466431,2021-05-17 12:43:50,45100.83,fair_price_15_min_before
466437,2021-05-17 12:43:55,45091.78,fair_price_15_min_before
466438,2021-05-17 12:44:00,45084.75,fair_price_15_min_before
466445,2021-05-17 12:44:06,45094.08,fair_price_15_min_before
466448,2021-05-17 12:44:10,45106.51,fair_price_15_min_before
466456,2021-05-17 12:44:15,45122.97,fair_price_15_min_before
466461,2021-05-17 12:44:20,45106.78,fair_price_15_min_before
466466,2021-05-17 12:44:25,45096.55,fair_price_15_min_before
466469,2021-05-17 12:44:30,45088.06,fair_price_15_min_before
466474,2021-05-17 12:44:35,45086.12,fair_price_15_min_before
466491,2021-05-17 12:44:40,45065.95,fair_price_15_min_before
466495,2021-05-17 12:44:45,45068.21,fair_price_15_min_before
466502,2021-05-17 12:44:55,45066.47,fair_price_15_min_before
466506,2021-05-17 12:45:00,45063.82,fair_price_15_min_before
466512,2021-05-17 12:45:05,45070.48,fair_price_15_min_before
466519,2021-05-17 12:45:10,45050.59,fair_price_15_min_before
466523,2021-05-17 12:45:16,45041.13,fair_price_15_min_before
466526,2021-05-17 12:45:20,45038.36,fair_price_15_min_before
466535,2021-05-17 12:45:25,45029.72,fair_price_15_min_before
466553,2021-05-17 12:45:31,45016.2,fair_price_15_min_before
466557,2021-05-17 12:45:35,45011.2,fair_price_15_min_before
466559,2021-05-17 12:45:40,45007.04,fair_price_15_min_before
This is the CSV
Firstly convert your date column to datetime dtype:
df['date']=pd.to_datetime(df['date'])
Then filter values:
date15min=df['date']-pd.offsets.DateOffset(minutes=15)
out=df.loc[df['date'].isin(date15min.tolist())]
Now Finally do your calculations:
df['price_before_15min']=df['fair_price'].where(df['date'].isin((out['date']+pd.offsets.DateOffset(minutes=15)).tolist()))
df['price_before_15min']=df['price_before_15min'].diff()
df['date_before_15min']=date15min
Now If you print df you will get your desired output
Update:
For that purpose just make a slightly change in the above method:
out=df.loc[df['date'].dt.minute.isin(date15min.dt.minute.tolist())]
df['price_before_15min']=df['fair_price'].where(df['date'].dt.minute.isin((out['date']+pd.offsets.DateOffset(minutes=15)).dt.minute.tolist()))

Hhow to filter by date of DataFrame in python function

I tried the following code.
The result1 is filtered by a given date, but the result2 isn't filtered.
How can I filter by date in a function?
import pandas as pd
over20='https://gist.githubusercontent.com/shinokada/dfcdc538dedf136d4a58b9bcdcfc8f18/raw/d1db4261b76af67dd67c00a400e373c175eab428/LNS14000024.csv'
df_over20 = pd.read_csv(over20)
display(df_over20)
result1=df_over20[df_over20['DATE']>='1972-01-01']
display(result1)
def changedate(item):
# something more here
item['DATE']=pd.to_datetime(item['DATE'])
start=pd.to_datetime('1972-01-01')
item[item['DATE']>=start]
return item
result2=changedate(df_over20)
display(result2)
In my experience I would make the Date column the index by running:
df.index = df[“DATE”]
df.drop(“DATE” , inplace = True , axis = 1 )
Try to use the index column
date = DT.datetime(‘2020-04-01’)
x = df[df.index > date]
You can also use the following command to make sure your index is a datetime index
df.index = pd.to_datetime( df.index )
You should not compare datetime by own string. it leads bad result.
please use this.
import datetime
def compare (date1,date2):
date1 = datetime.datetime.fromisoformat(date1).timestamp()
date2 = datetime.datetime.fromisoformat(date2).timestamp()
if(date1>date2):
return 1
elif(date1 == date2):
return 0
else:
return -1

Correcting Muddled Dates in Pandas DataFrame

I have a million-row time-series dataframe, in which some of the values in the Date column have muddled day/month values.
How do I efficiently unmuddle them without also ruining those that are correct?
# this creates a dataframe with muddled dates
import pandas as pd
import numpy as np
from pandas import Timestamp
start = Timestamp(2013,1,1)
dates = pd.date_range(start, periods=942)[::-1]
muddler = {}
for d in dates:
if d.day < 13:
muddler[d] = Timestamp(d.year, d.day, d.month)
else:
muddler[d] = Timestamp(d.year, d.month, d.day)
df = pd.DataFrame()
df['Date'] = dates
df['Date'] = df['Date'].map(muddler)
# now what? (assuming I don't know how the dates are muddled)
An option might be to calculate a fit for the timestamps and modify those that deviate from the fit greater than a certain threshold. Example:
import pandas as pd
import numpy as np
start = pd.Timestamp(2013,1,1)
dates = pd.date_range(start, periods=942)[::-1]
muddler = {}
for d in dates:
if d.day < 13:
muddler[d] = pd.Timestamp(d.year, d.day, d.month)
else:
muddler[d] = pd.Timestamp(d.year, d.month, d.day)
df = pd.DataFrame()
df['Date'] = dates
df['Date'] = df['Date'].map(muddler)
# convert date col to posix timestamp
df['ts'] = df['Date'].values.astype(np.float) / 10**9
# calculate a linear fit for ts col
x = np.linspace(df['ts'].iloc[0], df['ts'].iloc[-1], df['ts'].size)
df['ts_linfit'] = np.polyval(np.polyfit(x, df['ts'], 1), x)
# set a thresh and derive a mask that masks differences between
# fit and timestamp greater than thresh:
thresh = 1.2e6 # you might want to tweak this...
m = np.absolute(df['ts']-df['ts_linfit']) > thresh
# create new date col as copy of original
df['Date_filtered'] = df['Date']
# modify values that were caught in the mask
df.loc[m, 'Date_filtered'] = df['Date_filtered'][m].apply(lambda x: pd.Timestamp(x.year, x.day, x.month))
# also to posix timestamp
df['ts_filtered'] = df['Date_filtered'].values.astype(np.float) / 10**9
ax = df['ts'].plot(label='original')
ax = df['ts_filtered'].plot(label='filtered')
ax.legend()
While attempting to create a minimal reproducible example, I have actually solved my problem -- but I expect there is a more efficient and effective way to do what I'm trying to do...
# i first define a function to examine the dates
def disordered_muddle(date_series, future_first=True):
"""Check whether a series of dates is disordered or just muddled"""
disordered = []
muddle = []
dates = date_series
different_dates = pd.Series(dates.unique())
date = different_dates[0]
for i, d in enumerate(different_dates[1:]):
# we expect the date's dayofyear to decrease by one
if d.dayofyear!=date.dayofyear-1:
# unless the year is changing
if d.year!=date.year-1:
try:
# we check if the day and month are muddled
# if d.day > 12 this will cause an Exception
unmuddle = Timestamp(d.year,d.day,d.month)
if unmuddle.dayofyear==date.dayofyear-1:
muddle.append(d)
d = unmuddle
elif unmuddle.year==date.year-1:
muddle.append(d)
d = unmuddle
else:
disordered.append(d)
except:
disordered.append(d)
date=d
if len(disordered)==0 and len(muddle)==0:
return False
else:
return disordered, muddle
disorder, muddle = disordered_muddle(df['Date'])
# finally unmuddle the dates
date_correction = {}
for d in df['Date']:
if d in muddle:
date_correction[d] = Timestamp(d.year, d.day, d.month)
else:
date_correction[d] = Timestamp(d.year, d.month, d.day)
df['CorrectedDate'] = df['Date'].map(date_correction)
disordered_muddle(df['CorrectedDate'])

Convert negative duration in seconds to negative %H:%M:%S.%f

I'm making a function to calculate the time difference between two durations using Pandas.
The function is:
def time_calc(dur1, dur2):
date1 = pd.to_datetime(pd.Series(dur2))
date2 = pd.to_datetime(pd.Series(dur1))
df = pd.DataFrame(dict(ID = ids, DUR1 = date2, DUR2 = date1))
df1 = pd.DataFrame(dict(ID = ids, Duration1 = date2, Duration2 = date1))
df1['Duration1'] = df['DUR1'].dt.strftime('%H:%M:%S.%f')
df1['Duration2'] = df['DUR2'].dt.strftime('%H:%M:%S.%f')
cols = df.columns.tolist()
cols = ['ID', 'DUR1', 'DUR2']
df = df[cols]
df['diff_seconds'] = df['DUR2'] - df['DUR1']
df['diff_seconds'] = df['diff_seconds']/np.timedelta64(1,'s')
df['TimeDelta'] = df['diff_seconds'].apply(lambda d: str(datetime.timedelta(seconds=abs(d))))
df3 = df1.merge(df, on='ID')
cols = df3.columns.tolist()
cols = ['ID', 'Duration1', 'Duration2', 'TimeDelta', 'diff_seconds']
df3 = df3[cols]
print(df3)
The math is: Duration2-Duration1=TimeDelta
The function does it nicely:
Duration1 Duration2 TimeDelta diff_seconds
00:00:23.999891 00:00:25.102076 0:00:01.102185 1.102185
00:00:43.079173 00:00:44.621481 0:00:01.542308 1.542308
But when Duration2 < Duration1 we have a negative diff_seconds, but TimeDelta is still positive:
Duration1 Duration2 TimeDelta diff_seconds
00:05:03.744332 00:04:58.008081 0:00:05.736251 -5.736251
So what I need my function to do is to convert TimeDelta to negative value like this:
Duration1 Duration2 TimeDelta diff_seconds
00:05:03.744332 00:04:58.008081 -0:00:05.736251 -5.736251
I suppose that I need to convert 'TimeDelta' in another way, but all my attempts were useless.
I'll be very thankful if somebody will help me with this.
Thanks in advance!
I've solved this issue.
Made one by one timestamp picking logic and pass timestamps to 'time_convert' function
df['diff_seconds'] = df['DUR2'] - df['DUR1']
df['diff_seconds'] = df['diff_seconds']/np.timedelta64(1,'s')
for i in df['diff_seconds']:
df['TimeDelta'] = time_convert(i)
And the time_convert function just appends "-" to formatted timestamp if the seconds were negative:
def time_convert(d):
if d > 0:
lst.append(str(datetime.timedelta(seconds=d)))
else:
lst.append('-' + str(datetime.timedelta(seconds=abs(d))))
And then, I've just created new data frame using lst, and merged all together
df_t = pd.DataFrame(dict(ALERTS = alerts, TimeDelta = lst))
df_f = df_t.merge(df3, on='ID')
Hope this will help somebody.

A few operations with df.groupby()

I working with a forex dataset, trying to fill in my dataframe with open, high, low, close updated every tick.
Here is my code:
import pandas as pd
# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 320)
pd.set_option('display.width', 320)
# creating dataframe
df = pd.read_csv('https://www.dropbox.com/s/tcek3kmleklgxm5/eur_usd_lastweek.csv?dl=1', names=['timestamp', 'ask', 'bid', 'avol', 'bvol'], parse_dates=[0], header=0)
df['spread'] = df.ask - df.bid
df['symbol'] = 'EURUSD'
times = pd.DatetimeIndex(df.timestamp)
# parameters for df.groupby()
df['date'] = times.date
df['hour'] = times.hour
# 1h candles updated every tick
df['candle_number'] = '...'
df['1h_open'] = '...'
df['1h_high'] = '...'
df['1h_low'] = '...'
df['1h_close'] = '...'
# print(df)
grouped = df.groupby(['date', 'hour'])
for idx, x in enumerate(grouped):
print(idx)
print(x)
So as you can see, with for loop I'm getting groups.
Now I want to fill the following columns in my dataframe:
idx be my df['candle_number']
df['1h_open'] must be equal to the very first df.bid in the group
df['1h_high'] = the highest number in df.bid up until current row (so for instance if there are 350 rows in the group, for 20th value
we count the highest number from 0-20 span, on 215th value we the
highest value from 0-215 span which can be completely different.
df['1h_low'] = lowest value up until the current iteration (same approach as for the above)
I hope it's not too confusing =)
Cheers
It's convinient to reindex on date and hour:
df_new = df.set_index(['date', 'hour'])
Then apply groupby functions aggregating by index:
df_new['candle_number'] = df_new.groupby(level=[0,1]).ngroup()
df_new['1h_open'] = df_new.groupby(level=[0,1])['bid'].first()
df_new['1h_high'] = df_new.groupby(level=[0,1])['bid'].cummax()
df_new['1h_low'] = df_new.groupby(level=[0,1])['bid'].cummin()
you can reset_index() back to a flat dataframe.

Categories

Resources