How to Create a Pandas Index Faster?

How to Create a Pandas Index Faster? - python

Why is the following snippet performing so badly:
import numpy
import pandas
time = numpy.array(range(0, 1000000, 10), dtype = numpy.uint32)
index = [ pandas.Timedelta(str(t) + 'ms') for t in time ]
It takes approximately a second and a half on a decent desktop and we are talking only a million of pandas.Timedelta. Any ideas how to rewrite the last line?

If need TimedeltaIndex is possible use to_timedelta or TimedeltaIndex:
index = pd.to_timedelta(time, unit='ms')
Or:
index = pd.TimedeltaIndex(time, unit='ms')

You can also use pd.timedelta_range
index = pd.timedelta_range(0, periods=10000, freq='10ms')

Related

Calculate Pandas df timedelta of index

Would anyone know how to calculate the time delta of the time stamp of the index?
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='T')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
Am I on track using df['elapsed_time'] = pd.TimedeltaIndex(df) at all with this?
This will throw an error:
ValueError: Wrong number of items passed 2000, placement implies 1

This answer is beautiful!
This will create another pandas column which I called time_td where then I can cast it as a timedelta64 where m stands for minutes which I am looking for.
df['time_td'] = df.index.to_series().diff().astype('timedelta64[m]')
I can then sum this time_td column with:
df.time_td.sum()

Python - calculating difference between price extracting time

I need to create a new column and the value should be:
the current fair_price - fair_price 15 minutes ago(or the closest row)
I need to filter who is the row 15 minutes before then calculate the diff.
import numpy as np
import pandas as pd
from datetime import timedelta
df = pd.DataFrame(pd.read_csv('./data.csv'))
def calculate_15min(row):
end_date = pd.to_datetime(row['date']) - timedelta(minutes=15)
mask = (pd.to_datetime(df['date']) <= end_date).head(1)
price_before = df.loc[mask]
return price_before['fair_price']
def calc_new_val(row):
return 'show date 15 minutes before, maybe it will be null, nope'
df['15_min_ago'] = df.apply(lambda row: calculate_15min(row), axis=1)
myFields = ['pkey_id', 'date', '15_min_ago', 'fair_price']
print(df[myFields].head(5))
df[myFields].head(5).to_csv('output.csv', index=False)
I did it using nodejs but python is not my beach, maybe you have a fast solution...
pkey_id,date,fair_price,15_min_ago
465620,2021-05-17 12:28:30,45080.23,fair_price_15_min_before
465625,2021-05-17 12:28:35,45060.17,fair_price_15_min_before
465629,2021-05-17 12:28:40,45052.74,fair_price_15_min_before
465633,2021-05-17 12:28:45,45043.89,fair_price_15_min_before
465636,2021-05-17 12:28:50,45040.93,fair_price_15_min_before
465640,2021-05-17 12:28:56,45049.95,fair_price_15_min_before
465643,2021-05-17 12:29:00,45045.38,fair_price_15_min_before
465646,2021-05-17 12:29:05,45039.87,fair_price_15_min_before
465650,2021-05-17 12:29:10,45045.55,fair_price_15_min_before
465652,2021-05-17 12:29:15,45042.53,fair_price_15_min_before
465653,2021-05-17 12:29:20,45039.34,fair_price_15_min_before
466377,2021-05-17 12:42:50,45142.74,fair_price_15_min_before
466380,2021-05-17 12:42:55,45143.24,fair_price_15_min_before
466393,2021-05-17 12:43:00,45130.98,fair_price_15_min_before
466398,2021-05-17 12:43:05,45128.13,fair_price_15_min_before
466400,2021-05-17 12:43:10,45140.9,fair_price_15_min_before
466401,2021-05-17 12:43:15,45136.38,fair_price_15_min_before
466404,2021-05-17 12:43:20,45118.54,fair_price_15_min_before
466405,2021-05-17 12:43:25,45120.69,fair_price_15_min_before
466407,2021-05-17 12:43:30,45121.37,fair_price_15_min_before
466413,2021-05-17 12:43:36,45133.71,fair_price_15_min_before
466415,2021-05-17 12:43:40,45137.74,fair_price_15_min_before
466419,2021-05-17 12:43:45,45127.96,fair_price_15_min_before
466431,2021-05-17 12:43:50,45100.83,fair_price_15_min_before
466437,2021-05-17 12:43:55,45091.78,fair_price_15_min_before
466438,2021-05-17 12:44:00,45084.75,fair_price_15_min_before
466445,2021-05-17 12:44:06,45094.08,fair_price_15_min_before
466448,2021-05-17 12:44:10,45106.51,fair_price_15_min_before
466456,2021-05-17 12:44:15,45122.97,fair_price_15_min_before
466461,2021-05-17 12:44:20,45106.78,fair_price_15_min_before
466466,2021-05-17 12:44:25,45096.55,fair_price_15_min_before
466469,2021-05-17 12:44:30,45088.06,fair_price_15_min_before
466474,2021-05-17 12:44:35,45086.12,fair_price_15_min_before
466491,2021-05-17 12:44:40,45065.95,fair_price_15_min_before
466495,2021-05-17 12:44:45,45068.21,fair_price_15_min_before
466502,2021-05-17 12:44:55,45066.47,fair_price_15_min_before
466506,2021-05-17 12:45:00,45063.82,fair_price_15_min_before
466512,2021-05-17 12:45:05,45070.48,fair_price_15_min_before
466519,2021-05-17 12:45:10,45050.59,fair_price_15_min_before
466523,2021-05-17 12:45:16,45041.13,fair_price_15_min_before
466526,2021-05-17 12:45:20,45038.36,fair_price_15_min_before
466535,2021-05-17 12:45:25,45029.72,fair_price_15_min_before
466553,2021-05-17 12:45:31,45016.2,fair_price_15_min_before
466557,2021-05-17 12:45:35,45011.2,fair_price_15_min_before
466559,2021-05-17 12:45:40,45007.04,fair_price_15_min_before
This is the CSV

Firstly convert your date column to datetime dtype:
df['date']=pd.to_datetime(df['date'])
Then filter values:
date15min=df['date']-pd.offsets.DateOffset(minutes=15)
out=df.loc[df['date'].isin(date15min.tolist())]
Now Finally do your calculations:
df['price_before_15min']=df['fair_price'].where(df['date'].isin((out['date']+pd.offsets.DateOffset(minutes=15)).tolist()))
df['price_before_15min']=df['price_before_15min'].diff()
df['date_before_15min']=date15min
Now If you print df you will get your desired output
Update:
For that purpose just make a slightly change in the above method:
out=df.loc[df['date'].dt.minute.isin(date15min.dt.minute.tolist())]
df['price_before_15min']=df['fair_price'].where(df['date'].dt.minute.isin((out['date']+pd.offsets.DateOffset(minutes=15)).dt.minute.tolist()))

Copy and convert all values in pandas dataframe

In a dataframe, I have a column "UnixTime" and want to convert it to a new column containing the UTC time.
import pandas as pd
from datetime import datetime
df = pd.DataFrame([1565691196, 1565691297, 1565691398], columns = ["UnixTime"])
unix_list = df["UnixTime"].tolist()
utc_list = []
for i in unix_list:
i = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
utc_list.append(i)
df["UTC"] = utc_list
This works, but I guess there is a smarter approach?

Could you try this:
df["UTC"] = pd.to_datetime(df['UnixTime'], unit='s')

If you mean by smarter approach is pandas-way and less code, then this is your answer :
df["UTC"] = pd.to_datetime(df["UnixTime"], unit = "s")
Hope this helps.

python pandas: vectorized time series window function

I have a pandas dataframe in the following format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. I would greatly appreciate any tips on vectorizing these functions if possible.
EDIT:
The resultant dataframe should have this layout:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. It just needs 'window_start_dt' and 'window_end_dt' columns to be added. The current code works, it just need to be vectorized if possible.

EDIT 2: pandas.cut is built-in:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test and win_end_test should be equal to their counterparts computed using your function.
The ValueError was coming from not casting x to int in the relevant line. I also added a NaN check, though it wasn't needed for this toy example.
Note the change to pd.date_range and the use of the start-of-month and end-of-month flags M and MS, as well as converting the date strings into datetime.

Equivalent in DataFrame.rolling of ngroups from DataFrame.groupby

Is there an equivalent of ngroups from DataFrame.groupby in DataFrame.rolling?
If the window is numeric I get it that it is
nwindows = len(DataFrame)-min_periods+1
but what happens when the window is some freq? Is it lazy-evaluated or is there any variable which contains the number of windows that are going to be used? Some kind of property of the Rolling object.
EDIT: Added example
import pandas as pd
import numpy as np
N = 10
dates = pd.pandas.date_range(start="2017-01-01", periods=N, freq="10s").values
vals = np.random.rand(N)
df = pd.DataFrame(data=list(zip(dates, vals)), columns=['date', 'rnd'])
roll = df.rolling(window='1min', min_periods=2, on='date')
roll.mean()
Rephrasing my question: Can I know beforehand how many times mean() is going to be called?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Create a Pandas Index Faster? - python

If need TimedeltaIndex is possible use to_timedelta or TimedeltaIndex: index = pd.to_timedelta(time, unit='ms') Or: index = pd.TimedeltaIndex(time, unit='ms')

You can also use pd.timedelta_range index = pd.timedelta_range(0, periods=10000, freq='10ms')

Related

Calculate Pandas df timedelta of index

Python - calculating difference between price extracting time

Copy and convert all values in pandas dataframe

python pandas: vectorized time series window function

Equivalent in DataFrame.rolling of ngroups from DataFrame.groupby

Categories

Resources