How can I get a count of the number of periods in a Pandas DatetimeIndex using a frequency string (offset alias)? For example, let's say I have the following DatetimeIndex:
idx = pd.date_range("2019-03-01", periods=10000, freq='5T')
I would like to know how many 5 minute periods are in a week, or '7D'. I can calculate this "manually":
periods = (7*24*60)//5
Or I can get the length of a dummy index:
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
Neither approach seems very efficient. Is there a better way using Pandas date functionality?
try using numpy
len(np.arange(pd.Timedelta('1 days'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
2016
My testing, first import time:
import time
the OP solution:
start_time = time.time()
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
print((time.time() - start_time))
out:
0.0011057853698730469]
using numpy
start_time = time.time()
len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
print((time.time() - start_time))
out:
0.0001723766326904297
Follow the sugestion of #meW, doing the performance test using timeit
using timedelta_range:
%timeit len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
out:
91.1 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using numpy:
%timeit len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
16.3 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I finally figured out a reasonable solution:
pd.to_timedelta('7D')//idx.freq
This has the advantage that I can specify a range using a frequency string (offset alias) and the period or frequency is inferred from the dataframe. The numpy solution suggested by #Terry is still the fastest solution where speed is important.
Related
I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?
column_A
column_B
column_C
Test • test abc
winter • sun
snow rain blank
blabla • summer abc
break • Data
test letter • stop.
So far I created a solution which is slow:
print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
The str.count should be able to apply to the whole dataframe without hard coding this way. Try
sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.
%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use list comprehension and re.search. You can reduce 938 µs to 26.7 µs. (make sure don't create list and use generator)
res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
for col in ['column_A', 'column_B','column_C'])
print(res)
# 5
Benchmark:
%%timeit
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#
The following code for getting the week number and year works:
import pandas as pd
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year
date weekNo year
0 2021-12-05 48 2021
1 2021-12-12 49 2021
2 2021-12-19 50 2021
3 2021-12-26 51 2021
4 2022-01-02 52 2022
5 2022-01-09 1 2022
6 2022-01-16 2 2022
7 2022-01-23 3 2022
8 2022-01-30 4 2022
9 2022-02-06 5 2022
but,
df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
Gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
TypeError: %d format: a number is required, not Series
I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.
Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
Providing there is a format code for the value to be extracted, dt.strftime can be used.
This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
'%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
'%Y': Year with century as a decimal number.
import pandas as pd
# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')
# display(df)
date weekYear
0 2021-12-05 48/2021
1 2021-12-12 49/2021
2 2021-12-19 50/2021
3 2021-12-26 51/2021
4 2022-01-02 52/2022
5 2022-01-09 01/2022
6 2022-01-16 02/2022
7 2022-01-23 03/2022
8 2022-01-30 04/2022
9 2022-02-06 05/2022
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df.date.dt.strftime('%V/%Y')
[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can just use:
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
Or using pandas.Series.str.cat
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
Or using list comprehension
df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:
df["date"] = pd.to_datetime(df["date"])
df['weekYear'] = df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
See this answer to modify the code to create a timing analysis plot with a varying number of rows.
See IPython: %timeit for the option descriptions.
import pandas as pd
# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])
# functions
def test1(d):
return d.date.dt.strftime('%V/%Y')
def test2(d):
return d['date'].dt.isocalendar().week.astype(str) + '/' + d['date'].dt.year.astype(str)
def test3(d):
return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')
def test4(d):
return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]
def test5(d):
return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)
print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')
test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
I am trying to parse and convert "2020-07-30T20:40:33.1000000Z"in Python:
from datetime import datetime
Data = [{'id': 'XXXXXXXXXXXXX', 'number': 3, 'externalId': '0000', 'dateCreated': '2020-07-30T20:40:33.1005865Z', 'dateUpdated': '2020-07-30T20:40:33.36Z'}], 'tags': []}]
for i in Data:
creationtime= datetime.strptime(i["dateCreated"],"%Y-%m-%dT%H:%M:%S")
Error:
raise ValueError("unconverted data remains: %s" %
ValueError: unconverted data remains: .1005865Z
I tried :
%Y-%m-%dT%H:%M:%S.%fZ
Can anyone please suggest the correct format that I am missing.
if you really have 7 decimal places of fractional seconds and don't care about the 1/10th of the microseconds, you could use a re.sub and datetime.fromisoformat:
import re
from datetime import datetime
s = "2020-07-30T20:40:33.1000000Z"
dt = datetime.fromisoformat(re.sub('[0-9]Z', '+00:00', s))
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
datetime.datetime(2020, 7, 30, 20, 40, 33, 100000, tzinfo=datetime.timezone.utc)
...or use dateutil's parser:
from dateutil import parser
dt = parser.parse(s)
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
datetime.datetime(2020, 7, 30, 20, 40, 33, 100000, tzinfo=tzutc())
...or even pandas's to_datetime, if you maybe work with that lib anyway:
import pandas as pd
dt = pd.to_datetime(s)
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
Timestamp('2020-07-30 20:40:33.100000+0000', tz='UTC')
often irrelevant (depending on use-case) but note that convenience costs you some more time:
%timeit datetime.fromisoformat(re.sub('[0-9]Z', '+00:00', s))
1.92 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit parser.parse(s)
79.8 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.to_datetime(s)
62.4 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Im using 12 hours sensor data at 25Hrz that I query from mongo db into a dataframe
I'm trying to extract a list or a dict of 1 minute dataframes from the 12 hours.
I use a window of 1 minute and a stride/ step of 10 seconds.
The goal is to build a dataset by creating al list or dict of 1 minute dataframes/samples from 12 hours of data, that will be converted to tensor and fed to a deep learning model.
The index of the dataframe is datetime and 4 columns of sensor values.
here is how part of the data looks like:
A B C D
2020-06-17 22:00:00.000 1.052 -0.147 0.836 0.623
2020-06-17 22:00:00.040 1.011 -0.147 0.820 0.574
2020-06-17 22:00:00.080 1.067 -0.131 0.868 0.607
2020-06-17 22:00:00.120 1.033 -0.163 0.820 0.607
2020-06-17 22:00:00.160 1.030 -0.147 0.820 0.607
below is a sample code that is similar to how I extract windows of 1 minutes data. For 12 hours it takes 5 minutes-which is a long time..
Any ideas on how to reduce the running time in this case?
step= 10*25
w=60*25
df # 12 hours df data
sensor_dfs=[]
df_range = range(0, df.shape[0]-step, step)
for a in df_range:
sample = df.iloc[a:a+w]
sensor_dfs.append(sample)
I created random data and made the following experiments looking at runtime:
# create random normal samples
w= 60*25 # 1 minute window
step=w # no overlap
num_samples=50000
data= np.random.normal(size=(num_samples,3))
date_rng=pd.date_range(start="2020-07-09 00:00:00.000",
freq="40ms",periods=num_samples)
data=pd.DataFrame(data, columns=["x","y","z"], index=date_rng)
data.head()
x y z
2020-07-09 00:00:00.000 -1.062264 -0.008656 0.399642
2020-07-09 00:00:00.040 0.182398 -1.014290 -1.108719
2020-07-09 00:00:00.080 -0.489814 -0.020697 0.651120
2020-07-09 00:00:00.120 -0.776405 -0.596601 0.611516
2020-07-09 00:00:00.160 0.663900 0.149909 -0.552779
numbers are of type float64
data.dtypes
x float64
y float64
z float64
dtype: object
using for loops
minute_samples=[]
for i in range(0,len(data)-w,step):
minute_samples.append(data.iloc[i:i+w])
result:6.45 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using list comprehension
minute_samples=[data.iloc[i:i+w] for i in range(0,len(data)-w,step)]
result: 6.13 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using Grouper with list comprehension
minute_samples=[df for i, df in data.groupby(pd.Grouper(freq="1T"))]
result:7.89 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using grouper with dict
minute_samples=dict(tuple(data.groupby(pd.Grouper(freq="1T"))))
result: 7.41 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
resample is also an option here but since behind the scenes it uses grouper then I don't think it will be different in terms of runtime
It seems like list comprehension is slightly better than the rest
How do I convert a string to a date object in python?
The string would be: "24052010" (corresponding to the format: "%d%m%Y")
I don't want a datetime.datetime object, but rather a datetime.date.
You can use strptime in the datetime package of Python:
>>> import datetime
>>> datetime.datetime.strptime('24052010', "%d%m%Y").date()
datetime.date(2010, 5, 24)
import datetime
datetime.datetime.strptime('24052010', '%d%m%Y').date()
Directly related question:
What if you have
datetime.datetime.strptime("2015-02-24T13:00:00-08:00", "%Y-%B-%dT%H:%M:%S-%H:%M").date()
and you get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/_strptime.py", line 308, in _strptime
format_regex = _TimeRE_cache.compile(format)
File "/usr/local/lib/python2.7/_strptime.py", line 265, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "/usr/local/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: redefinition of group name 'H' as group 7; was group 4
and you tried:
<-24T13:00:00-08:00", "%Y-%B-%dT%HH:%MM:%SS-%HH:%MM").date()
but you still get the traceback above.
Answer:
>>> from dateutil.parser import parse
>>> from datetime import datetime
>>> parse("2015-02-24T13:00:00-08:00")
datetime.datetime(2015, 2, 24, 13, 0, tzinfo=tzoffset(None, -28800))
If you are lazy and don't want to fight with string literals, you can just go with the parser module.
from dateutil import parser
dt = parser.parse("Jun 1 2005 1:33PM")
print(dt.year, dt.month, dt.day,dt.hour, dt.minute, dt.second)
>2005 6 1 13 33 0
Just a side note, as we are trying to match any string representation, it is 10x slower than strptime
you have a date string like this, "24052010" and you want date object for this,
from datetime import datetime
cus_date = datetime.strptime("24052010", "%d%m%Y").date()
this cus_date will give you date object.
you can retrieve date string from your date object using this,
cus_date.strftime("%d%m%Y")
For single value the datetime.strptime method is the fastest
import arrow
from datetime import datetime
import pandas as pd
l = ['24052010']
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.86 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
305 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
46 µs ± 978 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For a list of values the pandas pd.to_datetime is the fastest
l = ['24052010'] * 1000
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.32 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
1.76 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
44.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For ISO8601 datetime format the ciso8601 is a rocket
import ciso8601
l = ['2010-05-24'] * 1000
%timeit _ = list(map(lambda x: ciso8601.parse_datetime(x).date(), l))
241 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There is another library called arrow really great to make manipulation on python date.
import arrow
import datetime
a = arrow.get('24052010', 'DMYYYY').date()
print(isinstance(a, datetime.date)) # True
string "24052010"
In a very manual way you could just go like this:-
first split the string as (yyyy-mm-dd) format so you could get a tuple something like this (2010, 5, 24), then simply convert this tuple to a date format something like 2010-05-24.
you could run this code on a list of string object similar to above and convert the entire list of tuples object to date object by simply unpacking(*tuple) check the code below.
import datetime
#for single string simply use:-
my_str= "24052010"
date_tup = (int(my_str[4:]),int(my_str[2:4]),int(my_str[:2]))
print(datetime.datetime(*date_tup))
output: 2012-01-01 00:00:00
# for a list of string date objects you could use below code.
date_list = []
str_date = ["24052010", "25082011", "25122011","01012012"]
for items in str_date:
date_list.append((int(items[4:]),int(items[2:4]),int(items[:2])))
for dates in date_list:
# unpack all tuple objects and convert to date
print(datetime.datetime(*dates))
output:
2010-05-24 00:00:00
2011-08-25 00:00:00
2011-12-25 00:00:00
2012-01-01 00:00:00
Use time module to convert data.
Code snippet:
import time
tring='20150103040500'
var = int(time.mktime(time.strptime(tring, '%Y%m%d%H%M%S')))
print var