What is the maximum numpy.datetime64 with ns resolution - python

I want to find the maximum np.datetime64[ns] (so that in my algorithm min() never chooses it). I've tried the suggestion from What is the maximum timestamp numpy.datetime64 can handle? but this gives strange (very wrong!) results in nanosecond resolution:
>>> from datetime import datetime
>>> import numpy as np
>>> np.datetime64(datetime.max, "ns")
numpy.datetime64('1816-03-30T05:56:08.066276376')
I assume this is because datetime.max is a later date than the maximum np.datetime64[ns], so it wraps when converting.
Edit: I've found np.datetime64((1 << 63) - 1, 'ns') works (I assume that is the maximum), but is obviously gross. Is there a nicer way to construct this?

Related

Converting np.array of unix timestamps (dtype '<U21') to np.datetime64

I am looking to process a large amount of data, so I am interested in the fastest way to compute the following:
I have the below np.array as part of an np.ndarray, which I would like to convert from '<U21' to 'np.datetime64' (ms).
When I execute the following code on one entry, it works:
tmp_array[:,0][0].astype(int).astype('datetime64[ms]')
Result: numpy.datetime64('2019-10-09T22:54:00.000')
When I execute the same on the sub-array like so:
tmp_array[:,0] = tmp_array[:,0].astype(int).astype('datetime64[ms]')
I always get the following error:
RuntimeError: The string provided for NumPy ISO datetime formatting was too short, with length 21
numpy version 1.22.4
array(['1570661640000', '1570661700000', '1570661760000'],dtype='<U21')
I am sure there is a way to use the power of numpy to do this more efficiently but this approach works:
Given your tmp_array of the form:
array(['1570661640000', '1570661700000', '1570661760000'], dtype='<U21')
express the unix base date as:
db = np.datetime64('1970-01-01')
then create the desired datetime array by:
cnvrt_array = np.array([db + np.timedelta64(int(x), 'ms') for x in tmp_array])
This yields the array:
array(['2019-10-09T22:54:00.000', '2019-10-09T22:55:00.000',
'2019-10-09T22:56:00.000'], dtype='datetime64[ms]')
As suggested by #FObersteiner, you can utilize the power of numpy to convert the array at a rate which is an order of magnitude faster than the list comprehension approach.
cvrted_array = tmp_array..astype(np.longlong).astype("datetime64[ms]")
which yields the same results as the list comprehension

pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset

I am confused how pandas blew out of bounds for datetime objects with these lines:
import pandas as pd
BOMoffset = pd.tseries.offsets.MonthBegin()
# here some code sets the all_treatments dataframe and the newrowix, micolix, mocolix counters
all_treatments.iloc[newrowix,micolix] = BOMoffset.rollforward(all_treatments.iloc[i,micolix] + pd.tseries.offsets.DateOffset(months = x))
all_treatments.iloc[newrowix,mocolix] = BOMoffset.rollforward(all_treatments.iloc[newrowix,micolix]+ pd.tseries.offsets.DateOffset(months = 1))
Here all_treatments.iloc[i,micolix] is a datetime set by pd.to_datetime(all_treatments['INDATUMA'], errors='coerce',format='%Y%m%d'), and INDATUMA is date information in the format 20070125.
This logic seems to work on mock data (no errors, dates make sense), so at the moment I cannot reproduce while it fails in my entire data with the following error:
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
In [54]: pd.Timestamp.min
Out[54]: Timestamp('1677-09-22 00:12:43.145225')
In [55]: pd.Timestamp.max
Out[55]: Timestamp('2262-04-11 23:47:16.854775807')
And your value is out of this range 2262-05-01 00:00:00 and hence the outofbounds error
Straight out of: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations
Workaround:
This will force the dates which are outside the bounds to NaT
pd.to_datetime(date_col_to_force, errors = 'coerce')
Setting the errors parameter in pd.to_datetime to 'coerce' causes replacement of out of bounds values with NaT. Quoting the docs:
If ‘coerce’, then invalid parsing will be set as NaT
E.g.:
datetime_variable = pd.to_datetime(datetime_variable, errors = 'coerce')
This does not fix the data (obviously), but still allows processing the non-NaT data points.
The reason you are seeing this error message
"OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-12-23 00:00:00" is because pandas timestamp data type stores date in nanosecond resolution(from the docs).
Which means the date values have to be in the range
pd.Timestamp.min(1677-09-21 00:12:43.145225) and
pd.Timestamp.max(2262-04-11 23:47:16.854775807)
Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above mentioned range.
This is surprising because databases like sql server and libraries like numpy allows to store date beyond this range. Also maximum of 64 bits are used in most of the cases to store the date.
But here is the difference.
SQL server stores date in nanosecond resolution but only up to a accuracy of 100 ns(as opposed to 1 ns in pandas). Since the space is limited(64 bits), its a matter of range vs accuracy. With pandas timestamp we have higher accuracy but lower date range.
In case of numpy (pandas is built on top of numpy) datetime64 data type,
if the date falls in the above mentioned range you can store
it in nanoseconds which is similar to pandas.
OR you can give up the nanosecond resolution and go with
microseconds which will give you a much larger range. This is something that is missing in pandas timestamp type.
However if you choose to store in nanoseconds and the date is outside the range then numpy will automatically wrap around this date and you might get unexpected results (referenced below in the 4th solution).
np.datetime64("3000-06-19T08:17:14.073456178", dtype="datetime64[ns]")
> numpy.datetime64('1831-05-11T09:08:06.654352946')
Now with pandas we have below options,
import pandas as pd
data = {'Name': ['John', 'Sam'], 'dob': ['3000-06-19T08:17:14', '2000-06-19T21:17:14']}
my_df = pd.DataFrame(data)
1)If you are ok with losing the data which is out of range then simply use below param to convert out of range date to NaT(not a time).
my_df['dob'] = pd.to_datetime(my_df['dob'], errors = 'coerce')
2)If you dont want to lose the data then you can convert the values into a python datetime type. Here the column "dob" is of type pandas object but the individual value will be of type python datetime. However doing this we will lose the benefit of vectorized functions.
import datetime as dt
my_df['dob'] = my_df['dob'].apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
print(type(my_df.iloc[0][1]))
> <class 'datetime.datetime'>
3)Another option is to use numpy instead of pandas series if possible. In case of pandas dataframe, you can convert a series(or column in a df) to numpy array. Process the data separately and then join it back to the dataframe.
4)we can also use pandas timespans as suggested in the docs. Do checkout the difference b/w timestamp and period before using this data type. Date range and frequency here works similar to numpy(mentioned above in the numpy section).
my_df['dob'] = my_df['dob'].apply(lambda x: pd.Period(x, freq='ms'))
You can try with strptime() in datetime library along with lambda expression to convert text to date values in a series object:
Example:
df['F'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S') if type(x)==str else np.NaN)
None of above are so good, because it will delete your data. But, you can only mantain and edit your conversion:
# convertin from epoch to datatime mantainig the nanoseconds timestamp
xbarout= pd.to_datetime(xbarout.iloc[:,0],unit='ns')

Pandas number of business days between a DatetimeIndex and a Timestamp

This is quite similar to the question here but I'm wondering if there is a clean way in pandas to make a business day aware TimedeltaIndex? Ultimately I am trying to get the number of business days (no holiday calendar) between a DatetimeIndex and a Timestamp. As per the referenced question, something like this works
import pandas as pd
import numpy as np
drg = pd.date_range('2015-07-31', '2015-08-05', freq='B')
A = [d.date() for d in drg]
B = pd.Timestamp('2015-08-05', 'B').date()
np.busday_count(A, B)
which gives
array([3, 2, 1, 0], dtype=int64)
but this seems a bit kludgy. If I try something like
drg - pd.Timestamp('2015-08-05', 'B')
I get a TimedeltaIndex but the business day frequency is dropped
TimedeltaIndex(['-5 days', '-2 days', '-1 days', '0 days'], dtype='timedelta64[ns]', freq=None)
Just wondering if there is a more elegant way to go about this.
TimedeltaIndexes represent fixed spans of time. They can be added to Pandas Timestamps to increment them by fixed amounts. Their behavior is never dependent on whether or not the Timestamp is a business day.
The TimedeltaIndex itself is never business-day aware.
Since the ultimate goal is to count the number of days between a DatetimeIndex and a Timestamp, I would look in another direction than conversion to TimedeltaIndex.
Unfortunately, date calculations are rather complicated, and a number of data structures have sprung up to deal with them -- Python datetime.dates, datetime.datetimes, Pandas Timestamps, NumPy datetime64s.
They each have their strengths, but no one of them is good for all purposes. To
take advantage of their strengths, it is sometime necessary to convert between
these types.
To use np.busday_count you need to convert the DatetimeIndex and Timestamp to
some type np.busday_count understands. What you call kludginess is the code
required to convert types. There is no way around that assuming we want to use np.busday_count -- and I know of no better tool for this job than np.busday_count.
So, although I don't think there is a more succinct way to count business days
than than the method you propose, there is a far more performant way:
Convert to datetime64[D]'s instead of Python datetime.date objects:
import pandas as pd
import numpy as np
drg = pd.date_range('2000-07-31', '2015-08-05', freq='B')
timestamp = pd.Timestamp('2015-08-05', 'B')
def using_astype(drg, timestamp):
A = drg.values.astype('<M8[D]')
B = timestamp.asm8.astype('<M8[D]')
return np.busday_count(A, B)
def using_datetimes(drg, timestamp):
A = [d.date() for d in drg]
B = pd.Timestamp('2015-08-05', 'B').date()
return np.busday_count(A, B)
This is over 100x faster for the example above (where len(drg) is close to 4000):
In [88]: %timeit using_astype(drg, timestamp)
10000 loops, best of 3: 95.4 µs per loop
In [89]: %timeit using_datetimes(drg, timestamp)
100 loops, best of 3: 10.3 ms per loop
np.busday_count converts its input to datetime64[D]s anyway, so avoiding this extra conversion to and from datetime.dates is far more efficient.

Is there a datetime ± infinity?

For floats we have special objects like -inf (and +inf), and which are guaranteed to compare less than (and greater than) other numbers.
I need something similar for datetimes, is there any such thing? In-db ordering must work correctly with django queryset filters, and ideally it should be db-agnostic (but at the very least it must work with mysql and sqlite) and timezone-agnostic.
At the moment I'm using null/None, but it is creating very messy queries because None is doing the job of both -inf and +inf and I have to explicitly account for all those cases in the queries.
Try this:
>>> import datetime
>>> datetime.datetime.max
datetime.datetime(9999, 12, 31, 23, 59, 59, 999999)
You can get min/max for datetime, date, and time.
There isn't; the best you have is the datetime.datetime.min and datetime.datetime.max values.
These are guaranteed to be the smallest and largest datetime values, but datetime.datetime.min == datetime.datetime.min is still True; everything else is larger. The inverse is true for the datatime.datetime.max value.
There are also min and max values for datetime.date and datetime.time.
In case someone is using dates in Pandas dataframe:
>>> import pandas as pd
>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
>>> pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')

Convert numpy.datetime64 to string object in python

I am having trouble converting a python datetime64 object into a string. For example:
t = numpy.datetime64('2012-06-30T20:00:00.000000000-0400')
Into:
'2012.07.01' as a string. (note time difference)
I have already tried to convert the datetime64 object to a datetime long then to a string, but I seem to get this error:
dt = t.astype(datetime.datetime) #1341100800000000000L
time.ctime(dt)
ValueError: unconvertible time
Solution was:
import pandas as pd
ts = pd.to_datetime(str(date))
d = ts.strftime('%Y.%m.%d')
If you don't want to do that conversion gobbledygook and are ok with just one date format, this was the best solution for me
str(t)[:10]
Out[11]: '2012-07-01'
As noted this works for pandas too
df['d'].astype(str).str[:10]
df['d'].dt.strftime('%Y-%m-%d') # equivalent
You can use Numpy's datetime_as_string function. The unit='D' argument specifies the precision, in this case days.
>>> t = numpy.datetime64('2012-06-30T20:00:00.000000000-0400')
>>> numpy.datetime_as_string(t, unit='D')
'2012-07-01'
t.item().strftime('%Y.%m.%d')
.item() will cast numpy.datetime64 to datetime.datetime, no need to import anything.
There is a route without using pandas; but see caveat below.
Well, the t variable has a resolution of nanoseconds, which can be shown by inspection in python:
>>> numpy.dtype(t)
dtype('<M8[ns]')
This means that the integral value of this value is 10^9 times the UNIX timestamp. The value printed in your question gives that hint. Your best bet is to divide the integral value of t by 1 billion then you can use time.strftime:
>>> import time
>>> time.strftime("%Y.%m.%d", time.gmtime(t.astype(int)/1000000000))
2012.07.01
In using this, be conscious of two assumptions:
1) the datetime64 resolution is nanosecond
2) the time stored in datetime64 is in UTC
Side note 1: Interestingly, the numpy developers decided [1] that datetime64 object that has a resolution greater than microsecond will be cast to a long type, which explains why t.astype(datetime.datetime) yields 1341100800000000000L. The reason is that datetime.datetime object can't accurately represent a nanosecond or finer timescale, because the resolution supported by datetime.datetime is only microsecond.
Side note 2: Beware the different conventions between numpy 1.10 and earlier vs 1.11 and later:
in numpy <= 1.10, datetime64 is stored internally as UTC, and printed as local time. Parsing is assuming local time if no TZ is specified, otherwise the timezone offset is accounted for.
in numpy >= 1.11, datetime64 is stored internally as timezone-agnostic value (seconds since 1970-01-01 00:00 in unspecified timezone), and printed as such. Time parsing does not assume the timezone, although +NNNN style timezone shift is still permitted and that the value is converted to UTC.
[1]: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/datetime.c see routine convert_datetime_to_pyobject.
I wanted an ISO 8601 formatted string without needing any extra dependencies. My numpy_array has a single element as a datetime64. With help from #Wirawan-Purwanto, I added just a bit:
from datetime import datetime
ts = numpy_array.values.astype(datetime)/1000000000
return datetime.utcfromtimestamp(ts).isoformat() # "2018-05-24T19:54:48"
Building on this answer I would do the following:
import numpy
import datetime
t = numpy.datetime64('2012-06-30T20:00:00.000000000')
datetime.datetime.fromtimestamp(t.item() / 10**9).strftime('%Y.%m.%d')
The division by a billion is to convert from nanoseconds to seconds.
Here is a one liner (note the padding with extra zero's):
datetime.strptime(str(t),'%Y-%m-%dT%H:%M:%S.%f000').strftime("%Y-%m-%d")
code sample
import numpy
from datetime import datetime
t = numpy.datetime64('2012-06-30T20:00:00.000000000-0400')
method 1:
datetime.strptime(str(t),'%Y-%m-%dT%H:%M:%S.%f000').strftime("%Y-%m-%d")
method 2:
datetime.strptime(str(t)[:10], "%Y-%m-%d").strftime("%Y-%m-%d")
output
'2012-07-01'
Also, if someone want to apply same formula for any series of datetime dataframe then you can follow below steps
import pandas as pd
temp = []
for i in range(len(t["myDate"])):
ts = pd.to_datetime(str(t["myDate"].iloc[i]))
temp.append(ts.strftime('%Y-%m-%d'))
t["myDate"] = temp
datetime objects can be converted to strings using the str() method
t.__str__()

Categories

Resources