Convert timezone of np.datetime64 without loss of precision - python

I have a DataFrame, one of whose columns is of type datetime64[ns]. These represent times in "Europe/London" timezone, and are on nanosecond-level of precision. (The data is coming from an external system)
I need to convert these to datetime64[ns] entries that represent UTC time instead. So in other words, bump each day by 0 or by 1 hours, depending on whether the entry is during summer time or not.
What is the best way of doing this?
Unfortunately, I couldn't find any timezone support baked into np.datetime64. At the same time, I can't just directly convert to/work with datetime.datetime objects, as that'd mean loss of precision. The only thing I could think of so far is converting np.datetime64 to datetime.datetime, adjusting timezones, getting some sort of timedelta between unadjusted and adjusted datetime.datetime, and then apply that timedelta back to np.datetime64. Sounds like a lot of hoops to jump through though, for something which I'm hoping can be done more easily?
Thanks!

It appears pandas has some built-in support for this, using the dt accessor:
import pandas as pd
import numpy as np
dt_arr = np.array(['2019-05-01T12:00:00.000000010',
'2019-05-01T12:00:00.000000100',],
dtype='datetime64[ns]')
df = pd.DataFrame(dt_arr)
# Represent naive datetimes as London time
df[0] = df[0].dt.tz_localize('Europe/London')
# Convert to UTC
df[0] = df[0].dt.tz_convert("UTC")
print(df)
# 0
# 0 2019-05-01 11:00:00.000000010+00:00
# 1 2019-05-01 11:00:00.000000100+00:00
Assuming you are starting with some ISO 8601 strings in your np.datetime64[ns], you can use dt.tz_localize to assign a time zone to them, then dt.tz_convert to convert them into another time zone.
I will warn though that if they came in as integers like 1556708400000000010, there's a good chance that they already represent UTC, since timestamps given in seconds or nanoseconds are usually Unix epoch times, which are independent of the time zone they were recorded in (it's a number of seconds/nanoseconds after the Unix epoch, not a civil time).

Related

How to use pandas beyond the pd.Timestamp.min and pd.Timestamp.max value?

pd.Timestamp.min
pd.Timestamp.max
Timestamp('1677-09-21 00:12:43.145224193')
Timestamp('2262-04-11 23:47:16.854775807')
I found out that pandas has a min and max date value. If I need to have date beyond these values, is that possible ?
Is it not possible to move the min/max values, like a century window ?
Any alternatives without pandas then ?
Thanks very much.
This is a known limitation due to the nanosecond precision of Timestamps.
Timestamp limitations
Since pandas represents timestamps in nanosecond resolution, the time
span that can be represented using a 64-bit integer is limited to
approximately 584 years
The documentation suggests to use pandas.period_range:
Representing out-of-bounds span
If you have data that is outside of the Timestamp bounds, see
Timestamp limitations, then you can use a PeriodIndex and/or Series of
Periods to do computations.
pd.period_range("1215-01-01", "1381-01-01", freq="D")
PeriodIndex(['1215-01-01', '1215-01-02', '1215-01-03', '1215-01-04',
'1215-01-05', '1215-01-06', '1215-01-07', '1215-01-08',
'1215-01-09', '1215-01-10',
...
'1380-12-23', '1380-12-24', '1380-12-25', '1380-12-26',
'1380-12-27', '1380-12-28', '1380-12-29', '1380-12-30',
'1380-12-31', '1381-01-01'],
dtype='period[D]', length=60632)
converting a Series
there is no direct method (like to_period) to convert an exiting Series, you need to go through a PeriodIndex:
df = pd.DataFrame({'str': ['1900-01-01', '2500-01-01']})
df['period'] = pd.PeriodIndex(df['str'], freq='D').values
output:
print(df)
str period
0 1900-01-01 1900-01-01
1 2500-01-01 2500-01-01
print(df.dtypes)
str object
period period[D]
dtype: object

convert time to UTC in pandas

I have multiple csv files, I've set DateTime as the index.
df6.set_index("gmtime", inplace=True)
#correct the underscores in old datetime format
df6.index = [" ".join( str(val).split("_")) for val in df6.index]
df6.index = pd.to_datetime(df6.index)
The time was put in GMT, but I think it's been saved as BST (British summertime) when I set the clock for raspberry pi.
I want to shift the time one hour backwards. When I use
df6.tz_convert(pytz.timezone('utc'))
it gives me below error as it assumes that the time is correct.
Cannot convert tz-naive timestamps, use tz_localize to localize
How can I shift the time to one hour?
Given a column that contains date/time info as string, you would convert to datetime, localize to a time zone (here: Europe/London), then convert to UTC. You can do that before you set as index.
Ex:
import pandas as pd
dti = pd.to_datetime(["2021-09-01"]).tz_localize("Europe/London").tz_convert("UTC")
print(dti) # notice 1 hour shift:
# DatetimeIndex(['2021-08-31 23:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
Note: setting a time zone means that DST is accounted for, i.e. here, during winter you'd have UTC+0 and during summer UTC+1.
To add to FObersteiner's response (sorry,new user, can't comment on answers yet):
I've noticed that in all the real world situations I've run across it (with full dataframes or pandas series instead of just a single date), .tz_localize() and .tz_convert() need to be called slightly differently.
What's worked for me is
df['column'] = pd.to_datetime(df['column']).dt.tz_localize('Europe/London').dt.tz_convert('UTC')
Without the .dt, I get "index is not a valid DatetimeIndex or PeriodIndex."

Storing with Dask date/timestamp columns in Parquet

I have a Dask data frame that has two columns, a date and a value.
I store it like so:
ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)
I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:
1546300800000000 instead of 2019-01-01
1548979200000000 instead of 2019-02-01
Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.
Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.
import pandas as pd
import dask.dataframe as dd
sample_dates = [
'2019-01-01 00:01:00',
'2019-01-02 05:04:02',
'2019-01-02 15:04:02'
]
df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])
ddf = dd.from_pandas(df, npartitions=2)
# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()
# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date
ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')
For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.
So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:
# both yield: Timestamp('2019-01-01 00:00:00')
pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')
If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.
I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:
SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;
Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think #mdurant is correct in his approach.
I would try either:
SELECT TO_TIMESTAMP(<date_col>) FROM ...
or
SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...

Timedelta time difference expressed as float variable

I have data in a pandas dataframe that is marked by timestamps as datetime objects. I would like to make a graph that takes the time as something fluid. My idea was to substract the first timestamp from the others (here exemplary for the second entry)
xhertz_df.loc[1]['Dates']-xhertz_df.loc[0]['Dates']
to get the time passed since the first measurement. Which gives 350 days 08:27:51 as a timedelta object. So far so good.
This might be a duplicate but I have not found the solution here so far. Is there a way to quickly transform this object to a number of e.g. minutes or seconds or hours. I know I could extract the individual days, hours and minutes and make a tedious calculation to get it. But is there an integrated way to just turn this object into what I want?
Something like
timedelta.tominutes
that gives it back as a float of minutes, would be great.
If all you want is a float representation, maybe as simple as:
float_index = pd.Index(xhertz_df.loc['Dates'].values.astype(float))
In Pandas, Timestamp and Timedelta columns are internally handled as numpy datetime64[ns], that is an integer number of nanoseconds.
So it is trivial to convert a Timedelta column to a number of minutes:
(xhertz_df.loc[1]['Dates']-xhertz_df.loc[0]['Dates']).astype('int64')/60000000000.
Here is a way to do so with ‘timestamp‘:
Two examples for converting and one for the diff
import datetime as dt
import time
# current date and time
now = dt.datetime.now()
timestamp1 = dt.datetime.timestamp(now)
print("timestamp1 =", timestamp1)
time.sleep(4)
now = dt.datetime.now()
timestamp2 = dt.datetime.timestamp(now)
print("timestamp2 =", timestamp2)
print(timestamp2 - timestamp1)

How can I convert a timestamp string of the form "%d%H%MZ" to a datetime object?

I have timestamp strings of the form "091250Z", where the first two numbers are the date and the last four numbers are the hours and minutes. The "Z" indicates UTC. Assuming the timestamp corresponds to the current month and year, how can this string be converted reliably to a datetime object?
I have the parsing to a timedelta sorted, but the task quickly becomes nontrivial when going further and I'm not sure how to proceed:
datetime.strptime("091250Z", "%d%H%MZ")
What you need is to replace the year and month of your existing datetime object.
your_datetime_obj = datetime.strptime("091250Z", "%d%H%MZ")
new_datetime_obj = your_datetime_obj.replace(year=datetime.now().year, month=datetime.now().month)
Like this? You've basically already done it, you just needed to assign it a variable
from datetime import datetime
dt = datetime.strptime('091250Z', '%m%H%MZ')

Categories

Resources