I need to convert a timezone aware date_range (TimeStamps) to UNIX epoch values for use in an external Javascript library.
My approach is:
# Create localized test data for one day
rng = pd.date_range('1.1.2014', freq='H', periods=24, tz="Europe/Berlin")
val = np.random.randn(24)
df = pd.DataFrame(data=val, index=rng, columns=['values'])
# Reset index as df column
df = df.reset_index()
# Convert the index column to the desired UNIX epoch format
df['index'] = df['index'].apply(lambda x: x.value // 10**6 )
df['index'] contains the UNIX epoch values as expected but they are are stored in UTC(!).
I suppose this is because pandas stores timestamps in numpy UTC datetime64 values under the hood.
Is there a smart way to get "right" epoch values in the requested time zone?
This proposal doesn't work with DST
In [17]: df
Out[17]:
values
2014-01-01 00:00:00+01:00 1.027799
2014-01-01 01:00:00+01:00 1.579586
2014-01-01 02:00:00+01:00 0.202947
2014-01-01 03:00:00+01:00 -0.214921
2014-01-01 04:00:00+01:00 0.021499
2014-01-01 05:00:00+01:00 -1.368302
2014-01-01 06:00:00+01:00 -0.261738
2014-01-01 22:00:00+01:00 0.808506
2014-01-01 23:00:00+01:00 0.459895
[24 rows x 1 columns]
Use the index method asi8 to convert to int64 (which is already in ns since epoch)
These are the UTC times!
In [18]: df.index.asi8//10**6
Out[18]:
array([1388530800000, 1388534400000, 1388538000000, 1388541600000,
1388545200000, 1388548800000, 1388552400000, 1388556000000,
1388559600000, 1388563200000, 1388566800000, 1388570400000,
1388574000000, 1388577600000, 1388581200000, 1388584800000,
1388588400000, 1388592000000, 1388595600000, 1388599200000,
1388602800000, 1388606400000, 1388610000000, 1388613600000])
These are the local timezone since epoch. Note that this is NOT a public method for normally, I would always exchange UTC data (and the timezone if you need).
In [7]: df.index._local_timestamps()//10**6
Out[7]:
array([1388534400000, 1388538000000, 1388541600000, 1388545200000,
1388548800000, 1388552400000, 1388556000000, 1388559600000,
1388563200000, 1388566800000, 1388570400000, 1388574000000,
1388577600000, 1388581200000, 1388584800000, 1388588400000,
1388592000000, 1388595600000, 1388599200000, 1388602800000,
1388606400000, 1388610000000, 1388613600000, 1388617200000])
Related
I have a Pandas dateframe which has two columns, with column names 'DateTimeInUTC' and 'TimeZone'. 'DateTimeInUTC' is the date and time of the instance in UTC, and 'TimeZone' is the time zone of the location of the instance.
An instance in the dataframe could be like this:
DateTimeInUTC: '2019-12-31 07:00:00'
TimeZone: 'US/Eastern'
I want to add another column to the dataframe with dType: datetime64 which converts 'DateTimeInUTC' to the specified time zone in that instance.
I tried using the method Pandas.tz_convert() but it takes the timezone as an argument and not as another column in the dataframe
EDIT: My best solution so far is to split the dataframe by timezone using pandas select statements and then apply the timezone on each dataframe with the same timezone, and then concatenate all the dataframes
The Better solution:
I was able to improve my own solution substantially:
timezones = weatherDf['TimeZone'].unique()
for timezone in timezones:
weatherDf.loc[weatherDf['TimeZone'] == timezone, 'DateTimeInTimeZone'] = weatherDf.loc[weatherDf['TimeZone'] == timezone, 'DateTimeInUTC'].dt.tz_localize('UTC').dt.tz_convert(timezone).dt.tz_localize(None)
This solution converted around 7 million instances on my system in 3.6 seconds
My previous solution:
This solution works, but it is probably not optimal:
let's assume weatherDf is my dataframe, which has these columns: DateTimeInUTC and TimeZone
timezones = weatherDf['TimeZone'].unique()
weatherDfs = []
for timezone in timezones:
tempDf = weatherDf[weatherDf['TimeZone'] == timezone]
tempDf['DateTimeInTimeZone'] = tempDf['DateTimeInUTC'].dt.tz_convert(timezone)
weatherDfs.append(tempDf)
weatherDfConverted = pd.concat(weatherDfs)
This solution converted around 7 million instances on my system in around 40 seconds
Approach with groupby():
import pytz
import random
import time
tic = time.perf_counter()
ltz = len(pytz.all_timezones) - 1
length = 7 * 10 ** 6
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None
# generate the dummy data
df = pd.DataFrame({'DateTimeInUTC': pd.date_range('01.01.2000', periods=length, freq='T', tz='UTC'),
'TimeZone': [pytz.all_timezones[random.randint(0, ltz)] for tz in range(length)]})
toc = time.perf_counter()
print(f"Generated the df in {toc - tic:0.4f} seconds\n")
tic = time.perf_counter()
df['Converted'] = df.groupby('TimeZone')['DateTimeInUTC'].apply(lambda x: x.dt.tz_convert(x.name).dt.tz_localize(None))
print(df)
toc = time.perf_counter()
print(f"\nConverted the df in {toc - tic:0.4f} seconds")
Output:
Generated the df in 6.3333 seconds
DateTimeInUTC TimeZone Converted
0 2000-01-01 00:00:00+00:00 Asia/Qyzylorda 2000-01-01 05:00:00
1 2000-01-01 00:01:00+00:00 America/Moncton 1999-12-31 20:01:00
2 2000-01-01 00:02:00+00:00 America/Cordoba 1999-12-31 21:02:00
3 2000-01-01 00:03:00+00:00 Africa/Dakar 2000-01-01 00:03:00
4 2000-01-01 00:04:00+00:00 Pacific/Wallis 2000-01-01 12:04:00
... ... ... ...
6999995 2013-04-23 02:35:00+00:00 America/Guyana 2013-04-22 22:35:00
6999996 2013-04-23 02:36:00+00:00 America/St_Vincent 2013-04-22 22:36:00
6999997 2013-04-23 02:37:00+00:00 MST7MDT 2013-04-22 20:37:00
6999998 2013-04-23 02:38:00+00:00 Antarctica/McMurdo 2013-04-23 14:38:00
6999999 2013-04-23 02:39:00+00:00 America/Atikokan 2013-04-22 21:39:00
[7000000 rows x 3 columns]
Converted the df in 4.1579 seconds
How to remove T00:00:00+05:30 after year, month and date values in pandas? I tried converting the column into datetime but also it's showing the same results, I'm using pandas in streamlit. I tried the below code
df['Date'] = pd.to_datetime(df['Date'])
The output is same as below :
Date
2019-07-01T00:00:00+05:30
2019-07-01T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-03T00:00:00+05:30
2019-07-03T00:00:00+05:30
2019-07-04T00:00:00+05:30
2019-07-04T00:00:00+05:30
2019-07-05T00:00:00+05:30
Can anyone help me how to remove T00:00:00+05:30 from the above rows?
If I understand correctly, you want to keep only the date part.
Convert date strings to datetime
df = pd.DataFrame(
columns={'date'},
data=["2019-07-01T02:00:00+05:30", "2019-07-02T01:00:00+05:30"]
)
date
0 2019-07-01T02:00:00+05:30
1 2019-07-02T01:00:00+05:30
2 2019-07-03T03:00:00+05:30
df['date'] = pd.to_datetime(df['date'])
date
0 2019-07-01 02:00:00+05:30
1 2019-07-02 01:00:00+05:30
Remove the timezone
df['datetime'] = df['datetime'].dt.tz_localize(None)
date
0 2019-07-01 02:00:00
1 2019-07-02 01:00:00
Keep the date only
df['date'] = df['date'].dt.date
0 2019-07-01
1 2019-07-02
Don't bother with apply to Python dates or string changes. The former will leave you with an object type column and the latter is slow. Just round to the day frequency using the library function.
>>> pd.Series([pd.Timestamp('2000-01-05 12:01')]).dt.round('D')
0 2000-01-06
dtype: datetime64[ns]
If you have a timezone aware timestamp, convert to UTC with no time zone then round:
>>> pd.Series([pd.Timestamp('2019-07-01T00:00:00+05:30')]).dt.tz_convert(None) \
.dt.round('D')
0 2019-07-01
dtype: datetime64[ns]
Pandas doesn't have a builtin conversion to datetime.date, but you could use .apply to achieve this if you want to have date objects instead of string:
import pandas as pd
import datetime
df = pd.DataFrame(
{"date": [
"2019-07-01T00:00:00+05:30",
"2019-07-01T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-03T00:00:00+05:30",
"2019-07-03T00:00:00+05:30",
"2019-07-04T00:00:00+05:30",
"2019-07-04T00:00:00+05:30",
"2019-07-05T00:00:00+05:30"]})
df["date"] = df["date"].apply(lambda x: datetime.datetime.fromisoformat(x).date())
print(df)
I have a dataframe that I grouped with function groupby. In order to do so, I had to use DatetimeIndex. However, I would like to transform my datetimeindex as integer to use it as index for aa dynamic optimization model. I'm able to transform my date time index as float by not as integer differenciating hours.
# My data look like this:
[ Date Hour MktDemand HOEP hour
Datetime
2019-01-01 01:00:00 2019-01-01 1 16231 0.00 0
2019-01-01 02:00:00 2019-01-01 2 16051 0.00 1
2019-01-01 03:00:00 2019-01-01 3 15805 -0.11 2
2019-01-01 04:00:00 2019-01-01 4 15580 -1.84 3
2019-01-01 05:00:00 2019-01-01 5 15609 -0.47 4
...
import datetime as dt
df['Datetime'] = pd.to_datetime(df.Date) + pd.to_timedelta(df.Hour, unit='h')
df['datetime'] = pd.to_datetime(df.Date) + pd.to_timedelta(df.Hour, unit='h')
grouped = df.set_index('Datetime').groupby(pd.Grouper(freq="15d"))
for name, group in grouped:
print(pd.to_numeric(group.index, downcast='integer'))
# It returns this:
Int64Index([1546304400000000000, 1546308000000000000, 1546311600000000000,
1546315200000000000, 1546318800000000000, 1546322400000000000,
1546326000000000000, 1546329600000000000, 1546333200000000000,
1546336800000000000,
...
# However, I would like to have integers in this format:
20190523
20190524
# I tried this but it doesn't work:
for name, group in grouped:
print(pd.to_timedelta(group.index).dt.total_hours().astype(int))
ERROR: dtype datetime64[ns] cannot be converted to timedelta64[ns]
The integers you expect represent a datetime format; they're not an actual numeric representation of datetime (which pd.to_numeric gives you, as nanoseconds since 1970-1-1 UTC).
Therefore, you'll want to format to string and then convert to integer.
Ex:
import pandas as pd
# some synthetic example data...
dti = pd.date_range("2015", "2016", freq='d')
df = pd.DataFrame({'some_value': [i for i in range(len(dti))]})
grouped = df.set_index(dti).groupby(pd.Grouper(freq="15d"))
for name, group in grouped:
print(group.index.strftime('%Y%m%d').astype(int))
# gives you e.g.
Int64Index([20150101, 20150102, 20150103, 20150104, 20150105, 20150106,
20150107, 20150108, 20150109, 20150110, 20150111, 20150112,
20150113, 20150114, 20150115],
dtype='int64')
...
You could also extend the strftime directive to give you additional parameters like hours or minutes.
I have a DataFrame like this:
Date X
....
2014-01-02 07:00:00 16
2014-01-02 07:15:00 20
2014-01-02 07:30:00 21
2014-01-02 07:45:00 33
2014-01-02 08:00:00 22
....
2014-01-02 23:45:00 0
....
1)
So my "Date" Column is a datetime and has values vor every 15min of a day.
What i want is to remove ALL Rows where the time is NOT between 08:00 and 18:00 o'clock.
2)
Some days are missing in the datas...how could i put the missing days in my dataframe and fill them with the value 0 as X.
My approach: Create a new Series between two Dates and set 15min as frequenz and concat my X Column with the new created Series. Is that right?
Edit:
Problem for my second Question:
#create new full DF without missing dates and reindex
full_range = pandas.date_range(start='2014-01-02', end='2017-11-
14',freq='15min')
df = df.reindex(full_range,fill_value=0)
df.head()
Output:
Date X
2014-01-02 00:00:00 1970-01-01 0
2014-01-02 00:15:00 1970-01-01 0
2014-01-02 00:30:00 1970-01-01 0
2014-01-02 00:45:00 1970-01-01 0
2014-01-02 01:00:00 1970-01-01 0
That didnt work as you see.
The "Date" Column is not a index btw. i need it as Column in my df
and why did he take "1970-01-01"? 1970 as year makes no sense to me
What I want is to remove ALL Rows where the time is NOT between 08:00
and 18:00 o'clock.
Create a mask with datetime.time. Example:
from datetime import time
idx = pd.date_range('2014-01-02', freq='15min', periods=10000)
df = pd.DataFrame({'x': np.empty(idx.shape[0])}, index=idx)
t1 = time(8); t2 = time(18)
times = df.index.time
mask = (times > t1) & (times < t2)
df = df.loc[mask]
Some days are missing in the data...how could I put the missing days
in my DataFrame and fill them with the value 0 as X?
Build a date range that doesn't have missing data with pd.date_range() (see above).
Call reindex() on df and specify fill_value=0.
Answering your questions in comments:
np.empty creates an empty array. I was just using it to build some "example" data that is basically garbage. Here idx.shape is the shape of your index (length, width), a tuple. So np.empty(idx.shape[0]) creates an empty 1d array with the same length as idx.
times = df.index.time creates a variable (a NumPy array) called times. df.index.time is the time for each element in the index of df. You can explore this yourself by just breaking the code down in pieces and experimenting with it on your own.
I am using Pandas dataframes with DatetimeIndex to manipulate timeseries data. The data is stored at UTC time and I usually keep it that way (with naive DatetimeIndex), and only use timezones for output. I like it that way because nothing in the world confuses me more than trying to manipuluate timezones.
e.g.
In: ts = pd.date_range('2017-01-01 00:00','2017-12-31 23:30',freq='30Min')
data = np.random.rand(17520,1)
df= pd.DataFrame(data,index=ts,columns = ['data'])
df.head()
Out[15]:
data
2017-01-01 00:00:00 0.697478
2017-01-01 00:30:00 0.506914
2017-01-01 01:00:00 0.792484
2017-01-01 01:30:00 0.043271
2017-01-01 02:00:00 0.558461
I want to plot a chart of data versus time for each day of the year so I reshape the dataframe to have time along the index and dates for columns
df.index = [df.index.time,df.index.date]
df_new = df['data'].unstack()
In: df_new.head()
Out :
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 \
00:00:00 0.697478 0.143626 0.189567 0.061872 0.748223
00:30:00 0.506914 0.470634 0.430101 0.551144 0.081071
01:00:00 0.792484 0.045259 0.748604 0.305681 0.333207
01:30:00 0.043271 0.276888 0.034643 0.413243 0.921668
02:00:00 0.558461 0.723032 0.293308 0.597601 0.120549
If I'm not worried about timezones i can plot like this:
fig, ax = plt.subplots()
ax.plot(df_new.index,df_new)
but I want to plot the data in the local timezone (tz = pytz.timezone('Australia/Sydney') making allowance for daylight savings time, but the times and dates are no longer Timestamp objects so I can't use Pandas timezone handling. Or can I?
Assuming I can't, I'm trying to do the shift manually, (given DST starts 1/10 at 2am and finishes 1/4 at 2am), so I've got this far:
df_new[[c for c in df_new.columns if c >= dt.datetime(2017,4,1) and c <dt.datetime(2017,10,1)]].shift_by(+10)
df_new[[c for c in df_new.columns if c < dt.datetime(2017,4,1) or c >= dt.datetime(2017,10,1)]].shift_by(+11)
but am not sure how to write the function shift_by.
(This doesn't handle midnight to 2am on teh changeover days correctly, which is not ideal, but I could live with)
Use dt.tz_localize + dt.tz_convert to convert the dataframe dates to a particular timezone.
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
df.index = [df.index.time, df.index.date]
Be a little careful when creating the MuliIndex - as you observed, it creates two rows of duplicate timestamps, so if that's the case, get rid of it with duplicated:
df = df[~df.index.duplicated()]
df = df['data'].unstack()
You can also create subplots with df.plot:
df.plot(subplots=True)
plt.show()