There are already a lot of questions about that topic, but I could not find replies that solve my troubles.
1. The context
I have timestamps stored in a list as strings, which look like:
print(my_timestamps)
...
3 Sun Mar 31 2019 00:00:00 GMT+0100
4 Sun Mar 31 2019 01:00:00 GMT+0100
5 Sun Mar 31 2019 03:00:00 GMT+0200
6 Sun Mar 31 2019 04:00:00 GMT+0200
...
13 Sun Oct 27 2019 01:00:00 GMT+0200
14 Sun Oct 27 2019 02:00:00 GMT+0200
15 Sun Oct 27 2019 02:00:00 GMT+0100
16 Sun Oct 27 2019 03:00:00 GMT+0100
17 Sun Oct 27 2019 04:00:00 GMT+0100
Name: date, dtype: object
You will notice I have kept 2 zones where there are DST.
I use to_datetime() to store it as timestamps in a panda dataframe
df['date'] = pd.to_datetime(my_timestamps)
print(df)
...
3 2019-03-31 00:00:00-01:00
4 2019-03-31 01:00:00-01:00
5 2019-03-31 03:00:00-02:00
6 2019-03-31 04:00:00-02:00
...
13 2019-10-27 01:00:00-02:00
14 2019-10-27 02:00:00-02:00
15 2019-10-27 02:00:00-01:00
16 2019-10-27 03:00:00-01:00
17 2019-10-27 04:00:00-01:00
Name: date, dtype: object
A 1st surprising (to me) thing is that 'date' column keeps its dtype as 'object' and not 'datetime64'.
When I want to use these timestamps as indexes with
df.set_index('date', inplace = True, verify_integrity = True)
I get an error with verify_integrity check informing me there are duplicate indexes.
ValueError: Index has duplicate keys: Index([2019-10-27 02:00:00-01:00, 2019-10-27 03:00:00-01:00], dtype='object', name='date')
I obviously would like to solve that.
2. What I tried
My understanding is that the timezone data is not used, and that to use it, I should try to convert the timestamps to have its dtype to 'datetime64'.
I first added the flag utc=True in to_datetime.
test = pd.to_datetime(my_timestamps,utc=True)
But then, I simply don't understand the result:
...
3 2019-03-31 01:00:00+00:00
4 2019-03-31 02:00:00+00:00
5 2019-03-31 05:00:00+00:00
6 2019-03-31 06:00:00+00:00
...
13 2019-10-27 03:00:00+00:00
14 2019-10-27 04:00:00+00:00
15 2019-10-27 03:00:00+00:00
16 2019-10-27 04:00:00+00:00
17 2019-10-27 05:00:00+00:00
According my understanding, timezone has been interpreted in a reversed manner ?!
3 Sun Mar 31 2019 00:00:00 GMT+0100
shifted in UTC time should read as
3 2019-03-30 23:00:00+00:00
but here it is translated into:
3 2019-03-31 01:00:00+00:00
This likely explains then the error of duplicate timestamps appearing
14 2019-10-27 04:00:00+00:00
...
16 2019-10-27 04:00:00+00:00
Please, has anyone any idea how to correctly handle the timezone information so that it doesn't lead to duplicate Index?
I thank you in advance for your help.
Have a good day,
Bests,
Pierrot
PS: I am fine with having the timestamps expressed in UTC, as long as the shift in hour is correctly managed.
3. Edit
It would seem fromisoformat() function, new in Python 3.7, could help. However, it accepts as input a string. I am not certain how it can be used in a "vectorized" manner to apply it on a complete dataframee column.
How to convert a timezone aware string to datetime in python without dateutil?
So there does be a trouble in dateutil as indicated above.
I reversed +/- sign in my original data file as indicated here:
How to replace a sub-string conditionally in a pandas dataframe column?
Bests,
Pierrot
Related
I have a dataframe which looks like below:
trip_id date journey_duration weekday
0 913460 2019-08-31 00:13:00 Sat
1 913459 2019-08-31 00:17:00 Sat
2 913455 2019-08-31 00:05:00 Sat
3 913454 2019-08-31 00:07:00 Sat
4 913453 2019-08-31 00:13:00 Sat
5 913452 2019-08-31 00:05:00 Sat
6 913451 2019-08-31 00:15:00 Sat
7 913450 2019-08-31 00:04:00 Sat
8 913449 2019-08-31 00:03:00 Sat
9 913448 2019-08-31 00:15:00 Sat
10 913443 2019-08-31 00:12:00 Sat
11 913442 2019-08-31 00:10:00 Sat
12 913441 2019-08-31 00:07:00 Sat
13 913440 2019-08-31 00:05:00 Sat
14 913435 2019-08-31 00:08:00 Sat
15 913434 2019-08-31 00:05:00 Sat
16 913433 2019-08-31 00:03:00 Sat
17 913432 2019-08-31 00:12:00 Sat
18 913431 2019-08-31 00:10:00 Sat
19 913429 2019-08-31 00:15:00 Sat
I would like to aggregate it to a daily level - changing the trip_id column to a count of number of trips per day and the journey duration to an average per day
I have used this:
trip_data = (pd.to_datetime(trip_data['date'])
.dt.floor('d')
.value_counts()
.rename_axis('date')
.reset_index(name='count'))
which works well to count the trips per day however this drops the journey duration
hope that makes sense, conscious my nomenclature might not be there as I'm a newbie
Thanks
Here's a way to do what your question asks:
trip_data.date = pd.to_datetime(trip_data.date)
trip_data.journey_duration = pd.to_timedelta(trip_data.journey_duration)
trip_data = ( trip_data
.assign(date=trip_data.date.dt.floor('d'))
.groupby('date', as_index=False)
.agg(count=("trip_id", "count"), journey_duration=("journey_duration", "mean")) )
Output:
date count journey_duration
0 2019-08-31 20 0 days 00:09:12
Explanation:
ensure date is a pandas datetime and journey_duration is a pandas timedelta type
round date to its day component using floor()
use groupby() to prepare for aggregation by unique date
use agg() to aggregate trip_id using count in a column named count and journey_duration using mean.
First, convert date and journey_duration to datetime objects. Since journey_duration doesn't contain the day, month etc. it might be a better idea to use pd.to_timedelta for its conversion:
df['date'] = pd.to_datetime(df['date'])
df['journey_duration'] = pd.to_timedelta(df['journey_duration'])
Then set date as the index and convert the dataframe to the daily frequency and use agg for multiple operations on different columns:
df.set_index('date').resample('D').agg(no_trips_per_day=('trip_id', 'count'), \
avg_duration=('journey_duration', 'mean'))
no_trips_per_day avg_duration
date
2019-08-31 20 0 days 00:09:12
Let's say I have a pandas dataframe df
DF
Timestamp Value
Jan 1 12:32 10
Jan 1 12:50 15
Jan 1 13:01 5
Jan 1 16:05 17
Jan 1 16:10 17
Jan 1 16:22 20
The result I want back, is a dataframe with per-hour (or any user specified time-segment, really) averages. Let's say my specified timesegment is 1 hour here. I want back something like
Jan 1 12:00 12.5
Jan 1 13:00 5
Jan 1 14:00 0
Jan 1 15:00 0
Jan 1 16:00 18
Is there a simple way built into pandas to segment like this? It feels like there should be, but my googling of "splitting pandas dataframe" in a variety of ways is failing me.
We need to convert to datetime first then do resample
df.Timestamp=pd.to_datetime('2020 '+df.Timestamp)
df.set_index('Timestamp').Value.resample('1H').mean().fillna(0)
Timestamp
2020-01-01 12:00:00 7.5
2020-01-01 13:00:00 5.0
2020-01-01 14:00:00 0.0
2020-01-01 15:00:00 0.0
2020-01-01 16:00:00 18.0
Freq: H, Name: Value, dtype: float64
Convert the index
newdf.index=newdf.index.strftime('%B %d %H:%M')
newdf
Timestamp
January 01 12:00 7.5
January 01 13:00 5.0
January 01 14:00 0.0
January 01 15:00 0.0
January 01 16:00 18.0
Name: Value, dtype: float64
Here's a quick peek of my dataframe:
local_date amount
0 2017-08-16 10.00
1 2017-10-26 21.70
2 2017-11-04 5.00
3 2017-11-12 37.20
4 2017-11-13 10.00
5 2017-11-18 31.00
6 2017-11-27 14.00
7 2017-11-29 10.00
8 2017-11-30 37.20
9 2017-12-16 8.00
10 2017-12-17 43.20
11 2017-12-17 49.60
12 2017-12-19 102.50
13 2017-12-19 28.80
14 2017-12-22 72.55
15 2017-12-23 24.80
16 2017-12-24 62.00
17 2017-12-26 12.40
18 2017-12-26 15.50
19 2017-12-26 40.00
20 2017-12-28 57.60
21 2017-12-31 37.20
22 2018-01-01 18.60
23 2018-01-02 12.40
24 2018-01-04 32.40
25 2018-01-05 17.00
26 2018-01-06 28.80
27 2018-01-11 20.80
28 2018-01-12 10.00
29 2018-01-12 26.00
I am trying to plot monthly sum of transactions, which is fine, except for ugly x-ticks:
I would like to change it to Name of the month and year (e.g. Jan 2019). So I sort the dates, change them using strftime and plot it again, but the order of the date are completely messed up.
The code I used to sort the dates and conver them is:
transactions = transactions.sort_values(by='local_date')
transactions['month_year'] = transactions['local_date'].dt.strftime('%B %Y')
#And then groupby that column:
transactions.groupby('month_year').amount.sum().plot(kind='bar')
When doing this, the Month_year are paired together. January 2019 comes after January 2018 etc. etc.
I thought sorting by date would fix this, but it doesn't. What's the best way to approach this?
You can convert column to mont periods by Series.dt.to_period and then change PeriodIndex to custom format in rename:
transactions = transactions.sort_values(by='local_date')
(transactions.groupby(transactions['local_date'].dt.to_period('m'))
.amount.sum()
.rename(lambda x: x.strftime('%B %Y'))
.plot(kind='bar'))
Alternative solution:
transactions = transactions.sort_values(by='local_date')
s = transactions.groupby(transactions['local_date'].dt.to_period('m')).amount.sum()
s.index = s.index.strftime('%B %Y')
s.plot(kind='bar')
I am kind of completely lost...
I have instanced a panda dataframe in python with read_csv() function.
I had to extract in a list the column containing timestamps and make some cleaning.
This list now looks like:
0 Sat Mar 30 2019 21:00:00 GMT+0100
1 Sat Mar 30 2019 22:00:00 GMT+0100
2 Sat Mar 30 2019 23:00:00 GMT+0100
...
I convert it back to 'datetime' object and add it back in my dataframe with following command:
df['date'] = pd.to_datetime(my_timestamps)
df['date'] now looks like:
0 2019-03-30 21:00:00-01:00
1 2019-03-30 22:00:00-01:00
2 2019-03-30 23:00:00-01:00
Before or after, I would like to actually apply the timezone offset, so as to have:
0 2019-03-30 20:00:00+00:00
1 2019-03-30 21:00:00+00:00
2 2019-03-30 22:00:00+00:00
Please, how can I obtain that?
I thank you in advance for your help.
Have a good evening,
Bests,
Pierrot
Try pd.DateOffset on hours with utc=True as follows
df['date'] = pd.to_datetime(my_timestamps, utc=True) - pd.DateOffset(hours=2)
Out[860]:
0 2019-03-30 20:00:00+00:00
1 2019-03-30 21:00:00+00:00
2 2019-03-30 22:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
If you want to strip the timezone completely, try this
import pytz
df['date'] = (pd.to_datetime(my_timestamps).dt.tz_convert(pytz.FixedOffset(-120))
.dt.tz_localize(None))
Out[863]:
0 2019-03-30 20:00:00
1 2019-03-30 21:00:00
2 2019-03-30 22:00:00
Name: 0, dtype: datetime64[ns]
This trouble is a known trouble with dateutil.
See:
https://github.com/dateutil/dateutil/issues/70
Timezone offset sign reversed by Python dateutil?
https://github.com/dateutil/dateutil/issues/968
I have a time series like the following:
date value
2017-08-27 564.285714
2017-09-03 28.857143
2017-09-10 NaN
2017-09-17 NaN
2017-09-24 NaN
2017-10-01 236.857143
... ...
2018-09-02 345.142857
2018-09-09 288.714286
2018-09-16 274.000000
2018-09-23 248.142857
2018-09-30 166.428571
It corresponds to that ranging from July 2017 to November 2019 and it's resampled by weeks. However, there are some weeks where the values were 0. I replaced it as there the values were missing and now I would like to feel those values based on values on the homologous period of a different year. For example, I have a lot of data missing for the month of September of 2017. I would like to interpolate those values using the values from September 2018. However, I'm a newbie and I'm not quite sure I to do it based only on a select period. I'm working in python, btw.
If anyone has any idea on how to this quickly, I'd be very much appreciated.
If you are OK with pandas library
One option is to find the week number from date and fill NaN values.
df['week'] = pd.to_datetime(df['date'], format='%Y-%m-%d').dt.strftime("%V")
df2 = df.sort_values(['week']).fillna(method='bfill').sort_values(['date'])
df2
which will give you the following output.
date value week
0 2017-08-27 564.285714 34
1 2017-09-03 28.857143 35
2 2017-09-10 288.714286 36
3 2017-09-17 274.000000 37
4 2017-09-24 248.142857 38
5 2017-10-01 236.857143 39
6 2018-09-02 345.142857 35
7 2018-09-09 288.714286 36
8 2018-09-16 274.000000 37
9 2018-09-23 248.142857 38
10 2018-09-30 166.428571 39
In Pandas:
df['value'] = df['value'].fillna(df['value_last_year'])