Splitting Pandas Dataframe into chunks by Timestamp - python

Let's say I have a pandas dataframe df
DF
Timestamp Value
Jan 1 12:32 10
Jan 1 12:50 15
Jan 1 13:01 5
Jan 1 16:05 17
Jan 1 16:10 17
Jan 1 16:22 20
The result I want back, is a dataframe with per-hour (or any user specified time-segment, really) averages. Let's say my specified timesegment is 1 hour here. I want back something like
Jan 1 12:00 12.5
Jan 1 13:00 5
Jan 1 14:00 0
Jan 1 15:00 0
Jan 1 16:00 18
Is there a simple way built into pandas to segment like this? It feels like there should be, but my googling of "splitting pandas dataframe" in a variety of ways is failing me.

We need to convert to datetime first then do resample
df.Timestamp=pd.to_datetime('2020 '+df.Timestamp)
df.set_index('Timestamp').Value.resample('1H').mean().fillna(0)
Timestamp
2020-01-01 12:00:00 7.5
2020-01-01 13:00:00 5.0
2020-01-01 14:00:00 0.0
2020-01-01 15:00:00 0.0
2020-01-01 16:00:00 18.0
Freq: H, Name: Value, dtype: float64
Convert the index
newdf.index=newdf.index.strftime('%B %d %H:%M')
newdf
Timestamp
January 01 12:00 7.5
January 01 13:00 5.0
January 01 14:00 0.0
January 01 15:00 0.0
January 01 16:00 18.0
Name: Value, dtype: float64

Related

Code required to aggregate dataframe group total as well as retrieve first and last date by group

I am using python for some supply chain/manufacturing purposes. I am trying to figure out some code that will allow me to do configure the relevant information I need to compile.
I am trying to aggregate the total 'UnitsProduced' by 'Lot' and also grab only the first occurring 'Date/StartTime' and last occurring 'Date/EndTime'.
Right now the (simplified) dataframe is as follows:
Lot
UnitsProduced
Date/StartTime
Date/EndTime
1
5
1/1/2021 8:00
1/1/2021 13:00
1
13
1/2/2021 10:00
1/2/2021 14:00
2
20
1/3/2021 7:00
1/3/2021 11:00
3
15
1/4/2021 14:30
1/4/2021 19:00
3
6
1/4/2021 20:00
1/4/2021 22:00
3
28
1/5/2021 7:00
1/5/2021 13:00
The end result should look something like:
Lot
Units Produced
Date/StartTime
Date/EndTime
1
18
1/1/2021 8:00
1/2/2021 14:00
2
20
1/3/2021 7:00
1/3/2021 11:00
3
49
1/4/2021 14:30
1/5/2021 13:00
Thank you for the help. If there is any other information I can provide please let me know
You could use groupby.agg with a dictionary of aggregate functions, just make sure that the date columns are in datetime format:
# df['Date/StartTime'] = pd.to_datetime(df['Date/StartTime'])
# df['Date/EndTime'] = pd.to_datetime(df['Date/EndTime'])
df.groupby('Lot', as_index=False).agg({'UnitsProduced':'sum',
'Date/StartTime':'min',
'Date/EndTime':'max'})
Lot UnitsProduced Date/StartTime Date/EndTime
0 1 18 2021-01-01 08:00:00 2021-01-02 14:00:00
1 2 20 2021-01-03 07:00:00 2021-01-03 11:00:00
2 3 49 2021-01-04 14:30:00 2021-01-05 13:00:00
​

Pandas MultiIndex: Partial indexing on second level

I have a data-set open in Pandas with a 2-level MultiIndex. The first level of the MultiIndex is a unique ID (SID) while the second level is time (ISO_TIME). A sample of the data-set is given below.
SEASON NATURE NUMBER
SID ISO_TIME
2020138N10086 2020-05-16 12:00:00 2020 NR 26
2020-05-16 15:00:00 2020 NR 26
2020-05-16 18:00:00 2020 NR 26
2020-05-16 21:00:00 2020 NR 26
2020-05-17 00:00:00 2020 NR 26
2020155N17072 2020-06-02 18:00:00 2020 NR 30
2020-06-02 21:00:00 2020 NR 30
2020-06-03 00:00:00 2020 NR 30
2020-06-03 03:00:00 2020 NR 30
2020-06-03 06:00:00 2020 NR 30
2020327N11056 2020-11-21 18:00:00 2020 NR 103
2020-11-21 21:00:00 2020 NR 103
2020-11-22 00:00:00 2020 NR 103
2020-11-22 03:00:00 2020 NR 103
2020-11-22 06:00:00 2020 NR 103
2020329N10084 2020-11-23 12:00:00 2020 NR 104
2020-11-23 15:00:00 2020 NR 104
2020-11-23 18:00:00 2020 NR 104
2020-11-23 21:00:00 2020 NR 104
2020-11-24 00:00:00 2020 NR 104
I can do df.loc[("2020138N10086")] to select rows with SID=2020138N10086 or df.loc[("2020138N10086", "2020-05-17")] to select rows with SID=2020138N10086 and are on 2020-05-17.
What I want to do, but not able to, is to partially index using the second level of MultiIndex. That is, select all rows on 2020-05-17, irrespective of the SID.
I have read through Pandas MultiIndex / advanced indexing which explains how indexing is done with MultiIndex. But nowhere in it could I find how to do a partial indexing on the second/inner level of a Pandas MultiIndex. Either I missed it in the document or it is not explained in there.
So, is it possible to do a partial indexing in the second level of a Pandas MultiIndex?
If it is possible, how do I do it?
you can do this with slicing. See the pandas documentation.
Example for your dataframe:
df.loc[(slice(None), '2020-05-17'), :]
df=df.reset_index()
dates_rows= df[df["ISO_TIME"]=="2020-05-17"]
If you want you can convert it back to a multi-level index again, like below
df.set_index(['SID', 'ISO_TIME'], inplace=True)
Use a cross-section
df.xs('2020-05-17', level="ISO_TIME")

Plotting count of unique values in groupby

I have a dataset with that form :
>>> df
my_timestamp disease month
0 2016-01-01 15:00:00 2 jan
0 2016-01-01 11:00:00 1 jan
1 2016-01-02 15:00:00 3 jan
2 2016-01-03 15:00:00 4 jan
3 2016-01-04 15:00:00 2 jan
I wont to count the number of unique apparition by month, by values, then plot the count of every value by month.
df
values count
jan 2 3
jan 2 3
How can I plot it ? In one plot with month on x axis, one line for every values, and their count on y
If you want to plot by month, then you also need to plot by year if multiple years. You can use dt.strftime when using .groupby to group by year and month.
Given the following slightly altered dataset to include more months:
my_timestamp disease month
2016-01-01 15:00:00 2 jan
2016-02-01 11:00:00 1 feb
2017-01-02 15:00:00 3 jan
2017-01-02 15:00:00 4 jan
2016-01-04 15:00:00 2 jan
You can run the following
df['my_timestamp'] = pd.to_datetime(df['my_timestamp'])
df.groupby(df['my_timestamp'].dt.strftime('%Y-%m'))['disease'].nunique().plot()
What I did to get that data into barplot.
I created a month column. Then :
for v in df.disease.unique():
diseases = df_cut[df_cut['disease']==v].groupby('month_num')['disease'].count()
x = diseases.index
y = diseases.values
plt.bar(x, y)

Handling timezone in panda to_datetime function

There are already a lot of questions about that topic, but I could not find replies that solve my troubles.
1. The context
I have timestamps stored in a list as strings, which look like:
print(my_timestamps)
...
3 Sun Mar 31 2019 00:00:00 GMT+0100
4 Sun Mar 31 2019 01:00:00 GMT+0100
5 Sun Mar 31 2019 03:00:00 GMT+0200
6 Sun Mar 31 2019 04:00:00 GMT+0200
...
13 Sun Oct 27 2019 01:00:00 GMT+0200
14 Sun Oct 27 2019 02:00:00 GMT+0200
15 Sun Oct 27 2019 02:00:00 GMT+0100
16 Sun Oct 27 2019 03:00:00 GMT+0100
17 Sun Oct 27 2019 04:00:00 GMT+0100
Name: date, dtype: object
You will notice I have kept 2 zones where there are DST.
I use to_datetime() to store it as timestamps in a panda dataframe
df['date'] = pd.to_datetime(my_timestamps)
print(df)
...
3 2019-03-31 00:00:00-01:00
4 2019-03-31 01:00:00-01:00
5 2019-03-31 03:00:00-02:00
6 2019-03-31 04:00:00-02:00
...
13 2019-10-27 01:00:00-02:00
14 2019-10-27 02:00:00-02:00
15 2019-10-27 02:00:00-01:00
16 2019-10-27 03:00:00-01:00
17 2019-10-27 04:00:00-01:00
Name: date, dtype: object
A 1st surprising (to me) thing is that 'date' column keeps its dtype as 'object' and not 'datetime64'.
When I want to use these timestamps as indexes with
df.set_index('date', inplace = True, verify_integrity = True)
I get an error with verify_integrity check informing me there are duplicate indexes.
ValueError: Index has duplicate keys: Index([2019-10-27 02:00:00-01:00, 2019-10-27 03:00:00-01:00], dtype='object', name='date')
I obviously would like to solve that.
2. What I tried
My understanding is that the timezone data is not used, and that to use it, I should try to convert the timestamps to have its dtype to 'datetime64'.
I first added the flag utc=True in to_datetime.
test = pd.to_datetime(my_timestamps,utc=True)
But then, I simply don't understand the result:
...
3 2019-03-31 01:00:00+00:00
4 2019-03-31 02:00:00+00:00
5 2019-03-31 05:00:00+00:00
6 2019-03-31 06:00:00+00:00
...
13 2019-10-27 03:00:00+00:00
14 2019-10-27 04:00:00+00:00
15 2019-10-27 03:00:00+00:00
16 2019-10-27 04:00:00+00:00
17 2019-10-27 05:00:00+00:00
According my understanding, timezone has been interpreted in a reversed manner ?!
3 Sun Mar 31 2019 00:00:00 GMT+0100
shifted in UTC time should read as
3 2019-03-30 23:00:00+00:00
but here it is translated into:
3 2019-03-31 01:00:00+00:00
This likely explains then the error of duplicate timestamps appearing
14 2019-10-27 04:00:00+00:00
...
16 2019-10-27 04:00:00+00:00
Please, has anyone any idea how to correctly handle the timezone information so that it doesn't lead to duplicate Index?
I thank you in advance for your help.
Have a good day,
Bests,
Pierrot
PS: I am fine with having the timestamps expressed in UTC, as long as the shift in hour is correctly managed.
3. Edit
It would seem fromisoformat() function, new in Python 3.7, could help. However, it accepts as input a string. I am not certain how it can be used in a "vectorized" manner to apply it on a complete dataframee column.
How to convert a timezone aware string to datetime in python without dateutil?
So there does be a trouble in dateutil as indicated above.
I reversed +/- sign in my original data file as indicated here:
How to replace a sub-string conditionally in a pandas dataframe column?
Bests,
Pierrot

Average pandas dataframe on time index for a particular time interval

I have a dataframe where for each timestamp there are some points earned by the user. It looks like the following i.e. data was collected after few seconds
>> df.head()
points
timestamp
2017-05-29 17:40:45 5
2017-05-29 17:41:53 7
2017-05-29 17:42:34 3
2017-05-29 17:42:36 8
2017-05-29 17:42:37 6
Then I wanted to resample it to an interval of 5 minutes so I did this
>> df.resample("5min").mean()
points
timestamp
5/29/2017 17:40 8
5/29/2017 17:45 1
5/29/2017 17:50 4
5/29/2017 17:55 3
5/29/2017 18:00 8
5/30/2017 17:30 3
5/30/2017 17:35 3
5/30/2017 17:40 7
5/30/2017 17:45 8
5/30/2017 17:50 5
5/30/2017 17:55 7
5/30/2017 18:00 1
Now I want to give an input like this input_time = "17:00-18:00" and I want to divide the input time into 5min interval for e.g. [17:05, 17:10 ... 17:55, 18:00]. After that for each interval I want to get the average points earned for that particular time interval. The results should look like the following
interval points
17:00 -
17:05 -
….
17:30 3
17:35 3
17:40 7.5
17:45 4.5
17:50 4.5
17:55 5
18:00 4.5
Need your help. Thanks
Create DatetimeIndex by date_range and change format by strftime:
input_time = "17:00-18:00"
s,e = input_time.split('-')
r = pd.date_range(s, e, freq='5T').strftime('%H:%M')
print (r)
['17:00' '17:05' '17:10' '17:15' '17:20' '17:25' '17:30' '17:35' '17:40'
'17:45' '17:50' '17:55' '18:00']
Also convert original index for groupby with aggregate mean, last reindex by range:
df = df.groupby(df.index.strftime('%H:%M'))['points'].mean().reindex(r)
print (df)
17:00 NaN
17:05 NaN
17:10 NaN
17:15 NaN
17:20 NaN
17:25 NaN
17:30 3.0
17:35 3.0
17:40 7.5
17:45 4.5
17:50 4.5
17:55 5.0
18:00 4.5
Name: points, dtype: float64

Categories

Resources