Create Datetime index after groupby - python

I would like to revert index after groupby function.
Question is how to create a DateTime index having year, month, day in separate columns in Multindex.
Given a DataFrame as an example:
import pandas as pd
import numpy as np
index=pd.date_range('2011-1-1 00:00:00', '2011-1-31 23:50:00', freq='10min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
Then, get the sum over each hour using grupby:
day_h = df.groupby([lambda x: x.year, lambda x: x.month, lambda x: x.day,lambda x: x.hour]).mean()
This creates an Index, where year, month, day and hour are in separate columns.
A B
2011 1 1 0 0.209908 1.196164
2011 1 1 1 0.692531 0.518185
2011 1 1 2 1.674748 0.013136
2011 1 1 3 1.674748 0.013136
2011 1 1 4 1.674748 0.013136
2011 1 1 5 1.674748 0.013136
The desired output would be to have DateTime index:
A B
2011-1-1 00:00 0.209908 1.196164
2011-1-1 01:00 0.692531 0.518185
2011-1-1 03:00 1.674748 0.013136
2011-1-1 04:00 1.674748 0.013136
2011-1-1 05:00 1.674748 0.013136
In my files there are some missing rows, so I can't create a new index with 1h timestep.
My data after groupby Example data

Someone else on SO had a similar question, but their solution was to use resample. You can avoid resampling by mapping the tuples in the multi-index to create a new index. This will handle missing rows just fine.
day_h['new_index'] = day_h.index.map(lambda x: datetime.datetime(x[0], x[1], x[2], x[3]))
day_h.set_index('new_index')
Output:
A B
new_index
2011-01-01 00:00:00 -1.095114 1.995776
2011-01-01 01:00:00 -2.411459 4.508794
2011-01-01 02:00:00 -1.261747 4.953709
2011-01-01 03:00:00 -0.311934 5.454112
2011-01-01 04:00:00 2.095718 6.854375
2011-01-01 05:00:00 1.696756 3.518919
2011-01-01 06:00:00 0.623589 1.740478
2011-01-01 07:00:00 0.544426 0.916016
2011-01-01 08:00:00 2.331326 0.891177

Related

Plotting count of unique values in groupby

I have a dataset with that form :
>>> df
my_timestamp disease month
0 2016-01-01 15:00:00 2 jan
0 2016-01-01 11:00:00 1 jan
1 2016-01-02 15:00:00 3 jan
2 2016-01-03 15:00:00 4 jan
3 2016-01-04 15:00:00 2 jan
I wont to count the number of unique apparition by month, by values, then plot the count of every value by month.
df
values count
jan 2 3
jan 2 3
How can I plot it ? In one plot with month on x axis, one line for every values, and their count on y
If you want to plot by month, then you also need to plot by year if multiple years. You can use dt.strftime when using .groupby to group by year and month.
Given the following slightly altered dataset to include more months:
my_timestamp disease month
2016-01-01 15:00:00 2 jan
2016-02-01 11:00:00 1 feb
2017-01-02 15:00:00 3 jan
2017-01-02 15:00:00 4 jan
2016-01-04 15:00:00 2 jan
You can run the following
df['my_timestamp'] = pd.to_datetime(df['my_timestamp'])
df.groupby(df['my_timestamp'].dt.strftime('%Y-%m'))['disease'].nunique().plot()
What I did to get that data into barplot.
I created a month column. Then :
for v in df.disease.unique():
diseases = df_cut[df_cut['disease']==v].groupby('month_num')['disease'].count()
x = diseases.index
y = diseases.values
plt.bar(x, y)

Python: upsampling dataframe from daily to hourly data using ffill()

I'm trying to upsample my data from daily to hourly frequency and forward fill missing data.
I start with the following code:
df1 = pd.read_csv("DATA.csv")
df1.head(5)
I then used the following to convert to a datetime string and set the date/time as an index:
df1['DT'] = pd.to_datetime(df1['DT']).dt.strftime('%Y-%m-%d %H:%M:%S')
df1.set_index('DT')
I try to resample hourly as follows:
df1['DT'] = df1.resample('H').ffill()
But I get the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'RangeIndex'
I thought my dtype was already date time as instructed by the pd.to_datetime code above. Nothing I try seems to be working. Can anyone please help me?
My expected output is as follows:
DT VALUE
2016-08-01 00:00:00 0.000000
2016-08-01 01:00:00 0.000000
2016-08-01 02:00:00 0.000000
etc.
The file itself has approximately 1000 rows. The first 50 rows or so are zero so to clarify where there's actual data:
DT VALUE
2018-12-13 00:00:00 24000.000000
2018-12-13 01:00:00 24000.000000
2018-12-13 02:00:00 24000.000000
...
2018-12-13 23:00:00 24000.000000
2018-12-14 00:00:00 26000.000000
2018-12-14 01:00:00 26000.000000
etc.
Try assign it back
df1=df1.set_index('DT')
Or
df1.set_index('DT',inplace=True)
I am assuming some initial rows of your dataset as you mentioned,
DT VALUE
0 2016-08-01 0
1 2016-08-02 0
2 2016-08-03 0
3 2016-08-04 0
4 2016-08-05 0
5 2016-08-06 0
6 2016-08-07 0
7 2016-08-08 0
8 2016-08-09 0
Then, make index on DT like this,
df = df.set_index('DT')
df
Output:
VALUE
DT
2016-08-01 0
2016-08-02 0
2016-08-03 0
2016-08-04 0
2016-08-05 0
2016-08-06 0
2016-08-07 0
2016-08-08 0
2016-08-09 0
Now, resample your dataframe,
df = df.resample('H').ffill()
df
Output: showing some initial values of output,
VALUE
DT
2016-08-01 00:00:00 0
2016-08-01 01:00:00 0
2016-08-01 02:00:00 0
2016-08-01 03:00:00 0
2016-08-01 04:00:00 0
2016-08-01 05:00:00 0
2016-08-01 06:00:00 0
2016-08-01 07:00:00 0
2016-08-01 08:00:00 0
2016-08-01 09:00:00 0
2016-08-01 10:00:00 0
You could convert the index to a pd.DatetimeIndex and then resample that. I also don't think you need (or want) the strftime() call:
df1 = pd.read_csv("DATA.csv")
df1['DT'] = pd.to_datetime(df1['DT'])
df1.set_index('DT')
df1.index = pd.DatetimeIndex(df1.index)
df1['DT'] = df1.resample('H').ffill()
NOTE: You could probably combine a bunch of this and it would still be quite clear, like:
df1 = pd.read_csv("DATA.csv")
df1.index = pd.DatetimeIndex(pd.to_datetime(df1['DT']))
df1['DT'] = df1.resample('H').ffill()

Reorder timestamps pandas

I have a pandas column that contain timestamps that are unordered. When I sort them it works fine except for the values H:MM:SS.
d = ({
'A' : ['8:00:00','9:00:00','10:00:00','20:00:00','24:00:00','26:20:00'],
})
df = pd.DataFrame(data=d)
df = df.sort_values(by='A',ascending=True)
Out:
A
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
0 8:00:00
1 9:00:00
Ideally, I'd like to add a zero before 5 letter strings. If I convert them all to time delta it converts the times after midnight into 1 day plus n amount of hours. e.g.
df['A'] = pd.to_timedelta(df['A'])
A
0 0 days 08:00:00
1 0 days 09:00:00
2 0 days 10:00:00
3 0 days 20:00:00
4 1 days 00:00:00
5 1 days 02:20:00
Intended Output:
A
0 08:00:00
1 09:00:00
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
If you only need to sort by the column as timedelta, you can convert the column to timedelta and use argsort on it to create the sorting order to sort the data frame:
df.iloc[pd.to_timedelta(df.A).argsort()]
# A
#0 8:00:00
#1 9:00:00
#2 10:00:00
#3 20:00:00
#4 24:00:00
#5 26:20:00

Convert and order time in a pandas df

I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object

How does pandas perform a groupby on a DateTime series by using a list as grouping criteria?

If df is a Dataframe indexed by DateTime objects, the following code splits it into the list groups_list where each index containts all the data in df that belongs to a given day:
groupby_clause = [df.index.year,df.index.month,df.index.day]
groups_list = [group[1] for group in df.groupby(groupby_clause)]
I am having trouble, though, to understand how the grouping is actually made, since I don't need to label the elements of groupby_clause as year, month, and day for the grouping to be made on DateTime objects.
As an example, I have the following components for groups_list:
Maybe I'm missing something obvious, but I don't get it: how does pandas know that it should associate groupby_clause[0] to year, groupby_clause[1] to month, and groupby_clause[2] to day in order to group the dataframe indexes that have DateTime type?
Suppose you have a DataFrame like this:
0
2011-01-01 00:00:00 -0.324398
2011-01-01 01:00:00 -0.761585
2011-01-01 02:00:00 0.057204
2011-01-01 03:00:00 -1.162510
2011-01-01 04:00:00 -0.680896
2011-01-01 05:00:00 -0.701835
2011-01-01 06:00:00 -0.431338
2011-01-01 07:00:00 0.306935
2011-01-01 08:00:00 -0.503177
2011-01-01 09:00:00 -0.507444
2011-01-01 10:00:00 0.230590
2011-01-01 11:00:00 -2.326702
2011-01-01 12:00:00 -0.034664
2011-01-01 13:00:00 0.224373
2011-01-01 14:00:00 -0.242884
If you want the index to be by year month and date then just set_index it:
df.set_index([ts.index.year, ts.index.month, ts.index.day])
Output
0
2011 1 1 -0.324398
1 -0.761585
1 0.057204
1 -1.162510
1 -0.680896
1 -0.701835
1 -0.431338
1 0.306935
1 -0.503177
1 -0.507444
1 0.230590
1 -2.326702
1 -0.034664
1 0.224373
1 -0.242884
1 -0.134757
1 -1.177362
1 0.931335
1 0.904084
1 -0.757860
1 0.406597
1 -0.664150

Categories

Resources