Efficently slicing non-integer multilevel indexes with integers in Pandas - python

The following code generates a sample DataFrame with a multilevel index. The first level is a string, the second level is a datetime.
Script
import pandas as pd
from datetime import datetime
import random
df = pd.DataFrame(columns=['network','time','active_clients','throughput','speed'])
networks = ['ALPHA','BETA','GAMMA']
times = pd.date_range(datetime.strptime('2021-01-01 00:00:00','%Y-%m-%d %H:%M:%S'),datetime.strptime('2021-01-01 12:00:00','%Y-%m-%d %H:%M:%S'),7).tolist()
for n in networks:
for t in times:
df = df.append({'network':n,'time':t,'active_clients':random.randint(10,30),'throughput':random.randint(1500,5000),'speed':random.randint(10000,12000)},ignore_index=True)
df.set_index(['network','time'],inplace=True)
print(df.to_string())
Output
active_clients throughput speed
network time
ALPHA 2021-01-01 00:00:00 16 4044 11023
2021-01-01 02:00:00 17 2966 10933
2021-01-01 04:00:00 10 4649 11981
2021-01-01 06:00:00 23 3629 10113
2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 00:00:00 17 3073 11798
2021-01-01 02:00:00 20 1941 10640
2021-01-01 04:00:00 17 1980 11869
2021-01-01 06:00:00 23 3346 10002
2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 00:00:00 21 4366 11587
2021-01-01 02:00:00 22 3404 11669
2021-01-01 04:00:00 20 1608 10344
2021-01-01 06:00:00 28 1849 10278
2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
For each item in the first level, I want to select the last three records in the second level. The catch is that I don't know the datetime values, so I need to select by integer-based index location instead. What's the most efficient way of slicing the DataFrame to achieve the following.
Desired output
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
My attempts
Returns the full dataframe:
df_sel = df.iloc[:,-3:]
Raises an error because loc doesn't support using integer values on datetime objects:
df_sel = df.loc[:,-3:]
Returns the last three entries in the second level, but only for the last entry in the first level:
df_sel = df.loc[:].iloc[-3:]

I have 2 methods to solve this problem:
Method 1:
As it mentions from the first comment from Quang Hoang, you can use groupby to do this, which I believe has the shortest code:
df.groupby(level=0).tail(3)
Method 2:
You can also slice each one in networks then concat them:
pd.concat([df.loc[[i]][-3:] for i in networks])
Both of these 2 methods will output the result you want:

Another method is to do some reshaping:
df.unstack(0).iloc[-3:].stack().swaplevel(0,1).sort_index()
Output:
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 26 4081 11325
2021-01-01 10:00:00 13 3370 10716
2021-01-01 12:00:00 13 3691 10737
BETA 2021-01-01 08:00:00 28 2105 10465
2021-01-01 10:00:00 21 2444 10158
2021-01-01 12:00:00 24 1947 11226
GAMMA 2021-01-01 08:00:00 13 1850 10288
2021-01-01 10:00:00 23 2241 11521
2021-01-01 12:00:00 30 3515 11138
Details:
unstack the outer most index level, level=0
Use, iloc to select the last three records in the dataframe
stack that level back to the index swaplevel and sort_index

Related

Turn float number into datetime format

I am working with a dataset where I have dates in datetime format in the first column and hours as float as separate columns like this:
date 1.0 2.0 3.0 ... 21.0 22.0 23.0 24.0
0 2021-01-01 24.95 24.35 23.98 ... 27.32 26.98 26.44 25.64
1 2021-01-02 25.59 24.91 24.74 ... 27.38 26.96 26.85 25.94
and what I want to achieve is this:
Date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2013-01-01 04:00:00 ...
So I have been figuring that the first step should be to change the hours into datetime format,
been trying this code for example: df[1.0] = pd.to_datetime(df[1.0], format='%h')
Where I get this: "ValueError: 'h' is a bad directive in format '%h'"
And then rearrange the columns and rows. Been thinking about doing this with pandas pivot_table and transform. Any help would be appreciated. Thank you.
Use DataFrame.set_index first, convert all columns to timedeltas, reshape by DataFrame.unstack and last join dates and timedeltas:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: pd.to_timedelta(float(x), unit='h')
df1 = (df.set_index('date')
.rename(columns=f)
.unstack()
.reset_index(name='Price')
.assign(date=lambda x: x['date'] + x.pop('level_0')))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-02 01:00:00 25.59
2 2021-01-01 02:00:00 24.35
3 2021-01-02 02:00:00 24.91
4 2021-01-01 03:00:00 23.98
5 2021-01-02 03:00:00 24.74
6 2021-01-01 21:00:00 27.32
7 2021-01-02 21:00:00 27.38
8 2021-01-01 22:00:00 26.98
9 2021-01-02 22:00:00 26.96
10 2021-01-01 23:00:00 26.44
11 2021-01-02 23:00:00 26.85
12 2021-01-02 00:00:00 25.64
13 2021-01-03 00:00:00 25.94
Or use DataFrame.melt and then join column converted to timedeltas:
df['date'] = pd.to_datetime(df['date'])
df1 = (df.melt('date', value_name='Price')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2021-01-01 21:00:00 27.32
4 2021-01-01 22:00:00 26.98
5 2021-01-01 23:00:00 26.44
6 2021-01-02 00:00:00 25.64
7 2021-01-02 01:00:00 25.59
8 2021-01-02 02:00:00 24.91
9 2021-01-02 03:00:00 24.74
10 2021-01-02 21:00:00 27.38
11 2021-01-02 22:00:00 26.96
12 2021-01-02 23:00:00 26.85
13 2021-01-03 00:00:00 25.94

How to apply a condition to Pandas dataframe rows, but only apply the condition to rows of the same day?

I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success

How to use pd.interpolate fill the gap with only one missing data

I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?

Pandas, insert datetime values that increase one hour for each row

I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]

Select groups using slicing based on the group index in pandas DataFrame

I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?
I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]

Categories

Resources