Pandas MultiIndex: Partial indexing on second level - python

I have a data-set open in Pandas with a 2-level MultiIndex. The first level of the MultiIndex is a unique ID (SID) while the second level is time (ISO_TIME). A sample of the data-set is given below.
SEASON NATURE NUMBER
SID ISO_TIME
2020138N10086 2020-05-16 12:00:00 2020 NR 26
2020-05-16 15:00:00 2020 NR 26
2020-05-16 18:00:00 2020 NR 26
2020-05-16 21:00:00 2020 NR 26
2020-05-17 00:00:00 2020 NR 26
2020155N17072 2020-06-02 18:00:00 2020 NR 30
2020-06-02 21:00:00 2020 NR 30
2020-06-03 00:00:00 2020 NR 30
2020-06-03 03:00:00 2020 NR 30
2020-06-03 06:00:00 2020 NR 30
2020327N11056 2020-11-21 18:00:00 2020 NR 103
2020-11-21 21:00:00 2020 NR 103
2020-11-22 00:00:00 2020 NR 103
2020-11-22 03:00:00 2020 NR 103
2020-11-22 06:00:00 2020 NR 103
2020329N10084 2020-11-23 12:00:00 2020 NR 104
2020-11-23 15:00:00 2020 NR 104
2020-11-23 18:00:00 2020 NR 104
2020-11-23 21:00:00 2020 NR 104
2020-11-24 00:00:00 2020 NR 104
I can do df.loc[("2020138N10086")] to select rows with SID=2020138N10086 or df.loc[("2020138N10086", "2020-05-17")] to select rows with SID=2020138N10086 and are on 2020-05-17.
What I want to do, but not able to, is to partially index using the second level of MultiIndex. That is, select all rows on 2020-05-17, irrespective of the SID.
I have read through Pandas MultiIndex / advanced indexing which explains how indexing is done with MultiIndex. But nowhere in it could I find how to do a partial indexing on the second/inner level of a Pandas MultiIndex. Either I missed it in the document or it is not explained in there.
So, is it possible to do a partial indexing in the second level of a Pandas MultiIndex?
If it is possible, how do I do it?

you can do this with slicing. See the pandas documentation.
Example for your dataframe:
df.loc[(slice(None), '2020-05-17'), :]

df=df.reset_index()
dates_rows= df[df["ISO_TIME"]=="2020-05-17"]
If you want you can convert it back to a multi-level index again, like below
df.set_index(['SID', 'ISO_TIME'], inplace=True)

Use a cross-section
df.xs('2020-05-17', level="ISO_TIME")

Related

Filtering dataframe given a list of dates

I have the following dataframe:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
3 1999-10-05 12:00:00 53
4 1999-10-10 16:00:00 43
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
I have a datetime list that I get from tolist() in another dataframe.
[Timestamp('1999-10-01 00:00:00'),
Timestamp('1999-10-02 00:00:00'),
Timestamp('1999-10-24 00:00:00')]
The tolist() purpose is to filter the dataframe based on the dates inside the list. The end result is:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
Where only 1st, 2nd and 24th Oct rows will appear in the dataframe.
What is the approach to do this? I have looked up and only see solution to filter between dates or a singular date.
Thank you.
If want compare Timestamp without times use Series.dt.normalize:
df1 = df[df['Date'].dt.normalize().isin(L)]
Or Series.dt.floor :
df1 = df[df['Date'].dt.floor('d').isin(L)]
For compare by dates is necessary convert also list to dates:
df1 = df[df['Date'].dt.date.isin([x.date for x in L])]
print (df1)
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21

How to use pd.interpolate fill the gap with only one missing data

I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?

How to aggregate a function on a specific column for each day

I have a CSV file that has minute data in it.
The end goal is to find the standard deviations of all of the lows ('Low' column) of each day, using all of the data of each day.
The issue is that the csv file has some holes in it in that it does not have exactly 390 minutes(number on minutes in a trading day). The code looks like this:
import pandas as pd
import datetime as dt
df = pd.read_csv('/Volumes/Seagate Portable/S&P 500 List/AAPL.txt')
df.columns = ['Extra', 'Dates', 'Open', 'High', 'Low', 'Close', 'Volume']
df.drop(['Extra', 'Open', 'High', 'Volume'], axis=1, inplace=True)
df.Dates = pd.to_datetime(df.Dates)
df.set_index(df.Dates, inplace=True)
df = df.between_time('9:30', '16:00')
print(df.Low[::390])
The output is as follows:
Dates
2020-01-02 09:30:00 73.8475
2020-01-02 16:00:00 75.0875
2020-01-03 15:59:00 74.3375
2020-01-06 15:58:00 74.9125
2020-01-07 15:57:00 74.5028
...
2020-12-14 09:41:00 122.8800
2020-12-15 09:40:00 125.9900
2020-12-16 09:39:00 126.5600
2020-12-17 09:38:00 129.1500
2020-12-18 09:37:00 127.9900
Name: Low, Length: 245, dtype: float64
As you can see in the output, even if one 9:30 is missing I can no longer index out by 390. So my solution to this would be to get as much data as possible even if dates are missing in the sense that, when the datetime code goes from say 15:59, or 16:00 to 9:31 or 9:32. In essence when it changes back down for 16 to 9:30? I don't know if there are any other solutions to this? Any ideas? And if this is the solution what would be the best way to code it?
Use .groupby() with pandas.Grouper() on 'date', with freq='D' for day, and then aggregate .std() on 'low'.
The 'date' column must be a datetime dtype. Use pd.to_datetime() to convert the 'Dates' column, if needed.
If desired, use df = df.set_index('date').between_time('9:30', '16:00').reset_index() to select only times within a specific range. This would be done before the .groupby().
The 'date' column needs to be the index, to use .between_time().
import requests
import pandas as pd
# sample stock data
periods = '3600'
resp = requests.get('https://api.cryptowat.ch/markets/poloniex/ethusdt/ohlc', params={'periods': periods})
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=['date', 'open', 'high', 'low', 'close', 'volume', 'amount'])
# convert to a datetime format
df['date'] = pd.to_datetime(df['date'], unit='s')
# display(df.head())
date open high low close volume amount
0 2020-11-22 02:00:00 550.544464 554.812114 536.523241 542.000000 2865.381737 1.567462e+06
1 2020-11-22 03:00:00 541.485933 551.621355 540.992000 548.500000 1061.275481 5.796859e+05
2 2020-11-22 04:00:00 548.722267 549.751680 545.153196 549.441709 310.874748 1.703272e+05
3 2020-11-22 05:00:00 549.157866 549.499632 544.135302 546.913493 259.077448 1.416777e+05
4 2020-11-22 06:00:00 547.600000 548.000000 541.668524 544.241871 363.433373 1.979504e+05
# groupby day, using pd.Grouper and then get std of low
std = df.groupby(pd.Grouper(key='date', freq='D'))['low'].std().reset_index(name='low std')
# display(std)
date low std
0 2020-11-22 14.751495
1 2020-11-23 14.964803
2 2020-11-24 6.542568
3 2020-11-25 9.523858
4 2020-11-26 24.041421
5 2020-11-27 8.272477
6 2020-11-28 12.340238
7 2020-11-29 8.444779
8 2020-11-30 10.290333
9 2020-12-01 13.605846
10 2020-12-02 6.201248
11 2020-12-03 9.403853
12 2020-12-04 12.667251
13 2020-12-05 10.180626
14 2020-12-06 4.481538
15 2020-12-07 3.881311
16 2020-12-08 10.518746
17 2020-12-09 12.077622
18 2020-12-10 6.161330
19 2020-12-11 5.035066
20 2020-12-12 6.297173
21 2020-12-13 9.739574
22 2020-12-14 3.505540
23 2020-12-15 3.304968
24 2020-12-16 16.753780
25 2020-12-17 10.963064
26 2020-12-18 5.574997
27 2020-12-19 4.976494
28 2020-12-20 7.243917
29 2020-12-21 16.844777
30 2020-12-22 10.348576
31 2020-12-23 15.769288
32 2020-12-24 10.329158
33 2020-12-25 5.980148
34 2020-12-26 8.530006
35 2020-12-27 21.136509
36 2020-12-28 16.115898
37 2020-12-29 10.587339
38 2020-12-30 7.634897
39 2020-12-31 7.278866
40 2021-01-01 6.617027
41 2021-01-02 19.708119

Interpolating missing values for time series based on the values of the same period from a different year

I have a time series like the following:
date value
2017-08-27 564.285714
2017-09-03 28.857143
2017-09-10 NaN
2017-09-17 NaN
2017-09-24 NaN
2017-10-01 236.857143
... ...
2018-09-02 345.142857
2018-09-09 288.714286
2018-09-16 274.000000
2018-09-23 248.142857
2018-09-30 166.428571
It corresponds to that ranging from July 2017 to November 2019 and it's resampled by weeks. However, there are some weeks where the values were 0. I replaced it as there the values were missing and now I would like to feel those values based on values on the homologous period of a different year. For example, I have a lot of data missing for the month of September of 2017. I would like to interpolate those values using the values from September 2018. However, I'm a newbie and I'm not quite sure I to do it based only on a select period. I'm working in python, btw.
If anyone has any idea on how to this quickly, I'd be very much appreciated.
If you are OK with pandas library
One option is to find the week number from date and fill NaN values.
df['week'] = pd.to_datetime(df['date'], format='%Y-%m-%d').dt.strftime("%V")
df2 = df.sort_values(['week']).fillna(method='bfill').sort_values(['date'])
df2
which will give you the following output.
date value week
0 2017-08-27 564.285714 34
1 2017-09-03 28.857143 35
2 2017-09-10 288.714286 36
3 2017-09-17 274.000000 37
4 2017-09-24 248.142857 38
5 2017-10-01 236.857143 39
6 2018-09-02 345.142857 35
7 2018-09-09 288.714286 36
8 2018-09-16 274.000000 37
9 2018-09-23 248.142857 38
10 2018-09-30 166.428571 39
In Pandas:
df['value'] = df['value'].fillna(df['value_last_year'])

Automating interpolation of missing values in pandas dataframe

I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

Categories

Resources