Automating interpolation of missing values in pandas dataframe - python

I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.

This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

Related

Shift specific rows to correct missing values in a Pandas Dataframe

Python beginner here.
I couldn't find anything similar to this, but I have the feeling it shouldn't be so hard.
I have a large excel sheet with values from different sensors, but some of the values are missing due to errors in the measurements. So when I put everything into a pandas dataframe I have something like this:
TimeStamp1
Sensor1
TimeStamp2
Sensor2
08:00
100
08:00
60
08:05
102
08:10
40
08:10
105
08:15
50
08:15
101
08:25
31
08:20
103
NaT
NaN
08:25
104
NaT
NaN
The real dataframe has 7 sensors and more than 100k rows, so there are different numbers of NaT's and NaN's in different columns.
I need timestamps for each sensor to be aligned in order to avoid some inconsistencies. So I want to shift the lines in TimeStamp2 and Sensor2 from the point where it differs from TimeStamp1, add the missing time and a NaN (or empty) value in the position in Sensor2, and make the NaT and NaN at the end disappear from both columns.
An output like this:
TimeStamp1
Sensor1
TimeStamp2
Sensor2
08:00
100
08:00
60
08:05
102
08:05
Empty (NaN)
08:10
105
08:10
40
08:15
101
08:15
50
08:20
103
08:20
Empty (NaN)
08:25
104
08:25
31
I guess I could simplify the question by asking a way to insert a specific element in a specific row of a specific column. All shifting examples I've seen will shift the entire column up or down. Is there an easy way to do this?
If it's easier, this solution also works for me:
TimeStamp
Sensor1
Sensor2
08:00
100
60
08:05
102
Empty (NaN)
08:10
105
40
08:15
101
50
08:20
103
Empty (NaN)
08:25
104
31
#ti7's suggestion is spot on; split the dataframe into individual frames, merge and fillna :
sensor1 = df.filter(like='1')
sensor2 = df.filter(like='2')
(sensor1.merge(sensor2,
how = 'outer',
left_on='TimeStamp1',
right_on = 'TimeStamp2',
sort = True)
.fillna({"TimeStamp2" : df.TimeStamp1})
.dropna(subset=['TimeStamp1'])
)
TimeStamp1 Sensor1 TimeStamp2 Sensor2
0 08:00 100.0 08:00 60.0
1 08:05 102.0 08:05 NaN
2 08:10 105.0 08:10 40.0
3 08:15 101.0 08:15 50.0
4 08:20 103.0 08:20 NaN
5 08:25 104.0 08:25 31.0
This will work if your data is setup exactly as your example, otherwise you'll have to adapt for your data.
# change timestamps columns to datetime. You don't say if there's a date component, so you may have to get your timestamps in order before moving on.
timestamps = df.filter(regex='TimeStamp').columns.tolist()
for t in timestamps:
df[t] = pd.to_datetime(df[t])
# get the max and min of all datetimes in the timestamp columns
end = df.filter(regex='TimeStamp').max().max()
start = df.filter(regex='TimeStamp').min().min()
# create a new date range
new_dates = pd.date_range(start=start, end=end, freq='5Min')
# get columns for iterations - should only be even and contain timestamp and sensor columns as your example shows
num_columns = df.shape[1]
# iterate and concat
dflist = []
for i in range(0, num_columns, 2):
print(i)
d = df.iloc[:, i:i+2].set_index(df.iloc[:, i].name).dropna().reindex(new_dates)
dflist.append(d)
pd.concat(dflist, axis=1)
Sensor1 Sensor2
2021-10-18 08:00:00 100 60.0
2021-10-18 08:05:00 102 NaN
2021-10-18 08:10:00 105 40.0
2021-10-18 08:15:00 101 50.0
2021-10-18 08:20:00 103 NaN
2021-10-18 08:25:00 104 31.0

Pandas MultiIndex: Partial indexing on second level

I have a data-set open in Pandas with a 2-level MultiIndex. The first level of the MultiIndex is a unique ID (SID) while the second level is time (ISO_TIME). A sample of the data-set is given below.
SEASON NATURE NUMBER
SID ISO_TIME
2020138N10086 2020-05-16 12:00:00 2020 NR 26
2020-05-16 15:00:00 2020 NR 26
2020-05-16 18:00:00 2020 NR 26
2020-05-16 21:00:00 2020 NR 26
2020-05-17 00:00:00 2020 NR 26
2020155N17072 2020-06-02 18:00:00 2020 NR 30
2020-06-02 21:00:00 2020 NR 30
2020-06-03 00:00:00 2020 NR 30
2020-06-03 03:00:00 2020 NR 30
2020-06-03 06:00:00 2020 NR 30
2020327N11056 2020-11-21 18:00:00 2020 NR 103
2020-11-21 21:00:00 2020 NR 103
2020-11-22 00:00:00 2020 NR 103
2020-11-22 03:00:00 2020 NR 103
2020-11-22 06:00:00 2020 NR 103
2020329N10084 2020-11-23 12:00:00 2020 NR 104
2020-11-23 15:00:00 2020 NR 104
2020-11-23 18:00:00 2020 NR 104
2020-11-23 21:00:00 2020 NR 104
2020-11-24 00:00:00 2020 NR 104
I can do df.loc[("2020138N10086")] to select rows with SID=2020138N10086 or df.loc[("2020138N10086", "2020-05-17")] to select rows with SID=2020138N10086 and are on 2020-05-17.
What I want to do, but not able to, is to partially index using the second level of MultiIndex. That is, select all rows on 2020-05-17, irrespective of the SID.
I have read through Pandas MultiIndex / advanced indexing which explains how indexing is done with MultiIndex. But nowhere in it could I find how to do a partial indexing on the second/inner level of a Pandas MultiIndex. Either I missed it in the document or it is not explained in there.
So, is it possible to do a partial indexing in the second level of a Pandas MultiIndex?
If it is possible, how do I do it?
you can do this with slicing. See the pandas documentation.
Example for your dataframe:
df.loc[(slice(None), '2020-05-17'), :]
df=df.reset_index()
dates_rows= df[df["ISO_TIME"]=="2020-05-17"]
If you want you can convert it back to a multi-level index again, like below
df.set_index(['SID', 'ISO_TIME'], inplace=True)
Use a cross-section
df.xs('2020-05-17', level="ISO_TIME")

Pandas percentage difference Calculation

I have the below Pandas dataframe. First column is date in YYYY-MM-DD format. It has month on month data but month starting may not be 1st necessarily and month last might not necessarily be 31 or 30 and not 29 or 28 incase of feb. It might vary. For example Feb 2020 has data from 2020-02-03 only and the last available data for feb is 2020-02-28 (not the 29th).
Date start_Value end_value
2020-01-01 115 120
2020-01-02 122 125
2020-01-03 125.2 126
...
2020-01-31 132 135
2020-02-03 135.5 137
2020-02-04 137.8 138
...
2020-02-28 144 145
My objective is to create a new column which calculates the percentage difference between the end value of the previous month's last available date in the dataframe and end value of next month's last available date in dataframe. It should be 0 for all the dates except the last available date for the month. For Jan 2020, since we dont have the previous month data, the percentage difference should be calculated using the end value of the first available date for the month.
For Jan 2020, the percentage difference will be calculated between end value of 2020-01-01 and end value on 2020-01-31.
For the rest (for example from Feb 2020: the percentage difference is calculated between end value on 2020-01-31 and end value on 2020-02-28).
Date start_Value end_value percentage difference
2020-01-01 115 120 0
2020-01-02 122 125 0
2020-01-03 125.2 126 0
...
2020-01-31 132 135 17.4
2020-02-03 135.5 137 0
2020-02-04 137.8 138 0
...
2020-02-28 144 145 7.41
how to achieve this in python and pandas?
Check with transform with duplicated
s = df.Date.dt.strftime('%Y-%m')
df['pct']= (df.groupby(s)['end_value'].transform('last')/df.groupby(s)['start_Value'].transform('first')-1).\
mask(s.duplicated(keep='last'))

Interpolating missing values for time series based on the values of the same period from a different year

I have a time series like the following:
date value
2017-08-27 564.285714
2017-09-03 28.857143
2017-09-10 NaN
2017-09-17 NaN
2017-09-24 NaN
2017-10-01 236.857143
... ...
2018-09-02 345.142857
2018-09-09 288.714286
2018-09-16 274.000000
2018-09-23 248.142857
2018-09-30 166.428571
It corresponds to that ranging from July 2017 to November 2019 and it's resampled by weeks. However, there are some weeks where the values were 0. I replaced it as there the values were missing and now I would like to feel those values based on values on the homologous period of a different year. For example, I have a lot of data missing for the month of September of 2017. I would like to interpolate those values using the values from September 2018. However, I'm a newbie and I'm not quite sure I to do it based only on a select period. I'm working in python, btw.
If anyone has any idea on how to this quickly, I'd be very much appreciated.
If you are OK with pandas library
One option is to find the week number from date and fill NaN values.
df['week'] = pd.to_datetime(df['date'], format='%Y-%m-%d').dt.strftime("%V")
df2 = df.sort_values(['week']).fillna(method='bfill').sort_values(['date'])
df2
which will give you the following output.
date value week
0 2017-08-27 564.285714 34
1 2017-09-03 28.857143 35
2 2017-09-10 288.714286 36
3 2017-09-17 274.000000 37
4 2017-09-24 248.142857 38
5 2017-10-01 236.857143 39
6 2018-09-02 345.142857 35
7 2018-09-09 288.714286 36
8 2018-09-16 274.000000 37
9 2018-09-23 248.142857 38
10 2018-09-30 166.428571 39
In Pandas:
df['value'] = df['value'].fillna(df['value_last_year'])

Pandas get unique monthly data based on date range

I have something like the following dataframe:
d=pd.DataFrame()
d['id']=['a','a','a','b','b','c']
d['version_start']=['2017-01-01','2017-02-12','2017-03-25','2017-01-01','2017-6-15','2017-01-22']
d['version_end']=['2017-02-11','2017-03-24','2017-08-01','2017-06-14','2018-01-01','2018-01-01']
d['version_start']=pd.to_datetime(d.version_start)
d['version_end']=pd.to_datetime(d.version_end)
d['values']=[10,15,20,5,6,200]
print d
id version_start version_end values
0 a 2017-01-01 2017-02-11 10
1 a 2017-02-12 2017-03-24 15
2 a 2017-03-25 2017-08-01 20
3 b 2017-01-01 2017-06-14 5
4 b 2017-06-15 2018-01-01 6
5 c 2017-01-22 2018-01-01 200
The version start and version end represent for each ID, the date range for which that row can be considered valid. For example, the total values for a given date would be the records for which that date is between the version start and version end.
I am looking to get for a set of dates (the first of the month for each month in 2017) the sum of the "values" field. I can do this by looping through each month as follows:
df=pd.DataFrame()
for month in pd.date_range('2017-01-01','2018-01-01',freq='MS'):
s = d[(d.version_start<=month)&(d.version_end>month)]
s['month']=month
s=s.set_index(['month','id'])[['values']]
df=df.append(s)
print df.groupby(level='month')['values'].sum()
2017-01-01 15
2017-02-01 215
2017-03-01 220
2017-04-01 225
2017-05-01 225
2017-06-01 225
2017-07-01 226
2017-08-01 206
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
Name: values, dtype: int64
Is there a more elegant / efficient solution that doesn't require looping through this list of dates?
d.version_start=d.version_start+ pd.offsets.MonthBegin(0)
d.version_end=d.version_end+ pd.offsets.MonthBegin(0)
d['New']=d[['version_start','version_end']].apply(lambda x : pd.date_range(start=x.version_start,end=x.version_end,freq='MS').tolist(),1)
d.set_index(['id','version_start','version_end','values']).New.apply(pd.Series).stack().reset_index('values').groupby(0)['values'].sum()
Out[845]:
0
2017-01-01 15
2017-02-01 215
2017-03-01 230
2017-04-01 240
2017-05-01 225
2017-06-01 225
2017-07-01 231
2017-08-01 226
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
2018-01-01 206
Name: values, dtype: int64
I keep thinking there should be a way more elegant way to do this, but for now:
s = pd.Series(0, index=pd.date_range('2017-01-01','2018-01-01',freq='MS'))
for _id, start, end, values in d.itertuples(index=False):
s[start:end] += values
this returns the proper series, and works with any series for that matter.
If you want version_end day to be excluded, a quick fix is to add this line before the for cycle (only works if you are using 'MS' as frequency):
d.version_end = d.version_end.apply(lambda t: t.replace(day=2))
I think the idea of using explicit indexing is cleaner than conditional indexing based on comparisons between dates, which at scale is terribly slow (timestamps are a valid alternative if you are forced to do this on huge arrays).

Categories

Resources