I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error
Consider interpolate (Series - DataFrame). This example shows how to fill gaps of any size with a straight line:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 NaN
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
df['value'].interpolate(method='linear', inplace=True)
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 2.0
3 2013-01-01 03:00:00 3.0
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 6.0
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
Interpolate & Filna :
Since it's Time series Question I will use o/p graph images in the answer for the explanation purpose:
Consider we are having data of time series as follows: (on x axis= number of days, y = Quantity)
pdDataFrame.set_index('Dates')['QUANTITY'].plot(figsize = (16,6))
We can see there is some NaN data in time series. % of nan = 19.400% of total data. Now we want to impute null/nan values.
I will try to show you o/p of interpolate and filna methods to fill Nan values in the data.
interpolate() :
1st we will use interpolate:
pdDataFrame.set_index('Dates')['QUANTITY'].interpolate(method='linear').plot(figsize = (16,6))
NOTE: There is no time method in interpolate here
fillna() with backfill method
pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=None, downcast=None).plot(figsize = (16,6))
fillna() with backfill method & limit = 7
limit: this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled.
pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=7, downcast=None).plot(figsize = (16,6))
I find fillna function more useful. But you can use any one of the methods to fill up nan values in both the columns.
For more details about these functions refer following links:
Filna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html#pandas.Series.fillna
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
There is one more Lib: impyute that you can check out. For more details regarding this lib refer this link: https://pypi.org/project/impyute/
You could use rolling like this:
frame = pd.DataFrame({'Humidity':np.arange(50,64)})
frame.loc[[3,7,10,11],'Humidity'] = np.nan
frame.Humidity.fillna(frame.Humidity.rolling(4,min_periods=1).mean())
Output:
0 50.0
1 51.0
2 52.0
3 51.0
4 54.0
5 55.0
6 56.0
7 55.0
8 58.0
9 59.0
10 58.5
11 58.5
12 62.0
13 63.0
Name: Humidity, dtype: float64
Looks like your data is by hour. How about just take the average of the hour before and the hour after? Or change the window size to 2, meaning the average of two hours before and after?
Imputing using other variables can be expensive and you should only consider those methods if the dummy methods do not work well (e.g. introducing too much noise).
Related
I have two dataframes:
Date Variable
2013-04-01 05:00:00 S
2013-04-01 05:00:00 A
2013-04-01 05:10:00 S
2013-04-01 05:20:00 A
2013-04-01 05:25:00 S
2013-04-01 05:35:00 S
And:
Date Variable
2013-04-01 04:50:00 A
2013-04-01 05:00:00 A
2013-04-01 05:05:00 S
2013-04-01 05:15:00 S
2013-04-01 05:35:00 S
2013-04-01 05:40:00 S
My goal is to count the number of dates on the first dataframe 20 min before and 20min after each date on the second dataframe. So, what I need to do is iterate over all dates on the second dataframe, and count how many dates are in the first dataframe 20min bef and 20min after each specific date. Also, I want to account the number of occurrences of variable A or S, in other words, the Nr_var_20_bef columns has the number of dates 20min bef with the same variable). Therefore, the output would be something like:
Date Variable Nr_20_bef Nr_20_aft Nr_var_20_bef Nr_var_20_after
2013-04-01 04:50:00 A 0 3 0 1
2013-04-01 05:00:00 A 2 4 1 2
2013-04-01 05:05:00 S 2 3 1 2
2013-04-01 05:15:00 S 3 3 2 2
2013-04-01 05:35:00 S 3 1 2 1
2013-04-01 05:40:00 S 3 0 2 0
My main problem is that both dataframes have over 1 million rows, and this means that I can not use a for loop or a pandas apply, because they are take way too time consuming with such huge dataframes. Thank you very much in advance.
This is a tough problem! I can offer you a partial solution, which will hopefully be enough to get you started.
You should look into pandas rolling methods which can take advantage of your DateTime index. Note that, as far as I'm aware, the rolling functions can only look at the previous time period, not a future period. This solution calculates the number of instances of the bar column appears in the past 20 minutes according to a set of merged times of foo and bar, which I believe is what you're asking for.
import pandas as pd
import numpy as np
# Attempting to generate some similar data
np.random.seed(0)
rng = pd.date_range('4/1/2013', periods=1000, freq='5T', name='Date')
df = pd.DataFrame({'Variable': np.random.choice(['S', 'A'], 1000)}, index=rng)
df1 = df.sample(frac=0.5)
df2 = df.sample(frac=0.5)
merged = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_foo', '_bar'])
# pandas can't found objects, but can count bools
m = merged.notnull()
# Rolling functions can't count "after", only "before" or "center"
merged['Nr_20_bef'] = m.Variable_bar.rolling('20T').sum()
print(merged.head(10))
Variable_foo Variable_bar Nr_20_bef
# Date
# 2013-04-01 00:05:00 A NaN 0.0
# 2013-04-01 00:10:00 A NaN 0.0
# 2013-04-01 00:15:00 NaN S 1.0
# 2013-04-01 00:20:00 A A 2.0
# 2013-04-01 00:25:00 A NaN 2.0
# 2013-04-01 00:40:00 NaN A 1.0
# 2013-04-01 00:45:00 A A 2.0
# 2013-04-01 00:50:00 NaN A 3.0
# 2013-04-01 01:05:00 NaN A 2.0
# 2013-04-01 01:10:00 S S 2.0
Generating the Nr_20_bef column is very fast, ~1 second for 10 million rows on my two year old laptop. If you want to count just "S" characters, for instance, you could instead do m = merged == 'S'.
I have a dataframe like this:
df = pd.DataFrame({'timestamp':pd.date_range('2018-01-01', '2018-01-02', freq='2h', closed='right'),'col1':[np.nan, np.nan, np.nan, 1,2,3,4,5,6,7,8,np.nan], 'col2':[np.nan, np.nan, 0, 1,2,3,4,5,np.nan,np.nan,np.nan,np.nan], 'col3':[np.nan, -1, 0, 1,2,3,4,5,6,7,8,9], 'col4':[-2, -1, 0, 1,2,3,4,np.nan,np.nan,np.nan,np.nan,np.nan]
})[['timestamp', 'col1', 'col2', 'col3', 'col4']]
which looks like this:
timestamp col1 col2 col3 col4
0 2018-01-01 02:00:00 NaN NaN NaN -2.0
1 2018-01-01 04:00:00 NaN NaN -1.0 -1.0
2 2018-01-01 06:00:00 NaN 0.0 NaN 0.0
3 2018-01-01 08:00:00 1.0 1.0 1.0 1.0
4 2018-01-01 10:00:00 2.0 NaN 2.0 2.0
5 2018-01-01 12:00:00 3.0 3.0 NaN 3.0
6 2018-01-01 14:00:00 NaN 4.0 4.0 4.0
7 2018-01-01 16:00:00 5.0 NaN 5.0 NaN
8 2018-01-01 18:00:00 6.0 NaN 6.0 NaN
9 2018-01-01 20:00:00 7.0 NaN 7.0 NaN
10 2018-01-01 22:00:00 8.0 NaN 8.0 NaN
11 2018-01-02 00:00:00 NaN NaN 9.0 NaN
Now, I want to find an efficient and pythonic way of chopping off (for each column! Not counting timestamp) before the first valid index and after the last valid index. In this example I have 4 columns, but in reality I have a lot more, 600 or so. I am looking for a way of chop of all the NaN values before the first valid index and all the NaN values after the last valid index.
One way would be to loop through I guess.. But is there a better way? This way has to be efficient. I tried to "unpivot" the dataframe using melt, but then this didn't help.
An obvious point is that each column would have a different number of rows after the chopping. So I would like the result to be a list of data frames (one for each column) having timestamp and the column in question. For instance:
timestamp col1
3 2018-01-01 08:00:00 1.0
4 2018-01-01 10:00:00 2.0
5 2018-01-01 12:00:00 3.0
6 2018-01-01 14:00:00 NaN
7 2018-01-01 16:00:00 5.0
8 2018-01-01 18:00:00 6.0
9 2018-01-01 20:00:00 7.0
10 2018-01-01 22:00:00 8.0
My try
I tried like this:
final = []
columns = [c for c in df if c !='timestamp']
for col in columns:
first = df.loc[:, col].first_valid_index()
last = df.loc[:, col].last_valid_index()
final.append(df.loc[:, ['timestamp', col]].iloc[first:last+1, :])
One idea is to use a list or dictionary comprehension after setting your index as timestamp. You should test with your data to see if this resolves your issue with performance. It is unlikely to help if your limitation is memory.
df = df.set_index('timestamp')
final = {col: df[col].loc[df[col].first_valid_index(): df[col].last_valid_index()] \
for col in df}
print(final)
{'col1': timestamp
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
2018-01-01 16:00:00 5.0
2018-01-01 18:00:00 6.0
2018-01-01 20:00:00 7.0
2018-01-01 22:00:00 8.0
Name: col1, dtype: float64,
...
'col4': timestamp
2018-01-01 02:00:00 -2.0
2018-01-01 04:00:00 -1.0
2018-01-01 06:00:00 0.0
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
Name: col4, dtype: float64}
You can use the power of functional programming and apply a function to each column. This may speed things up. Also, as you timestamps looks sorted, you can use them as index of your Datarame.
df.set_index('timestamp', inplace=True)
final = []
def func(col):
first = col.first_valid_index()
last = col.last_valid_index()
final.append(col.loc[first:last])
return
df.apply(func)
Also, you can compact everything in a one liner:
final = []
df.apply(lambda col: final.append(col.loc[col.first_valid_index() : col.last_valid_index()]))
My approach is to find the cumulative sum of NaN for each column and its inverse and filter those entries that are greater than 0. Then I do a dict comprehension to return a dataframe for each column (you can change that to a list if that's what you prefer).
For your example we have
cols = [c for c in df.columns if c!='timestamp']
result_dict = {c: df[(df[c].notnull().cumsum() > 0) &
(df.ix[::-1,c].notnull().cumsum()[::-1] > 0)][['timestamp', c]]
for c in cols}
Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0
For a dataframe with no missing values, this would be as easy as df.diff(periods=24, axis=0). But how is it possible to connect the calculations to the index values?
Reproducible dataframe - Code:
# Imports
import pandas as pd
import numpy as np
# A dataframe with two variables, random numbers and hourly time series
np.random.seed(123)
rows = 36
rng = pd.date_range('1/1/2017', periods=rows, freq='H')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['A', 'B'])
df = df.set_index(rng)
Reproducible dataframe - Screenshot:
Desired output - Code:
# Running difference step = 24
df = df.diff(periods=24, axis=0)
df = df.dropna(axis=0, how='all')
Desired output - Screenshot
The real challenge
The problem is that my real-world examples are full of missing values.
So I'll have to connect the difference intervals with the index values, and I have no Idea how. I've tried a few solutions with filling in the missing hours in the index first, and then running the differences like before, but it's not very elegant.
Thank you for any suggestions!
Edit - As requested in the comments, here's my best attempt for a bit longer time period:
df_missing = df.drop(df.index[[2,3]])
newIndex = pd.date_range(start = '1/1/2017', end = '1/3/2017', freq='H')
df_missing = df_missing.reindex(newIndex, fill_value = np.nan)
df_refilled = df_missing.diff(periods=24, axis=0)
Compared to the other suggestions, I would say that this is not very elegant =)
I think maybe you can use groupby
df.groupby(df.index.hour).diff().dropna()
Out[784]:
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0
You can snap your dataframe to hourly recordings using asfreq, and then use diff?
df.asfreq('1H').diff(periods=24, axis=0).dropna()
Or, use shift and then subtract (instead of diff),
v = df.asfreq('1h')
(v - v.shift(periods=24)).dropna()
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0
I have a data frame that looks like this, with monthly data points:
Date Value
1 2010-01-01 18.45
2 2010-02-01 18.13
3 2010-03-01 18.25
4 2010-04-01 17.92
5 2010-05-01 18.85
I want to make it daily data and fill in the resulting new dates with the current month value. For example:
Date Value
1 2010-01-01 18.45
2 2010-01-02 18.45
3 2010-01-03 18.45
4 2010-01-04 18.45
5 2010-01-05 18.45
....
This is the code I'm using to add the interim dates and fill the values:
today = get_datetime('US/Eastern') #.strftime('%Y-%m-%d')
enddate='1881-01-01'
idx = pd.date_range(enddate, today.strftime('%Y-%m-%d'), freq='D')
df = df.reindex(idx)
df = df.fillna(method = 'ffill')
The output is as follows:
Date Value
2010-01-01 00:00:00 NaN NaN
2010-01-02 00:00:00 NaN NaN
2010-01-03 00:00:00 NaN NaN
2010-01-04 00:00:00 NaN NaN
2010-01-05 00:00:00 NaN NaN
The logs show that the NaN values appear just before the .fillna method is invoked. So the forward fill is not the culprit.
Any ideas why this is happening?
option 3
safest approach, very general
up-sample to daily, then group monthly with a transform
The reason why this is important is that your day may not fall on the first of the month. If you want to ensure that that days value gets broadcast for every other day in the month, do this
df.set_index('Date').asfreq('D') \
.groupby(pd.TimeGrouper('M')).Value \
.transform('first').reset_index()
option 2
asfreq
df.set_index('Date').asfreq('D').ffill().reset_index()
option 3
resample
df.set_index('Date').resample('D').first().ffill().reset_index()
For pandas=0.16.1
df.set_index('Date').resample('D').ffill().reset_index()
All produce the same result over this sample data set
you need to add index to the original dataframe before calling reindex
test = pd.DataFrame(np.random.randn(4), index=pd.date_range('2017-01-01', '2017-01-04'), columns=['test'])
test.reindex(pd.date_range('2017-01-01', '2017-01-05'), method='ffill')