I have a data frame that looks like this, with monthly data points:
Date Value
1 2010-01-01 18.45
2 2010-02-01 18.13
3 2010-03-01 18.25
4 2010-04-01 17.92
5 2010-05-01 18.85
I want to make it daily data and fill in the resulting new dates with the current month value. For example:
Date Value
1 2010-01-01 18.45
2 2010-01-02 18.45
3 2010-01-03 18.45
4 2010-01-04 18.45
5 2010-01-05 18.45
....
This is the code I'm using to add the interim dates and fill the values:
today = get_datetime('US/Eastern') #.strftime('%Y-%m-%d')
enddate='1881-01-01'
idx = pd.date_range(enddate, today.strftime('%Y-%m-%d'), freq='D')
df = df.reindex(idx)
df = df.fillna(method = 'ffill')
The output is as follows:
Date Value
2010-01-01 00:00:00 NaN NaN
2010-01-02 00:00:00 NaN NaN
2010-01-03 00:00:00 NaN NaN
2010-01-04 00:00:00 NaN NaN
2010-01-05 00:00:00 NaN NaN
The logs show that the NaN values appear just before the .fillna method is invoked. So the forward fill is not the culprit.
Any ideas why this is happening?
option 3
safest approach, very general
up-sample to daily, then group monthly with a transform
The reason why this is important is that your day may not fall on the first of the month. If you want to ensure that that days value gets broadcast for every other day in the month, do this
df.set_index('Date').asfreq('D') \
.groupby(pd.TimeGrouper('M')).Value \
.transform('first').reset_index()
option 2
asfreq
df.set_index('Date').asfreq('D').ffill().reset_index()
option 3
resample
df.set_index('Date').resample('D').first().ffill().reset_index()
For pandas=0.16.1
df.set_index('Date').resample('D').ffill().reset_index()
All produce the same result over this sample data set
you need to add index to the original dataframe before calling reindex
test = pd.DataFrame(np.random.randn(4), index=pd.date_range('2017-01-01', '2017-01-04'), columns=['test'])
test.reindex(pd.date_range('2017-01-01', '2017-01-05'), method='ffill')
Related
So I have a dataset with a specific date along with every data. I want to fill these values according to their specific date in Excel which contains the date range of the whole year. It's like the date starts from 01-01-2020 00:00:00 and end at 31-12-2020 23:45:00 with the frequency of 15 mins. So there will be a total of 35040 date-time values in Excel.
my data is like:
load date
12 01-02-2020 06:30:00
21 29-04-2020 03:45:00
23 02-07-2020 12:15:00
54 07-08-2020 16:00:00
23 22-09-2020 16:30:00
As you can see these values are not continuous but they have specific dates with them, so I these date values as the index and put it at that particular date in the Excel which has the date column, and also put zero in the missing values. Can someone please help?
Use DataFrame.reindex with date_range - so added 0 values for all not exist datetimes:
rng = pd.date_range('2020-01-01','2020-12-31 23:45:00', freq='15Min')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').reindex(rng, fill_value=0)
print (df)
load
2020-01-01 00:00:00 0
2020-01-01 00:15:00 0
2020-01-01 00:30:00 0
2020-01-01 00:45:00 0
2020-01-01 01:00:00 0
...
2020-12-31 22:45:00 0
2020-12-31 23:00:00 0
2020-12-31 23:15:00 0
2020-12-31 23:30:00 0
2020-12-31 23:45:00 0
[35136 rows x 1 columns]
I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.
In a pandas df, I have number of days for a given month in the first col and Amount in the sec col. How can I add the days that are not in there for that month in the first col and give the value 0 for it in the second col
df = pd.DataFrame({
'Date':['5/23/2019', '5/9/2019'],
'Amount':np.random.choice([10000])
})
I would like the result to look like the following:
Expected Output
Date Amount
0 5/01/2019 0
1 5/02/2019 0
.
.
. 5/23/2019 1000
. 5/24/2019 0
Look at date_range from pandas.
I'm assuming that 5/31/2019 is not in your output like the comment asks because you want the differences between the min and max dates?
I convert the date column to a datetime type. I pass the min and max date to date_range and store that in a dataframe. then I do left join.
df['Date'] = pd.to_datetime(df['Date'])
date_range = pd.DataFrame(pd.date_range(start=df['Date'].min(), end=df['Date'].max()), columns=['Date'])
final_df = pd.merge(date_range, df, how='left')
Date Amount
0 2019-05-09 10000.0
1 2019-05-10 NaN
2 2019-05-11 NaN
3 2019-05-12 NaN
4 2019-05-13 NaN
5 2019-05-14 NaN
6 2019-05-15 NaN
7 2019-05-16 NaN
8 2019-05-17 NaN
9 2019-05-18 NaN
10 2019-05-19 NaN
11 2019-05-20 NaN
12 2019-05-21 NaN
13 2019-05-22 NaN
14 2019-05-23 10000.0
I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error
Consider interpolate (Series - DataFrame). This example shows how to fill gaps of any size with a straight line:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 NaN
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
df['value'].interpolate(method='linear', inplace=True)
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 2.0
3 2013-01-01 03:00:00 3.0
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 6.0
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
Interpolate & Filna :
Since it's Time series Question I will use o/p graph images in the answer for the explanation purpose:
Consider we are having data of time series as follows: (on x axis= number of days, y = Quantity)
pdDataFrame.set_index('Dates')['QUANTITY'].plot(figsize = (16,6))
We can see there is some NaN data in time series. % of nan = 19.400% of total data. Now we want to impute null/nan values.
I will try to show you o/p of interpolate and filna methods to fill Nan values in the data.
interpolate() :
1st we will use interpolate:
pdDataFrame.set_index('Dates')['QUANTITY'].interpolate(method='linear').plot(figsize = (16,6))
NOTE: There is no time method in interpolate here
fillna() with backfill method
pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=None, downcast=None).plot(figsize = (16,6))
fillna() with backfill method & limit = 7
limit: this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled.
pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=7, downcast=None).plot(figsize = (16,6))
I find fillna function more useful. But you can use any one of the methods to fill up nan values in both the columns.
For more details about these functions refer following links:
Filna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html#pandas.Series.fillna
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
There is one more Lib: impyute that you can check out. For more details regarding this lib refer this link: https://pypi.org/project/impyute/
You could use rolling like this:
frame = pd.DataFrame({'Humidity':np.arange(50,64)})
frame.loc[[3,7,10,11],'Humidity'] = np.nan
frame.Humidity.fillna(frame.Humidity.rolling(4,min_periods=1).mean())
Output:
0 50.0
1 51.0
2 52.0
3 51.0
4 54.0
5 55.0
6 56.0
7 55.0
8 58.0
9 59.0
10 58.5
11 58.5
12 62.0
13 63.0
Name: Humidity, dtype: float64
Looks like your data is by hour. How about just take the average of the hour before and the hour after? Or change the window size to 2, meaning the average of two hours before and after?
Imputing using other variables can be expensive and you should only consider those methods if the dummy methods do not work well (e.g. introducing too much noise).
I'm looking for a way how to get a value from the previous year for the same day.
F.e. we have a value for 2014-01-01 and I want to create a new column with the value for this day but from one year ago.
a sample of the table, and I want to get the Previos_Year column.
Date Ticks Previos_Year
2013-01-01 0 NaN
2013-01-02 1 NaN
2013-01-03 2 NaN
....
2014-01-01 3 0
2014-01-02 4 1
What have I tried so far:
I created a new column day of the year,
df['Day_in_Year'] = df.Date.dt.dayofyear
but I could not figure out how to use it for my task.
Also, I tried the shift fucntion:
df['Ticks'].shift(365)
and it works, until a leap year...
You can groupby month and day, then shift i.e
df['Previous'] = df.groupby([df['Date'].dt.month,df['Date'].dt.day])['Value'].shift()
Sample Output :
Date Ticks Value Previous
0 2013-01-01 0 99 NaN
1 2013-01-02 1 0 NaN
2 2013-01-03 2 5 NaN
3 2014-01-01 3 0 99.0
4 2014-01-02 4 1 0.0
5 2014-01-03 2 5 5.0
7 2014-01-04 2 5 NaN
You can also use Pandas DateOffset. The help document is here.
For example, to get previous year, use
from pandas.tseries.offsets import DateOffset
df['Previous'] = df['Date'] - DateOffset(years=1)
If you want last month, you can try:
df['Previous'] = df['Date'] - DateOffset(months=1)