I have a pandas series object consists of a datetime_index and some values, looks like following:
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 00:30:00 35.6
2020-01-01 00:45:00 39.2
2020-01-01 01:00:00 56.7
...
2020-12-31 23:45:00 56.3
I am adding some values to this df with .append(). Since it is not sorted then I sort its index via .sort_index(). However what I would like to achieve is that I want to sort only for given day.
So for example I add some values to day 2020-01-01, and since the added values will be after the end of the day 2020-01-01 I just need to sort the first day of the year. NOT ALL THE DF.
Here is an example, NaN value is added with .append():
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
...
2020-01-01 23:45:00 34.3
2020-01-01 15:00:00 NaN
...
2020-12-31 23:45:00 56.3
Now I cannot df.sort_index(), because it breaks other days. That is why I just want to apply .sort_index() to the day 2020-01-01. How do I do that?
WHAT I TRIED SO FAR AND DOES NOT WORK:
df.loc['2020-01-01'] = df.loc['2020-01-01'].sort_index()
Filter rows for 2020-01-01 days, sorting and join back with not matched rows:
mask = df.index.normalize() == '2020-01-01'
df = pd.concat([df[mask].sort_index(), df[~mask]])
print (df)
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 15:00:00 NaN
2020-01-01 23:45:00 34.3
2020-12-31 23:45:00 56.3
Name: a, dtype: float64
Another idea:
df1 = df['2020-01-01'].sort_index()
df = pd.concat([df1, df.drop(df1.index)])
I have a DataFrame that resembles as follows:
import pandas as pd
import numpy as np
date = pd.date_range(start='2020-01-01', freq='H', periods=4)
locations = ["AA3", "AB1", "AD1", "AC0"]
x = [5.5, 10.2, np.nan, 2.3, 11.2, np.nan, 2.1, 4.0, 6.1, np.nan, 20.3, 11.3, 4.9, 15.2, 21.3, np.nan]
df = pd.DataFrame({'x': x})
df.index = pd.MultiIndex.from_product([locations, date], names=['location', 'date'])
df = df.sort_index()
df
x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 NaN
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 NaN
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
Index values are location codes and hours of the day. I want to fill missing values of column x with valid value of the same column from the nearest location on the same day and hour, where the proximity of each location to other locations is defined as
nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
nearest
AA3 AB1 AD1 AC0
0 AA3 AB1 AD1 AC0
1 AB1 AA3 AC0 AD1
2 AD1 AC0 AB1 AA3
3 AC0 AD1 AA1 AB1
In this dataset, column names are location codes and row values under each column indicate other locations in order of their proximity to the location whose name is give as column name.
If the nearest location also has missing value on the same day and hour, then I would take the value of the second nearest location on the same day and hour. If the second nearest location is missing, then the third nearest location on the same day and hour, and so on.
Desired output:
x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 11.3
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
The following based on suggestions by #kiona1018 works as intended but it is slow.
def fillna_by_nearest(x: pd.Series, nn_data: pd.DataFrame):
out = x.copy()
for index, value in x.iteritems():
if np.isnan(value) and (index[0] in nn_data.columns):
location, date = index
for near_location in nn_data[location]:
if ((near_location, date) in x.index) and pd.notna(x.loc[near_location, date]):
out.loc[index] = x.loc[near_location, date]
break
return out
fillna_by_nearest(df['x'], nearest)
I agree with Serial Lazer that there are no neater pandas/numpy executions for this. The requirement is a little bit complicated. In such a case, you should make your own function. An example is below.
nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
def fill_by_nearest(sr: pd.Series):
if not np.isnan(sr['x']):
return sr
location = sr.name[0]
date = sr.name[1]
for near_location in nearest[location]:
if not np.isnan(df.loc[near_location, date]['x']):
sr['x'] = df.loc[near_location, date]['x']
return sr
return sr
df = df.apply(fill_by_nearest, axis=1)
You can use apply function:
def find_nearest(row):
for item in list(nearest[row['location']]):
if len(df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))]):
return df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))].x.values[0]
df = df.reset_index()
df = df.assign(x = lambda x: x.apply(find_nearest,axis=1))
Output:
location date x
0 AA3 2020-01-01 00:00:00 5.5
1 AA3 2020-01-01 01:00:00 10.2
2 AA3 2020-01-01 02:00:00 2.1
3 AA3 2020-01-01 03:00:00 2.3
4 AB1 2020-01-01 00:00:00 11.2
5 AB1 2020-01-01 01:00:00 10.2
6 AB1 2020-01-01 02:00:00 2.1
7 AB1 2020-01-01 03:00:00 4.0
8 AC0 2020-01-01 00:00:00 4.9
9 AC0 2020-01-01 01:00:00 15.2
10 AC0 2020-01-01 02:00:00 21.3
11 AC0 2020-01-01 03:00:00 11.3
12 AD1 2020-01-01 00:00:00 6.1
13 AD1 2020-01-01 01:00:00 15.2
14 AD1 2020-01-01 02:00:00 20.3
15 AD1 2020-01-01 03:00:00 11.3
If we can divide time of a day from 00:00:00 hrs to 23:59:00 into 15 min blocks we will have 96 blocks. we can number them from 0 to 95.
I want to add a "timeblock" column to the dataframe, where i can number each row with a timeblock number that time stamp sits in as shown below.
tagdatetime tagvalue timeblock
2020-01-01 00:00:00 47.874423 0
2020-01-01 00:01:00 14.913561 0
2020-01-01 00:02:00 56.368034 0
2020-01-01 00:03:00 16.555687 0
2020-01-01 00:04:00 42.138176 0
... ... ...
2020-01-01 00:13:00 47.874423 0
2020-01-01 00:14:00 14.913561 0
2020-01-01 00:15:00 56.368034 0
2020-01-01 00:16:00 16.555687 1
2020-01-01 00:17:00 42.138176 1
... ... ...
2020-01-01 23:55:00 18.550685 95
2020-01-01 23:56:00 51.219147 95
2020-01-01 23:57:00 15.098951 95
2020-01-01 23:58:00 37.863191 95
2020-01-01 23:59:00 51.380950 95
I think there's a better way to do it, but I think it's possible below.
import pandas as pd
import numpy as np
tindex = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='min')
tvalue = np.random.randint(1,50, (1440,))
df = pd.DataFrame({'tagdatetime':tindex, 'tagvalue':tvalue})
min15 = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='15min')
tblock = np.arange(96)
df2 = pd.DataFrame({'min15':min15, 'timeblock':tblock})
df3 = pd.merge(df, df2, left_on='tagdatetime', right_on='min15', how='outer')
df3.ffill(axis=0, inplace=True)
df3 = df3.drop('min15', axis=1)
df3.iloc[10:20,]
tagdatetime tagvalue timeblock
10 2020-01-01 00:10:00 20 0.0
11 2020-01-01 00:11:00 25 0.0
12 2020-01-01 00:12:00 42 0.0
13 2020-01-01 00:13:00 45 0.0
14 2020-01-01 00:14:00 11 0.0
15 2020-01-01 00:15:00 15 1.0
16 2020-01-01 00:16:00 38 1.0
17 2020-01-01 00:17:00 23 1.0
18 2020-01-01 00:18:00 5 1.0
19 2020-01-01 00:19:00 32 1.0
I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks
Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).
I'd like to create multiple columns while resampling a pandas DataFrame like the built-in ohlc method.
def mhl(data):
return pandas.Series([np.mean(data),np.max(data),np.min(data)],index = ['mean','high','low'])
ts.resample('30Min',how=mhl)
Dies with
Exception: Must produce aggregated value
Any suggestions? Thanks!
You can pass a dictionary of functions to the resample method:
In [35]: ts
Out[35]:
2013-01-01 00:00:00 0
2013-01-01 00:15:00 1
2013-01-01 00:30:00 2
2013-01-01 00:45:00 3
2013-01-01 01:00:00 4
2013-01-01 01:15:00 5
...
2013-01-01 23:00:00 92
2013-01-01 23:15:00 93
2013-01-01 23:30:00 94
2013-01-01 23:45:00 95
2013-01-02 00:00:00 96
Freq: 15T, Length: 97
Create a dictionary of functions:
mhl = {'m':np.mean, 'h':np.max, 'l':np.min}
Pass the dictionary to the how parameter of resample:
In [36]: ts.resample("30Min", how=mhl)
Out[36]:
h m l
2013-01-01 00:00:00 1 0.5 0
2013-01-01 00:30:00 3 2.5 2
2013-01-01 01:00:00 5 4.5 4
2013-01-01 01:30:00 7 6.5 6
2013-01-01 02:00:00 9 8.5 8
2013-01-01 02:30:00 11 10.5 10
2013-01-01 03:00:00 13 12.5 12
2013-01-01 03:30:00 15 14.5 14