I'm looking for a Pythonic approach to capture stats based on the amount of matches in a DF column. So working with this example:
rng = pd.DataFrame( {'initial_data': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
test_B_mask = rng['initial_data'] == 'B'
rng['test_for_B'] = rng['initial_data'][test_B_mask]
and running this function to provide matches:
def func_match(df_in,val):
return ((df_in == val) & (df_in.shift() == val)).astype(int)
func_match(rng['test_for_B'],rng['test_for_B'])
I get the following output:
2014-04-02 09:00:00 0
2014-04-02 10:00:00 0
2014-04-02 11:00:00 0
2014-04-02 12:00:00 0
2014-04-02 13:00:00 0
2014-04-02 14:00:00 1
2014-04-02 15:00:00 0
2014-04-02 16:00:00 0
2014-04-03 09:00:00 0
2014-04-03 10:00:00 0
2014-04-03 11:00:00 0
2014-04-03 12:00:00 1
2014-04-03 13:00:00 1
2014-04-03 14:00:00 0
Freq: BH, Name: test_for_B, dtype: int64
I can use something simple like func_match(rng['test_for_B'],rng['test_for_B']).sum()
which returns
3
to get the amount if times the values match in total but could someone help with a function to provide the following more granular function please?
Amount and percentage of times a single match is seen.
Amount and percentage of times two consecutive matches are seen (up to n max matches which is just 3 matches 2014-04-02 11:00:00 through 13:00:00 in this example).
I'm guessing this would be a dict used within the function but Im sure many of the experienced coders on Stack Overflow are used to conducting this kind of analysis so would love to learn how to approach this task.
Thank you in advance for any help with this.
Edit:
I didn't initially specify desired output as I am open to all options and didn't want to deter anyone from providing solutions. However as per request from MaxU for desired output, something like this would be great:
Matches Matches_Percent
0 match 3 30
1 match 4 40
2 match 2 20
3 match 1 10
etc
Initial setup
rng = pd.DataFrame({'initial_data': ['A', 'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A',]},
index = pd.date_range('4/2/2014', periods=14, freq='BH'))
Assign bool to columns 'test_for_B'
rng['test_for_B'] = rng['initial_data'] == 'B'
Tricky bit
Test for 'B' and that last row was not 'B'. This signifies the beginning of a group. Then cumsum ties groups together.
contigious_groups = ((rng.initial_data == 'B') & (rng.initial_data != rng.initial_data.shift())).cumsum()
Now I groupby this grouping we created and sum the bools within each group. This gets at whether its a double, triple, etc.
counts = rng.loc[contigious_groups.astype(bool)].groupby(contigious_groups).test_for_B.sum()
Then use value_counts to get the frequency of each group type and divide by contigious_groups.max() because that's a count of how many groups.
counts.value_counts() / contigious_groups.max()
3.0 0.5
2.0 0.5
Name: test_for_B, dtype: float64
df = pd.DataFrame({'ID': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]},
index = pd.date_range('4/2/2014', periods=14, freq='BH'))
df.head()
Out: ID
2014-04-02 09:00:00 A
2014-04-02 10:00:00 A
2014-04-02 11:00:00 A
2014-04-02 12:00:00 A
2014-04-02 13:00:00 B
To count occurences for each ID, you can use pd.Series.value_counts
df['ID'].value_counts()
Out: A 9
B 5
Name: ID, dtype: int64
To count consecutive occurences, you can do as follow: pivot the table with dummy variables for each ID:
df2 = df.assign(Count = lambda x: 1)\
.reset_index()\
.pivot_table('Count', columns='ID', index='index')
df2.head()
Out: ID A B
index
2014-04-02 09:00:00 1.0 NaN
2014-04-02 10:00:00 1.0 NaN
2014-04-02 11:00:00 1.0 NaN
2014-04-02 12:00:00 1.0 NaN
2014-04-02 13:00:00 NaN 1.0
The following function counts the number of consecutive matches:
df2.apply(lambda x: x.notnull()\
.groupby(x.isnull().cumsum()).sum())
Out:
ID A B
0 4.0 NaN
1 0.0 0.0
2 4.0 0.0
3 0.0 0.0
4 0.0 2.0
5 1.0 0.0
6 NaN 0.0
7 NaN 0.0
8 NaN 3.0
9 NaN 0.0
We just need to group by ID and values:
df2.apply(lambda x: x.notnull().groupby(x.isnull().cumsum()).sum())\
.unstack()\
.reset_index()\
.groupby(['ID', 0]).count()\
.reset_index()\
.pivot_table(values='level_1', index=0, columns=['ID']).fillna(0)
Out:
ID A B
0
0.0 3.0 7.0
1.0 1.0 0.0
2.0 0.0 1.0
3.0 0.0 1.0
4.0 2.0 0.0
For instance, the previous table reads Ahas 2 4-consecutive matches.
To get percentages instead, add .pipe(lambda x: x/x.values.sum()):
Out:
ID A B
0
0.0 0.200000 0.466667
1.0 0.066667 0.000000
2.0 0.000000 0.066667
3.0 0.000000 0.066667
4.0 0.133333 0.000000
Related
Initial problem statement
Using pandas, I would like to apply function available for resample() but not for rolling().
This works:
df1 = df.resample(to_freq,
closed='left',
kind='period',
).agg(OrderedDict([('Open', 'first'),
('Close', 'last'),
]))
This doesn't:
df2 = df.rolling(my_indexer).agg(
OrderedDict([('Open', 'first'),
('Close', 'last') ]))
>>> AttributeError: 'first' is not a valid function for 'Rolling' object
df3 = df.rolling(my_indexer).agg(
OrderedDict([
('Close', 'last') ]))
>>> AttributeError: 'last' is not a valid function for 'Rolling' object
What would be your advice to keep first and last value of a rolling windows to be put into two different columns?
EDIT 1 - with usable input data
import pandas as pd
from random import seed
from random import randint
from collections import OrderedDict
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0,10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
# First & last work with resample
resampled_first = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'first')]))
resampled_last = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'last')]))
# They don't with rolling
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'first')]))
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'last')]))
Thanks for your help!
Bests,
You can use own function to get first or last element in rolling window
rolling_first = df.rolling(3).agg(lambda rows: rows[0])
rolling_last = df.rolling(3).agg(lambda rows: rows[-1])
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
df['last'] = df['Values'].rolling(3).agg(lambda rows: rows[-1])
print(df)
Result
Values first last
2020-01-01 00:00:00+00:00 2 NaN NaN
2020-01-01 01:00:00+00:00 9 NaN NaN
2020-01-01 02:00:00+00:00 1 2.0 1.0
2020-01-01 03:00:00+00:00 4 9.0 4.0
2020-01-01 04:00:00+00:00 1 1.0 1.0
2020-01-01 05:00:00+00:00 7 4.0 7.0
2020-01-01 06:00:00+00:00 7 1.0 7.0
2020-01-01 07:00:00+00:00 7 7.0 7.0
2020-01-01 08:00:00+00:00 10 7.0 10.0
2020-01-01 09:00:00+00:00 6 7.0 6.0
2020-01-01 10:00:00+00:00 3 10.0 3.0
2020-01-01 11:00:00+00:00 1 6.0 1.0
2020-01-01 12:00:00+00:00 7 3.0 7.0
2020-01-01 13:00:00+00:00 0 1.0 0.0
2020-01-01 14:00:00+00:00 6 7.0 6.0
2020-01-01 15:00:00+00:00 6 0.0 6.0
2020-01-01 16:00:00+00:00 9 6.0 9.0
2020-01-01 17:00:00+00:00 0 6.0 0.0
2020-01-01 18:00:00+00:00 7 9.0 7.0
2020-01-01 19:00:00+00:00 4 0.0 4.0
2020-01-01 20:00:00+00:00 3 7.0 3.0
2020-01-01 21:00:00+00:00 9 4.0 9.0
2020-01-01 22:00:00+00:00 1 3.0 1.0
2020-01-01 23:00:00+00:00 5 9.0 5.0
2020-01-02 00:00:00+00:00 0 1.0 0.0
EDIT:
Using dictionary you have to put directly lambda, not string
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
The same with own function - you have to put its name, not string with name
def first(rows):
return rows[0]
def last(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
def first(rows):
return rows[0]
def mylast(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
In case anyone else needs to find the difference between the first and last value in a 'rolling-window'. I used this on stock market data and wanted to know the price difference from the beginning to the end of the 'window' so I created a new column which used the current row 'close' value and the 'open' value using .shift() so it is taking the "open" value from 60 rows above.
df[windowColumn] = df["close"] - (df["open"].shift(60))
I think it's a very quick method for large datasets.
Say I have a dataframe as follows:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=6, freq='M'),
'value': [3, 3.5, -5, 2, 7, 6.8], 'type': ['a', 'a', 'a', 'b', 'b', 'b']})
df['pct'] = df.groupby(['type'])['value'].pct_change()
Ouput:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b -0.028571
I want to replace the pct values which is bigger than 0.2 or smaller than -0.2, then replace them with groupby type means.
My attempt to solve this problem by: first, replace "outliers" with extrame values -999, then replace them by groupby outputs, this is what I have done:
df.loc[df['pct'] >= 0.2, 'pct'] = -999
df.loc[df['pct'] <= -0.2, 'pct'] = -999
df["pct"] = df.groupby(['type'])['pct'].transform(lambda x: x.replace(-999, x.mean()))
But obviously, it is not best solution to solve this problem and results are not correct:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -499.416667
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b -499.514286
5 2013-06-30 6.8 b -0.028571
The expected result should look like this:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -1.130
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.24
What I have done wrong? Again thanks for your kind help.
Instead your both conditions is possible use Series.between and set values in pct by GroupBy.transform with mean:
mask = df['pct'].between(-0.2, 0.2)
df.loc[mask, 'pct'] = df.groupby('type')['pct'].transform('mean').values
print (df)
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a -1.130952
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.235714
Alternative solution is use numpy.where:
mask = df['pct'].between(-0.2, 0.2)
df['pct'] = np.where(mask, df.groupby('type')['pct'].transform('mean'), df['pct'])
I have a dataframe like this:
df = pd.DataFrame({'timestamp':pd.date_range('2018-01-01', '2018-01-02', freq='2h', closed='right'),'col1':[np.nan, np.nan, np.nan, 1,2,3,4,5,6,7,8,np.nan], 'col2':[np.nan, np.nan, 0, 1,2,3,4,5,np.nan,np.nan,np.nan,np.nan], 'col3':[np.nan, -1, 0, 1,2,3,4,5,6,7,8,9], 'col4':[-2, -1, 0, 1,2,3,4,np.nan,np.nan,np.nan,np.nan,np.nan]
})[['timestamp', 'col1', 'col2', 'col3', 'col4']]
which looks like this:
timestamp col1 col2 col3 col4
0 2018-01-01 02:00:00 NaN NaN NaN -2.0
1 2018-01-01 04:00:00 NaN NaN -1.0 -1.0
2 2018-01-01 06:00:00 NaN 0.0 NaN 0.0
3 2018-01-01 08:00:00 1.0 1.0 1.0 1.0
4 2018-01-01 10:00:00 2.0 NaN 2.0 2.0
5 2018-01-01 12:00:00 3.0 3.0 NaN 3.0
6 2018-01-01 14:00:00 NaN 4.0 4.0 4.0
7 2018-01-01 16:00:00 5.0 NaN 5.0 NaN
8 2018-01-01 18:00:00 6.0 NaN 6.0 NaN
9 2018-01-01 20:00:00 7.0 NaN 7.0 NaN
10 2018-01-01 22:00:00 8.0 NaN 8.0 NaN
11 2018-01-02 00:00:00 NaN NaN 9.0 NaN
Now, I want to find an efficient and pythonic way of chopping off (for each column! Not counting timestamp) before the first valid index and after the last valid index. In this example I have 4 columns, but in reality I have a lot more, 600 or so. I am looking for a way of chop of all the NaN values before the first valid index and all the NaN values after the last valid index.
One way would be to loop through I guess.. But is there a better way? This way has to be efficient. I tried to "unpivot" the dataframe using melt, but then this didn't help.
An obvious point is that each column would have a different number of rows after the chopping. So I would like the result to be a list of data frames (one for each column) having timestamp and the column in question. For instance:
timestamp col1
3 2018-01-01 08:00:00 1.0
4 2018-01-01 10:00:00 2.0
5 2018-01-01 12:00:00 3.0
6 2018-01-01 14:00:00 NaN
7 2018-01-01 16:00:00 5.0
8 2018-01-01 18:00:00 6.0
9 2018-01-01 20:00:00 7.0
10 2018-01-01 22:00:00 8.0
My try
I tried like this:
final = []
columns = [c for c in df if c !='timestamp']
for col in columns:
first = df.loc[:, col].first_valid_index()
last = df.loc[:, col].last_valid_index()
final.append(df.loc[:, ['timestamp', col]].iloc[first:last+1, :])
One idea is to use a list or dictionary comprehension after setting your index as timestamp. You should test with your data to see if this resolves your issue with performance. It is unlikely to help if your limitation is memory.
df = df.set_index('timestamp')
final = {col: df[col].loc[df[col].first_valid_index(): df[col].last_valid_index()] \
for col in df}
print(final)
{'col1': timestamp
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
2018-01-01 16:00:00 5.0
2018-01-01 18:00:00 6.0
2018-01-01 20:00:00 7.0
2018-01-01 22:00:00 8.0
Name: col1, dtype: float64,
...
'col4': timestamp
2018-01-01 02:00:00 -2.0
2018-01-01 04:00:00 -1.0
2018-01-01 06:00:00 0.0
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
Name: col4, dtype: float64}
You can use the power of functional programming and apply a function to each column. This may speed things up. Also, as you timestamps looks sorted, you can use them as index of your Datarame.
df.set_index('timestamp', inplace=True)
final = []
def func(col):
first = col.first_valid_index()
last = col.last_valid_index()
final.append(col.loc[first:last])
return
df.apply(func)
Also, you can compact everything in a one liner:
final = []
df.apply(lambda col: final.append(col.loc[col.first_valid_index() : col.last_valid_index()]))
My approach is to find the cumulative sum of NaN for each column and its inverse and filter those entries that are greater than 0. Then I do a dict comprehension to return a dataframe for each column (you can change that to a list if that's what you prefer).
For your example we have
cols = [c for c in df.columns if c!='timestamp']
result_dict = {c: df[(df[c].notnull().cumsum() > 0) &
(df.ix[::-1,c].notnull().cumsum()[::-1] > 0)][['timestamp', c]]
for c in cols}
I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error
Consider interpolate (Series - DataFrame). This example shows how to fill gaps of any size with a straight line:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 NaN
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
df['value'].interpolate(method='linear', inplace=True)
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 2.0
3 2013-01-01 03:00:00 3.0
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 6.0
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
Interpolate & Filna :
Since it's Time series Question I will use o/p graph images in the answer for the explanation purpose:
Consider we are having data of time series as follows: (on x axis= number of days, y = Quantity)
pdDataFrame.set_index('Dates')['QUANTITY'].plot(figsize = (16,6))
We can see there is some NaN data in time series. % of nan = 19.400% of total data. Now we want to impute null/nan values.
I will try to show you o/p of interpolate and filna methods to fill Nan values in the data.
interpolate() :
1st we will use interpolate:
pdDataFrame.set_index('Dates')['QUANTITY'].interpolate(method='linear').plot(figsize = (16,6))
NOTE: There is no time method in interpolate here
fillna() with backfill method
pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=None, downcast=None).plot(figsize = (16,6))
fillna() with backfill method & limit = 7
limit: this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled.
pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=7, downcast=None).plot(figsize = (16,6))
I find fillna function more useful. But you can use any one of the methods to fill up nan values in both the columns.
For more details about these functions refer following links:
Filna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html#pandas.Series.fillna
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
There is one more Lib: impyute that you can check out. For more details regarding this lib refer this link: https://pypi.org/project/impyute/
You could use rolling like this:
frame = pd.DataFrame({'Humidity':np.arange(50,64)})
frame.loc[[3,7,10,11],'Humidity'] = np.nan
frame.Humidity.fillna(frame.Humidity.rolling(4,min_periods=1).mean())
Output:
0 50.0
1 51.0
2 52.0
3 51.0
4 54.0
5 55.0
6 56.0
7 55.0
8 58.0
9 59.0
10 58.5
11 58.5
12 62.0
13 63.0
Name: Humidity, dtype: float64
Looks like your data is by hour. How about just take the average of the hour before and the hour after? Or change the window size to 2, meaning the average of two hours before and after?
Imputing using other variables can be expensive and you should only consider those methods if the dummy methods do not work well (e.g. introducing too much noise).
I have a pandas dataframe that looks like this:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
I would like to groupby on KEY and sum on VALUE but only on continuous periods of time. For instance in the above example I would like to get:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-01 5.0
2 B 2017-01-01 2017-02-10 6.0
There are tow groups for A since there is a gap in the time periods.
I would like to avoid for loops since the dataframe has tens of millions of rows.
Create helper Series by compare shifted START column per group and use it for groupby:
s = df.loc[df.groupby('KEY')['START'].shift(-1) == df['END'], 'END']
s = s.combine_first(df['START'])
print (s)
0 2017-01-01
1 2017-01-23
2 2017-01-23
3 2017-02-02
4 2017-02-02
Name: END, dtype: datetime64[ns]
df = df.groupby(['KEY', s], as_index=False).agg({'START':'first','END':'last','VALUE':'sum'})
print (df)
KEY VALUE START END
0 A 2.1 2017-01-01 2017-01-16
1 A 5.0 2017-01-28 2017-03-01
2 B 6.0 2017-01-01 2017-02-10
The answer from jezrael works like a charm if there are only two consecutive rows to aggregate. In the new example, it would not aggregate the last three rows for KEY = A.
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
5 A 2017-03-01 2017-03-23 1.0
The following solution (slight modification of jezrael's solution) enables to aggregate all rows that should be aggregated:
df = df.sort_values(by='START')
idx = df.groupby('KEY')['START'].shift(-1) != df['END']
df['DATE'] = df.loc[idx, 'START']
df['DATE'] = df.groupby('KEY').DATE.fillna(method='backfill')
df = (df.groupby(['KEY', 'DATE'], as_index=False)
.agg({'START': 'first', 'END': 'last', 'VALUE': 'sum'})
.drop(['DATE'], axis=1))
Which gives:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-23 6.0
2 B 2017-01-01 2017-02-10 6.0
Thanks #jezrael for the elegant approach!