Related
I am trying to groupby my DataFrame and shift some columns for 1 day.
Code for the test df:
import pandas as pd
import datetime as dt
d = {'date' : ['202211', '202211', '202211','202211','202211', '202212', '202212', '202212', '202212', '202213', '202213', '202213', '202213', '202213'],
'id' : ['a', 'b', 'c', 'd','e', 'a', 'b', 'c', 'd', 'a', 'b', 'c', 'd', 'e'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 2, 1.1, 2, 1.5, 0.8, 1.3, 1.5],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df.set_index('date', inplace=True)
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3),
duplicates='drop'))
df['weight'] = df['me'] / (df.groupby(['date', 'rank'])['me'].transform('sum'))
df['ew'] = 1 / (df.groupby(['date', 'rank'])['price'].transform('count'))
df.sort_values(['id', 'date'], inplace=True)
print(df)
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 1 0.285714 0.333333
2022-01-02 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-03 a 2.0 100 200.0 2 1.000000 1.000000
2022-01-01 b 1.2 100 120.0 1 0.342857 0.333333
2022-01-02 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-03 b 1.5 100 150.0 1 0.294118 0.250000
2022-01-01 c 1.3 100 130.0 1 0.371429 0.333333
2022-01-02 c 2.0 100 200.0 2 0.500000 0.500000
2022-01-03 c 0.8 100 80.0 1 0.156863 0.250000
2022-01-01 d 1.5 100 150.0 2 0.468750 0.500000
2022-01-02 d 1.1 100 110.0 1 0.423077 0.500000
2022-01-03 d 1.3 100 130.0 1 0.254902 0.250000
2022-01-01 e 1.7 100 170.0 2 0.531250 0.500000
2022-01-03 e 1.5 100 150.0 1 0.294118 0.250000
The following code results almost in what I want. But as my data is not consistent and some days might be skipped (see observations for id "e"), I need cannot do simply shift(1) but need to implement the frequency as well.
df['rank'] = df.groupby('id')['rank'].shift(1)
df['weight'] = df.groupby('id')['weight'].shift(1)
df['ew'] = df.groupby('id')['ew'].shift(1)
print(df)
results in:
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 NaN NaN NaN
2022-01-02 a 2.0 100 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100 120.0 NaN NaN NaN
2022-01-02 b 1.5 100 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100 130.0 NaN NaN NaN
2022-01-02 c 2.0 100 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100 150.0 NaN NaN NaN
2022-01-02 d 1.1 100 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100 170.0 NaN NaN NaN
2022-01-03 e 1.5 100 150.0 2 0.531250 0.500000
The desired outcome would be (watch observation for id "e"):
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 NaN NaN NaN
2022-01-02 a 2.0 100 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100 120.0 NaN NaN NaN
2022-01-02 b 1.5 100 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100 130.0 NaN NaN NaN
2022-01-02 c 2.0 100 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100 150.0 NaN NaN NaN
2022-01-02 d 1.1 100 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100 170.0 NaN NaN NaN
2022-01-03 e 1.5 100 150.0 NaN NaN NaN
I did not manage to simply use freq='d' here. What could be the easiest solution?
One quick and dirty solution that works with your data is to unstack() and stack() the id column to create nulls for the gaps and then drop the nulls at the end:
In [50]: df = df.set_index('id', append=True).unstack().stack(dropna=False).reset_index(level=1).sort_values('id')
In [51]: df['rank'] = df.groupby('id')['rank'].shift(1)
...: df['weight'] = df.groupby('id')['weight'].shift(1)
...: df['ew'] = df.groupby('id')['ew'].shift(1)
In [52]: print(df[df['price'].notnull()])
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100.0 100.0 NaN NaN NaN
2022-01-02 a 2.0 100.0 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100.0 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100.0 120.0 NaN NaN NaN
2022-01-02 b 1.5 100.0 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100.0 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100.0 130.0 NaN NaN NaN
2022-01-02 c 2.0 100.0 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100.0 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100.0 150.0 NaN NaN NaN
2022-01-02 d 1.1 100.0 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100.0 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100.0 170.0 NaN NaN NaN
2022-01-03 e 1.5 100.0 150.0 NaN NaN NaN
You can resample on a groupby to generate a daily series:
# Resample to produce a daily data for each ID
shifted = df.groupby("id").resample("1D").first().groupby(level=0).shift()
# `shifted` and `df`'s indexes must have the same shape
df = df.set_index("id", append=True).swaplevel()
# Rely on panda's auto row alignment to handle the assignment
cols = ["rank", "weight", "ew"]
df[cols] = shifted[cols]
df.reset_index(0, inplace=True)
I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6
I am using the pandas .qcut() function to divide a column 'AveragePrice' into 4 bins. I would like to assign each bin to a new variable. The reason for this is to do a separate analysis on each quartile. IE) I would like something like:
bin1 = quartile 1
bin2= quartile 2
bin3 = quartile 3
bin4= quantile 4
Here is what I'm working with.
`pd.qcut(data['AveragePrice'], q=4)`
2 (0.439, 1.1]
3 (0.439, 1.1]
17596 (1.1, 1.38]
17600 (1.1, 1.38]
Name: AveragePrice, Length: 14127, dtype: category
Categories (4, interval[float64]): [(0.439, 1.1] < (1.1, 1.38] < (1.38, 1.69] < (1.69, 3.25]]
If I understand correctly, you can "pivot" your quartile values into columns.
Toy example:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'AveragePrice': np.random.randint(0, 100, size=10) })
AveragePrice
0
20
1
29
2
53
3
30
4
3
5
4
6
78
7
62
8
75
9
1
Create the Quartile column, pivot Quartile into columns, and rename the columns to something more reader-friendly:
df['Quartile'] = pd.qcut(df.AveragePrice, q=4)
pivot = df.reset_index().pivot_table(
index='index',
columns='Quartile',
values='AveragePrice')
pivot.columns = ['Q1', 'Q2', 'Q3', 'Q4']
Q1
Q2
Q3
Q4
0
NaN
20.0
NaN
NaN
1
NaN
29.0
NaN
NaN
2
NaN
NaN
53.0
NaN
3
NaN
NaN
30.0
NaN
4
3.0
NaN
NaN
NaN
5
4.0
NaN
NaN
NaN
6
NaN
NaN
NaN
78.0
7
NaN
NaN
NaN
62.0
8
NaN
NaN
NaN
75.0
9
1.0
NaN
NaN
NaN
Now you can analyze the bins separately, e.g., describe them:
pivot.describe()
Q1
Q2
Q3
Q4
count
3.000000
2.000000
2.000000
3.000000
mean
2.666667
24.500000
41.500000
71.666667
std
1.527525
6.363961
16.263456
8.504901
min
1.000000
20.000000
30.000000
62.000000
25%
2.000000
22.250000
35.750000
68.500000
50%
3.000000
24.500000
41.500000
75.000000
75%
3.500000
26.750000
47.250000
76.500000
max
4.000000
29.000000
53.000000
78.000000
I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0
I'm looking for a Pythonic approach to capture stats based on the amount of matches in a DF column. So working with this example:
rng = pd.DataFrame( {'initial_data': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
test_B_mask = rng['initial_data'] == 'B'
rng['test_for_B'] = rng['initial_data'][test_B_mask]
and running this function to provide matches:
def func_match(df_in,val):
return ((df_in == val) & (df_in.shift() == val)).astype(int)
func_match(rng['test_for_B'],rng['test_for_B'])
I get the following output:
2014-04-02 09:00:00 0
2014-04-02 10:00:00 0
2014-04-02 11:00:00 0
2014-04-02 12:00:00 0
2014-04-02 13:00:00 0
2014-04-02 14:00:00 1
2014-04-02 15:00:00 0
2014-04-02 16:00:00 0
2014-04-03 09:00:00 0
2014-04-03 10:00:00 0
2014-04-03 11:00:00 0
2014-04-03 12:00:00 1
2014-04-03 13:00:00 1
2014-04-03 14:00:00 0
Freq: BH, Name: test_for_B, dtype: int64
I can use something simple like func_match(rng['test_for_B'],rng['test_for_B']).sum()
which returns
3
to get the amount if times the values match in total but could someone help with a function to provide the following more granular function please?
Amount and percentage of times a single match is seen.
Amount and percentage of times two consecutive matches are seen (up to n max matches which is just 3 matches 2014-04-02 11:00:00 through 13:00:00 in this example).
I'm guessing this would be a dict used within the function but Im sure many of the experienced coders on Stack Overflow are used to conducting this kind of analysis so would love to learn how to approach this task.
Thank you in advance for any help with this.
Edit:
I didn't initially specify desired output as I am open to all options and didn't want to deter anyone from providing solutions. However as per request from MaxU for desired output, something like this would be great:
Matches Matches_Percent
0 match 3 30
1 match 4 40
2 match 2 20
3 match 1 10
etc
Initial setup
rng = pd.DataFrame({'initial_data': ['A', 'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A',]},
index = pd.date_range('4/2/2014', periods=14, freq='BH'))
Assign bool to columns 'test_for_B'
rng['test_for_B'] = rng['initial_data'] == 'B'
Tricky bit
Test for 'B' and that last row was not 'B'. This signifies the beginning of a group. Then cumsum ties groups together.
contigious_groups = ((rng.initial_data == 'B') & (rng.initial_data != rng.initial_data.shift())).cumsum()
Now I groupby this grouping we created and sum the bools within each group. This gets at whether its a double, triple, etc.
counts = rng.loc[contigious_groups.astype(bool)].groupby(contigious_groups).test_for_B.sum()
Then use value_counts to get the frequency of each group type and divide by contigious_groups.max() because that's a count of how many groups.
counts.value_counts() / contigious_groups.max()
3.0 0.5
2.0 0.5
Name: test_for_B, dtype: float64
df = pd.DataFrame({'ID': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]},
index = pd.date_range('4/2/2014', periods=14, freq='BH'))
df.head()
Out: ID
2014-04-02 09:00:00 A
2014-04-02 10:00:00 A
2014-04-02 11:00:00 A
2014-04-02 12:00:00 A
2014-04-02 13:00:00 B
To count occurences for each ID, you can use pd.Series.value_counts
df['ID'].value_counts()
Out: A 9
B 5
Name: ID, dtype: int64
To count consecutive occurences, you can do as follow: pivot the table with dummy variables for each ID:
df2 = df.assign(Count = lambda x: 1)\
.reset_index()\
.pivot_table('Count', columns='ID', index='index')
df2.head()
Out: ID A B
index
2014-04-02 09:00:00 1.0 NaN
2014-04-02 10:00:00 1.0 NaN
2014-04-02 11:00:00 1.0 NaN
2014-04-02 12:00:00 1.0 NaN
2014-04-02 13:00:00 NaN 1.0
The following function counts the number of consecutive matches:
df2.apply(lambda x: x.notnull()\
.groupby(x.isnull().cumsum()).sum())
Out:
ID A B
0 4.0 NaN
1 0.0 0.0
2 4.0 0.0
3 0.0 0.0
4 0.0 2.0
5 1.0 0.0
6 NaN 0.0
7 NaN 0.0
8 NaN 3.0
9 NaN 0.0
We just need to group by ID and values:
df2.apply(lambda x: x.notnull().groupby(x.isnull().cumsum()).sum())\
.unstack()\
.reset_index()\
.groupby(['ID', 0]).count()\
.reset_index()\
.pivot_table(values='level_1', index=0, columns=['ID']).fillna(0)
Out:
ID A B
0
0.0 3.0 7.0
1.0 1.0 0.0
2.0 0.0 1.0
3.0 0.0 1.0
4.0 2.0 0.0
For instance, the previous table reads Ahas 2 4-consecutive matches.
To get percentages instead, add .pipe(lambda x: x/x.values.sum()):
Out:
ID A B
0
0.0 0.200000 0.466667
1.0 0.066667 0.000000
2.0 0.000000 0.066667
3.0 0.000000 0.066667
4.0 0.133333 0.000000