rolling unique value count in pandas across multiple columns - python

there are several answers around rolling count in pandas
Rolling unique value count in pandas
How to efficiently compute a rolling unique count in a pandas time series?
How do I count unique values across multiple columns?
For one column, I can do:
df[my_col]=df[my_col].rolling(300).apply(lambda x: len(np.unique(x)))
How to extend to multipe columns, counting unique values overall across all values in the rolling window?

Inside a list comprehension iterate over the rolling windows and for each window flatten the values in required columns then use set to get the distinct elements
cols = [...] # define your cols here
df['count'] = [len(set(w[cols].values.ravel())) for w in df.rolling(300)]

I took a dataframe as a example (3-rows rolling window taking into account all the columns at the same time)
Dataframe for visualization
col1 col2 col3
0 1 1 1
1 1 1 4
2 2 5 2
3 3 3 3
4 3 7 3
5 5 3 9
6 8 8 2
Proposed script for checkings
import pandas as pd
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df['count'] = df.rolling(3).apply(lambda w: len(set(df.iloc[w.index].to_numpy().flatten())))['col1']
print(df)
Output
col1 col2 col3 count
0 1 1 1 NaN
1 1 1 4 NaN
2 2 5 2 4.0
3 3 3 3 5.0
4 3 7 3 4.0
5 5 3 9 4.0
6 8 8 2 6.0
Another method
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df = (df.assign( count=df.rolling(3, method='table')
.apply(lambda d:len(set(d.flatten()) ), raw=True, engine="numba")
.iloc[:,-1:] )
)

Related

Concatenate data frames over finite index otherwise start a new column - pandas

I need to add new data to the last column of a data-frame, if this has any empty cells, or create a new column otherwise. I wonder if there is any pythonic way to achieve this through pandas functionalities (e.g. concact, join, merge, etc.). The example is as follows:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'0':[8, 9, 3, 5, 0], '1':[9, 6, 6, np.nan, np.nan]})
df2 = pd.DataFrame({'2':[2, 9, 4]}, index = [3,4,0])
desired_output = pd.DataFrame({'0':[8, 9, 3, 5, 0],
'1':[9, 6, 6, 2, 9],
'2':[4, np.nan, np.nan, np.nan, np.nan]})
# df1
0 1
0 8 9
1 9 6
2 3 6
3 5 NaN
4 0 NaN
# df 2
2
3 2
4 9
0 4
# desired_output
0 1 2
0 8 9 4
1 9 6 NaN
2 3 6 NaN
3 5 2 NaN
4 0 9 NaN
Your problem can be broken down into 2 steps:
Contenate df1 and df2 based on their indexes.
For each row of the concatenated dataframe, move the nan to the end.
Try this:
# Step 1: concatenate the two dataframes
result = pd.concat([df1, df2], axis=1)
# Step 2a: for each row, sort the elements based on their nan status
# For example: sort [1, 2, nan, 3] based on [False, False, True, False]
# np.argsort will return [0, 1, 3, 2]
# Stable sort is critical here since we don't want to swap elements whose
# sort keys are equal.
arr = result.to_numpy()
idx = np.argsort(np.isnan(arr), kind="stable")
# Step 2b: reconstruct the result dataframe based on the sort order
result = pd.DataFrame(np.take_along_axis(arr, idx, axis=1), columns=result.columns)

using isin across multiple columns

I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.
So, I have 2 data-sets with 9 rows:
df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)
Index Serial Count Churn
1 9 5 0
2 8 6 0
3 10 2 1
4 7 4 2
5 7 9 2
6 10 2 2
7 2 9 1
8 9 8 3
9 4 3 5
Index Serial Count Churn
1 10 2 1
2 10 2 1
3 9 3 0
4 8 6 0
5 9 8 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.
For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.
The churn column doesn't really matter.
I can get it to work but based only on 1 column but not on more than 1 column.
df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
One solution is to merge with indicators:
df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
Output
Serial Count Churn
1 9 4 1
5 1 9 1
6 10 3 1
7 6 7 1
8 4 8 1
I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.
A simple function achieving this could be the following
from functools import reduce
get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")
def subset_on_x_columns(df1, df2, cols):
"""
Subsets the input dataframe `df1` based on the missing unique values of input columns
`cols` of dataframe `df2`.
:param df1: Pandas dataframe to be subsetted
:param df2: Pandas dataframe which missing values are going to be
used to subset `df1` by
:param cols: List of column names
"""
df1_temp_col = get_temp_col(df1, cols)
df2_temp_col = get_temp_col(df2, cols)
return df1[~df1_temp_col.isin(df2_temp_col.unique())]
Thus for your case all that is needed, is to execute:
result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])
which has the wanted rows:
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
The nice thing about this solution is that it is naturally scalable in the number of columns to use, i.e. all that is needed is to specify in the input parameter list cols which columns to use as identifiers.

How to iterate a vectorized if/else statement over multiple columns with a list that can change?

The numbers in ltlist refer to ID numbers that can change, is it possible to literate through multiple columns for the items in ltlist assume the elements in ltlist in this example aren't constant. Hope to use loop instead of vectorized if/else too but couldn't get it to work.
import pandas as pd, numpy as np
ltlist = [1, 2]
org = {'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]}
ltlist_set = set(ltlist)
org['LT'] = np.where(org['ID'].isin(ltlist_set), org['ID'], 0)
I'll need to check the ID2 column and write the ID in, unless it already has an ID.
output
ID ID2 LT
1 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 2 2
Thanks!
Since you are using 0 as the default value, you can pass it as an or with against the data frame.
import pandas as pd
import numpy as np
ltset = set([1, 2])
org = pd.DataFrame({'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]})
org['LT'] = 0
for col in org.columns.drop('LT'):
org['LT'] = np.where(org[col].isin(ltset), org[col], org['LT']|0)
org
# returns:
ID ID2 LT
0 1 3 1
1 3 4 0
2 4 5 0
3 5 6 0
4 6 7 0
5 7 2 2
This will always keep the value of the right-most column that has a value in ltlist. If you want to keep the left-most column that has a value, you can just iterate over the columns in reverse.
for col in org.columns.drop('LT')[::-1]:
...

Pandas merge duplicate DataFrame columns preserving column names

How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!
# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df
I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T

pandas rearrange dataframe to have all values in ascending order per every column independently

The title should say it all, I want to turn this DataFrame:
A NaN 4 3
B 2 1 4
C 3 4 2
D 4 2 8
into this DataFrame:
A 2 1 2
B 3 2 3
C 4 4 4
D NaN 4 8
And I want to do it in a nice manner. The ugly solution would be to take every column and form a new DataFrame.
To test, use:
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
The desired sort ignores the index values, so the operation appears to be more
like a NumPy operation than a Pandas one:
import pandas as pd
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
# one three two
# A NaN 3 4
# B 2 4 1
# C 3 6 4
# D 4 8 2
arr = df.values
arr.sort(axis=0)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
yields
one three two
A 2 3 1
B 3 4 2
C 4 6 4
D NaN 8 4

Categories

Resources