How to compare 2 dataframe without comparing null values in python? [duplicate]

How to compare 2 dataframe without comparing null values in python? [duplicate] - python

This question already has answers here:
Keeping NaNs with pandas dataframe inequalities
(4 answers)
Closed 4 years ago.
I have 2 dataframes:
df1:
a b c
0 1 2 6
1 2 3 7
2 3 4 8
3 4 5 9
4 5 6 10
df2:
a b c
0 NaN NaN NaN
1 1.0 NaN NaN
2 2.0 2.0 NaN
3 4.0 3.0 NaN
4 6.0 6.0 11.0
When I trying to do df1 > df2, the output is:
In [150]:df1 > df2
Out[150]:
a b c
0 False False False
1 True False False
2 True True False
3 False True False
4 False False False
But what I expect is like this:
a b c
0 NaN NaN NaN
1 True NaN NaN
2 True True NaN
3 False True NaN
4 False False False
So, how should I do to compare 2 df and keep the null to null?

Let's try:
df1.gt(df2).astype(str).mask(df2.isnull())
Output:
a b c
0 NaN NaN NaN
1 True NaN NaN
2 True True NaN
3 False True NaN
4 False False False
You could try, but the way pandas changes the dtype of any series with a null to float dtype you will get the following:
df1.gt(df2).mask(df2.isnull())
Output:
a b c
0 NaN NaN NaN
1 1.0 NaN NaN
2 1.0 1.0 NaN
3 0.0 1.0 NaN
4 0.0 0.0 0.0

Related

Pandas assign series to another Series based on index

I have three Pandas Dataframes:
df1:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
df2:
0 1
3 7
6 5
9 2
df3:
1 2
4 6
7 6
My goal is to assign the values of df2 and df3 to df1 based on the index.
df1 should then become:
0 1
1 2
2 NaN
3 7
4 6
5 NaN
6 5
7 6
8 NaN
9 2
I tried with simple assinment:
df1.loc[df2.index] = df2.values
or
df1.loc[df2.index] = df2
but this gives me an ValueError:
ValueError: Must have equal len keys and value when setting with an iterable
Thanks for your help!

You can do concat with combine_first:
pd.concat([df2,df3]).combine_first(df1)
Or reindex:
pd.concat([df2,df3]).reindex_like(df1)
0 1.0
1 2.0
2 NaN
3 7.0
4 6.0
5 NaN
6 5.0
7 6.0
8 NaN
9 2.0

Pandas: Generate column on groupby and value_counts

Goal is to generate a column pct group by id where 'pct' = (1st value of 'pts' group by 'id' * 100) / number of same consecutive 'id' value where 'x' and 'y' both are 'NaN'. For e.g. when id=1, pct = (5*100) / 2 = 250. It will loop through whole dataframe.
Sample df:
id pts x y
0 1 5 NaN NaN
1 1 5 1.0 NaN
2 1 5 NaN NaN
3 2 8 NaN NaN
4 2 8 2.0 1.0
5 3 7 NaN NaN
6 3 7 NaN 5.0
7 3 7 NaN NaN
8 3 7 NaN NaN
9 4 1 NaN NaN
Expected Output:
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
I tried:
df['pct'] = df.groupby('id')['pts']/df.groupby('id')['x']['y'].count(axis=1)* 100

This works:
df['pct'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum()))
Output:
>>> df
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
Explanation
>>> df[['x', 'y']]
x y
0 NaN NaN
1 1.0 NaN
2 NaN NaN
3 NaN NaN
4 2.0 1.0
5 NaN NaN
6 NaN 5.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
First, we create a mask of the selected x and y columns where each value is True if it's not NaN and False if it is NaN:
>>> df[['x', 'y']].isna()
0 True True
1 False True
2 True True
3 True True
4 False False
5 True True
6 True False
7 True True
8 True True
9 True True
Next, we count how many NaNs were in each row by summing horizontally. Since True is interepreted as 1 and False as 0, this will work:
>>> df[['x', 'y']].isna().sum(axis=1)
0 2
1 1
2 2
3 2
4 0
5 2
6 1
7 2
8 2
9 2
Then, we count how many rows had 2 NaN values (2 because x and y are 2 columns):
>>> df[['x', 'y']].isna().sum(axis=1).eq(2)
0 True
1 False
2 True
3 True
4 False
5 True
6 False
7 True
8 True
9 True
Finally, we count how many True values there were (a True value means that row contained only NaNs), by summing the True values again:
>>> df[['x', 'y']].isna().sum(axis=1).eq(2).sum()
7
Of course, we do this in a .groupby(...).apply(...) call, so this code gets executed for each group of id, not across the whole dataframe like this explanation has done. But the concepts are identical:
>>> df.groupby('id').apply(lambda x: x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 2
2 1
3 3
4 1
dtype: int64
So for id = 1, 2 rows have x and y NaN. For id = 2, 1 row has x and y NaN. And so on...
The other (first) part of the code in the groupby call:
x['pts'].iloc[0] * 100
All it does is, for each group, it selects the 0th (first) value, and multiplies it by 100:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100)
id
1 500
2 800
3 700
4 100
dtype: int64
Combined with the other code just explained:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 250
2 800
3 233
4 100
dtype: int64
Finally, we map the values in id to the values we've just computed (notice in the above that the numbers are indexes by the values of id):
>>> df['id']
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 3
9 4
Name: id, dtype: int64
>>> computed = df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
>>> computed
id
1 250
2 800
3 233
4 100
dtype: int64
>>> df['id'].map(computed)
0 250
1 250
2 250
3 800
4 800
5 233
6 233
7 233
8 233
9 100
Name: id, dtype: int64

Pandas Groupby, does not take selection into account

I am struggling for a while now with the Pandas grouping function.
Explaination of data:
In the code below I load a CSV file into a DataFrame. I had to preprocess the data to structure in a way that each column is a variable, before a column was a variable of a certain wave. The result DataFrame has over 120 variables and each variable has it's own column. Row wise is an observation. Each participant has between the 1 and 13 observations, uniquely identified by a ResponseID and wave number.
Goal:
In this experiment I have two scenario's, one for only the Dutch participants and one scenario for all participants from all over the world. I want to know the mean of the observations per participant for each variable, for both scenario's.
Problem:
Whenever I run this code, the shape of the group per ID is the same for both the Dutch Data scenario as for the World data scenario. While when i deterimine the amount of participants with the .unique() function of Pandas, it does exactly what i expect, namely a small number for the amount of Dutch participants and a large number for the world Data participants.
Question:
Can anyone help me solve this problem?
it states that I have a equal amount of participant for both the Dutch Data scenario as for the World data scenario.
Preprocessing code
import sklearn as sk
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import KBinsDiscretizer
# CSV Settings
seperator = ","
decimal='.'
# Input files
data_raw_file = {"path": "./data.csv", "sheet": None}
data_question_file = {"path": "./Variable Justification.xlsx", "sheet": "Full sheet"}
# View x lines of a table
view_lines = 10
pd.set_option("display.max_rows", view_lines, "display.max_columns", None)
# Load data
data = pd.read_csv(data_raw_file["path"], sep=seperator, decimal=decimal)
# Use
identifier_vars = ["ResponseId", "id"]
demographic_vars = ["age", "coded_country", "edu", "gender"]
outcome_vars = ["affAnx", "affBor", "affCalm", "affDepr", "affEnerg", "affNerv", "affExh", "affInsp", "affRel", "affAng", "affLov"]
# Types (Other variables are considered numerical)
discrete_vars = ["employstatus", "isoObjWho", "coronaClose", "houseLeaveWhy"]
orderd_categorical_vars = ["age", "edu"]
categorical_vars = ["coded_country", "gender", "ResponseId", "id"]
# Wave information
waves = pd.DataFrame(data = [["base" , "" , "Ongoing"],
["wave01", "w1_", "1-3-2020"],
["wave02", "w2_", "2-4-2020"],
["wave03", "w3_", "3-5-2020"],
["wave04", "w4_", "4-6-2020"],
columns =["Waves" , "Wave_ref", "Date"])
# Extract the unique variable names, by making use of the general structure wx_varName, where x is the wave number.
# ================================================================
variable_names = data.keys().str.replace(r'(w\d*\_)', "").unique().to_frame(index=False, name="name")
# Define the different types of variable and their use
# ================================================================
variable_names["use"] = "IV"
variable_names.loc[variable_names["name"].isin(demographic_vars), "use"] = "Demographic"
variable_names.loc[variable_names["name"].isin(identifier_vars), "use"] = "Identifier"
variable_names.loc[variable_names["name"].isin(outcome_vars), "use"] = "DV"
variable_names["type"] = "Continuous"
variable_names.loc[variable_names["name"].isin(categorical_vars), "type"] = "Categorical"
variable_names.loc[variable_names["name"].isin(orderd_categorical_vars), "type"] = "Ordered_Categorical"
for var in discrete_vars:
variable_names.loc[variable_names["name"].str.match('^' + var + '.*'), "type"] = "Discrete"
# Wave in to dataFrame
# ==============================================================
df_waves = pd.DataFrame(columns=variable_names["name"])
for idx, w_ref in enumerate(waves["Wave_ref"]):
# Add Wx_ to the variable names
temp_var = [w_ref + s for s in variable_names[variable_names["type"].isin(["Continuous", "Discrete"])]["name"]] + demographic_vars + identifier_vars
temp_df = data[data.columns.intersection(temp_var)].copy()
temp_df.columns = [s.replace(str(w_ref), "") for s in temp_df.columns] # Remove wave number from column
temp_df["wave"] = waves[waves["Wave_ref"] == w_ref]["Waves"].values[0]
temp_df["wave_date"] = waves[waves["Wave_ref"] == w_ref]["Date"].values[0]
df_waves = df_waves.append(temp_df, ignore_index=True)
data = df_waves.copy()
del(df_waves, temp_df, temp_var, idx, w_ref)
# Define data types
# =================================================================
discrete_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Discrete")]["name"]
data[discrete_vars] = data[discrete_vars].replace(1, True)
data[discrete_vars] = data[discrete_vars].fillna(value=False)
data[discrete_vars] = data[discrete_vars].astype(bool)
continuous_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Continuous")]["name"]
data[continuous_vars] = data[continuous_vars].astype(float)
o_categorical_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Ordered_Categorical")]["name"]
data[o_categorical_vars] = data[o_categorical_vars].astype(float)
continuous_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Categorical")]["name"]
data[continuous_vars] = data[continuous_vars].astype("category")
Part of code, where I think/thought the problem is (comes after preprocessing)
p1_data = data.copy()
p1_scenarios = {"Dutch_Data" : p1_data[p1_data["coded_country"]=="Netherlands"],
"World_Data": p1_data
}
for i, scenario in enumerate(p1_scenarios):
p1_data_scene = p1_scenarios[scenario]
participants = 1_data_scene["ResponseId"].unique()
mean_per_id = p1_data_scene[outcome_vars+["ResponseId"]].groupby(by="ResponseId", dropna=False).mean()
print(scenario)
print(p1_data_scene.shape)
print("Amount of participants " + str(len(participants)))
print("Shape of group per ID" + str(mean_per_id.shape))
Example of the Data after preprocessing:
observation
affAnx
affBor
affCalm
affDepr
affEnerg
affNerv
affExh
affInsp
affRel
affAng
affLov
PLRAC19
PLRAEco
disc01
disc02
disc03
jbInsec01
jbInsec02
jbInsec03
jbInsec04
employstatus_1
employstatus_2
employstatus_3
employstatus_14
employstatus_4
employstatus_5
employstatus_11
employstatus_12
employstatus_6
employstatus_7
employstatus_8
employstatus_9
employstatus_10
employstatus_13
hours_worked_1
PFS01
fail01
isoFriends_inPerson
isoOthPpl_inPerson
isoFriends_online
isoOthPpl_online
isoObj
isoObjWho_1
isoObjWho_2
isoObjWho_3
isoObjWho_4
isoObjWho_5
isoObjWho_6
houseTrad
discPers
lone01
mentHealth
mentPhys
happy
lifeSat
MLQ
JWB_1
tightNorms
tightLoose
tightTreat
probSolving01
probSolving02
probSolving03
posrefocus01
posrefocus02
posrefocus03
C19Know
c19Hope
c19Eff
c19ProSo01
c19ProSo03
c19perBeh01
c19perBeh02
c19perBeh03
c19RCA01
c19RCA02
coronaClose_1
coronaClose_2
coronaClose_3
coronaClose_4
coronaClose_5
coronaClose_6
ecoHope
ecoProSo01
ecoProSo03
ecoRCA02
ecoRCA03
houseLeave
houseLeaveWhy_1
houseLeaveWhy_2
houseLeaveWhy_8
houseLeaveWhy_4
houseLeaveWhy_7
houseLeaveWhy_6
houseActWant
houseActHave
bor02
tempFocPast
tempFocPres
tempFocFut
neuro01
neuro02
neuro03
para01
para02
para03
consp01
consp02
relYesNo
godyesno
godOpinCtrl
godOpinInfl
godPunish
godForgive
trustGovCtry
trustGovState
ctrGJob
solMyCtr
solOthCtr
depressed
gender
age
edu
coded_country
ResponseId
wave
wave_date
1
4
2
2
2
3
4
4
3
1
nan
nan
2
1
1
0
1
nan
0
0
0
False
False
False
False
False
False
False
False
False
False
False
True
True
False
nan
1
-1
1
0
7
7
nan
False
False
False
False
False
False
nan
1
2
nan
nan
4
4
0
nan
9
7
7
4
4
3
3
3
1
3
2
2
3
0
0
3
2
1
3
False
False
False
False
False
True
2
0
1
0
0
1
False
False
False
False
False
False
nan
nan
0
2
2
2
1
0
2
10
5
2
4
1
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
base
Ongoing
2
3
4
1
nan
nan
nan
nan
nan
nan
nan
nan
1
7
1
1
-2
-2
1
1
0
False
False
False
False
False
False
False
False
False
False
False
True
False
False
nan
2
0
7
5
0
3
nan
False
False
False
False
False
False
nan
1
3
nan
nan
5
4
3
nan
3
9
2
3
2
4
4
4
2
2
2
1
1
1
3
3
2
1
3
False
False
False
False
False
True
0
1
0
2
0
2
False
False
False
True
False
False
nan
nan
-3
3
3
2
3
3
1
10
10
2
2
1
1
1
3
3
nan
nan
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
wave1
12-3-2021
3
4
4
3
4
1
3
4
2
1
nan
nan
4
5
1
-1
-1
1
-1
1
1
False
False
False
False
False
False
False
False
True
False
False
False
False
False
nan
2
1
7
7
4
3
nan
False
False
False
False
False
False
nan
1
2
nan
nan
6
5
2
nan
3
3
3
3
4
5
3
3
1
3
1
-3
-1
-3
0
3
0
1
1
False
False
False
False
False
True
0
0
0
0
3
1
False
False
False
False
False
False
nan
nan
-3
3
0
1
-3
1
-2
4
5
1
2
3
1
1
3
3
3
6
nan
nan
nan
nan
nan
nan
1
2
4
nan
028a6e28
base
Ongoing
4
4
5
3
4
3
5
2
2
3
nan
nan
4
4
1
1
1
-2
2
-2
-2
True
False
False
False
False
False
False
False
False
False
False
False
False
False
nan
0
1
0
0
6
6
nan
False
False
False
False
False
False
nan
-1
5
nan
nan
5
3
0
nan
6
1
6
3
3
4
3
3
3
4
2
3
-3
1
3
3
3
3
3
False
False
False
False
True
False
1
0
2
2
2
3
False
True
False
False
False
False
nan
nan
2
3
2
3
2
2
2
0
6
2
10
10
1
1
3
3
3
6
5
5
nan
nan
nan
nan
1
5
5
Saudi Arabia
4212d3a1
base
Ongoing
5
2
5
4
3
5
2
2
4
4
nan
nan
3
5
2
1
0
nan
-2
-2
nan
False
False
False
False
False
False
False
False
False
True
False
False
False
False
nan
1
0
0
0
7
3
nan
False
False
False
False
False
False
nan
1
4
nan
nan
4
4
1
nan
8
5
9
5
5
3
3
3
2
5
3
3
0
2
3
2
3
3
3
False
False
False
False
False
True
3
3
3
2
-2
1
False
False
False
False
False
False
nan
nan
3
1
2
-2
1
1
1
10
3
0
0
0
1
1
3
3
6
6
nan
nan
nan
nan
nan
nan
2
6
6
Saudi Arabia
26dc23cf
base
Ongoing
6
2
3
1
4
2
2
4
1
1
nan
nan
5
6
1
2
-2
1
-1
1
nan
False
False
False
False
False
False
False
False
True
False
False
False
False
False
nan
1
2
0
3
2
0
nan
False
False
False
False
False
False
nan
1
4
nan
nan
2
3
-1
nan
6
1
4
2
1
1
1
2
1
3
-2
-2
1
0
1
1
2
0
3
False
False
False
False
True
False
-3
1
1
2
2
2
False
False
False
False
False
True
nan
nan
-1
-3
2
2
1
1
-1
10
10
10
7
10
0
1
2
2
1
6
2
2
nan
nan
nan
nan
1
2
6
Egypt
bed32257
base
Ongoing
7
3
3
2
1
2
4
2
1
1
nan
nan
4
2
2
-1
-1
nan
nan
nan
nan
False
False
False
False
True
False
False
False
False
False
False
True
False
False
nan
2
1
0
0
2
0
nan
False
False
False
False
False
False
nan
-1
5
nan
nan
1
1
1
nan
9
9
9
5
5
5
1
1
1
5
-2
-1
3
3
3
3
2
3
3
False
False
False
True
False
False
-2
-2
-1
2
0
2
False
False
False
False
False
True
nan
nan
2
-1
1
2
1
2
0
10
8
9
7
10
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
5
Morocco
4sc2f1ae
base
Ongoing
8
nan
nan
nan
3
3
nan
nan
nan
nan
nan
nan
nan
4
nan
nan
nan
nan
nan
nan
nan
False
False
False
False
False
False
False
False
False
False
False
False
False
False
nan
nan
1
nan
nan
nan
nan
nan
False
False
False
False
False
False
nan
1
1
nan
nan
10
6
2
nan
nan
nan
nan
3
3
3
5
5
5
4
3
2
1
1
2
2
2
2
2
False
False
False
False
False
True
nan
1
1
nan
1
1
False
False
False
False
False
False
nan
nan
1
1
2
2
1
0
2
9
0
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
nan
nan
Saudi Arabia
3a86dadc
base
Ongoing
9
3
2
1
3
2
2
1
1
3
nan
nan
4
3
1
0
-1
nan
nan
nan
nan
False
False
False
False
False
True
False
False
False
False
False
False
False
False
nan
-1
-1
1
0
5
0
nan
False
False
False
False
False
False
nan
1
1
nan
nan
7
5
0
nan
7
8
7
1
1
2
2
2
2
3
2
1
1
2
3
3
3
2
3
False
True
False
False
False
False
1
1
1
2
2
1
False
False
False
False
False
False
nan
nan
1
1
2
-2
1
2
1
3
4
5
3
5
1
1
3
2
4
6
nan
nan
nan
nan
nan
nan
1
2
5
Netherlands
5d181ac9
base
Ongoing
10
3
2
4
1
2
2
3
3
2
nan
nan
3
3
1
1
0
-2
2
-1
-2
False
False
False
False
False
False
False
False
False
True
False
False
False
False
nan
0
-1
2
5
7
7
nan
False
False
False
False
False
False
nan
-1
1
nan
nan
8
5
3
nan
7
6
5
4
4
4
2
2
3
4
3
2
-1
0
3
3
3
1
3
False
False
False
False
False
True
1
2
1
2
1
3
False
True
False
False
False
True
nan
nan
-2
-2
2
-1
-2
1
1
6
2
1
3
3
1
1
3
nan
2
6
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
wave2
16-3-2021

After copying your example data into a CSV file, I ran the code below. Note I've fixed typos and indentation, made sure the NAN values are recognized as such, and replaced outcome_vars (which wasn't defined in the code you posted) with two sample columns.
import numpy as np
import pandas as pd
data = pd.read_csv("data_path.csv", delim_whitespace=True)
data.replace('nan', np.nan, inplace=True)
p1_data = data.copy()
p1_scenarios = {"Dutch_Data" : p1_data[p1_data["coded_country"]=="Netherlands"],
"World_Data": p1_data}
for i, scenario in enumerate(p1_scenarios):
p1_data_scene = p1_scenarios[scenario]
participants = p1_data_scene["ResponseId"].unique()
mean_per_id = p1_data_scene[['affAnx', 'affBor', "ResponseId"]].groupby(
by="ResponseId", dropna=False).mean()
print(scenario)
print(p1_data_scene.shape)
print("Amount of participants " + str(len(participants)))
print("Shape of group per ID" + str(mean_per_id.shape))
The output is as expected in view of the fact that three of the Dutch response ID codes are identical, and the rest are unique:
Dutch_Data
(4, 128)
Amount of participants 2
Shape of group per ID(2, 2)
World_Data
(10, 128)
Amount of participants 8
Shape of group per ID(8, 2)
So for this dataset, the problem you describe is not caused by the code you posted, unless the choice of outcome_vars somehow affects the groupby operation, which it shouldn't.
Can you come up with a minimal data sample that does reproduce the problem?

How to delete cells and cut rows find by " ̶i̶s̶i̶n̶(̶)̶ " df.mask()?

I have dataframe with random cells, for example "boss".
How can I delete the cells "boss" and all right cells in the same row using df.isin()?
x=[]
for i in range (5):
x.append("boss")
df=pd.DataFrame(np.diagflat(x) )
0 1 2 3 4
0 boss
1 boss
2 boss
3 boss
4 boss
cut rows using "i̶s̶i̶n̶(̶)̶ " df.mask to
mask=(df.eq('boss').cumsum(axis=1).ne(0))
df.mask(mask,"Nan", inplace =True)
df
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN
4 NaN

Use DataFrame.mask with mask:
df = df.mask(df.eq('boss').cumsum(axis=1).ne(0))
print (df)
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN
4 NaN
Details:
Compare value boss:
print (df.eq('boss'))
0 1 2 3 4
0 True False False False False
1 False True False False False
2 False False True False False
3 False False False True False
4 False False False False True
Use cumulative sum per rows, so after first match get 1, after second 2..:
print (df.eq('boss').cumsum(axis=1))
0 1 2 3 4
0 1 1 1 1 1
1 0 1 1 1 1
2 0 0 1 1 1
3 0 0 0 1 1
4 0 0 0 0 1
Compare if not equal 0 and pass to mask:
print (df.eq('boss').cumsum(axis=1).ne(0))
0 1 2 3 4
0 True True True True True
1 False True True True True
2 False False True True True
3 False False False True True
4 False False False False True

Count distinct strings in rolling window include NaN using pandas

I would like to use rolling count with maximum value is 36 which need to include NaN value such as start with 0 if its NaN. I have dataframe that look like this:
Input:
val
NaN
1
1
NaN
2
1
3
NaN
5
Code:
b = a.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
It gives me:
Val count
NaN 1
1 2
1 2
NaN 3
2 4
1 4
3 5
NaN 6
5 7
Expected Output:
Val count
NaN 0
1 1
1 1
NaN 1
2 2
1 2
3 3
NaN 3
5 4

You can just filter out nan
df.val.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x[~np.isnan(x)]))).fillna(0)
Out[35]:
0 0.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 3.0
7 3.0
8 4.0
Name: val, dtype: float64
The reason why
np.unique([np.nan]*2)
Out[38]: array([nan, nan])
np.nan==np.nan
Out[39]: False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare 2 dataframe without comparing null values in python? [duplicate] - python

Related

Pandas assign series to another Series based on index

Pandas: Generate column on groupby and value_counts

Pandas Groupby, does not take selection into account

How to delete cells and cut rows find by " ̶i̶s̶i̶n̶(̶)̶ " df.mask()?

Count distinct strings in rolling window include NaN using pandas

Categories

Resources