Pandas Groupby, does not take selection into account

Pandas Groupby, does not take selection into account - python

I am struggling for a while now with the Pandas grouping function.
Explaination of data:
In the code below I load a CSV file into a DataFrame. I had to preprocess the data to structure in a way that each column is a variable, before a column was a variable of a certain wave. The result DataFrame has over 120 variables and each variable has it's own column. Row wise is an observation. Each participant has between the 1 and 13 observations, uniquely identified by a ResponseID and wave number.
Goal:
In this experiment I have two scenario's, one for only the Dutch participants and one scenario for all participants from all over the world. I want to know the mean of the observations per participant for each variable, for both scenario's.
Problem:
Whenever I run this code, the shape of the group per ID is the same for both the Dutch Data scenario as for the World data scenario. While when i deterimine the amount of participants with the .unique() function of Pandas, it does exactly what i expect, namely a small number for the amount of Dutch participants and a large number for the world Data participants.
Question:
Can anyone help me solve this problem?
it states that I have a equal amount of participant for both the Dutch Data scenario as for the World data scenario.
Preprocessing code
import sklearn as sk
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import KBinsDiscretizer
# CSV Settings
seperator = ","
decimal='.'
# Input files
data_raw_file = {"path": "./data.csv", "sheet": None}
data_question_file = {"path": "./Variable Justification.xlsx", "sheet": "Full sheet"}
# View x lines of a table
view_lines = 10
pd.set_option("display.max_rows", view_lines, "display.max_columns", None)
# Load data
data = pd.read_csv(data_raw_file["path"], sep=seperator, decimal=decimal)
# Use
identifier_vars = ["ResponseId", "id"]
demographic_vars = ["age", "coded_country", "edu", "gender"]
outcome_vars = ["affAnx", "affBor", "affCalm", "affDepr", "affEnerg", "affNerv", "affExh", "affInsp", "affRel", "affAng", "affLov"]
# Types (Other variables are considered numerical)
discrete_vars = ["employstatus", "isoObjWho", "coronaClose", "houseLeaveWhy"]
orderd_categorical_vars = ["age", "edu"]
categorical_vars = ["coded_country", "gender", "ResponseId", "id"]
# Wave information
waves = pd.DataFrame(data = [["base" , "" , "Ongoing"],
["wave01", "w1_", "1-3-2020"],
["wave02", "w2_", "2-4-2020"],
["wave03", "w3_", "3-5-2020"],
["wave04", "w4_", "4-6-2020"],
columns =["Waves" , "Wave_ref", "Date"])
# Extract the unique variable names, by making use of the general structure wx_varName, where x is the wave number.
# ================================================================
variable_names = data.keys().str.replace(r'(w\d*\_)', "").unique().to_frame(index=False, name="name")
# Define the different types of variable and their use
# ================================================================
variable_names["use"] = "IV"
variable_names.loc[variable_names["name"].isin(demographic_vars), "use"] = "Demographic"
variable_names.loc[variable_names["name"].isin(identifier_vars), "use"] = "Identifier"
variable_names.loc[variable_names["name"].isin(outcome_vars), "use"] = "DV"
variable_names["type"] = "Continuous"
variable_names.loc[variable_names["name"].isin(categorical_vars), "type"] = "Categorical"
variable_names.loc[variable_names["name"].isin(orderd_categorical_vars), "type"] = "Ordered_Categorical"
for var in discrete_vars:
variable_names.loc[variable_names["name"].str.match('^' + var + '.*'), "type"] = "Discrete"
# Wave in to dataFrame
# ==============================================================
df_waves = pd.DataFrame(columns=variable_names["name"])
for idx, w_ref in enumerate(waves["Wave_ref"]):
# Add Wx_ to the variable names
temp_var = [w_ref + s for s in variable_names[variable_names["type"].isin(["Continuous", "Discrete"])]["name"]] + demographic_vars + identifier_vars
temp_df = data[data.columns.intersection(temp_var)].copy()
temp_df.columns = [s.replace(str(w_ref), "") for s in temp_df.columns] # Remove wave number from column
temp_df["wave"] = waves[waves["Wave_ref"] == w_ref]["Waves"].values[0]
temp_df["wave_date"] = waves[waves["Wave_ref"] == w_ref]["Date"].values[0]
df_waves = df_waves.append(temp_df, ignore_index=True)
data = df_waves.copy()
del(df_waves, temp_df, temp_var, idx, w_ref)
# Define data types
# =================================================================
discrete_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Discrete")]["name"]
data[discrete_vars] = data[discrete_vars].replace(1, True)
data[discrete_vars] = data[discrete_vars].fillna(value=False)
data[discrete_vars] = data[discrete_vars].astype(bool)
continuous_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Continuous")]["name"]
data[continuous_vars] = data[continuous_vars].astype(float)
o_categorical_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Ordered_Categorical")]["name"]
data[o_categorical_vars] = data[o_categorical_vars].astype(float)
continuous_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Categorical")]["name"]
data[continuous_vars] = data[continuous_vars].astype("category")
Part of code, where I think/thought the problem is (comes after preprocessing)
p1_data = data.copy()
p1_scenarios = {"Dutch_Data" : p1_data[p1_data["coded_country"]=="Netherlands"],
"World_Data": p1_data
}
for i, scenario in enumerate(p1_scenarios):
p1_data_scene = p1_scenarios[scenario]
participants = 1_data_scene["ResponseId"].unique()
mean_per_id = p1_data_scene[outcome_vars+["ResponseId"]].groupby(by="ResponseId", dropna=False).mean()
print(scenario)
print(p1_data_scene.shape)
print("Amount of participants " + str(len(participants)))
print("Shape of group per ID" + str(mean_per_id.shape))
Example of the Data after preprocessing:
observation
affAnx
affBor
affCalm
affDepr
affEnerg
affNerv
affExh
affInsp
affRel
affAng
affLov
PLRAC19
PLRAEco
disc01
disc02
disc03
jbInsec01
jbInsec02
jbInsec03
jbInsec04
employstatus_1
employstatus_2
employstatus_3
employstatus_14
employstatus_4
employstatus_5
employstatus_11
employstatus_12
employstatus_6
employstatus_7
employstatus_8
employstatus_9
employstatus_10
employstatus_13
hours_worked_1
PFS01
fail01
isoFriends_inPerson
isoOthPpl_inPerson
isoFriends_online
isoOthPpl_online
isoObj
isoObjWho_1
isoObjWho_2
isoObjWho_3
isoObjWho_4
isoObjWho_5
isoObjWho_6
houseTrad
discPers
lone01
mentHealth
mentPhys
happy
lifeSat
MLQ
JWB_1
tightNorms
tightLoose
tightTreat
probSolving01
probSolving02
probSolving03
posrefocus01
posrefocus02
posrefocus03
C19Know
c19Hope
c19Eff
c19ProSo01
c19ProSo03
c19perBeh01
c19perBeh02
c19perBeh03
c19RCA01
c19RCA02
coronaClose_1
coronaClose_2
coronaClose_3
coronaClose_4
coronaClose_5
coronaClose_6
ecoHope
ecoProSo01
ecoProSo03
ecoRCA02
ecoRCA03
houseLeave
houseLeaveWhy_1
houseLeaveWhy_2
houseLeaveWhy_8
houseLeaveWhy_4
houseLeaveWhy_7
houseLeaveWhy_6
houseActWant
houseActHave
bor02
tempFocPast
tempFocPres
tempFocFut
neuro01
neuro02
neuro03
para01
para02
para03
consp01
consp02
relYesNo
godyesno
godOpinCtrl
godOpinInfl
godPunish
godForgive
trustGovCtry
trustGovState
ctrGJob
solMyCtr
solOthCtr
depressed
gender
age
edu
coded_country
ResponseId
wave
wave_date
1
4
2
2
2
3
4
4
3
1
nan
nan
2
1
1
0
1
nan
0
0
0
False
False
False
False
False
False
False
False
False
False
False
True
True
False
nan
1
-1
1
0
7
7
nan
False
False
False
False
False
False
nan
1
2
nan
nan
4
4
0
nan
9
7
7
4
4
3
3
3
1
3
2
2
3
0
0
3
2
1
3
False
False
False
False
False
True
2
0
1
0
0
1
False
False
False
False
False
False
nan
nan
0
2
2
2
1
0
2
10
5
2
4
1
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
base
Ongoing
2
3
4
1
nan
nan
nan
nan
nan
nan
nan
nan
1
7
1
1
-2
-2
1
1
0
False
False
False
False
False
False
False
False
False
False
False
True
False
False
nan
2
0
7
5
0
3
nan
False
False
False
False
False
False
nan
1
3
nan
nan
5
4
3
nan
3
9
2
3
2
4
4
4
2
2
2
1
1
1
3
3
2
1
3
False
False
False
False
False
True
0
1
0
2
0
2
False
False
False
True
False
False
nan
nan
-3
3
3
2
3
3
1
10
10
2
2
1
1
1
3
3
nan
nan
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
wave1
12-3-2021
3
4
4
3
4
1
3
4
2
1
nan
nan
4
5
1
-1
-1
1
-1
1
1
False
False
False
False
False
False
False
False
True
False
False
False
False
False
nan
2
1
7
7
4
3
nan
False
False
False
False
False
False
nan
1
2
nan
nan
6
5
2
nan
3
3
3
3
4
5
3
3
1
3
1
-3
-1
-3
0
3
0
1
1
False
False
False
False
False
True
0
0
0
0
3
1
False
False
False
False
False
False
nan
nan
-3
3
0
1
-3
1
-2
4
5
1
2
3
1
1
3
3
3
6
nan
nan
nan
nan
nan
nan
1
2
4
nan
028a6e28
base
Ongoing
4
4
5
3
4
3
5
2
2
3
nan
nan
4
4
1
1
1
-2
2
-2
-2
True
False
False
False
False
False
False
False
False
False
False
False
False
False
nan
0
1
0
0
6
6
nan
False
False
False
False
False
False
nan
-1
5
nan
nan
5
3
0
nan
6
1
6
3
3
4
3
3
3
4
2
3
-3
1
3
3
3
3
3
False
False
False
False
True
False
1
0
2
2
2
3
False
True
False
False
False
False
nan
nan
2
3
2
3
2
2
2
0
6
2
10
10
1
1
3
3
3
6
5
5
nan
nan
nan
nan
1
5
5
Saudi Arabia
4212d3a1
base
Ongoing
5
2
5
4
3
5
2
2
4
4
nan
nan
3
5
2
1
0
nan
-2
-2
nan
False
False
False
False
False
False
False
False
False
True
False
False
False
False
nan
1
0
0
0
7
3
nan
False
False
False
False
False
False
nan
1
4
nan
nan
4
4
1
nan
8
5
9
5
5
3
3
3
2
5
3
3
0
2
3
2
3
3
3
False
False
False
False
False
True
3
3
3
2
-2
1
False
False
False
False
False
False
nan
nan
3
1
2
-2
1
1
1
10
3
0
0
0
1
1
3
3
6
6
nan
nan
nan
nan
nan
nan
2
6
6
Saudi Arabia
26dc23cf
base
Ongoing
6
2
3
1
4
2
2
4
1
1
nan
nan
5
6
1
2
-2
1
-1
1
nan
False
False
False
False
False
False
False
False
True
False
False
False
False
False
nan
1
2
0
3
2
0
nan
False
False
False
False
False
False
nan
1
4
nan
nan
2
3
-1
nan
6
1
4
2
1
1
1
2
1
3
-2
-2
1
0
1
1
2
0
3
False
False
False
False
True
False
-3
1
1
2
2
2
False
False
False
False
False
True
nan
nan
-1
-3
2
2
1
1
-1
10
10
10
7
10
0
1
2
2
1
6
2
2
nan
nan
nan
nan
1
2
6
Egypt
bed32257
base
Ongoing
7
3
3
2
1
2
4
2
1
1
nan
nan
4
2
2
-1
-1
nan
nan
nan
nan
False
False
False
False
True
False
False
False
False
False
False
True
False
False
nan
2
1
0
0
2
0
nan
False
False
False
False
False
False
nan
-1
5
nan
nan
1
1
1
nan
9
9
9
5
5
5
1
1
1
5
-2
-1
3
3
3
3
2
3
3
False
False
False
True
False
False
-2
-2
-1
2
0
2
False
False
False
False
False
True
nan
nan
2
-1
1
2
1
2
0
10
8
9
7
10
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
5
Morocco
4sc2f1ae
base
Ongoing
8
nan
nan
nan
3
3
nan
nan
nan
nan
nan
nan
nan
4
nan
nan
nan
nan
nan
nan
nan
False
False
False
False
False
False
False
False
False
False
False
False
False
False
nan
nan
1
nan
nan
nan
nan
nan
False
False
False
False
False
False
nan
1
1
nan
nan
10
6
2
nan
nan
nan
nan
3
3
3
5
5
5
4
3
2
1
1
2
2
2
2
2
False
False
False
False
False
True
nan
1
1
nan
1
1
False
False
False
False
False
False
nan
nan
1
1
2
2
1
0
2
9
0
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
nan
nan
Saudi Arabia
3a86dadc
base
Ongoing
9
3
2
1
3
2
2
1
1
3
nan
nan
4
3
1
0
-1
nan
nan
nan
nan
False
False
False
False
False
True
False
False
False
False
False
False
False
False
nan
-1
-1
1
0
5
0
nan
False
False
False
False
False
False
nan
1
1
nan
nan
7
5
0
nan
7
8
7
1
1
2
2
2
2
3
2
1
1
2
3
3
3
2
3
False
True
False
False
False
False
1
1
1
2
2
1
False
False
False
False
False
False
nan
nan
1
1
2
-2
1
2
1
3
4
5
3
5
1
1
3
2
4
6
nan
nan
nan
nan
nan
nan
1
2
5
Netherlands
5d181ac9
base
Ongoing
10
3
2
4
1
2
2
3
3
2
nan
nan
3
3
1
1
0
-2
2
-1
-2
False
False
False
False
False
False
False
False
False
True
False
False
False
False
nan
0
-1
2
5
7
7
nan
False
False
False
False
False
False
nan
-1
1
nan
nan
8
5
3
nan
7
6
5
4
4
4
2
2
3
4
3
2
-1
0
3
3
3
1
3
False
False
False
False
False
True
1
2
1
2
1
3
False
True
False
False
False
True
nan
nan
-2
-2
2
-1
-2
1
1
6
2
1
3
3
1
1
3
nan
2
6
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
wave2
16-3-2021

After copying your example data into a CSV file, I ran the code below. Note I've fixed typos and indentation, made sure the NAN values are recognized as such, and replaced outcome_vars (which wasn't defined in the code you posted) with two sample columns.
import numpy as np
import pandas as pd
data = pd.read_csv("data_path.csv", delim_whitespace=True)
data.replace('nan', np.nan, inplace=True)
p1_data = data.copy()
p1_scenarios = {"Dutch_Data" : p1_data[p1_data["coded_country"]=="Netherlands"],
"World_Data": p1_data}
for i, scenario in enumerate(p1_scenarios):
p1_data_scene = p1_scenarios[scenario]
participants = p1_data_scene["ResponseId"].unique()
mean_per_id = p1_data_scene[['affAnx', 'affBor', "ResponseId"]].groupby(
by="ResponseId", dropna=False).mean()
print(scenario)
print(p1_data_scene.shape)
print("Amount of participants " + str(len(participants)))
print("Shape of group per ID" + str(mean_per_id.shape))
The output is as expected in view of the fact that three of the Dutch response ID codes are identical, and the rest are unique:
Dutch_Data
(4, 128)
Amount of participants 2
Shape of group per ID(2, 2)
World_Data
(10, 128)
Amount of participants 8
Shape of group per ID(8, 2)
So for this dataset, the problem you describe is not caused by the code you posted, unless the choice of outcome_vars somehow affects the groupby operation, which it shouldn't.
Can you come up with a minimal data sample that does reproduce the problem?

Related

How to delete cells and cut rows find by " ̶i̶s̶i̶n̶(̶)̶ " df.mask()?

I have dataframe with random cells, for example "boss".
How can I delete the cells "boss" and all right cells in the same row using df.isin()?
x=[]
for i in range (5):
x.append("boss")
df=pd.DataFrame(np.diagflat(x) )
0 1 2 3 4
0 boss
1 boss
2 boss
3 boss
4 boss
cut rows using "i̶s̶i̶n̶(̶)̶ " df.mask to
mask=(df.eq('boss').cumsum(axis=1).ne(0))
df.mask(mask,"Nan", inplace =True)
df
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN
4 NaN

Use DataFrame.mask with mask:
df = df.mask(df.eq('boss').cumsum(axis=1).ne(0))
print (df)
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN
4 NaN
Details:
Compare value boss:
print (df.eq('boss'))
0 1 2 3 4
0 True False False False False
1 False True False False False
2 False False True False False
3 False False False True False
4 False False False False True
Use cumulative sum per rows, so after first match get 1, after second 2..:
print (df.eq('boss').cumsum(axis=1))
0 1 2 3 4
0 1 1 1 1 1
1 0 1 1 1 1
2 0 0 1 1 1
3 0 0 0 1 1
4 0 0 0 0 1
Compare if not equal 0 and pass to mask:
print (df.eq('boss').cumsum(axis=1).ne(0))
0 1 2 3 4
0 True True True True True
1 False True True True True
2 False False True True True
3 False False False True True
4 False False False False True

How to count consecutive repetitions in a pandas series

Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1

Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64

The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN

Iterating over pandas rows

Having a df:
cell;value
0;8
1;2
2;1
3;6
4;4
5;6
6;7
And i'm trying to define a function that will check the cell values of the row after the observed one. If the value of the cell after the observed one (i+1) is bigger than then observed (i), than the values in a new column maxValue is equal to 0, if smaller - 1.
The final df should look like:
cell;value;maxValue
0;8;1
1;2;1
2;1;0
3;6;1
4;4;0
5;6;0
6;7;0
My solution that does not work yet is:
def MaxFind(df, a, col='value'):
if df.iloc[a+1][col] > df.iloc[a][col]:
return 0
df['maxValue'] = df.apply(lambda row: MaxFind(df, row.value), axis=1)

I believe you need shift with comparing by gt, inverting mask and cast to integers:
df['maxValue'] = (~df['value'].shift().gt(df['value'])).astype(int)
#another solution
#df['maxValue'] = df['value'].shift().le(df['value']).astype(int)
print (df)
cell value maxValue
0 0 8 1
1 1 2 0
2 2 1 0
3 3 6 1
4 4 4 0
5 5 6 1
6 6 7 1
Details:
df['shifted'] = df['value'].shift()
df['mask'] = (df['value'].shift().gt(df['value']))
df['inverted_mask'] = (~df['value'].shift().gt(df['value']))
df['maxValue'] = (~df['value'].shift().gt(df['value'])).astype(int)
print (df)
cell value shifted mask inverted_mask maxValue
0 0 8 NaN False True 1
1 1 2 8.0 True False 0
2 2 1 2.0 True False 0
3 3 6 1.0 False True 1
4 4 4 6.0 True False 0
5 5 6 4.0 False True 1
6 6 7 6.0 False True 1
EDIT:
df['maxValue'] = df['value'].shift(1).le(df['value'].shift(-1)).astype(int)
print (df)
cell value maxValue
0 0 8 0
1 1 2 0
2 2 1 1
3 3 6 1
4 4 4 1
5 5 6 1
6 6 7 0
df['shift_1'] = df['value'].shift(1)
df['shift_-1'] = df['value'].shift(-1)
df['mask'] = df['value'].shift(1).le(df['value'].shift(-1))
df['maxValue'] = df['value'].shift(1).le(df['value'].shift(-1)).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 NaN 2.0 False 0
1 1 2 8.0 1.0 False 0
2 2 1 2.0 6.0 True 1
3 3 6 1.0 4.0 True 1
4 4 4 6.0 6.0 True 1
5 5 6 4.0 7.0 True 1
6 6 7 6.0 NaN False 0
If shift values, get for first or last ones missing values. If necessary, is possible repalce them by first no NaN or last non NaNs with forward or back filling:
df['shift_1'] = df['value'].shift(2)
df['shift_-1'] = df['value'].shift(-2)
df['mask'] = df['value'].shift(2).le(df['value'].shift(-2))
df['maxValue'] = df['value'].shift(2).le(df['value'].shift(-2)).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 NaN 1.0 False 0
1 1 2 NaN 6.0 False 0
2 2 1 8.0 4.0 False 0
3 3 6 2.0 6.0 True 1
4 4 4 1.0 7.0 True 1
5 5 6 6.0 NaN False 0
6 6 7 4.0 NaN False 0
df['shift_1'] = df['value'].shift(2).bfill()
df['shift_-1'] = df['value'].shift(-2).ffill()
df['mask'] = df['value'].shift(2).bfill().le(df['value'].shift(-2).ffill())
df['maxValue'] = df['value'].shift(2).bfill().le(df['value'].shift(-2).ffill()).astype(int)
print (df)
cell value shift_1 shift_-1 mask maxValue
0 0 8 8.0 1.0 False 0
1 1 2 8.0 6.0 False 0
2 2 1 8.0 4.0 False 0
3 3 6 2.0 6.0 True 1
4 4 4 1.0 7.0 True 1
5 5 6 6.0 7.0 True 1
6 6 7 4.0 7.0 True 1

How to compare 2 dataframe without comparing null values in python? [duplicate]

This question already has answers here:
Keeping NaNs with pandas dataframe inequalities
(4 answers)
Closed 4 years ago.
I have 2 dataframes:
df1:
a b c
0 1 2 6
1 2 3 7
2 3 4 8
3 4 5 9
4 5 6 10
df2:
a b c
0 NaN NaN NaN
1 1.0 NaN NaN
2 2.0 2.0 NaN
3 4.0 3.0 NaN
4 6.0 6.0 11.0
When I trying to do df1 > df2, the output is:
In [150]:df1 > df2
Out[150]:
a b c
0 False False False
1 True False False
2 True True False
3 False True False
4 False False False
But what I expect is like this:
a b c
0 NaN NaN NaN
1 True NaN NaN
2 True True NaN
3 False True NaN
4 False False False
So, how should I do to compare 2 df and keep the null to null?

Let's try:
df1.gt(df2).astype(str).mask(df2.isnull())
Output:
a b c
0 NaN NaN NaN
1 True NaN NaN
2 True True NaN
3 False True NaN
4 False False False
You could try, but the way pandas changes the dtype of any series with a null to float dtype you will get the following:
df1.gt(df2).mask(df2.isnull())
Output:
a b c
0 NaN NaN NaN
1 1.0 NaN NaN
2 1.0 1.0 NaN
3 0.0 1.0 NaN
4 0.0 0.0 0.0

Count rows that match string and numeric with pandas

I have 1-12 numbers in SAMPLE column and for each number I try to count mutation numbers(A:T, C:G etc..). This code works but how can I modify this code to gives me all 12 condition for each mutation, instead of writing the same code 12 times and also for each mutation?
In this example; AT gives me the number while SAMPLE=1. I am trying to get number of AT for each sample number(1,2,..12). So how can modify this code for that? I'll appreciate for any help. Thank you.
SAMPLE MUT
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
... ... ...
d= df["SAMPLE", "MUT" ]
chars1 = "TGC-"
number = {}
for item in chars1:
dm= d[(d["MUT"].str.contains("A:" + item)) & (d["SAMPLE"].isin([1]))]
num1 = dm.count()
number[item] = num1
AT=number["T"]
AG=number["G"]
AC=number["C"]
A_=number["-"]

I would use the native string extraction methods in pandas
df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')
Which returns the matches of the different groups:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN G NaN NaN
5 NaN NaN NaN NaN
6 NaN G NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
Then I would convert this to True or False using pd.isnull and invert it with ~. Thereby getting True where it is a match, and false where there is not.
~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
0 1 2 3
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False True False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
Then assign this to the dataframe
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
SAMPLE MUT T G C -
0 11 chr1:100154376:G:A False False False False
1 2 chr1:100177723:C:T False False False False
2 9 chr1:100177723:C:T False False False False
3 1 chr1:100194200:-:AA False False False False
4 8 chr1:10032249:A:G False True False False
5 2 chr1:100340787:G:A False False False False
6 1 chr1:100349757:A:G False True False False
7 3 chr1:10041186:C:A False False False False
8 10 chr1:100476986:G:C False False False False
9 4 chr1:100572459:C:T False False False False
10 5 chr1:100572459:C:T False False False False
Now we can simply sum the columns:
df[["T","G","C","-"]].sum()
T 0
G 2
C 0
- 0
But wait, we have not done this only where SAMPLE == 1
We can do this very easily with a mask:
sample_one_mask = df.SAMPLE == 1
df[sample_one_mask][["T","G","C","-"]].sum()
T 0
G 1
C 0
- 0
If you want this to count per SAMPLE instead, you can use the groupby function:
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
T G C -
SAMPLE
1 0 1 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
TLDR;
Do this:
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)

You can create a column with the mutation type (A->T, G->C) with a regular expression substitution then apply pandas groupby to count.
import pandas as pd
import re
df = pd.read_table('df.tsv')
df['mutation_type'] = df['MUT'].apply(lambda x: re.sub(r'^.*?:([^:]+:[^:]+)$', r'\1', x))
df.groupby(['SAMPLE','mutation_type']).agg('count')['MUT']
The output is like this for your data:
SAMPLE mutation_type
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1
Name: MUT, dtype: int64

I had a similar answer to A.P.
import pandas as pd
df = pd.DataFrame(data={'SAMPLE': [11,2,9,1,8,2,1,3,10,4,5], 'MUT': ['chr1:100154376:G:A', 'chr1:100177723:C:T', 'chr1:100177723:C:T', 'chr1:100194200:-:AA', 'chr1:10032249:A:G', 'chr1:100340787:G:A', 'chr1:100349757:A:G', 'chr1:10041186:C:A', 'chr1:100476986:G:C', 'chr1:100572459:C:T', 'chr1:100572459:C:T']}, columns=['SAMPLE', 'MUT'])
df['Sequence'] = df['MUT'].str.replace(r'\w+:\d+:', '\1')
df.groupby(['SAMPLE', 'Sequence']).count()
Produces
MUT
SAMPLE Sequence
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Groupby, does not take selection into account - python

Related

How to delete cells and cut rows find by " ̶i̶s̶i̶n̶(̶)̶ " df.mask()?

How to count consecutive repetitions in a pandas series

Iterating over pandas rows

How to compare 2 dataframe without comparing null values in python? [duplicate]

Count rows that match string and numeric with pandas

Categories

Resources