Count rows that match string and numeric with pandas

Count rows that match string and numeric with pandas - python

I have 1-12 numbers in SAMPLE column and for each number I try to count mutation numbers(A:T, C:G etc..). This code works but how can I modify this code to gives me all 12 condition for each mutation, instead of writing the same code 12 times and also for each mutation?
In this example; AT gives me the number while SAMPLE=1. I am trying to get number of AT for each sample number(1,2,..12). So how can modify this code for that? I'll appreciate for any help. Thank you.
SAMPLE MUT
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
... ... ...
d= df["SAMPLE", "MUT" ]
chars1 = "TGC-"
number = {}
for item in chars1:
dm= d[(d["MUT"].str.contains("A:" + item)) & (d["SAMPLE"].isin([1]))]
num1 = dm.count()
number[item] = num1
AT=number["T"]
AG=number["G"]
AC=number["C"]
A_=number["-"]

I would use the native string extraction methods in pandas
df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')
Which returns the matches of the different groups:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN G NaN NaN
5 NaN NaN NaN NaN
6 NaN G NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
Then I would convert this to True or False using pd.isnull and invert it with ~. Thereby getting True where it is a match, and false where there is not.
~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
0 1 2 3
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False True False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
Then assign this to the dataframe
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
SAMPLE MUT T G C -
0 11 chr1:100154376:G:A False False False False
1 2 chr1:100177723:C:T False False False False
2 9 chr1:100177723:C:T False False False False
3 1 chr1:100194200:-:AA False False False False
4 8 chr1:10032249:A:G False True False False
5 2 chr1:100340787:G:A False False False False
6 1 chr1:100349757:A:G False True False False
7 3 chr1:10041186:C:A False False False False
8 10 chr1:100476986:G:C False False False False
9 4 chr1:100572459:C:T False False False False
10 5 chr1:100572459:C:T False False False False
Now we can simply sum the columns:
df[["T","G","C","-"]].sum()
T 0
G 2
C 0
- 0
But wait, we have not done this only where SAMPLE == 1
We can do this very easily with a mask:
sample_one_mask = df.SAMPLE == 1
df[sample_one_mask][["T","G","C","-"]].sum()
T 0
G 1
C 0
- 0
If you want this to count per SAMPLE instead, you can use the groupby function:
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
T G C -
SAMPLE
1 0 1 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
TLDR;
Do this:
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)

You can create a column with the mutation type (A->T, G->C) with a regular expression substitution then apply pandas groupby to count.
import pandas as pd
import re
df = pd.read_table('df.tsv')
df['mutation_type'] = df['MUT'].apply(lambda x: re.sub(r'^.*?:([^:]+:[^:]+)$', r'\1', x))
df.groupby(['SAMPLE','mutation_type']).agg('count')['MUT']
The output is like this for your data:
SAMPLE mutation_type
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1
Name: MUT, dtype: int64

I had a similar answer to A.P.
import pandas as pd
df = pd.DataFrame(data={'SAMPLE': [11,2,9,1,8,2,1,3,10,4,5], 'MUT': ['chr1:100154376:G:A', 'chr1:100177723:C:T', 'chr1:100177723:C:T', 'chr1:100194200:-:AA', 'chr1:10032249:A:G', 'chr1:100340787:G:A', 'chr1:100349757:A:G', 'chr1:10041186:C:A', 'chr1:100476986:G:C', 'chr1:100572459:C:T', 'chr1:100572459:C:T']}, columns=['SAMPLE', 'MUT'])
df['Sequence'] = df['MUT'].str.replace(r'\w+:\d+:', '\1')
df.groupby(['SAMPLE', 'Sequence']).count()
Produces
MUT
SAMPLE Sequence
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1

Related

Cumulative count resetting to and staying 0 based on a condition in pandas

I am trying to create a column 'count' on a pandas DF that cumulatively counts when field 'boolean' is True but resets and stays at 0 when 'boolean' is False. Also needs to be grouped by the ID column, so the count resets when looking at a new ID. No loops please as working with a big data set
Used the code from the following question which works but need to add a group by to include the ID column grouping
Pandas Dataframe - Row Iteration with Resetting Count-Value by Condition without loop
Expected output below: (ID, Boolean columns already exist, just need to create Count)
ID Boolean Count
1 True 1
1 True 2
1 True 3
1 True 4
1 True 5
1 False 0
1 False 0
1 False 0
1 False 0
1 True 1
1 True 2
1 True 3
2 True 1
2 True 2
2 True 3
2 True 4
2 False 0
2 False 0
2 False 0
2 True 1
2 True 2
2 True 3

Identify blocks by using cumsum on inverted boolean mask, then group the dataframe by ID and blocks and use cumsum on Boolean to create a counter
b = (~df['Boolean']).cumsum()
df['Count'] = df.groupby(['ID', b])['Boolean'].cumsum()
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3

df['Count'] = df.groupby('ID')['Boolean'].diff()
df = df.fillna(False)
df['Count'] = df.groupby('ID')['Count'].cumsum()
df['Count'] = df.groupby(['ID', 'Count'])['Boolean'].cumsum()
df
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3

You can use a column shift for ID and Boolean columns to identify the groups to do the groupby on. Then do a cumsum for each of those groups.
groups = ((df['ID']!=df['ID'].shift()) | (df['Boolean']!=df['Boolean'].shift())).cumsum()
df.assign(Count2=df.groupby(groups)['Boolean'].cumsum())
Result
ID Boolean Count Count2
0 1 True 1 1
1 1 True 2 2
2 1 True 3 3
3 1 True 4 4
4 1 True 5 5
5 1 False 0 0
6 1 False 0 0
7 1 False 0 0
8 1 False 0 0
9 1 True 1 1
10 1 True 2 2
11 1 True 3 3
12 2 True 1 1
13 2 True 2 2
14 2 True 3 3
15 2 True 4 4
16 2 False 0 0
17 2 False 0 0
18 2 False 0 0
19 2 True 1 1
20 2 True 2 2
21 2 True 3 3

Pandas Groupby, does not take selection into account

I am struggling for a while now with the Pandas grouping function.
Explaination of data:
In the code below I load a CSV file into a DataFrame. I had to preprocess the data to structure in a way that each column is a variable, before a column was a variable of a certain wave. The result DataFrame has over 120 variables and each variable has it's own column. Row wise is an observation. Each participant has between the 1 and 13 observations, uniquely identified by a ResponseID and wave number.
Goal:
In this experiment I have two scenario's, one for only the Dutch participants and one scenario for all participants from all over the world. I want to know the mean of the observations per participant for each variable, for both scenario's.
Problem:
Whenever I run this code, the shape of the group per ID is the same for both the Dutch Data scenario as for the World data scenario. While when i deterimine the amount of participants with the .unique() function of Pandas, it does exactly what i expect, namely a small number for the amount of Dutch participants and a large number for the world Data participants.
Question:
Can anyone help me solve this problem?
it states that I have a equal amount of participant for both the Dutch Data scenario as for the World data scenario.
Preprocessing code
import sklearn as sk
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import KBinsDiscretizer
# CSV Settings
seperator = ","
decimal='.'
# Input files
data_raw_file = {"path": "./data.csv", "sheet": None}
data_question_file = {"path": "./Variable Justification.xlsx", "sheet": "Full sheet"}
# View x lines of a table
view_lines = 10
pd.set_option("display.max_rows", view_lines, "display.max_columns", None)
# Load data
data = pd.read_csv(data_raw_file["path"], sep=seperator, decimal=decimal)
# Use
identifier_vars = ["ResponseId", "id"]
demographic_vars = ["age", "coded_country", "edu", "gender"]
outcome_vars = ["affAnx", "affBor", "affCalm", "affDepr", "affEnerg", "affNerv", "affExh", "affInsp", "affRel", "affAng", "affLov"]
# Types (Other variables are considered numerical)
discrete_vars = ["employstatus", "isoObjWho", "coronaClose", "houseLeaveWhy"]
orderd_categorical_vars = ["age", "edu"]
categorical_vars = ["coded_country", "gender", "ResponseId", "id"]
# Wave information
waves = pd.DataFrame(data = [["base" , "" , "Ongoing"],
["wave01", "w1_", "1-3-2020"],
["wave02", "w2_", "2-4-2020"],
["wave03", "w3_", "3-5-2020"],
["wave04", "w4_", "4-6-2020"],
columns =["Waves" , "Wave_ref", "Date"])
# Extract the unique variable names, by making use of the general structure wx_varName, where x is the wave number.
# ================================================================
variable_names = data.keys().str.replace(r'(w\d*\_)', "").unique().to_frame(index=False, name="name")
# Define the different types of variable and their use
# ================================================================
variable_names["use"] = "IV"
variable_names.loc[variable_names["name"].isin(demographic_vars), "use"] = "Demographic"
variable_names.loc[variable_names["name"].isin(identifier_vars), "use"] = "Identifier"
variable_names.loc[variable_names["name"].isin(outcome_vars), "use"] = "DV"
variable_names["type"] = "Continuous"
variable_names.loc[variable_names["name"].isin(categorical_vars), "type"] = "Categorical"
variable_names.loc[variable_names["name"].isin(orderd_categorical_vars), "type"] = "Ordered_Categorical"
for var in discrete_vars:
variable_names.loc[variable_names["name"].str.match('^' + var + '.*'), "type"] = "Discrete"
# Wave in to dataFrame
# ==============================================================
df_waves = pd.DataFrame(columns=variable_names["name"])
for idx, w_ref in enumerate(waves["Wave_ref"]):
# Add Wx_ to the variable names
temp_var = [w_ref + s for s in variable_names[variable_names["type"].isin(["Continuous", "Discrete"])]["name"]] + demographic_vars + identifier_vars
temp_df = data[data.columns.intersection(temp_var)].copy()
temp_df.columns = [s.replace(str(w_ref), "") for s in temp_df.columns] # Remove wave number from column
temp_df["wave"] = waves[waves["Wave_ref"] == w_ref]["Waves"].values[0]
temp_df["wave_date"] = waves[waves["Wave_ref"] == w_ref]["Date"].values[0]
df_waves = df_waves.append(temp_df, ignore_index=True)
data = df_waves.copy()
del(df_waves, temp_df, temp_var, idx, w_ref)
# Define data types
# =================================================================
discrete_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Discrete")]["name"]
data[discrete_vars] = data[discrete_vars].replace(1, True)
data[discrete_vars] = data[discrete_vars].fillna(value=False)
data[discrete_vars] = data[discrete_vars].astype(bool)
continuous_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Continuous")]["name"]
data[continuous_vars] = data[continuous_vars].astype(float)
o_categorical_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Ordered_Categorical")]["name"]
data[o_categorical_vars] = data[o_categorical_vars].astype(float)
continuous_vars = variable_names[(variable_names["name"].isin(data.columns)) & (variable_names["type"] == "Categorical")]["name"]
data[continuous_vars] = data[continuous_vars].astype("category")
Part of code, where I think/thought the problem is (comes after preprocessing)
p1_data = data.copy()
p1_scenarios = {"Dutch_Data" : p1_data[p1_data["coded_country"]=="Netherlands"],
"World_Data": p1_data
}
for i, scenario in enumerate(p1_scenarios):
p1_data_scene = p1_scenarios[scenario]
participants = 1_data_scene["ResponseId"].unique()
mean_per_id = p1_data_scene[outcome_vars+["ResponseId"]].groupby(by="ResponseId", dropna=False).mean()
print(scenario)
print(p1_data_scene.shape)
print("Amount of participants " + str(len(participants)))
print("Shape of group per ID" + str(mean_per_id.shape))
Example of the Data after preprocessing:
observation
affAnx
affBor
affCalm
affDepr
affEnerg
affNerv
affExh
affInsp
affRel
affAng
affLov
PLRAC19
PLRAEco
disc01
disc02
disc03
jbInsec01
jbInsec02
jbInsec03
jbInsec04
employstatus_1
employstatus_2
employstatus_3
employstatus_14
employstatus_4
employstatus_5
employstatus_11
employstatus_12
employstatus_6
employstatus_7
employstatus_8
employstatus_9
employstatus_10
employstatus_13
hours_worked_1
PFS01
fail01
isoFriends_inPerson
isoOthPpl_inPerson
isoFriends_online
isoOthPpl_online
isoObj
isoObjWho_1
isoObjWho_2
isoObjWho_3
isoObjWho_4
isoObjWho_5
isoObjWho_6
houseTrad
discPers
lone01
mentHealth
mentPhys
happy
lifeSat
MLQ
JWB_1
tightNorms
tightLoose
tightTreat
probSolving01
probSolving02
probSolving03
posrefocus01
posrefocus02
posrefocus03
C19Know
c19Hope
c19Eff
c19ProSo01
c19ProSo03
c19perBeh01
c19perBeh02
c19perBeh03
c19RCA01
c19RCA02
coronaClose_1
coronaClose_2
coronaClose_3
coronaClose_4
coronaClose_5
coronaClose_6
ecoHope
ecoProSo01
ecoProSo03
ecoRCA02
ecoRCA03
houseLeave
houseLeaveWhy_1
houseLeaveWhy_2
houseLeaveWhy_8
houseLeaveWhy_4
houseLeaveWhy_7
houseLeaveWhy_6
houseActWant
houseActHave
bor02
tempFocPast
tempFocPres
tempFocFut
neuro01
neuro02
neuro03
para01
para02
para03
consp01
consp02
relYesNo
godyesno
godOpinCtrl
godOpinInfl
godPunish
godForgive
trustGovCtry
trustGovState
ctrGJob
solMyCtr
solOthCtr
depressed
gender
age
edu
coded_country
ResponseId
wave
wave_date
1
4
2
2
2
3
4
4
3
1
nan
nan
2
1
1
0
1
nan
0
0
0
False
False
False
False
False
False
False
False
False
False
False
True
True
False
nan
1
-1
1
0
7
7
nan
False
False
False
False
False
False
nan
1
2
nan
nan
4
4
0
nan
9
7
7
4
4
3
3
3
1
3
2
2
3
0
0
3
2
1
3
False
False
False
False
False
True
2
0
1
0
0
1
False
False
False
False
False
False
nan
nan
0
2
2
2
1
0
2
10
5
2
4
1
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
base
Ongoing
2
3
4
1
nan
nan
nan
nan
nan
nan
nan
nan
1
7
1
1
-2
-2
1
1
0
False
False
False
False
False
False
False
False
False
False
False
True
False
False
nan
2
0
7
5
0
3
nan
False
False
False
False
False
False
nan
1
3
nan
nan
5
4
3
nan
3
9
2
3
2
4
4
4
2
2
2
1
1
1
3
3
2
1
3
False
False
False
False
False
True
0
1
0
2
0
2
False
False
False
True
False
False
nan
nan
-3
3
3
2
3
3
1
10
10
2
2
1
1
1
3
3
nan
nan
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
wave1
12-3-2021
3
4
4
3
4
1
3
4
2
1
nan
nan
4
5
1
-1
-1
1
-1
1
1
False
False
False
False
False
False
False
False
True
False
False
False
False
False
nan
2
1
7
7
4
3
nan
False
False
False
False
False
False
nan
1
2
nan
nan
6
5
2
nan
3
3
3
3
4
5
3
3
1
3
1
-3
-1
-3
0
3
0
1
1
False
False
False
False
False
True
0
0
0
0
3
1
False
False
False
False
False
False
nan
nan
-3
3
0
1
-3
1
-2
4
5
1
2
3
1
1
3
3
3
6
nan
nan
nan
nan
nan
nan
1
2
4
nan
028a6e28
base
Ongoing
4
4
5
3
4
3
5
2
2
3
nan
nan
4
4
1
1
1
-2
2
-2
-2
True
False
False
False
False
False
False
False
False
False
False
False
False
False
nan
0
1
0
0
6
6
nan
False
False
False
False
False
False
nan
-1
5
nan
nan
5
3
0
nan
6
1
6
3
3
4
3
3
3
4
2
3
-3
1
3
3
3
3
3
False
False
False
False
True
False
1
0
2
2
2
3
False
True
False
False
False
False
nan
nan
2
3
2
3
2
2
2
0
6
2
10
10
1
1
3
3
3
6
5
5
nan
nan
nan
nan
1
5
5
Saudi Arabia
4212d3a1
base
Ongoing
5
2
5
4
3
5
2
2
4
4
nan
nan
3
5
2
1
0
nan
-2
-2
nan
False
False
False
False
False
False
False
False
False
True
False
False
False
False
nan
1
0
0
0
7
3
nan
False
False
False
False
False
False
nan
1
4
nan
nan
4
4
1
nan
8
5
9
5
5
3
3
3
2
5
3
3
0
2
3
2
3
3
3
False
False
False
False
False
True
3
3
3
2
-2
1
False
False
False
False
False
False
nan
nan
3
1
2
-2
1
1
1
10
3
0
0
0
1
1
3
3
6
6
nan
nan
nan
nan
nan
nan
2
6
6
Saudi Arabia
26dc23cf
base
Ongoing
6
2
3
1
4
2
2
4
1
1
nan
nan
5
6
1
2
-2
1
-1
1
nan
False
False
False
False
False
False
False
False
True
False
False
False
False
False
nan
1
2
0
3
2
0
nan
False
False
False
False
False
False
nan
1
4
nan
nan
2
3
-1
nan
6
1
4
2
1
1
1
2
1
3
-2
-2
1
0
1
1
2
0
3
False
False
False
False
True
False
-3
1
1
2
2
2
False
False
False
False
False
True
nan
nan
-1
-3
2
2
1
1
-1
10
10
10
7
10
0
1
2
2
1
6
2
2
nan
nan
nan
nan
1
2
6
Egypt
bed32257
base
Ongoing
7
3
3
2
1
2
4
2
1
1
nan
nan
4
2
2
-1
-1
nan
nan
nan
nan
False
False
False
False
True
False
False
False
False
False
False
True
False
False
nan
2
1
0
0
2
0
nan
False
False
False
False
False
False
nan
-1
5
nan
nan
1
1
1
nan
9
9
9
5
5
5
1
1
1
5
-2
-1
3
3
3
3
2
3
3
False
False
False
True
False
False
-2
-2
-1
2
0
2
False
False
False
False
False
True
nan
nan
2
-1
1
2
1
2
0
10
8
9
7
10
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
5
Morocco
4sc2f1ae
base
Ongoing
8
nan
nan
nan
3
3
nan
nan
nan
nan
nan
nan
nan
4
nan
nan
nan
nan
nan
nan
nan
False
False
False
False
False
False
False
False
False
False
False
False
False
False
nan
nan
1
nan
nan
nan
nan
nan
False
False
False
False
False
False
nan
1
1
nan
nan
10
6
2
nan
nan
nan
nan
3
3
3
5
5
5
4
3
2
1
1
2
2
2
2
2
False
False
False
False
False
True
nan
1
1
nan
1
1
False
False
False
False
False
False
nan
nan
1
1
2
2
1
0
2
9
0
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
nan
nan
Saudi Arabia
3a86dadc
base
Ongoing
9
3
2
1
3
2
2
1
1
3
nan
nan
4
3
1
0
-1
nan
nan
nan
nan
False
False
False
False
False
True
False
False
False
False
False
False
False
False
nan
-1
-1
1
0
5
0
nan
False
False
False
False
False
False
nan
1
1
nan
nan
7
5
0
nan
7
8
7
1
1
2
2
2
2
3
2
1
1
2
3
3
3
2
3
False
True
False
False
False
False
1
1
1
2
2
1
False
False
False
False
False
False
nan
nan
1
1
2
-2
1
2
1
3
4
5
3
5
1
1
3
2
4
6
nan
nan
nan
nan
nan
nan
1
2
5
Netherlands
5d181ac9
base
Ongoing
10
3
2
4
1
2
2
3
3
2
nan
nan
3
3
1
1
0
-2
2
-1
-2
False
False
False
False
False
False
False
False
False
True
False
False
False
False
nan
0
-1
2
5
7
7
nan
False
False
False
False
False
False
nan
-1
1
nan
nan
8
5
3
nan
7
6
5
4
4
4
2
2
3
4
3
2
-1
0
3
3
3
1
3
False
False
False
False
False
True
1
2
1
2
1
3
False
True
False
False
False
True
nan
nan
-2
-2
2
-1
-2
1
1
6
2
1
3
3
1
1
3
nan
2
6
nan
nan
nan
nan
nan
nan
1
1
5
Netherlands
d455fa2ff
wave2
16-3-2021

After copying your example data into a CSV file, I ran the code below. Note I've fixed typos and indentation, made sure the NAN values are recognized as such, and replaced outcome_vars (which wasn't defined in the code you posted) with two sample columns.
import numpy as np
import pandas as pd
data = pd.read_csv("data_path.csv", delim_whitespace=True)
data.replace('nan', np.nan, inplace=True)
p1_data = data.copy()
p1_scenarios = {"Dutch_Data" : p1_data[p1_data["coded_country"]=="Netherlands"],
"World_Data": p1_data}
for i, scenario in enumerate(p1_scenarios):
p1_data_scene = p1_scenarios[scenario]
participants = p1_data_scene["ResponseId"].unique()
mean_per_id = p1_data_scene[['affAnx', 'affBor', "ResponseId"]].groupby(
by="ResponseId", dropna=False).mean()
print(scenario)
print(p1_data_scene.shape)
print("Amount of participants " + str(len(participants)))
print("Shape of group per ID" + str(mean_per_id.shape))
The output is as expected in view of the fact that three of the Dutch response ID codes are identical, and the rest are unique:
Dutch_Data
(4, 128)
Amount of participants 2
Shape of group per ID(2, 2)
World_Data
(10, 128)
Amount of participants 8
Shape of group per ID(8, 2)
So for this dataset, the problem you describe is not caused by the code you posted, unless the choice of outcome_vars somehow affects the groupby operation, which it shouldn't.
Can you come up with a minimal data sample that does reproduce the problem?

How to delete cells and cut rows find by " ̶i̶s̶i̶n̶(̶)̶ " df.mask()?

I have dataframe with random cells, for example "boss".
How can I delete the cells "boss" and all right cells in the same row using df.isin()?
x=[]
for i in range (5):
x.append("boss")
df=pd.DataFrame(np.diagflat(x) )
0 1 2 3 4
0 boss
1 boss
2 boss
3 boss
4 boss
cut rows using "i̶s̶i̶n̶(̶)̶ " df.mask to
mask=(df.eq('boss').cumsum(axis=1).ne(0))
df.mask(mask,"Nan", inplace =True)
df
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN
4 NaN

Use DataFrame.mask with mask:
df = df.mask(df.eq('boss').cumsum(axis=1).ne(0))
print (df)
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN
4 NaN
Details:
Compare value boss:
print (df.eq('boss'))
0 1 2 3 4
0 True False False False False
1 False True False False False
2 False False True False False
3 False False False True False
4 False False False False True
Use cumulative sum per rows, so after first match get 1, after second 2..:
print (df.eq('boss').cumsum(axis=1))
0 1 2 3 4
0 1 1 1 1 1
1 0 1 1 1 1
2 0 0 1 1 1
3 0 0 0 1 1
4 0 0 0 0 1
Compare if not equal 0 and pass to mask:
print (df.eq('boss').cumsum(axis=1).ne(0))
0 1 2 3 4
0 True True True True True
1 False True True True True
2 False False True True True
3 False False False True True
4 False False False False True

How to count consecutive repetitions in a pandas series

Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1

Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64

The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN

Problems while creating a two column based index in a new pandas column?

Given the following dataframe:
col_1 col_2
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 2
True 2
False 2
False 2
True 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
How can I create a new index that help to identify when a True value is present in col_1? That is, when in the first column a True value appears I would like to fill backward with a number starting from one the new column. For example, this is the expected output for the above dataframe:
col_1 col_2 new_id
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 2 1
True 2 1 --------- ^ (fill with 1 and increase the counter)
False 2 2
False 2 2
True 2 2 --------- ^ (fill with 2 and increase the counter)
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
True 2 4 --------- ^ (fill with 3 and increase the counter)
The problem is that I do not know how to create the id although I know that pandas provide a bfill object that may help to achieve this purpose. So far I tried to iterate with a simple for loop:
count = 0
for index, row in df.iterrows():
if row['col_1'] == False:
print(count+1)
else:
print(row['col_2'] + 1)
However, I do not know how to increase the counter to the next number. Also I tried to create a function and then apply it to the dataframe:
def create_id(col_1, col_2):
counter = 0
if col_1 == True and col_2.bool() == True:
return counter + 1
else:
pass
Nevertheless, i lose control of filling backward the column.

Just do with cumsum
df['new_id']=(df.col_1.cumsum().shift().fillna(0)+1).astype(int)
df
Out[210]:
col_1 col_2 new_id
0 False 1 1
1 False 1 1
2 False 1 1
3 False 1 1
4 False 1 1
5 False 1 1
6 False 1 1
7 False 1 1
8 False 1 1
9 False 1 1
10 False 1 1
11 False 1 1
12 False 1 1
13 False 1 1
14 False 2 1
15 True 2 1
16 False 2 2
17 False 2 2
18 True 2 2
19 False 2 3
20 False 2 3
21 False 2 3
22 False 2 3
23 False 2 3
24 False 2 3
25 False 2 3
26 False 2 3
27 False 2 3
28 False 2 3
29 False 2 3

If you aim to append the new_id column to your dataframe:
new_id=[]
counter=1
for index, row in df.iterrows():
new_id+= [counter]
if row['col_1']==True:
counter+=1
df['new_id']=new_id

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count rows that match string and numeric with pandas - python

Related

Cumulative count resetting to and staying 0 based on a condition in pandas

Pandas Groupby, does not take selection into account

How to delete cells and cut rows find by " ̶i̶s̶i̶n̶(̶)̶ " df.mask()?

How to count consecutive repetitions in a pandas series

Problems while creating a two column based index in a new pandas column?

Categories

Resources