Data problem: identifying data rows where colleagues have reached a consensus - python

I have a table that shows the results of four colleagues trying to classify several objects as either a, b, c or d. If the colleagues were able to agree on the classification, or if only one colleague is able to classify the object, then in a new column I want to show the colleague's classification. If the colleagues disagree, I want to create a separate dataframe that displays
those objects. For each object, at max only two colleagues are assigned to try classify it, so there won't be a situation where three colleagues cannot agree on the classification.
It is easy to show an object's classification if only one colleague is able to identify it, but I am struggling when there are two. I can only get as far as the following given my noob python skills.
The end result I am looking for, is 'a' for the first row, 'b' for third, and 'd' for fourth. The second row would be singled out for manual classification by a more experienced colleague.
df_test = pd.DataFrame({'check1':['a','a','unknown','d'],
'check2':['unknown','b','unknown','unknown'],
'check3':['unknown','unknown','c','d'],
'check4':['unknown','unknown','c','unknown']})
cols = ['check_ind','check1_ind','check2_ind','check3_ind','check4_ind']
for col in cols:
df_test[col] = 0
checks = [('check1','check1_ind'),('check2','check2_ind'),('check3','check3_ind'),('check4','check4_ind')]
rows = df_test.shape[0]
for r in range(rows):
for c in checks:
if df_test.iloc[r, df_test.columns.get_loc(c[0])] != 'unknown':
df_test.iloc[r, df_test.columns.get_loc(c[1])] = 1
sumcolumn = df_test['check1_ind'] + df_test['check2_ind'] + df_test['check3_ind'] + df_test['check4_ind']
df_test['body_check'] = sumcolumn

df.replace('unknown', np.nan, inplace=True)
df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else 'No Consensus', axis=1)
Output:
0 a
1 No Consensus
2 c
3 d
dtype: object
In use:
df['consensus'] = df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else np.nan, axis=1)
print(df)
...
check1 check2 check3 check4 consensus
0 a NaN NaN NaN a
1 a b NaN NaN NaN
2 NaN NaN c c c
3 d NaN d NaN d

Something like this should do the trick:
def function(series):
val_counts = series.value_counts()
if val_counts.size > 1:
return 'No Consensus'
else:
return val_counts.index[0]
df_test.replace({'unknown': np.nan}).apply(function, axis=1)

For an efficient, vectorial, approach, use mode:
df2 = (df_test
.mask(df_test.eq('unknown'))
.mode(1)
# ensure having a "1" column
.reindex(columns=[0,1])
)
print(df2)
# 0 1
# 0 a NaN
# 1 a b
# 2 c NaN
# 3 d NaN
m = df2[1].notna()
df_test['consensus'] = df2[0].mask(m, 'No consensus')
print(df_test)
Output:
check1 check2 check3 check4 consensus
0 a unknown unknown unknown a
1 a b unknown unknown No consensus
2 unknown unknown c c c
3 d unknown d unknown d

Related

How to drill down data using pandas - pythonic way?

I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
My objective is to compute the drill down info for each column
Let me explain by an example.
If we filter the dataframe by df[df['grade']=='A'], we get 2 records as result. let's consider the filtered column grade as parent_variable. Out of those 2 records returned as result, how much dumeel column (child_variable) values and dash column (child_variable) values account for target column values (which is 0 and 1). All categorical/object columns other than parent variable are called child variables.
We have to repeat the above exaple procedure for all the categorical/object variables in our dataset
As a first step, I made use of the below from a SO post
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
out.rename(columns={'variable': 'parent_variable','value': 'parent_value'}, inplace=True)
But the above, gets me only the % and count of target based on all parent variable. I would like to get the breakdown by child variables as well (for each parent variable)
%_contrib is obtained by computing the % of that record to the target value. ex: for dash=P, we have one grade values A (for target = 1). So, it has to be 100%. Hope this helps.
I expect my output to be like as shown below. I have shown sample only for couple of columns under parent_variable. But in my real data, there will be more than 20 categorical variables. So, any efficient approach is welcome and useful
As you are using a random function to generate the DataFrame it is hard for me to reproduce your example, but I think you are looking for value_counts -
This is the DataFrame I generated with your code -
grade dash dumeel dumma target
0 D P W 50 1
1 D S R 595 0
2 C P E 495 1
3 A Q Q 690 0
4 B P W 653 1
5 D R E 554 0
6 C P Q 392 1
7 D Q Q 186 0
8 B Q E 1228 1
9 C P E 14 0
When I do a value_counts() on the two columns -
df[(df['dash']=='P') & (df['target'] == 1)]['dumeel'].value_counts(normalize=True)
W 0.50
Q 0.25
E 0.25
Name: dumeel, dtype: float64
df[(df['dash']=='P') & (df['target'] == 1)]['grade'].value_counts(normalize=True)
C 0.50
D 0.25
B 0.25
Name: grade, dtype: float64
If you want to loop over all the child_columns - you can do
excl_cols = ['dash', 'target']
child_cols = [col for col in df.columns if col not in excl_cols]
for col in child_cols:
print(df[(df['dash']=='P') & (df['target'] == 1)][col].value_counts(normalize=True))
If you want to loop over all the columns - then you can use:
loop_columns = set(df.columns) - {'target'}
for parent_col in loop_columns:
print(f'Parent column is {parent_col}\n')
parent_vals = df[parent_col].unique()
child_cols = loop_columns - {parent_col}
for parent_val in parent_vals:
for child_col in child_cols:
print(df[(df[parent_col]==parent_val) & (df['target'] == 1)][child_col].value_counts(normalize=True))

lambda function referencing a column value not specified in function

I have a situation where I want to use the results of a groupby in my training set to fill in results for my test set.
I don't think there's a straight forward way to do this in pandas, so I'm trying use the apply method on the column in my test set.
MY SITUATION:
I want to use the average values from my MSZoning column to infer the missing value for my LotFrontage column.
If I use the groupby method on my training set I get this:
train.groupby('MSZoning')['LotFrontage'].agg(['mean', 'count'])
giving.....
Now, I want to use these values to impute missing values on my test set, so I can't just use the transform method.
Instead, I created a function that I wanted to pass into the apply method, which can be seen here:
def fill_MSZoning(row):
if row['MSZoning'] == 'C':
return 69.7
elif row['MSZoning'] == 'FV':
return 59.49
elif row['MSZoning'] == 'RH':
return 58.92
elif row['MSZoning'] == 'RL':
return 74.68
else:
return 52.4
I call the function like this:
test['LotFrontage'] = test.apply(lambda x: x.fillna(fill_MSZoning), axis=1)
Now, the results for the LotFrontage column are the same as the Id column, even though I didn't specify this.
Any idea what is happening?
you can do it like this
import pandas as pd
import numpy as np
## creating dummy data
np.random.seed(100)
raw = {
"group": np.random.choice("A B C".split(), 10),
"value": [np.nan if np.random.rand()>0.8 else np.random.choice(100) for _ in range(10)]
}
df = pd.DataFrame(raw)
display(df)
## calculate mean
means = df.groupby("group").mean()
display(means)
Fill With Group Mean
## fill with mean value
def fill_group_mean(x):
group_mean = means["value"].loc[x["group"].max()]
return x["value"].mask(x["value"].isna(), group_mean)
r= df.groupby("group").apply(fill_group_mean)
r.reset_index(level=0)
Output
group value
0 A NaN
1 A 24.0
2 A 60.0
3 C 9.0
4 C 2.0
5 A NaN
6 C NaN
7 B 83.0
8 C 91.0
9 C 7.0
group value
0 A 42.00
1 A 24.00
2 A 60.00
5 A 42.00
7 B 83.00
3 C 9.00
4 C 2.00
6 C 27.25
8 C 91.00
9 C 7.00

Pandas: Dynamically replace NaN values with the average of previous and next non-missing values

I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2

Python pandas apply too slow Fuzzy Match

def fuzzy_clean(i, dfr, merge_list, key):
for col in range(0,len(merge_list)):
if col == 0:
scaled_down = dfr[dfr[merge_list[col]]==i[merge_list[col]]]
else:
scaled_down = scaled_down[scaled_down[merge_list[col]]==i[merge_list[col]]]
if len(scaled_down)>0:
if i[key] in scaled_down[key].values.tolist():
return i[key]
else:
return pd.to_datetime(scaled_down[key][min(abs([scaled_down[key]-i[key]])).index].values[0])
else:
return i[key]
df[key]=df.apply(lambda i: fuzzy_clean(i,dfr,merge_list,key), axis=1)
I'm trying to eventually merge together two dataframes, dfr and df. The issue I have is that I need to merge on about 9 columns, one of which being a timestamp that doesn't quite match up between the two dataframes where sometimes it is slightly lagging, sometimes leading. I wrote a function that works when using the following; however, in practice it is just too slow running through hundreds of thousands of rows.
merge_list is a list of columns that each dataframe share that match up 100%
key is a string of a column, 'timestamp', that each share, which is what doesn't match up too well
Any suggestions in speeding this up would be greatly appreciated!
The data looks like the following:
df:
timestamp A B C
0 100 x y z
1 101 y i u
2 102 r a e
3 103 q w e
dfr:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 101.05 y i u
3 102 r a e
4 103.01 q w e
5 103.20 q w e
I want df to look like the following:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 102 r a e
3 103.01 q w e
Adding the final merge for reference:
def fuzzy_merge(df_left, df_right, on, key, how='outer'):
df_right[key]=df_right.apply(lambda i: fuzzy_clean(i,df_left,on,key), axis=1)
return pd.merge(df_left, df_right, on=on+[key], how=how, indicator=True).sort_values(key)
I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job.
pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')

Filter a pandas dataframe using values from a dict

I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.
IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3
Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.
For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]
Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3
To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]
You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)
Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]
I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]

Categories

Resources