How to retrieve rows based on mismatched condition on particular columns? - python

I need to do the following tasks.
I have 9 columns along with the original label. Each of those 9 columns consists of a probability value. Each 3 value is a prediction by a particular model. I have a total of 3 classifier models and there are 3 classes.
Now I have to apply the max rule.
For each class I have to pick the max probability this will give me three max values. Now I will finally return to the class which is maxed among those 3.
My code and sample
import numpy as np
df['Covid_max'] = np.where(df.columns == 'Covid',df.values,0).max(axis=1)
df['Normal_max'] = np.where(df.columns == 'Normal',df.values,0).max(axis=1)
df['Pneumonia_max'] = np.where(df.columns == 'Pneumonia',df.values,0).max(axis=1)
df['pred'] = df[['Covid_max','Normal_max','Pneumonia_max']].idxmax(axis=1)
new_label = {"pred": {"Covid_max": 0, "Normal_max": 1,"Pneumonia_max": 2,}}
df.replace(new_label , inplace = True)
Upto I have done it already. Now I got stuck. I only require the records where there is a mismatch between class and pred columns.(Here it should only print the 2nd row) How to do that?
Also, if anybody gives another solution, I would be happy to grasp that.
TIA

Try this.
df_mismatch = df.loc[~(df['Class'] == df['pred'])]

Related

Assignment with both fillna() and loc() apparently not working

I've searched for answer around, but I cannot find them.
My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.
My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT
train_df = df[df['my_column'].notna()] #I need to train the model without using the missing data
train_x = train_df[['lat','long']] #Lat e Long are the inputs
train_y = train_df[['my_column']] #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y) #clf is the classifies, here we train it
df_x = df[['lat','long']] #I need this part to do the prediction
prediction = clf.predict(df_x) #clf.predict() returns an array
series_pred = pd.Series(prediction) #now the array is a series
print(series_pred.shape) #RETURNS (2381,)
print(series_pred.isna().sum()) #RETURN 0
So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)
Here I try to assign the predictions to my Dataframe:
#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred #I assign the predictions using .loc()
#test_2
df['my_colum'] = df['my_colum'].fillna(series_pred) #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape) #RETURNS (2381,)
print(df['my_colum'].isna().sum()) #RETURN 6
As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:
#test_3
df[['my_colum']] = df[['my_colum']].fillna(series_pred) #Will it work?
print(df[['my_colum']].shape) #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum()) #RETURNS 6
Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:
In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum()) #extreme test
Out[42]:
6
So... where is my very very stupid mistake? Thanks a lot
EDIT 1
To show a little bit of the data,
In[1]:
df.head()
Out[1]:
my_column lat long
id
9df Wil 51 5
4f3 Fabio 47 9
x32 Fabio 47 8
z6f Fabio 47 9
a6f Giovanni 47 7
Also, I've added info at the beginning of the question
#Ben.T or #Dan should post their own answers, they deserve to be accepted as the correct one.
Following their hints, I would say that there are two solutions:
Solution 1 (Best): Use loc()
The problem
The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column'] is expecting to receive X values, where X is the number of missing values. My variable prediction has actually both the prediction for the missing values and for the non missing values
The solution
pred_df = df[df['my_column'].isna()] #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction
Solution 2: Use fillna()
The problem
The problem with the current solution is that df['my_colum'].fillna(series_pred) requires the indexes of my df to be the same of series_pred, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]
The solution
Resetting the index of the df at the very beginning of the code.
Why is this not the best
The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification
Edit: series_pred.index = df['my_column'].isna().index Thanks #Dan

How do I get one column of an array into a new Array whilst applying a fancy indexing filter on it?

So basically I have an array, that consists of 14 Columns and 426 rows, every column represents one property of a dog and every row represents one dog, now I want to know the average heart frequency of an ill dog, the 14. column is the column that indicates whether the Dog is ill or not [0 = Healthy 1 = ill], the 8. row is the heart frequency. Now my problem is, that I don't know how I can get the 8. column out of the whole array and use the boolean filter on it
I am pretty new to Python. As I mentioned above I think that I know what I have to do [Use a fancy indexing filter] but I don't know how I can do this. I tried doing it while still being in the original Array but that didn't work out, so I thought I need to get the Infos into another one and use the Boolean filter on that one.
EDIT: Ok, so here is the code that I got right now:
import numpy as np
def average_heart_rate_for_pathologic_group(D):
a=np.array(D[:, 13]) #gets information, wether the dogs are sick or not
b=np.array(D[:, 7]) #gets the heartfrequency
R=(a >= 0) #gets all the values that are from sick dogs
amhr = np.mean(R) #calculates the average heartfrequency
return amhr
I think boolean indexing is the way foward.
The shortcuts for this work like:
#Your data:
data = [[0,1,2,3,4,5,6,7,8...],[..]...]
#This indexing chooses the rows in the 8th column that equals 1 and then their
#column number 14 values. Any analysis can be done after this on the new variable
heart_frequency_ill = data[data[:,7] == 1,13]
Probably you'll have to actually copy the data from the original array into a new one with the selected data.
Could you please share a sample with let's say 3 or 4 rows of your data?
I will give a try thought.
Let me build data with 4 columns here (but you could use 14 as in your problem)
data = [['c1a','c2a','c3a','c4a'], ['c1b','c2b','c3b','c4b']]
You could use numpy.array to get its nth column.
See how one can get the 2nd column:
import numpy as np
a = np.array(data)
a[:,2]
If you want to get the 8. Column of all the dogs that are healthy, you can do it the following:
# we use 7 for the column because the index starts by 0
# we use filter and fancy to get the rows where the conditions are true
# we use n.argwhere to get the indices where the conditions are true
A[np.argwhere([A[:,13] == 0])[:,1],7]
If you also want to compute the mean:
A[np.argwhere([A[:,13] == 0])[:,1],7].mean()

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

Duplicate the samples in a dataset?

I use the code to check my dataset 'df' and see serious imbalance in column 'Has_Arrears'. I would expand my target dataset with duplicate samples under Has_Arrears = 1 35 times. i.e. sample 35 times for each observation of Has_Arrears = 1. How can I achive this? Cheers
If I would like to use stratify sampling, how can I code for this?
If I understand you correctly, this may be what you're looking for:
new = df['Has_Arrears'] == 1
a = df[new]
df = df.append([a]*35, ignore_index=True)

Conditional Sum/Average/etc... CSV file in Python

First off, I've found similar articles, but I haven't been able to figure out how to translate the answers from those questions to my own problem. Secondly, I'm new to python, so I apologize for being a noob.
Here's my question: I want to perform conditional calculations (average/proportion/etc..) on values within a text file
More concretely, I have a file that looks a little something like below
0 Diamond Correct
0 Cross Incorrect
1 Diamond Correct
1 Cross Correct
Thus far, I am able to read in the file and collect all of the rows.
import pandas as pd
fileLocation = r'C:/Users/Me/Desktop/LogFiles/SubjectData.txt'
df = pd.read_csv(fileLocation, header = None, sep='\t', index_col = False,
name = ["Session Number", "Image", "Outcome"])
I'm looking to query the file such that I can ask questions like:
--What is the proportion of "Correct" values in the 'Outcome' column when the first column ('Session Number') is 0? So this would be 0.5, because there is one "Correct" and one "Incorrect".
I have other calculations I'd like to perform, but I should be able to figure out where to go once I know how to do this, hopefully simple, command.
Thanks!
you can also do it this way:
In [467]: df.groupby('Session#')['Outcome'].apply(lambda x: (x == 'Correct').sum()/len(x))
Out[467]:
Session#
0 0.5
1 1.0
Name: Outcome, dtype: float64
it'll group your DF by Session# and calculate Ratio of correct Outcomes for each group (Session#)
# getting the total number of rows
total = len(df)
# getting the number of rows that have 'Correct' for 'Outcome' and 0 for 'Session Number'
correct_and_session_zero = len(df[(df['Outcome'] == 'Correct') &
(df['Session Number'] == 0)])
# if you're using python 2 you might need to convert correct_and_session_zero or total
# to float so you won't lose precision
print(correct_and_session_zero / total)

Categories

Resources