I have a large datasets. I partitioned the data into training and test.
I found the missing values of the independent variable.
I want to calculate the number of columns that have the missing value. in this case, I should get 12 names. I was only able to sum the whole column
Here is my attempt:
finding_missing_values = data.train.isnull().sum()
finding_missing_values
finding_missing_values.sum()
is there a way I can count the number of column that has a missing value?
Take data list to and then count non zero values as follows.
finding_missing_values = (data.train.isnull().sum()).to_list()
number of missing value columns = sum(k>0 for k in finding_missing_values )
print(number of missing value columns)
should Give #
12
You wrote
finding_missing_values.sum()
You were looking for
(finding_missing_values > 0).values.sum()
From .values we get a numpy array.
The comparison gives us False / True values,
which conveniently are treated as 0 / 1 by .sum()
I am working with Titanic data set. This set have 891 rows. At moment I am focus on column 'Age'.
import pandas as pd
import numpy as np
import os
titanic_df = pd.read_csv('titanic_data.csv')
titanic_df ['Age']
Column 'Age' have 177 Nan values, so I want to replace this values from values from my sample. I already made sample for this column and you can see code below.
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(177)
So next steep should be replacing Nan value from age_sample into titanic_df ['Age']. In order to do this I try with this lines of code.
titanic_df ['Age']=age_sample
titanic_df ['Age'].isna()=age_sample
But obliviously here I made some mistakes. So can anybody help me how to replace value from sample (177 rows) into original data set (891 rows) and replace only Nan values.
A two line solution:
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(177))
If number of NaN values is not known:
nan_len = len(df['Age'][df['Age'].isna()])
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(nan_len ))
You need to select the subframe you want to update using loc:
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
I will divide my answer to two parts. Solution you are looking for and solution that makes it more robust.
Solution you are looking for
We have to find the number of missing values first, then generate number of sample matching our missing value and then assign. This will insure that you have the same size of needed missing values.
...
age_na_size = titanic_df ['Age'].isna().sum()
# generate sample of that sum
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(age_na_size)
# feed that to missing values
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
Solutions to make it robust
find the group mean age and replace missing values according. Example group by gender, carbin etc features that makes sense and use median age as a replacer.
Use k-Nearest Neighbour as age replacer. See scikit-learn
knnimputer
Use bins of age instead of actual ages. In this way you can first create a classifier to predict the age bin then use that as your code imputer.
This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T
I need to group a Pandas dataframe by date, and then take a weighted average of given values. Here's how it's currently done using the margin value as an example (and it works perfectly until there are NaN values):
df = orders.copy()
# Create new columns as required
df['margin_WA'] = df['net_margin'].astype(float) # original data as str or Decimal
def group_wa():
return lambda num: np.average(num, weights=df.loc[num.index, 'order_amount'])
agg_func = {
'margin_WA': group_wa(), # agg_func includes WAs for other elements
}
result = df.groupby('order_date').agg(agg_func)
result['margin_WA'] = result['margin_WA'].astype(str)
In the case where 'net_margin' fields contain NaN values, the WA is set to NaN. I can't seem to be able to dropna() or filtering by pd.notnull when creating new columns, and I don't know where to create a masked array to avoid passing NaN to the group_wa function (like suggested here). How do I ignore NaN in this case?
I think a simple solution is to drop the missing values before you groupby/aggregate like:
result = df.dropna(subset='margin_WA').groupby('order_date').agg(agg_func)
In this case, no indices containing missings are passed to your group_wa function.
Edit
Another approach is to move the dropna into your aggregating function like:
def group_wa(series):
dropped = series.dropna()
return np.average(dropped, weights=df.loc[dropped.index, 'order_amount'])
agg_func = {'margin_WA': group_wa}
result = df.groupby('order_date').agg(agg_func)
There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.