Exclude rows which have NA value for a column [duplicate] - python

This question already has answers here:
How to drop rows of Pandas DataFrame whose value in a certain column is NaN
(15 answers)
Closed 1 year ago.
This is a sample of my data
I have written this code which removes all categorical columns (eg. MsZoning). However, some non-categorical columns have NA value. How can I exclude them from my data set.
def main():
print('Starting program execution')
iowa_train_prices_file_path='C:\\...\\programs\\python\\kaggle_competition_iowa_house_prices_train.csv'
iowa_file_data = pd.read_csv(iowa_train_prices_file_path)
print('Read file')
model_random_forest = RandomForestRegressor(random_state=1)
features = ['MSSubClass','MSZoning',...]
y = iowa_file_data.SalePrice
# every colmn except SalePrice
X = iowa_file_data.drop('SalePrice', axis = 1)
#The object dtype indicates a column has text (hint that the column is categorical)
X_dropped = X.select_dtypes(exclude=['object'])
print("fitting model")
model_random_forest.fit(X_dropped, y)
print("MAE of dropped categorical approach");
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
main()
When I run the program, I get error ValueError: Input contains NaN, infinity or a value too large for dtype('float32') which I believe is due to NA value of Id=8.
Question 1 - How do I remove such rows entirely
Question 2 - What is the type of such columns which are mostly nos. but have text in between? I thought I'll do print("X types",type(X.columns)) but that doesn't give the result

To remove nans, you can replace them with another value. It is common practice to use zeros.
iowa_file_data = iowa_file_data.fillna(0)
If you still want to remove the whole column, use
iowa_file_data = iowa_file_data.dropna(axis='columns')
And if you want to remove the entire row, use
iowa_file_data = iowa_file_data.dropna()
For your second question, from what I understand, you might want to see some info about the pandas object dtype: link.

Related

new column returns NaN

Upon adding a new column('as_%_of_revenue') to my pandas data frame, I am given a response of NaN throughout the whole column. I intend to find all row's percentage of revenue by dividing each row by 'total revenue'. To do this I used the code:
'''income_statement = pd.DataFrame.from_dict(IS["annualReports"][0],
orient = 'index')
income_statement = income_statement[3:26]
income_statement.columns = ['currentYear']
income_statement['as_%_of_revenue'] =
income_statement['currentYear'] / income_statement.iloc[0]'''
With income_statement['currentYear'] = a single column with 26 rows of different values
and income_statement.iloc[0] = a single column with one row of one value('total revenue')
I understand the dimensions of both data frames are not the same which may be why I can not do calculations. I tried this alternative by using the '''set_index''' command although it didn't seem to work either

I do not understand the behavior of pandas.drop, since I get different results from dropna (too many rows are dropped)

I have a DataFrame with some NA and I want to drop the rows where a particular column has NA values.
My first trial has been:
- Identifying the rows where the specific column values were NA
- Pass them to pandas.drop()
In my specific case, I have a DataFrame of 39164 rows by 40 columns.
If I look to NA in the specific columns I found 17715 concerned labels that I saved to a dedicated variable. I then sent them to pandas.drop() expecting about 22000 rows remaining, but I only got 2001.
If I use pandas.dropna() I get 21449 rows remaining, which is what I was expecting.
Here follows my code.
The first code portion downloads data from gouv.fr (sorry for not using fake data...but it will only take less than 10 seconds to execute).
WARNING: only the last 5 years are stored on the online database. So my example should be adapted later...
import pandas as pd
villes = {'Versailles' : '78646',
'Aix-en-Provence' : '13001'}
years = range(2014,2019)
root = "https://cadastre.data.gouv.fr/data/etalab-dvf/latest/csv/"
data = pd.DataFrame({})
for ville in villes.keys() :
for year in years :
file_addr = '/'.join([root,str(year),'communes',villes[ville][:2],villes[ville]+'.csv'])
print(file_addr)
tmp = pd.read_csv(file_addr)
data =pd.concat([data,tmp])
This is the second portion, where I try to drop some rows. As said, the results are very different depending on the chosen strategy (data_1 vs data_2). data_2, obtained by dropna() is the expected results.
print(data.shape)
undefined_surface = data.index[data.surface_reelle_bati.isna()]
print(undefined_surface)
data_1 = data.drop(undefined_surface)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_1.shape)
print(data_2.shape)
Using dropna() is totally fine for me, but I would like to understand what I was doing wrong with drop() since I got a very silly result compared to my expectation and I would like to be aware of that in the future...
Thanks in advance for help.
This is because your index is not unique, look for example for index 0, you have forty rows with this index
data_idx0 = data.iloc[0]
data_idx0.shape
# (40,)
If at least one of the rows with index 0 has surface_reelle_bati missing, all the forty rows will disappear from data_1. This is why you drop more rows when creating data_1 than when creating data_2.
To solve this, use reset_index()to get index go from 0 to the number of rows of data
data = data.reset_index()
undefined_surface = data.index[data.surface_reelle_bati.isna()].tolist()
data_1 = data.drop(undefined_surface)
print(data_1.shape)
# (21449, 41)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_2.shape)
# (21449, 41)

Remove outliers from the target column when an independent variable column has a specific value

I have a dataframe that looks as follow (click on the lick below):
df.head(10)
https://ibb.co/vqmrkXb
What I would like to do is to remove outliers from the target column (occupied_parking_spaces) when the value of the day column is equal to 6 for instance which refers to sunday (df[‘day’] == 6) using the normal distribution 68-95-99.7 rule.
I tried the following code :
df = df.mask((df['occupied_parking_spaces'] - df['occupied_parking_spaces'].mean()).abs() > 2 * df['occupied_parking_spaces'].std()).dropna()
This line of code removes outliers from the whole dataset no matter the independent variables but I only want to remove outliers from the occupied_parking_spacs column where the day value is equal to 6 for exemple.
What I can do is to create a different dataframe for which I will remove outliers:
sunday_df = df.loc[df['day'] == 0]
sunday_df = sunday_df.mask((sunday_df['occupied_parking_spaces'] - sunday_df['occupied_parking_spaces'].mean()).abs() > 2 * sunday_df['occupied_parking_spaces'].std()).dropna()
But by doing this I will get multiple dataframes for everday of the week that I will have to concatenate at the end, and this is something I do not want to do as there must be a way to do this inside the same dataframe.
Could you please help me out?
Having defined some function to remove outliers, you could use np.where to apply it selectively:
import numpy as np
df = np.where(df['day'] == 0,
remove_outliers(df['occupied_parking_spaces']),
df['occupied_parking_spaces']
)

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

Categories

Resources