I am trying to use the KNNImputer Package to impute missing values into my dataframe.
Here is my dataframe
pd.DataDrame(numeric_data)
age bmi children charges
0 19 NaN 0.0 16884.9240
1 18 33.770 1.0 NaN
2 28 33.000 3.0 4449.4620
3 33 22.705 0.0 NaN
Here is when I pass the imputer package and output the dataframe.
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data))
This gives:
0 1 2 3
0 19.0 34.0850 0.0 16884.924000
1 18.0 33.7700 1.0 6309.517125
2 28.0 33.0000 3.0 4449.462000
3 33.0 22.7050 0.0 4610.464925
How do I execute the same without losing my column name? Can I store the column name somewhere else and append later or can I impute with the column name being affected itself.
I have tried to exclude the column but I get the following error:
ValueError: could not convert string to float: 'age'
This should give you the desired result:-
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)
Related
lets say i have a dataset like below:
I want to replace the null values with the median of each column. But when I am trying to do that all NA is replaced with the median of the first column only.
Rough_df = pd.read_excel(r'Cleandata_withOutliers.xlsx', sheet_name='Sheet2')
Rough_df.fillna(Rough_df.select_dtypes(include='number').median().iloc[0], inplace=True)
My output looks like this:
But, ideally, the NA values in the 2nd column should be replaced with 10170.5 and not with 77.5. Where I am doing wrong?
You can just do median with fillna
out = df.fillna(df.median())
Out[68]:
X Y
0 60.0 9550.0
1 85.0 10170.5
2 77.5 10791.0
3 101.0 14215.0
4 47.0 16321.0
5 108.0 10170.5
6 77.5 8658.0
7 70.0 7945.0
ValueError: Input contains NaN
i have run
from sklearn.preprocessing import OrdinalEncoderfrom
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])
here is data_
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0
Before doing some processor, you have always have to to preprocess the data and make a few summary of how your data is. In concrete, the error you obtained is telling you that you have NaN values. To check it, try this command:
df.isnull().any().any()
If the output is TRUE, you have NaN values. You can run the next command if you want to know where this NaN values are:
df.isnull().any()
Then, you will know in which column are your NaN values.
Once you know you have NaN values, you have to preprocess them (eliminate, fill,... whatever you believe is the best option). The link gtomer commented is a nice resource.
I am having a little problem with pandas.concat
Namely, I am concatenating a dataframe with 3 series. The 1 dataframe and 2 of the series are concatenating as expected. One series, however is being attached to the bottom of my new data frame instead of as a column.
Here is my minimal working example. To get the output below, run it on the titanic Kaggle dataset.
#INCLUDED ONLY SO MY CODE WILL RUN ON YOUR MACHINE. IGNORE.
def bin_dump(data, increment):
if data <= increment:
return f'0 - {increment}'
if data % increment == 0:
return f'{data - increment} - {data}'
else:
m = data % increment
a = data - m
b = data + (increment - m)
return f'{a} - {b}'
#INCLUDED SO MY CODE WILL RUN ON YOUR MACHINE. IGNORE
train_df['AgeGroup'] = train_df.apply(lambda x: bin_dump(x.Age, 3), axis=1)
# THE PROBLEM IS ACTUALLY IN THIS METHOD:
def plot_dists(X, Y, input_df, percent_what):
totals = input_df[X].value_counts()
totals.name = 'totals'
df = pd.Series(totals.index).str.extract(r'([0-9]+)').astype('int64')
df.columns=['index']
values = pd.Series(totals.index, name=X)
percentages = []
for group, total in zip(totals.index, totals):
x = input_df.loc[(input_df[X] == group)&(input_df[Y] == 1), Y].sum()
percent = 1 - x/total
percentages.append(percent)
percentages = pd.Series(percentages, name='Percentages')
# THE PROBLEM IS HERE:
df = pd.concat([df, values, totals, percentages], axis=1).set_index('index').sort_index(axis=0)
return df
output looks like this:
AgeGroup totals Percentages
index
0.0 0 - 3 NaN 0.333333
3.0 3.0 - 6.0 NaN 0.235294
6.0 6.0 - 9.0 NaN 0.666667
9.0 9.0 - 12.0 NaN 0.714286
12.0 12.0 - 15.0 NaN 0.357143
15.0 15.0 - 18.0 NaN 0.625000
18.0 18.0 - 21.0 NaN 0.738462
21.0 21.0 - 24.0 NaN 0.57534
. . . .
. . . .
. . . .
NaN NaN 11.0 NaN
NaN NaN 15.0 NaN
NaN NaN 9.0 NaN
NaN NaN 6.0 NaN
So, the 'totals' are being appended as a dataframe on the bottom.
In addition to trying to fix this concat/append issue, I'd welcome any suggestions on how to optimize my code. This is my first go at building my own tool for visualizing data (I cut out the plotting part because it's not really part of the question).
Check this out. Did you try to change from concat to merge?
I have a dataframe called res_df:
In [54]: res_df.head()
Out[54]:
Bldg_Sq_Ft GEOID CensusPop HU_Pop Pop_By_Area
0 753.026123 240010013002022 11.0 7.0 NaN
7 95.890495 240430003022003 17.0 8.0 NaN
8 1940.862793 240430003022021 86.0 33.0 NaN
24 2254.519775 245102801012021 27.0 13.0 NaN
25 11685.613281 245101503002000 152.0 74.0 NaN
I have a second dataframe made from the summarized information in res_df. It's grouped by the GEOID column and then summarized using aggregations to get the sum of the Bldg_Sq_Ft and the mean of the CensusPop columns for each unique GEOID. Let's call it geoid_sum:
In [55]:geoid_sum = geoid_sum.groupby('GEOID').agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'})
In [56]: geoid_sum.head()
Out[56]:
GEOID Bldg_Sq_Ft CensusPop
GEOID
100010431001011 1 1154.915527 0.0
100030144041044 1 5443.207520 26.0
100050519001066 1 1164.390503 4.0
240010001001001 15 30923.517090 41.0
240010001001007 3 6651.656677 0.0
My goal is to find the GEOIDs in res_df that match the GEOID's in geoid_sum. I want to populate the value in Pop_By_Area for that row using an equation:
Pop_By_Area = (geoid_sum['CensusPop'] * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft']
I've created a simple function that takes those parameters, but I am unsure how to iterate through the dataframes and apply the function.
def popByArea(census_pop_mean, bldg_sqft, bldg_sqft_sum):
x = float()
x = (census_pop_mean * bldg_sqft)/bldg_sqft_sum
return x
I've tried creating a series based on the GEOID matches: s = res_df.GEOID.isin(geoid_sum.GEOID.values) but that didn't seem to work (produced all false boolean values). How can I find the matches and apply my function to populate the Pop_By_Area column?
I think you need the reindex
geoid_sum = geoid_sum.groupby('GEOID').\
agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'}).\
reindex(res_df['GEOID'])
res_df['Pop_By_Area'] = (geoid_sum['CensusPop'].values * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft'].values
When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?
As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]
This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0