How to fill the NaN values with Mode in Python Pandas dataset? - python

In my data sets (train, test) max_floor values are null for some records. I am trying to fill the null values with the mode of max_floor values of apartments which shares the same apartment name:
for t in full.apartment_name.unique():
for df in frames:
df['max_floor'].fillna((df.loc[df["apartment_name"]==t,
'max_floor']).mode, inplace=True)
where full is train.append(test)
and df is [train,test]
Running the above code is not giving me the expected result. The above code is running fine but is filling all the max_floor null values with the below text:
bound method Series.mode of 0 NaN
1084 NaN
23278 9.0
Name: max_floor, dtype: float64
I just wanted to replace the above text with just the max_floor values. Any help would be appreciated.

mode() is a function and you've referred to it but not invoked it.
Change mode to mode()

You need to access the first value from the mode() result. For example:
A B
0 1 3.0
1 2 NaN
2 2 NaN
3 3 NaN
Fill missed values with the mode of the column A:
df.fillna(df['A'].mode()[0])
Output:
A B
0 1 3.0
1 2 2.0
2 2 2.0
3 3 2.0

Related

Pandas .iloc indexing coupled with boolean indexing in a Dataframe

I looked into existing threads regarding indexing, none of said threads address the present use case.
I would like to alter specific values in a DataFrame based on their position therein, ie., I'd like the values in the second column from the first to the 4th row to be NaN and values in the third column, first and second row to be NaN say we have the following `DataFrame`:
df = pd.DataFrame(np.random.standard_normal((7,3)))
print(df)
0 1 2
0 -1.102888 1.293658 -2.290175
1 -1.826924 -0.661667 -1.067578
2 1.015479 0.058240 -0.228613
3 -0.760368 0.256324 -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
And I want alter df like below with the least amount of code:
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
I tried using boolean indexing with .loc but resulted in an error:
df.loc[(:2,1:) & (2:4,1)] = np.nan
# exception message:
df.loc[(:2,1:) & (2:4,1)] = np.nan
^
SyntaxError: invalid syntax
I also thought about converting the DataFrame object to a numpy narray object but then I wouldn't know how to use boolean in that case.
One way is define the requirement and assign to be clear:
d = {1:4,2:2}
for col,val in d.items():
df.iloc[:val,col] = np.nan
print(df)
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350

Why sometimes we have to add .values when we do elementwise operation in pandas?

Suppose I have a dataframe looks like
A
0 0
1 1
2 2
3 3
and when I run:
a = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)]
I get
A
0 NaN
1 NaN
2 NaN
3 NaN
I know I could get the right result by writing
a = df.loc[np.arange(0,2)].values / df.loc[np.arange(2,4)]
b = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)].values
Can anyone explain why?
Due to pandas is index and columns sensitive, when you do the calculation the hidden key for them get match first , if we only need to get the value match and remove the impact of index and columns is adding .values or to_numpy() , however, index also bring some advantage as well
Example 1 index not match so the value will return NaN
s1=pd.Series([1],index=[1])
s2=pd.Series([1],index=[999])
s1/s2
1 NaN
999 NaN
dtype: float64
s1.values/s2.values
array([1.])
Example 2 index match so pandas will return the value when the index match
s1=pd.Series([1],index=[1])
s2=pd.Series([1,999],index=[1,999])
s1/s2
1 1.0
999 NaN
dtype: float64

Trying to fill NaNs with fillna() and groupby()

So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!
The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0

Pandas: Sum multiple columns, but write NaN if any column in that row is NaN or 0

I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)

thresh in dropna for DataFrame in pandas in python

df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)

Categories

Resources