Trying to fill NaNs with fillna() and groupby() - python

So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!

The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0

Related

thresh in dropna for DataFrame in pandas in python

df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)

Python - iterate over rows and columns

First of all, please pardon my skills. I am trying to get into Python, I learn just for fun, let's say, I don't use it professionally and I am quite bad, to be honest. Probably there will be basic errors on my question.
Anyway, I am trying to go over a dataframe's rows and columns. I want to check if the values of the columns (except the first one) are NaNs. If they are, then they should change to the value of the first one.
import math
for index, row in rawdata3.iterrows():
test = row[0]
for column in row:
if math.isnan(row.loc[column]) == True:
row.loc[column] = test
The error I get is something like this:
the label [4.0] is not in the [columns]
I also had other errors with slightly different code like:
cannot do label indexing on class pandas.core.indexes.base.Index with these indexers class float
Could you give me a hand, please?
Thanks in advance!
Cheers.
I don't know if there is a better way but this works fine:
for i in df.columns:
df.loc[df[i].isnull(), i] = df.loc[df[i].isnull(), 'A']
output:
A B C
0 5 5.0 2.0
1 6 5.0 6.0
2 9 9.0 9.0
3 2 4.0 6.0
Where df is:
A B C
0 5 NaN 2.0
1 6 5.0 NaN
2 9 NaN NaN
3 2 4.0 6.0
Use transpose and fillna:
Due to fillna "NotImplementedEerror"
NotImplementedError: Currently only can fill with dict/Series column
by column
df.fillna(value=df.A, axis=1) will not work. Therefore we use:
df.T.fillna(df.A).T
Output:
A B C
0 5.0 5.0 2.0
1 6.0 5.0 6.0
2 9.0 9.0 9.0
3 2.0 4.0 6.0

pandas - partially updating DataFrame with derived calculations of a subset groupby

I have a DataFrame with some NaN records that I want to fill based on a combination of data of the NaN record (index in this example) and of the non-NaN records. The original DataFrame should be modified.
Details of input/output/code below:
I have an initial DataFrame that contains some pre-calculated data:
Initial Input
raw_data = {'raw':[x for x in range(5)]+[np.nan for x in range(2)]}
source = pd.DataFrame(raw_data)
raw
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN
I want to identify and perform calculations to "update" the NaN data, where the calculations are based on data of the non-NaN data and some data from the NaN records.
In this contrived example I am calculating this as:
Calculate average/mean of 'valid' records.
Add this to the index number of 'invalid' records.
Finally this needs to be updated on the initial DataFrame.
Desired Output
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0
The current solution I have (below) makes a calculation on a copy then updates the original DataFrame.
# Setup grouping by NaN in 'raw'
source['valid'] = ~np.isnan(source['raw'])*1
subsets = source.groupby('valid')
# Mean of 'valid' is used later to fill 'invalid' records
valid_mean = subsets.get_group(1)['raw'].mean()
# Operate on a copy of group(0), then update the original DataFrame
invalid = subsets.get_group(0).copy()
invalid['raw'] = subsets.get_group(0).index + valid_mean
source.update(invalid)
Is there a less clunky or more efficient way to do this? The real application is on significantly larger DataFrames (and with a significantly longer process of processing NaN rows).
Thanks in advance.
You can use combine_first:
#mean by default omit `NaN`s
m = source['raw'].mean()
#same as
#m = source['raw'].dropna().mean()
print (m)
2.0
#create valid column if necessary
source['valid'] = source['raw'].notnull().astype(int)
#update NaNs
source['raw'] = source['raw'].combine_first(source.index.to_series() + m)
print (source)
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0

Python Pandas: How to merge based on an "OR" condition?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber.
However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense...
Any help is really really appreciated!
As suggested, I looked into this post:
Python pandas merge with OR logic
But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks #Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

How to fill the NaN values with Mode in Python Pandas dataset?

In my data sets (train, test) max_floor values are null for some records. I am trying to fill the null values with the mode of max_floor values of apartments which shares the same apartment name:
for t in full.apartment_name.unique():
for df in frames:
df['max_floor'].fillna((df.loc[df["apartment_name"]==t,
'max_floor']).mode, inplace=True)
where full is train.append(test)
and df is [train,test]
Running the above code is not giving me the expected result. The above code is running fine but is filling all the max_floor null values with the below text:
bound method Series.mode of 0 NaN
1084 NaN
23278 9.0
Name: max_floor, dtype: float64
I just wanted to replace the above text with just the max_floor values. Any help would be appreciated.
mode() is a function and you've referred to it but not invoked it.
Change mode to mode()
You need to access the first value from the mode() result. For example:
A B
0 1 3.0
1 2 NaN
2 2 NaN
3 3 NaN
Fill missed values with the mode of the column A:
df.fillna(df['A'].mode()[0])
Output:
A B
0 1 3.0
1 2 2.0
2 2 2.0
3 3 2.0

Categories

Resources