pivot= pd.pivot_table(buliding_area_notnull,values = ['BuildingArea','Landsize'],index=['Bedroom2', 'Bathroom' ,'Car','Type'])
this is my code which gives a pivot table like=
Bedroom2 Bathroom Car Type Landsize
1_________1_________1______365.2
__________0_________2_______555[![enter image description here][2]][2]
____________________1________666
now i want to fill NaN dataframe['Landsize'] values using above pivot.what is the syntax.
note: the above pivot table is just a small part.
EDIT: So now I have a better idea of what you are doing.
What you need to do is flatten the multi index of the first dataframe with reset_index().
Then you want to join the two dataframes together on [Bedroom2, Bathroom, Car, Type].
This will give you an 8 column df (the four above plus buildingarea and landsize twice.
Then I would just create a new column and fill with building area from the second df if it is non nan and building area from the first df if it is nan.
EDIT END:
Your output there does not match what the code is that you typed at all. That being said, there is a fill parameter that you may find useful.
Docs below.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
Related
I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']
Good evening everyone!
I have a problem with NaN values in python with pandas.
I am working on database with information on different countries. I cannot get rid of all of my NaN values altogether or I would lose too much data.
I wish to replace the NaN values based on some condition.
The dataframe I am working on
What I would like is to create a new column that would take the existing values of a column (Here: OECDSTInterbkRate) and replace all its NaN values based on a specific condition.
For example, I want to replace the NaN corresponding to Australia with the moving average of the values I already have for Australia.
Same thing for every other country for which I am missing values (Replace NaN observations in this column for France by the moving average of the values I already have for France, etc.).
What piece of code do you think I could use?
Thank you very much for your help !
Maybe you can try something like this df.fillna(df.mean(), inplace=True)
Replace df.mean() with your mean values.
I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]
So, what i am trying to do, is complete the NaN values of a Dataframe with the correct values that are to be found in a second dataframe. It would be something like this
df={"Name":["Lennon","Mercury","Jagger"],"Band":["The Beatles", "Queen", NaN]}
df2={"Name":["Jagger"],"Band":["The Rolling Stones"]}
So, I have this command to know which rows have at least one NaN:
inds = list(pd.isnull(dfinal).any(1).nonzero()[0].astype(int))
I thought this would be useful to use a for like function (didn't succeed there)
And then I tried this:
result=df.join(dfinal, on=["Name"])
But it gives me the following error
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
I checked, and both Series "Name" are string values. So i am unable to solve this.
Keep in mind there are more columns, and the likely result it would be that if a row has one NaN, it will have like 7 NaN.
It is there a way to solve this?
Thanks in advance!
Map and Fillna()
we can target missing values in your target df with missing values from the second dataframe based on the Name column.
df["Band"] = df["Band"].fillna(df["Name"].map(df2.set_index("Name")["Band"]))
print(df)
Name Band
0 Lennon The Beatles
1 Mercury Queen
2 Jagger The Rolling Stones
I have an original data frame with information from real estate properties. To fill nan values in the column price per m2 in usd I have made a multi-index pivot table that has the mean of the price per m2 sliced by property type, place and surface covered in m2.
Now, I want to iterate in the original data frameĀ“s column price per m2 in usd to fill nan values with the ones I created in the pivot table.
Pivot table code:
df6 = df4.pivot_table( values=['price_usd_per_m2'],
index=['cuartiles_superficie_cubierta'],
columns=['localidad','property_type'],
aggfunc=['mean'])
I'm not sure my understanding is correct or not, do you mind to show how does your data table look like? Based on your question, you have done the calculation on mean values, and you wish to replace the NaN in the original table before pivoted.
Possible if you fill up NaN with the mean values in the pivoted table, then only transform back to the original structure as you wish?
Apologize if my answer not helping, I just wish to learn how to solving problem. I will also learn from other who giving advise on this question.