I am working on a dataset which consists of average age of marriage. On this dataset I am doing data cleaning job. While performing this process, I came across a feature where I had to fill the 'NaN' values in the location column. But in location column there are multiple unique values and I want to fill the nan values in location. I need some suggestion on how to fill these Nan values in column which had many unique values.
I have attached the dataset for reference, DataSet
I suggest doing it in 3 steps:
Fill in the missing values of location with either the most common location or with a separate value "Unknown";
Fill in the missing values of "age_of_marriage" with a mean value of this feature by location;
If there are any missing values of "age_of_marriage" left, fill them in with the average value.
df = pd.read_csv('https://raw.githubusercontent.com/atharva07/Age-of-marriage/main/age_of_marriage_data.csv', sep=',')
df['location'] = df['location'].fillna('Unknown')
df['age_of_marriage'] = df.groupby(['location'])['age_of_marriage'].apply(lambda x: x.fillna(x.median()))
df['age_of_marriage'] = df['age_of_marriage'].fillna(df['age_of_marriage'].mean())
Related
Attached image is a test data which has missing values for multiple columns.
I need to fill the missing values by doing the rate of change for previous 12 months
For example, in the attached dataset I have got missing values in rows 23 & 24 for columns weight_a, weight_b, weight_c
To fill the missing value in row 23, weight_a column I need to do =(B22-B10)/12 + B22
To fill the missing value in row 24, weight_a column I need to do =(B23-B11)/12 + B23
To fill the missing value in row 23, weight_b column I need to do =(C22-C10)/12 + C22
To fill the missing value in row 24, weight_b column I need to do =(C23-C11)/12 + C23
and so on, repeats for the weight_c column(and the real data set has a lot of missing values for multiple columns)
How do I write python code to implement this for all missing values in a dataframe?
Calculate the values then update them manually:
result_23=[1,2,3] # calculate real value instead of [1,2,3]
result_24=[1,2,3] # calculate real value instead of [1,2,3]
#Calculate them in this way, Based on what you want
#(df.iloc[22]["B"]-df.iloc[10]["B"]/12)+df.iloc[22]["B"]
df.loc[df.index==23,["weight_a","weight_b","weight_c"]]=result_23
df.loc[df.index==24,["weight_a","weight_b","weight_c"]]=result_24
I have two Pandas DataFrames with one column in common, namely "Dates". I need to merge these two where "Dates" correspond. with pd.merge() it does the expected but removes the uncorresponding values. I want to keep other values too.
Ex: I have historical data for a stock for 1 min. and a calculated indicator for 5min. data ie. for each 5 rows I have a new value calculated in 1 min Data Frame.
I know that Series.dt.floor method may reveal useful here but I couldn't figure out.
I concatenated respective "Dates" to calculated indicator Series so that I can merge them where column matches. I obtained a right result but missing values. I need a continuity of 1 min values, i.e. same indicator must be valid for the next 5 entries then the second indicator value's turn to be merged.
df1.merge(df2, left_on='Dates', right_on='Dates')
In the above image, I colored some rows with the same colors, so I want to create new data frames with the values of the same color. as you can see, the values of the same color are the same as an example - 0.8 multiple and CE option type values rows, in this entire data frame these same values come in 2 times, so I want to create these 3 rows new data frame, and like the same i want to do for all rows.
Below is some code that may help you.
df_dictionary = dict(tuple(your_dataframe.groupby('columns_to_groupby')))
This will produce a dictionary whose keys are the grouped values (in your case, "CE, PE, etc...") and whose values are the dataframes split by the grouping specified. Hope this helps.
I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)
I have a housing dataframe:
where there are missing values in the Price column. I wish to fill the missing values by the mean price in the respective suburb.
This is my code for filling up the mean price by the same column:
all_housing_df['Price'].fillna(all_housing_df['Price'].mean())
How to fill in the mean price by the respective suburb?
You can group by Suburb, get the mean Price and save this as a dictionary to conditionally replace null values.
# Create dictionary for NaN values
nan_dict = all_housing_df.groupby('Suburb')['Price'].mean().to_dict()
# Replace NaN with dictionary
all_housing_df['Price'].fillna(all_housing_df['Suburb'].map(nan_dict))
You can use transform to fill missing values with the full list after grouping by Suburb
all_housing_df["Price"].fillna(all_housing_df.groupby("Suburb")["Price"].transform("mean"))