Attached image is a test data which has missing values for multiple columns.
I need to fill the missing values by doing the rate of change for previous 12 months
For example, in the attached dataset I have got missing values in rows 23 & 24 for columns weight_a, weight_b, weight_c
To fill the missing value in row 23, weight_a column I need to do =(B22-B10)/12 + B22
To fill the missing value in row 24, weight_a column I need to do =(B23-B11)/12 + B23
To fill the missing value in row 23, weight_b column I need to do =(C22-C10)/12 + C22
To fill the missing value in row 24, weight_b column I need to do =(C23-C11)/12 + C23
and so on, repeats for the weight_c column(and the real data set has a lot of missing values for multiple columns)
How do I write python code to implement this for all missing values in a dataframe?
Calculate the values then update them manually:
result_23=[1,2,3] # calculate real value instead of [1,2,3]
result_24=[1,2,3] # calculate real value instead of [1,2,3]
#Calculate them in this way, Based on what you want
#(df.iloc[22]["B"]-df.iloc[10]["B"]/12)+df.iloc[22]["B"]
df.loc[df.index==23,["weight_a","weight_b","weight_c"]]=result_23
df.loc[df.index==24,["weight_a","weight_b","weight_c"]]=result_24
Related
I have two Pandas DataFrames with one column in common, namely "Dates". I need to merge these two where "Dates" correspond. with pd.merge() it does the expected but removes the uncorresponding values. I want to keep other values too.
Ex: I have historical data for a stock for 1 min. and a calculated indicator for 5min. data ie. for each 5 rows I have a new value calculated in 1 min Data Frame.
I know that Series.dt.floor method may reveal useful here but I couldn't figure out.
I concatenated respective "Dates" to calculated indicator Series so that I can merge them where column matches. I obtained a right result but missing values. I need a continuity of 1 min values, i.e. same indicator must be valid for the next 5 entries then the second indicator value's turn to be merged.
df1.merge(df2, left_on='Dates', right_on='Dates')
I am working on a dataset which consists of average age of marriage. On this dataset I am doing data cleaning job. While performing this process, I came across a feature where I had to fill the 'NaN' values in the location column. But in location column there are multiple unique values and I want to fill the nan values in location. I need some suggestion on how to fill these Nan values in column which had many unique values.
I have attached the dataset for reference, DataSet
I suggest doing it in 3 steps:
Fill in the missing values of location with either the most common location or with a separate value "Unknown";
Fill in the missing values of "age_of_marriage" with a mean value of this feature by location;
If there are any missing values of "age_of_marriage" left, fill them in with the average value.
df = pd.read_csv('https://raw.githubusercontent.com/atharva07/Age-of-marriage/main/age_of_marriage_data.csv', sep=',')
df['location'] = df['location'].fillna('Unknown')
df['age_of_marriage'] = df.groupby(['location'])['age_of_marriage'].apply(lambda x: x.fillna(x.median()))
df['age_of_marriage'] = df['age_of_marriage'].fillna(df['age_of_marriage'].mean())
Here's my
If I have two dataframes (say df4avg and df5avg) with identical corrected wavelengths and different count rates, and I want to divide the df4avg count rate by df5avg's count rate and get an output of the corrected wavelength and the new divided value with a new column name (say 'ratio'), how would I do this?
If you want to add the ratio column in the df4avg Dataframe then
df4avg['ratio'] = df4avg['COUNT_RATE'] / df5avg['COUNT_RATE']
I have a Pandas DataFrame extracted from Estespark Weather for the dates between Sep-2009 and Oct-2018, and the mean of the Average windspeed column is 4.65. I am taking a challenge where there is a sanity check where the mean of this column needed to be 4.64. How can I modify the values of this column so that the mean of this column becomes 4.64? Is there any code solution for this, or do we have to do it manually?
I can see two solutions:
Substract 0.01 (4.65 - 4.64) to every value of that column like:
df['AvgWS'] -= 0.01
2 If you dont want to alter all rows: find wich rows you can remove to give you the desired mean (if there are any):
current_mean = 4.65
desired_mean = 4.64
n_rows = len(df['AvgWS'])
df['can_remove'] = df['AvgWS'].map(lambda x: (current_mean*n_rows - x)/(n_rows-1) == 4.64)
This will create a new boolean column in your dataframe with True in the rows that, if removed, make the rest of the column's mean = 4.64. If there are more than one you can analyse them to choose which one seems less important and then remove that one.
Hi I am trying to edit rows for a particular column 'PM2.5' using '''df.at['0', 'PM2.5'] = 10''' for first row of the PM2.5 column but instead of editing, it is adding a new row. The headers for the columns are defined by titles but my rows are numbered, how do I go around this? I want to do this for 18 rows and manually add data to the column PM2.5. Thanks!
After comment problem is index values are integers (RangeIndex), so for set values need integer too.
So change '0' (string)
df.at['0', 'PM2.5'] = 10
to 0 (integer):
df.at[0, 'PM2.5'] = 10