Wrong filling of Median replacing of Null values in Pandas Data frame - python

lets say i have a dataset like below:
I want to replace the null values with the median of each column. But when I am trying to do that all NA is replaced with the median of the first column only.
Rough_df = pd.read_excel(r'Cleandata_withOutliers.xlsx', sheet_name='Sheet2')
Rough_df.fillna(Rough_df.select_dtypes(include='number').median().iloc[0], inplace=True)
My output looks like this:
But, ideally, the NA values in the 2nd column should be replaced with 10170.5 and not with 77.5. Where I am doing wrong?

You can just do median with fillna
out = df.fillna(df.median())
Out[68]:
X Y
0 60.0 9550.0
1 85.0 10170.5
2 77.5 10791.0
3 101.0 14215.0
4 47.0 16321.0
5 108.0 10170.5
6 77.5 8658.0
7 70.0 7945.0

Related

Sum two columns only if the values of one column is bigger/greater 0

I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0

Appending from one dataframe to another dataframe (with different sizes) when two values match

I have two pandas dataframes and some of the values overlap and I'd like to append to the original dataframe if the time_date value and the origin values are the same.
Here is my original dataframe called flightsDF which is very long, it has the format:
year month origin dep_time dep_delay arr_time time_hour
2001 01 EWR 15:00 15 17:00 2013-01-01T06:00:00Z
I have another dataframe weatherDF (much shorter than flightsDF) with some extra infomation for some of the values in the original dataframe
origin temp dewp humid wind_dir wind_speed precip visib time_hour
0 EWR 39.02 26.06 59.37 270.0 10.35702 0.0 10.0 2013-01-01T06:00:00Z
1 EWR 39.02 26.96 61.63 250.0 8.05546 0.0 10.0 2013-01-01T07:00:00Z
2 LGH 39.02 28.04 64.43 240.0 11.50780 0.0 10.0 2013-01-01T08:00:00Z
I'd like to append the extra information (temp, dewp, humid,...) from weatherDF to the original data frame if both the time_hour and origin match with the original dataframe flightsDF
I have tried
for x in weatherDF:
if x['time_hour'] == flightsDF['time_hour'] & flightsDF['origin']=='EWR':
flights_df.append(x)
and some other similar ways but I can't seem to get it working, can anyone help?
I am planning to append all the corresponding values and then dropping any from the combined dataframe that don't have those values.
You are probably looking for pd.merge:
flightDF = flightsDF.merge(weatherDF, on=['origin', 'time_hour'], how='left')
print(out)
# Output
year month origin dep_time dep_delay arr_time time_hour temp dewp humid wind_dir wind_speed precip visib
0 2001 1 EWR 15:00 15 17:00 2013-01-01T06:00:00Z 39.02 26.06 59.37 270.0 10.35702 0.0 10.0
If I'm right take the time to read Pandas Merging 101

Python Dataframe NaN rows slicing, filling and rejoining

I have a big dataframe. Some of the values in a column are NaN. I want to fill them with some value based on the other column value.
Data:
df =
A B
2019-10-01 09:19:40 667.029710 10
2019-10-01 09:20:15 673.518030 20
2019-10-01 09:21:29 533.137144 30
2020-07-25 15:51:15 NaN 40
2020-07-25 17:20:20 NaN 50
2020-07-25 17:21:23 NaN 60
I want to fill NaN in A column based on the B column value.
My code:
sdf = df[df['A'].isnull()] # slice NaN and create a new dataframe
sdf['A'] = sdf['B']*sdf['B']
df = pd.concat([df,sdf])
Everything works fine. I feel my code is lengthy. Is there a one line code?
For fillna we can do
df.A.fillna(df.B**2, inplace=True)

Shifting column values in Pandas Dataframe causes missing values

I want to shift column values one space to the left. I don't want to save the original values of the column 'average_rating'.
I used the shift command:
data3 = data3.shift(-1, axis=1)
But the output I get has missing values for two columns- 'num_pages' and 'text_reviews_count'
It is because the data types of the source and target columns do not match. Try converting the column value after shift() to the target data type for each source and target column - for example .fillna(0).astype(int).
Alternately, you can convert all the data in the data frame to strings and then perform the shift. You might want to convert them back to their original data types again.
df = df.astype(str) # convert all data to str
df_shifted = (df.shift(-1,axis=1)) # perform the shift
df_string = df_shifted.to_csv() # store the shifted to a string variable
new_df = pd.read_csv(StringIO(df_string), index_col=0) # read the data again from the string variable
Output:
average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count extra
0 3.57 0674842111 978067 en-US 236 55 6.0 NaN
1 3.60 1593600119 978067 eng 400 25 4.0 NaN
2 3.63 156384155X 978067 eng 342 38 4.0 NaN
3 3.98 1857237250 978067 eng 383 2197 17.0 NaN
4 0.00 0851742718 978067 eng 49 0 0.0 NaN

Selecting column values of a dataframe which is in a range and put it in appropriate columns of another dataframe in pandas

I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837

Categories

Resources