I have a big dataframe. Some of the values in a column are NaN. I want to fill them with some value based on the other column value.
Data:
df =
A B
2019-10-01 09:19:40 667.029710 10
2019-10-01 09:20:15 673.518030 20
2019-10-01 09:21:29 533.137144 30
2020-07-25 15:51:15 NaN 40
2020-07-25 17:20:20 NaN 50
2020-07-25 17:21:23 NaN 60
I want to fill NaN in A column based on the B column value.
My code:
sdf = df[df['A'].isnull()] # slice NaN and create a new dataframe
sdf['A'] = sdf['B']*sdf['B']
df = pd.concat([df,sdf])
Everything works fine. I feel my code is lengthy. Is there a one line code?
For fillna we can do
df.A.fillna(df.B**2, inplace=True)
Related
I have a pandas series called mean_return:
BHP 0.094214
GOOG 0.180892
INTC -0.179899
MRK 0.065741
MSFT 0.205519
MXL 0.153332
SHEL 0.001714
TSM 0.162741
WBD -0.233863
dtype: float64
pandas.core.series.Series
When i try to merge the above with the dataframe below I get Nan values for mean return: (excuse the formatting Im not sure how to copy and paste the dataframe). I see the tickers are not ordered the same for the series as the DF , what can i do to merge both DF and series?
0
mkt_value weights investment shares mean_return
0 GOOG 51.180000 0.115308 14413.469698 281.623087 NaN
1 BHP 99.570000 0.140488 17560.996495 176.368349 NaN
INTC 25.719999 0.092804 11600.473577 451.029311 NaN
MXL 87.599998 0.110175 13771.865664 157.213081 NaN
MRK 234.240005 0.102416 12801.944297 54.653108 NaN
MSFT 34.220001 0.123298 15412.217160 450.386225 NaN
SHEL 51.970001 0.142114 17764.225757 341.816920 NaN
TSM 69.750000 0.134963 16870.389838 241.869388 NaN
WBD 11.980000 0.038435 4804.417515 40
Here is the code is used:
df=pd.DataFrame(tickers_list)
df.rename({'index':'tickers_list'},axis='columns',inplace=True)
df['mkt_value']=data.values[-1]
df['weights']=weights
df['investment']=port_size*weights
df['shares']=df['investment']/df['mkt_value']
df['mean_return']=mean_return
df
If the tickers in the df are unique, then I would set them as an index. and then use pd.concat which concatenates based on index.
df = df.set_index('tickers_list')
mean_returns.name = "mean_returns"
df = pd.concat([df, mean_returns], axis=1)
You set the name attribute as that will be the name of the new column.
I have several column in my df, one is error. If that column has rows with a value (this one always has 99 as the error message value) I want to remove those rows and keep the ones that are nan.
df:
error
date
count
99
nan
nan
nan
2022-02-01
234
nan
2022-02-02
34643
99
nan
nan
nan
2022-03-02
23425
99
nan
nan
I know how to drop if nan, but I want to do the opposite for the error column
A more general solution than proposed by enke is:
df = df[df.error.isna()]
This way you retain only rows with NaN in error column,
regardless of the error value in original DataFrame.
lets say i have a dataset like below:
I want to replace the null values with the median of each column. But when I am trying to do that all NA is replaced with the median of the first column only.
Rough_df = pd.read_excel(r'Cleandata_withOutliers.xlsx', sheet_name='Sheet2')
Rough_df.fillna(Rough_df.select_dtypes(include='number').median().iloc[0], inplace=True)
My output looks like this:
But, ideally, the NA values in the 2nd column should be replaced with 10170.5 and not with 77.5. Where I am doing wrong?
You can just do median with fillna
out = df.fillna(df.median())
Out[68]:
X Y
0 60.0 9550.0
1 85.0 10170.5
2 77.5 10791.0
3 101.0 14215.0
4 47.0 16321.0
5 108.0 10170.5
6 77.5 8658.0
7 70.0 7945.0
I have a data frame with the date as an index and a parameter. I want to convert column data into a new data frame with year as row index and week number as column name and cells showing weekly mean value. I would then use this information to plot using seaborn https://seaborn.pydata.org/generated/seaborn.relplot.html.
My data:
df =
data
2019-01-03 10
2019-01-04 20
2019-05-21 30
2019-05-22 40
2020-10-15 50
2020-10-16 60
2021-04-04 70
2021-04-05 80
My code:
# convert the df into weekly averaged dataframe
wdf = df.groupby(df.index.dt.strftime('%Y-%W')).data.mean()
wdf
2019-01 15
2019-26 35
2020-45 55
2021-20 75
Expected answer: Column name denotes the week number, index denotes the year. Cell denotes the sample's mean in that week.
01 20 26 45
2019 15 NaN 35 NaN # 15 is mean of 1st week (10,20) in above df
2020 NaN NaN NaN 55
2021 NaN 75 NaN NaN
No idea on how to proceed further to get the expected answer from the above-obtained solution.
You can use a pivot_table :
df['year'] = pd.DatetimeIndex(df['date']).year
df['week'] = pd.DatetimeIndex(df['date']).week
final_table = pd.pivot_table(data = df,index= 'year', columns = 'week',values = 'data', aggfunc = np.mean )
You need to use two dimensions in the groupby, and then unstack to lay out the data as a grid:
df.groupby([df.index.year,df.index.week])['data'].mean().unstack()
I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837