How to calculate the mean of a pandas DataFrame with NaN values - python

I have a DataFrame that looks like as follows (the address key is the index):
address date1 date2 date3 date4 date5 date6 date7
<email> NaN NaN NaN 1 NaN NaN NaN
I want to calculate the mean across a row, but when I use DataFrame.mean(axis=1), I get NaN (in the above example, I want a mean of 1). I get NaN even when I use DataFrame.mean(axis=1, skipna=True, numeric_only=True). How can I get the correct mean for the rows in this DataFrame?

Despite appearances your dtypes are not numeric hence the NaN values, you need to cast the type using astype:
df['date4'] = df['date4'].astype(int)
then it will work, depending on how you loaded/created this data then it should be something that you should correct at that stage rather than as a post-processing step if possible
You can confirm what the dtypes are but looking at the output from df.info() and also you can filter non-numeric columns out using select_dtypes: df.select_dtypes(include=[np.number]) to select just the numeric columns

Related

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Does this occur because there is an NaN?

I have a list of floats, and when I try to convert it into series or dataframe
code
000001.SZ 1.305442
000002.SZ 1.771655
000004.SZ 2.649862
000005.SZ 1.373074
000006.SZ 1.115238
...
601512.SH 16.305734
688123.SH 53.395579
603995.SH 19.598881
688268.SH 70.174454
002972.SZ 19.644900
300811.SZ 24.042762
688078.SH 86.263280
603109.SH NaN
Length: 3753, dtype: float64
df = pd.DataFrame(data = mylist,columns = ["std_r_in20days"])
print(df)
s = pd.Series(mylist)
print(s)
The result is:
std_r_in20days
0 NaN
0 code
000001.SZ 1.305442
000002.SZ 1.77...
dtype: object
AttributeError: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')
Does this occur because there is an NaN in mylist? If so, how can I fix it? I don't want to delete the row with NaN but just leave them there
Ser=pd.DataFrame([[1,2,3,4,None],[2,3,4,5,np.nan]])
Ser=Ser.replace(np.nan,0)
You can do it like this. There are also other functions in pandas like fillna().:)
Instead of removing the whole row , you can just replace those values with 0 using pandas function
df.fillna(0)
Also, learn to perform null check in the beginning of your script using
df.isna().sum().sum()
This will give you the number of NaN values in your whole dataframe

Reorganize Dataframe to Multi Index

I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.
Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.

Find closest valid numbers among missing values in a pandas dataframe

I've got a dataset with multiple missing sequences of varying lengths where I'd like to find the first valid numbers that occur before and after these sequences for some particular dates. In the sample dataset below, I would like to find the valid numbers for ColumnB that occur closest to the date 2018-11-26.
Datasample:
Date ColumnA ColumnB
2018-11-19 107.00 NaN
2018-11-20 104.00 NaN
2018-11-21 106.00 NaN
2018-11-22 105.24 80.00
2018-11-23 104.63 NaN
2018-11-26 104.62 NaN
2018-11-28 104.54 NaN
2018-11-29 103.91 86.88
2018-11-30 103.43 NaN
2018-12-01 106.13 NaN
2018-12-02 110.83 NaN
Expected output:
[80, 86.88]
Some details:
If it were the case that this particular sequence was the only one with missing values, I would have been able to solve it using For Loops, or the pandas functions first_valid_index() or isnull() as described in Pandas - find first non-null value in column, but that will rarely be the case.
I'm able to solve this using a few For Loops, but it's very slow for larger datasets and not very elegant, so I'd really like to hear other suggestions!
Try this way, get the index and slice to get the first valid number
idx= np.where(df['Date']=='2018-11-26')[0][0]
# idx 3
num = (df.loc[df.loc[:idx,'ColumnB'].first_valid_index(),'ColumnB'],
df.loc[df.loc[idx:,'ColumnB'].first_valid_index(),'ColumnB'])
num
(80.0, 86.879999999999995)
I'd try it this way:
import pandas as pd
import numpy as np
df_vld = df.dropna()
idx = np.argmin(abs(df_vld.index - pd.datetime(2018, 11,26)))
# 1
df_vld.loc[df_vld.index[idx]]
Out:
ColumnA 103.91
ColumnB 86.88
Name: 2018-11-29 00:00:00, dtype: float64
[df['ColumnB'].ffill().loc['2018-11-26'], df['ColumnB'].bfill().loc['2018-11-26']]
You can use ffill and bfill to create two columns with the value from before and after such as
df['before'] = df.ColumnB.ffill()
df['after'] = df.ColumnB.bfill()
then get the value for the dates you want with a loc
print (df.loc[df.Date == pd.to_datetime('2018-11-26'),['before','after']].values[0].tolist())
[80.0, 86.88]
and if you have a list of dates then you can use isin:
list_dates = ['2018-11-26','2018-11-28']
print (df.loc[df.Date.isin(pd.to_datetime(list_dates)),['before','after']].values.tolist())
[[80.0, 86.88], [80.0, 86.88]]
Here's a way to do it:
t = '2018-11-26'
Look for the index of the date t:
ix = df.loc[df.Date==t].index.values[0]
Keep positions of non-null values in ColumnB:
non_nulls = np.where(~df.ColumnB.isnull())[0]
Get the nearest non-null values both on top and bellow:
[df.loc[non_nulls[non_nulls < ix][-1],'ColumnB']] + [df.loc[non_nulls[non_nulls > ix][0],'ColumnB']]
[80.0, 86.88]

optimal normalization of dataframe in pandas using std of backward looking window

Given the following DataFrame:
var
date
1900-01-31 0.0357
1900-02-28 0.0362
1900-03-31 0.0371
1900-04-30 0.0379
1900-05-31 0.0410
1900-06-30 0.0435
1900-07-31 0.0448
1900-08-31 0.0455
1900-09-30 0.0478
1900-10-31 0.0474
1900-11-30 0.0451
1900-12-31 0.0437
1901-01-31 0.0427
1901-02-28 0.0418
1901-03-31 0.0406
1901-04-30 0.0377
1901-05-31 0.0399
1901-06-30 0.0365
1901-07-31 0.0393
1901-08-31 0.0390
I need to normalize these values by dividing each one by their standard deviations using a backward-looking window of 500 days (~16 months, means I can use a backwards splice of size 16). There are two ways to do this from what I've researched:
The first that came to mind is to use DF.rolling_std() to make a new dataframe, iterate over both, divide accordingly and replace the value in the original DataFrame with the result of the division.
The second idea I had was to iterate over the original Dataframe, make a new DF that is simply all values over the last 16 months, calculate the STD of that, and divide & store accordingly.
However, I'm not sure which is more efficient, or if either of these are efficient compared to other solutions. Here is the idea I'm running with:
def normalize(df, window=16):
columnName = list(df.columns.values)[0]
df[[columnName]] = df[[columnName]].apply(pd.to_numeric)
df['std'] = df.rolling(window=16).std()
df[columnName] = df.apply(lambda row: float(row[columnName])/row['std'], axis=1)
return df.loc[:, :columnName]
Output:
var
date
1900-01-31 NaN
1900-02-28 NaN
1900-03-31 NaN
1900-04-30 NaN
1900-05-31 NaN
1900-06-30 NaN
1900-07-31 NaN
1900-08-31 NaN
1900-09-30 NaN
1900-10-31 NaN
1900-11-30 NaN
1900-12-31 NaN
1901-01-31 NaN
1901-02-28 NaN
1901-03-31 NaN
1901-04-30 9.565440
1901-05-31 10.969396
1901-06-30 10.122305
1901-07-31 11.416832
1901-08-31 11.604732
Note that this output is good. I haven't manually checked to see if the values are actually correct, but they seem accurate. My question is if there is a more efficient/intuitive way to perform this operation.

Categories

Resources