i am trying to calculate the Jarque-Bera-Bera test (normality test) on my data that look like that (after chain operation) :
ranking Q1 Q2 Q3 Q4
Date
2009-12-29 nan nan nan nan
2009-12-30 0.12 -0.21 -0.36 -0.39
2009-12-31 0.05 0.09 0.06 -0.02
2010-01-01 nan nan nan nan
2010-01-04 1.45 1.90 1.81 1.77
... ... ... ... ...
2020-10-13 -0.67 -0.59 -0.63 -0.61
2020-10-14 -0.05 -0.12 -0.05 -0.13
2020-10-15 -1.91 -1.62 -1.78 -1.91
2020-10-16 1.21 1.13 1.09 1.37
2020-10-19 -0.03 0.01 0.06 -0.02
I use a function like that :
from scipy import stats
def stat(x):
return pd.Series([x.mean(),
np.sqrt(x.var()),
stats.jarque_bera(x),
],
index=['Return',
'Volatility',
'JB P-Value'
])
data.apply(stat)
whereas the mean and variance calculation work fine, I have a error message stats.jarque_berafunction with is :
ValueError: Length of passed values is 10, index implies 9.
Any idea ?
I tried to reproduce and the function works fine for me, by copying the 10 rows of data you are providing above. This looks like a data input issue, where some column seems to have fewer values than the index of that pd.Series (effectively somehow len(data[col]) > len(data[col].index)). You can try to figure out which column it is by running a naive "debugging" function such as:
for col in data.columns:
if len(data[col].values) != len(data[col].index):
print(f"Column {col} has more/less values than the index")
However, the Jarque-Bera test documentation on Scipy says that x can be any "array-like" structure, so you don't need to pass a pd.Series, which might run you into issues with missing values, etc. Essentially you can just pass a list of values and calculate their JB test statistic and p-value.
So with that, I would modify your function to
def stat(x):
return pd.Series([x.mean(),
np.sqrt(x.var()),
stats.jarque_bera(x.dropna().values), # drop NaN and get numpy array instead of pd.Series
],
index=['Return',
'Volatility',
'JB P-Value'
])
Related
I'm working with a dataframe containing environnemental values (sentinel2 satellite : NDVI) like:
Date ID_151894 ID_109386 ID_111656 ID_110006 ID_112281 ID_132408
0 2015-07-06 0.82 0.61 0.85 0.86 0.76 nan
1 2015-07-16 0.83 0.81 0.77 0.83 0.84 0.82
2 2015-08-02 0.88 0.89 0.89 0.89 0.86 0.84
3 2015-08-05 nan nan 0.85 nan 0.83 0.77
4 2015-08-12 0.82 0.77 nan 0.65 nan 0.42
5 2015-08-22 0.85 0.85 0.88 0.87 0.83 0.83
The columns correspond to different places and the nan values are due to cloudy conditions (which happen often in Belgium). There are obviously lot more values. To remove outliers, I use the method described in the timesat manual (Jönsson & Eklundh, 2015) :
it deviates more than a maximum deviation (here called cutoff) from the median
value is lower than the mean value of its immediate neighbors minus the cutoff
or it is larger than the highest value of its immediate neighbor plus the cutoff
So, I have made the code below to do so :
NDVI = pd.read_excel("C:/Python_files/Cartofor/NDVI_frene_5ha.xlsx")
date = NDVI["Date"]
MED = NDVI.median(axis = 0, skipna = True, numeric_only=True)
SD = NDVI.std(axis = 0, skipna = True, numeric_only=True)
cutoff = 1.5 * SD
for j in range(1,21): #columns
for i in range(1,480): #rows
if (NDVIF.iloc[i,j] < (((NDVIF.iloc[i-1,j] + NDVIF.iloc[i+1,j])/2) - cutoff.iloc[j])):
NDVIF.iloc[i,j] == float('NaN')
elif (NDVIF.iloc[i,j] > (max(NDVIF.iloc[i-1,j], NDVIF.iloc[i+1,j]) + cutoff.iloc[j])): #2)
NDVIF.iloc[i,j] == float('NaN')
elif ((NDVIF.iloc[i,j] >= abs(MED.iloc[j] - cutoff.iloc[j]))) & (NDVIF.iloc[i,j] <= abs(MED.iloc[j] + cutoff.iloc[j])): #1)
NDVIF.iloc[i,j] == NDVIF.iloc[i,j]
else:
NDVIF.iloc[i,j] == float('NaN')
The problem is that I need to omit the 'NaN' values for the calculations. The goal is to have a dataframe like the one above without the outliers.
Once this is made, I have to interpolate the values for a new chosen time index (e.g. one value per day or one value every five days from 2016 to 2020) and write each interpolated column on a txt file to enter it on the TimeSat software.
I hope my english is not too bad and thank you for your answers! :)
I'm new to python and also new to this site. My colleague and I are working on a time series dataset. we wish to introduce some missing values to the dataset and then use some techniques to fill in the missing values to see how well those techniques perform for the data imputation task. The challenge we have at the moment is how to introduce missing values to the dataset in a consecutive manner and not just randomly. For example, we want to replace data for a period of time with NaNs, eg, 3 consecutive days. I will really appreciate if anyone can point us in the right direction on how to get this done. we are working with python.
Here is my sample data
There is a method for filling NaNs
dataframe['name_of_column'].fillna('value')
See set_missing_data function below:
import numpy as np
np.set_printoptions(precision=3, linewidth=1000)
def set_missing_data(data, missing_locations, missing_length):
for i in missing_locations:
data[i:i+missing_length] = np.nan
np.random.seed(0)
n_data_points = np.random.randint(40, 50)
data = np.random.normal(size=[n_data_points])
n_missing = np.random.randint(3, 6)
missing_length = 3
missing_locations = np.random.choice(
n_data_points - missing_length,
size=n_missing,
replace=False
)
print(data)
set_missing_data(data, missing_locations, missing_length)
print(data)
Console output:
[ 0.118 0.114 0.37 1.041 -1.517 -0.866 -0.055 -0.107 1.365 -0.098 -2.426 -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 1.419 1.168 0.947 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
[ 0.118 nan nan nan -1.517 -0.866 -0.055 -0.107 nan nan nan -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 nan nan nan 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
I called describe on one column of a dataframe and ended up with the following output,
count 1.048575e+06
mean 8.232821e+01
std 2.859016e+02
min 0.000000e+00
25% 3.000000e+00
50% 1.400000e+01
75% 6.000000e+01
max 8.599700e+04
What parameter do I pass to get meaningful integer values. What I mean is when I check the SQL count its about 43 million. All the other values are also different.Can someone help me understand what this conversion means and how do I get float rounded to 2 decimal places. I'm new to Pandas.
You can directly use round() and pass the number of decimals you want as argument
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# setting the seed to create the dataframe
np.random.seed(25)
# Creating a 5 * 4 dataframe
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
# rounding describe
df.describe().round(2)
A B C D
count 5.00 5.00 5.00 5.00
mean 0.52 0.47 0.38 0.42
std 0.21 0.23 0.19 0.29
min 0.33 0.12 0.16 0.11
25% 0.41 0.37 0.28 0.19
50% 0.45 0.58 0.37 0.44
75% 0.56 0.59 0.40 0.52
max 0.87 0.70 0.68 0.84
DOCS
There are two ways to control the output of pandas, either by controlling it or by using apply.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)
multidimensional interpolation with dataframe not working
import pandas as pd
import numpy as np
raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'],
'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018',
'17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'],
'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05,0.55,0.06,0.065,0.07],
'VOLATILITY':[np.nan,np.nan,0.3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan]
}
df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY'])
df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE'])
df_volsurface.interpolate(method='akima',limit_direction='both')
Output:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>NaN</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>NaN</td></tr></tbody></table>
Expected Result:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>Expected some logical value</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>Expected some logical value</td></tr></tbody></table>
Linear interpolation methods gives copy last available values to all backward and forward missing value without considering ccy_code
df_volsurface.interpolate(method='linear',limit_direction='both')
Output:
<table><tbody><tr><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th><th> </th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>0.3424</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>0.3424</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.30205</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.2326</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.2238</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.20975</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>0.2092</td></tr></tbody></table>
Any help is appreciated! Thanks!
I'd like to point out that this is still onedimensional interpolation. We have one independent variable ('STRIKE') and one dependent variable ('VOLATILITY'). Interpolation is done for different conditions, e.g. for each day, each currency, each scenario, etc. The following is an example of how the interpolation can be done based on 'END_DATE' and 'CCY_CODE'.
# set all the conditions as index
df_volsurface.set_index(['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
df_volsurface.sort_index(level=['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
# create separate columns for all criteria except the independent variable
df_volsurface = df_volsurface.unstack(level=['END_DATE', 'CCY_CODE'])
for ccy in df_volsurface:
indices = df_volsurface[ccy].notna()
if not any(indices):
continue # we are not interested in a column with only NaN
x = df_volsurface.index.get_level_values(level='STRIKE') # independent var
y = df_volsurface[ccy] # dependent var
# create interpolation function
f_interp = scipy.interpolate.interp1d(x[indices], y[indices], kind='linear',
bounds_error=False, fill_value='extrapolate')
df_volsurface['VOL_INTERP', ccy[1], ccy[2]] = f_interp(x)
print(df_volsurface)
The interpolation for the other conditions should work analogously. This is the resulting DataFrame:
VOLATILITY VOL_INTERP
END_DATE 2018-03-16 2018-03-17 2018-03-17
CCY_CODE SGD EUR USD EUR USD
STRIKE
0.005 NaN NaN NaN 0.23900 0.42310
0.010 NaN NaN NaN 0.23600 0.38275
0.015 NaN NaN 0.3424 0.23300 0.34240
0.020 NaN NaN NaN 0.23000 0.30205
0.025 NaN NaN 0.2617 0.22700 0.26170
0.030 NaN NaN 0.2414 0.22400 0.24140
0.035 NaN NaN NaN 0.22100 0.22110
0.040 NaN NaN NaN 0.21800 0.20080
0.045 NaN 0.2150 NaN 0.21500 0.18050
0.050 NaN 0.2120 NaN 0.21200 0.16020
0.055 NaN 0.2103 NaN 0.21030 0.13990
0.060 NaN NaN NaN 0.20975 0.11960
0.065 NaN 0.2092 NaN 0.20920 0.09930
0.070 NaN NaN NaN 0.20865 0.07900
Use df_volsurface.stack() to return to a multiindex of your choice. There are also several pandas interpolation methods to choose from. However, I have not found a satisfactory solution for your problem using method='akima' because it only interpolates between the given data points, but does not seem to extrapolate beyond.
As a part of data comparison in Python, i have a dataframe's output. As you can see, PROD_ and PROJ_ data are compared.
Sample:
print (df)
PROD_Label PROJ_Label Diff_Label PROD_OAD PROJ_OAD \
0 Energy Energy True 1.94 1.94
1 Food and Beverage Food and Beverage True 1.97 1.97
2 Healthcare Healthcare True 8.23 8.23
3 Consumer Products Consumer Products True 3.67 NaN
4 Retailers Retailers True 5.88 NaN
Diff_OAD PROD_OAD_Tin PROJ_OAD_Tin Diff_OAD_Tin
0 True 0.02 0.02 True
1 True 0.54 0.01 False
2 True 0.05 0.05 True
3 False 0.02 0.02 True
4 False 0.06 0.06 True
String columns like PROD_Label, PROJ_Label are "non-null objects". Here the comparison results in true/false and as expected.
While for numeric columns like PROD_OAD, PROJ_OAD, PROD_OAD_Tin, PROJ_OAD_Tin are "non-null float64". Currently my output is showing the comparison as true and false (as above). but i expect this to be as the actual differences, like below but only for the numeric columns.
Is there a way to specify the particular column names and get the difference of results to be dumped in the Diff_ column.
Please note i don't want to compare all the PROD_ and PROJ_ columns. String's differences are already correct in true/false. Just looking for some specific columns which are in numeric format.
I think if exist only numeric columns with same structure is possible extract only numeric columns and get unique values which are used in for with sub:
a = df.select_dtypes([np.number]).columns.str.split('_', n=1).str[1].unique()
print (a)
Index(['OAD', 'OAD_Tin'], dtype='object')
for x in a:
df['Diff_' + x] = df['PROJ_' + x].sub(df['PROD_' + x], fill_value=0)
print (df)
PROD_Label PROJ_Label Diff_Label PROD_OAD PROJ_OAD \
0 Energy Energy True 1.94 1.94
1 Food and Beverage Food and Beverage True 1.97 1.97
2 Healthcare Healthcare True 8.23 8.23
3 Consumer Products Consumer Products True 3.67 NaN
4 Retailers Retailers True 5.88 NaN
Diff_OAD PROD_OAD_Tin PROJ_OAD_Tin Diff_OAD_Tin
0 0.00 0.02 0.02 0.00
1 0.00 0.54 0.01 -0.53
2 0.00 0.05 0.05 0.00
3 -3.67 0.02 0.02 0.00
4 -5.88 0.06 0.06 0.00