Exponentially Weighted Covariance Matrix in Python - python

I have weekly return data in ascending order. Suppose I want to calculate exponentially weighted moving covariance matrix, with suppose half life of 5 years or 260 weeks. I am trying to follow pandas.DataFrame.ewm, but not really clear about how to implement this in my case. Given I can't mention halflife, as "Only applicable to mean() and halflife value will not apply to the other functions."
Using alpha = 1-exp(ln(0.5)/HalfLife) = 2/(span + 1), I used span = 750.
ret is my return data frame with DatetimeIndex.
ret.shape is (895, 11)
I tried SigmaExpW = ret.ewm(span = 750).cov(), but result is not what I expected. SigmaExpW.shape is (9845, 11).
For context -
ret.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 895 entries, 2003-06-04 to 2020-07-22
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GE 895 non-null float64
1 IGM 895 non-null float64
2 USLC 895 non-null float64
3 USSC 895 non-null float64
4 ExUS 895 non-null float64
5 ExUSHedged 895 non-null float64
6 EM 895 non-null float64
7 Commodities 895 non-null float64
8 HYM 895 non-null float64
9 Government 895 non-null float64
10 Cash 895 non-null float64
dtypes: float64(11)
I am clearly making some elementary mistake. Would appreciate help.

It creates 11*11 matrix everyday. 895*11 = 9845

Related

pandas.DataFrame.convert_dtypes increasing memory usage

Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.

Trying to use a Lambda Function and getting NAN

I have an object named air_list and it contains the following in the first column:
['John_F_Kennedy_International_Airport', 'Charlotte_Douglas_International_Airport', 'OHare_International_Airport']
In the second and third columns, i have Lat and Long coordinates.
I have another object named cust_loc and it contains Lat, Lon, and names of airports, coordinates, distance between coordinates. Now, I'm trying to use a Lambda Function to basically say, if the distance is less than 500 miles, the Condition is 'In', otherwise it's 'Out'
Here's the function that I am testing out.
for i in air_list:
cust_loc.loc['Condition']=cust_loc.loc[cust_loc.Condition=='Out'][i].apply(lambda x: 'In' if x<=500 else 'Out')
The In flags seems to be fine, but all Out flags come in as NANs. All coordinates and distances are float and Coordinate is an object. Any idea what's wrong with my setup?
These are two Pandas Dataframes:
cust_loc.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Longitude 150961 non-null float64
1 Latitude 150961 non-null float64
2 coor 150961 non-null object
3 John_F_Kennedy_International_Airport 150961 non-null float64
4 Charlotte_Douglas_International_Airport 150961 non-null float64
5 OHare_International_Airport 150961 non-null float64
6 Tucson_International_Airport 150961 non-null float64
7 Candy_Kitchen_Ranch_Airport 150961 non-null float64
8 Canandaigua_Airport 150961 non-null float64
9 Asheville_Regional_Airport 150961 non-null float64
10 Dallas_Love_Field_Airport 150961 non-null float64
11 Fly_Barts 150961 non-null float64
12 Tampa_International_Airport 150961 non-null float64
13 Condition 150961 non-null object
air_list.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Place_Name 10 non-null object
1 Latitude 10 non-null float64
2 Longitude 10 non-null float64
3 coor 10 non-null object
Thanks.

Division of two dataframe with Group by of a Column Pandas

I have a dataframe df_F1:
df_F1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 7 columns):
class_energy 2 non-null object
ACT_TIME_AERATEUR_1_F1 2 non-null float64
ACT_TIME_AERATEUR_1_F3 2 non-null float64
ACT_TIME_AERATEUR_1_F5 2 non-null float64
ACT_TIME_AERATEUR_1_F8 2 non-null float64
ACT_TIME_AERATEUR_1_F7 2 non-null float64
ACT_TIME_AERATEUR_1_F8 2 non-null float64
dtypes: float64(6), object(1)
memory usage: 128.0+ bytes
df_F1.head()
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 ACT_TIME_AERATEUR_1_F5
low 5.875550 431.000000 856.666667
medium 856.666667 856.666667 856.666667
I try to create a dataframe Ratio which contain for each class_energy the value of energy of each ACT_TIME_AERATEUR_1_Fx divided by the sum of energy of all ACT_TIME_AERATEUR_1_Fx.
For example:
ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 ACT_TIME_AERATEUR_1_F5
low 5.875550/(5.875550 + 431.000000+856.666667) 431.000000/(5.875550+431.000000+856.666667) 856.666667/(5.875550+431.000000+856.666667)
medium 856.666667/(856.666667+856.666667+856.666667) 856.666667/(856.666667+856.666667+856.666667) 856.666667/(856.666667+856.666667+856.666667)
Any idea please to help me?
You could use DF.divide to divide the required columns with their sum along the same columns as shown:
df.iloc[:,1:4] = df.iloc[:,1:4].divide(df.sum(axis=1), axis=0)
print(df)
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 \
0 low 0.004542 0.333194
1 medium 0.333333 0.333333
ACT_TIME_AERATEUR_1_F5
0 0.662264
1 0.333333

Python - Input contains NaN, infinity or a value too large for dtype('float64')

I am new on Python. I am trying to use sklearn.cluster.
Here is my code:
from sklearn.cluster import MiniBatchKMeans
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df)
But I get the following error:
50 and not np.isfinite(X).all()):
51 raise ValueError("Input contains NaN, infinity"
---> 52 " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
I checked that the there is no Nan or infinity value. So there is only one option left. However, my data info tells me that all variables are float64, so I don't understand where the problem comes from.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
Thanks a lot,
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
I think that fit() accepts only "array-like, shape = [n_samples, n_features]", not pandas dataframes. So try to pass the values of the dataframe into it as:
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df.values)
Or shape them in order to run the function correctly. Hope that helps.
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
So you can slice your data to the right fit with iloc():
df = pd.read_csv(location1, encoding = "ISO-8859-1").iloc[2:20]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 2 to 19
Data columns (total 6 columns):
zip_code 18 non-null int64
latitude 18 non-null float64
longitude 18 non-null float64
city 18 non-null object
state 18 non-null object
county 18 non-null object
dtypes: float64(2), int64(1), object(3)

Plotting with GroupBy in Pandas/Python

Although it is straight-forward and easy to plot groupby objects in pandas, I am wondering what the most pythonic (pandastic?) way to grab the unique groups from a groupby object is. For example:
I am working with atmospheric data and trying to plot diurnal trends over a period of several days or more. The following is the DataFrame containing many days worth of data where the timestamp is the index:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10909 entries, 2013-08-04 12:01:00 to 2013-08-13 17:43:00
Data columns (total 17 columns):
Date 10909 non-null values
Flags 10909 non-null values
Time 10909 non-null values
convt 10909 non-null values
hino 10909 non-null values
hinox 10909 non-null values
intt 10909 non-null values
no 10909 non-null values
nox 10909 non-null values
ozonf 10909 non-null values
pmtt 10909 non-null values
pmtv 10909 non-null values
pres 10909 non-null values
rctt 10909 non-null values
smplf 10909 non-null values
stamp 10909 non-null values
no2 10909 non-null values
dtypes: datetime64[ns](1), float64(11), int64(2), object(3)
To be able to average (and take other statistics) the data at every minute for several days, I group the dataframe:
data = no.groupby('Time')
I can then easily plot the mean NO concentration as well as quartiles:
ax = figure(figsize=(12,8)).add_subplot(111)
title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
ylabel('Concentration [ppb]')
data.no.mean().plot(ax=ax, style='b', label='Mean')
data.no.apply(lambda x: percentile(x, 25)).plot(ax=ax, style='r', label='25%')
data.no.apply(lambda x: percentile(x, 75)).plot(ax=ax, style='r', label='75%')
The issue that fuels my question, is that in order to plot more interesting looking things like plots using like fill_between() it is necessary to know the x-axis information per the documentation
fill_between(x, y1, y2=0, where=None, interpolate=False, hold=None, **kwargs)
For the life of me, I cannot figure out the best way to accomplish this. I have tried:
Iterating over the groupby object and creating an array of the groups
Grabbing all of the unique Time entries from the original DataFrame
I can make these work, but I know there is a better way. Python is far too beautiful. Any ideas/hints?
UPDATES:
The statistics can be dumped into a new dataframe using unstack() such as
no_new = no.groupby('Time')['no'].describe().unstack()
no_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1440 entries, 00:00 to 23:59
Data columns (total 8 columns):
count 1440 non-null values
mean 1440 non-null values
std 1440 non-null values
min 1440 non-null values
25% 1440 non-null values
50% 1440 non-null values
75% 1440 non-null values
max 1440 non-null values
dtypes: float64(8)
Although I should be able to plot with fill_between() using no_new.index, I receive a TypeError.
Current Plot code and TypeError:
ax = figure(figzise=(12,8)).add_subplot(111)
ax.plot(no_new['mean'])
ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
TypeError:
TypeError Traceback (most recent call last)
<ipython-input-6-47493de920f1> in <module>()
2 ax = figure(figsize=(12,8)).add_subplot(111)
3 ax.plot(no_new['mean'])
----> 4 ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
5 #title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
6 #ylabel('Concentration [ppb]')
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\axes.pyc in fill_between(self, x, y1, y2, where, interpolate, **kwargs)
6986
6987 # Convert the arrays so we can work with them
-> 6988 x = ma.masked_invalid(self.convert_xunits(x))
6989 y1 = ma.masked_invalid(self.convert_yunits(y1))
6990 y2 = ma.masked_invalid(self.convert_yunits(y2))
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\ma\core.pyc in masked_invalid(a, copy)
2237 cls = type(a)
2238 else:
-> 2239 condition = ~(np.isfinite(a))
2240 cls = MaskedArray
2241 result = a.view(cls)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The plot as of now looks like this:
Storing the groupby stats (mean/25/75) as columns in a new dataframe and then passing the new dataframe's index as the x parameter of plt.fill_between() works for me (tested with matplotlib 1.3.1). e.g.,
gdf = df.groupby('Time')[col].describe().unstack()
plt.fill_between(gdf.index, gdf['25%'], gdf['75%'], alpha=.5)
gdf.info() should look like this:
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 00:00:00 to 22:00:00
Data columns (total 8 columns):
count 12 non-null float64
mean 12 non-null float64
std 12 non-null float64
min 12 non-null float64
25% 12 non-null float64
50% 12 non-null float64
75% 12 non-null float64
max 12 non-null float64
dtypes: float64(8)
Update: to address the TypeError: ufunc 'isfinite' not supported exception, it is necessary to first convert the Time column from a series of string objects in "HH:MM" format to a series of datetime.time objects, which can be done as follows:
df['Time'] = df.Time.map(lambda x: pd.datetools.parse(x).time())

Categories

Resources