covariance of each key in a dictionary - python

I have a list, which is a set of tickers. For each ticker, I get the the daily return going back six months. I then want to compute the covariance between each ticker. I am having trouble with np.cov, here is my code to test COV:
newStockDict = {}
for i in newList_of_index:
a = Share(i)
dataB = a.get_historical(look_back_date, end_date)
stockData = pd.DataFrame(dataB)
stockData['Daily Return'] = ""
yList = []
for y in range(0,len(stockData)-1):
stockData['Daily Return'][y] = np.log(float(stockData['Adj_Close'][y])/float(stockData['Adj_Close'][y+1]))
yList = stockData['Daily Return'].values.tolist()
newStockDict[stockData['Symbol'][0]] = yList
g = (np.cov(pd.Series((newStockDict[newList_of_index[0]]))), pd.Series(((newStockDict[newList_of_index[1]]))))
return g
My error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Udaya\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:/Users/Udaya/Documents/Python Scripts/SR_YahooFinanceRead.py", line 150, in <module>
print CumReturnStdDev(stock_list)
File "C:/Users/Udaya/Documents/Python Scripts/SR_YahooFinanceRead.py", line 132, in CumReturnStdDev
g = (np.cov(pd.Series((newStockDict[newList_of_index[0]]))), pd.Series(((newStockDict[newList_of_index[1]]))))
File "C:\Users\Udaya\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 1885, in cov
X -= X.mean(axis=1-axis, keepdims=True)
File "C:\Users\Udaya\Anaconda\lib\site-packages\numpy\core\_methods.py", line 66, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'numpy.float64' and 'str'
>>> TypeError: unsupported operand type(s) for +: 'numpy.float64' and 'str'
I've tried using pd.cov on a dataframe, then np.cov. Nothing works. Here I am actually appending the daily returns to a list, then to a dictionary, before I manually calculate an n by n covariance matrix. But I am unable to get np.cov to work.
Please help. The idea is I can easily construct a dataframe of N tickers, with each row being a daily return. but am unable to compute cov with said dataframe, thus this df-->list-->dict process.

Related

AttributeError: 'tuple' object has no attribute 'ravel'

I'm trying to solve two simultaneous nonlinear equations using the scipy.optimize.brute function
import numpy as np
import scipy.optimize as so
def root2d(x,a,b):
F1 = np.exp(-np.exp(-(x[0]+x[1]))) - x[1]*(b+x[0]**2)
F2 = x[0]*np.cos(x[1]) + x[1]*np.sin(x[0]) - a
return (F1,F2)
a = 0.5
b = 1
x0 = np.array([-0.1,0.1]) # initial guesses
rranges = (slice(-4,4,0.2),slice(-4,4,0.2))
print(so.brute(root2d,rranges,args=(a,b),finish=so.fmin))
I get an error that I don't understand: AttributeError: 'tuple' object has no attribute 'ravel'. What does this mean and how do I fix my code (if it's possible)?
Edit: full error message
Traceback (most recent call last):
File "<ipython-input-2-29b9507fcb99>", line 1, in <module>
runfile('.../test')
File "C:\WinPython\WinPython-64bit-3.5.2.3\python-3.5.2.amd64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\WinPython\WinPython-64bit-3.5.2.3\python-3.5.2.amd64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "../test.py", line 111, in <module>
print(so.brute(root2d,rranges,args=(a,b),finish=so.fmin))
File "C:\WinPython\WinPython-64bit-3.5.2.3\python-3.5.2.amd64\lib\site-packages\scipy\optimize\optimize.py", line 2711, in brute
indx = argmin(Jout.ravel(), axis=-1)
AttributeError: 'tuple' object has no attribute 'ravel'
You return 2 variables F1 and F2 and reveive them using a single variable obj.(say) This is what is called a tuple obj,it is associated with 2 values, the values of F1 and F2. So, use index as you use in a list to get the value you want, in order.

Python sklearn kaggle/titanic tutorial fails on the last feature scale

I was in the process of working through this tutorial: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
And it went with no problems, until I got to the last section of the middle section:
As you can see, the features range in different intervals. Let's normalize all of them in the unit interval. All of them except the PassengerId that we'll need for the submission
In [48]:
>>> def scale_all_features():
>>> global combined
>>> features = list(combined.columns)
>>> features.remove('PassengerId')
>>> combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
>>> print 'Features scaled successfully !'
In [49]:
>>> scale_all_features()
Features scaled successfully !
and despite typing it word for word in my python script:
#Cell 48
GreatDivide.split()
def scale_all_features():
global combined
features = list(combined.columns)
features.remove('PassengerId')
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
print 'Features scaled successfully !'
#Cell 49
GreatDivide.split()
scale_all_features()
It keeps giving me an error:
--------------------------------------------------48--------------------------------------------------
--------------------------------------------------49--------------------------------------------------
Traceback (most recent call last):
File "KaggleTitanic[2-FE]--[01].py", line 350, in <module>
scale_all_features()
File "KaggleTitanic[2-FE]--[01].py", line 332, in scale_all_features
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4061, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4157, in _apply_standard
results[i] = func(v)
File "KaggleTitanic[2-FE]--[01].py", line 332, in <lambda>
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 651, in wrapper
return left._constructor(wrap_results(na_op(lvalues, rvalues)),
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 592, in na_op
result[mask] = op(x[mask], y)
TypeError: ("unsupported operand type(s) for /: 'str' and 'str'", u'occurred at index Ticket')
What's the problem here? All of the previous 49 sections ran with no problem, so if I was getting an error it would have shown by now, right?
You can help insure that the math transformation only occurs on numeric columns with the following.
numeric_cols = combined.columns[combined.dtypes != 'object']
combined.loc[:, numeric_cols] = combined[numeric_cols] / combined[numeric_cols].max()
There is no need for that apply function.

Numpy std calculation: TypeError: cannot perform reduce with flexible type

I am trying to read lines of numbers starting at line 7 and compiling the numbers into a list until there is no more data, then calculate standard deviation and %rms on this list. Seems straightforward but I keep getting the error:
Traceback (most recent call last):
File "rmscalc.py", line 21, in <module>
std = np.std(values)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/fromnumeric.py", line 2817, in std
keepdims=keepdims)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py", line 116, in _std
keepdims=keepdims)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py", line 86, in _var
arrmean = um.add.reduce(arr, axis=axis, dtype=dtype, keepdims=True)
TypeError: cannot perform reduce with flexible type
Here is my code below:
import numpy as np
import glob
import os
values = []
line_number = 6
road = '/Users/allisondavis/Documents/HCl'
for pbpfile in glob.glob(os.path.join(road, 'pbpfile*')):
lines = open(pbpfile, 'r').readlines()
while line_number < 400 :
if lines[line_number] == '\n':
break
else:
variables = lines[line_number].split()
values.append(variables)
line_number = line_number + 3
print values
a = np.asarray(values).astype(np.float32)
std = np.std(a)
rms = std * 100
print rms
Edit: It produces an rms (which is wrong - not sure why yet) but the following error message is confusing: I need the count to be high (picked 400 just to ensure it would get the entire file no matter how large)
Traceback (most recent call last):
File "rmscalc.py", line 13, in <module>
if lines[line_number] == '\n':
IndexError: list index out of range
values is a string array and so is a. Convert a into a numeric type using astype. For example,
a = np.asarray(values).astype(np.float32)
std = np.std(a)

Pandas correlation error - decimal and float type mismatch

This problem has been raised here, but has not been answered. I am providing more details in this thread, hoping that gets the juices flowing.
I have a pandas dataframe master_frame that contains timeseries data:
SUBMIT_DATE CRUX_VOL CRUX_RATE
0 2016-02-01 76.38733173161 0.02832710529
1 2016-01-31 76.68984699154 0.02720243998
2 2016-01-30 75.59094829615 0.02720243998
3 2016-01-29 75.91758975956 0.02720243998
4 2016-01-28 76.31809997200 0.02671927211
... ... ... ...
I want the correlation between CRUX_VOL and CRUX_RATE columns. Both are decimal type:
ln[3]: print type(master_frame["CRUX_VOL"][0]), type(master_frame["CRUX_RATE"][0])
out[3]: <class 'decimal.Decimal'> <class 'decimal.Decimal'>
When I use the corr function, I get a nasty error that relates to the type of the inputs.
print master_frame['CRUX_VOL'].corr(master_frame['CRUX_RATE'])
Traceback (most recent call last):
File "U:/Programming/VolPathReport/VolPath.py", line 52, in <module>
print master_frame['CRUX_VOL'].corr(master_frame['CRUX_RATE'])
File "C:\Anaconda2\lib\site-packages\pandas\core\series.py", line 1312, in corr
min_periods=min_periods)
File "C:\Anaconda2\lib\site-packages\pandas\core\nanops.py", line 47, in _f
return f(*args, **kwargs)
File "C:\Anaconda2\lib\site-packages\pandas\core\nanops.py", line 644, in nancorr
return f(a, b)
File "C:\Anaconda2\lib\site-packages\pandas\core\nanops.py", line 652, in _pearson
return np.corrcoef(a, b)[0, 1]
File "C:\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 2145, in corrcoef
c = cov(x, y, rowvar)
File "C:\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 2065, in cov
avg, w_sum = average(X, axis=1, weights=w, returned=True)
File "C:\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 599, in average
scl = np.multiply(avg, 0) + scl
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
I've messed with the types and can't get this thing to work. Help me, o wizards of the internet!
The last line of the error message points to
np.multiply(avg, 0) + scl
as the cause for
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
I don't think numpy has a Decimal type, so np.multiply returns float, which then doesn't collaborate with Decimal when using the + operator. Since pandas relies on numpy, it's probably best to convert the DataFrame to float dtype using
master_frame.loc[:, ['CRUX_VOL', 'CRUX_RATE']].astype(float)
or
master_frame.convert_objects(convert_numeric=True)

Python Pandas: Increase Maximum Number of Rows

I am processing a large text file (500k lines), formatted as below:
S1_A16
0.141,0.009340221649748676
0.141,4.192618196894668E-5
0.11,0.014122135626540204
S1_A17
0.188,2.3292323316081486E-6
0.469,0.007928706856794138
0.172,3.726771730573038E-5
I'm using the code below to return the correlation coefficients of each series, e.g. S!_A16:
import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'wordUnigramPauseTEST.data'
df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
.where(mask, np.nan)
.fillna(method='ffill'))
df = df.loc[~mask]
result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)
However, on some large files, this returns the error:
Traceback (most recent call last):
File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in <module>
print(result)
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
return self.__bytes__()
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
return self.__unicode__().encode(encoding, 'replace')
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
I understand that this is related to the print statement, but how do I fix it?
EDIT:
This is related to the maximum number of rows. Does anyone know how to accommodate a greater number of rows?
The error message:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
is saying None minus an int is a TypeError. If you look at the next-to-last line in the traceback you see that the only subtraction going on there is
max_rows - 4
So max_rows must be None. If you dive into /Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py, near line 857 and ask yourself how max_rows could end up being equal to None, you'll see that somehow
get_option("display.max_rows")
must be returning None.
This part of the code is calling _tidy_repr which is used to summarize the Series. None is the correct value to set when you want pandas to display all lines of the Series.
So this part of the code should not have been reached when max_rows is None.
I've made a pull request to correct this.

Categories

Resources