Python sklearn kaggle/titanic tutorial fails on the last feature scale

Python sklearn kaggle/titanic tutorial fails on the last feature scale - python

I was in the process of working through this tutorial: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
And it went with no problems, until I got to the last section of the middle section:
As you can see, the features range in different intervals. Let's normalize all of them in the unit interval. All of them except the PassengerId that we'll need for the submission
In [48]:
>>> def scale_all_features():
>>> global combined
>>> features = list(combined.columns)
>>> features.remove('PassengerId')
>>> combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
>>> print 'Features scaled successfully !'
In [49]:
>>> scale_all_features()
Features scaled successfully !
and despite typing it word for word in my python script:
#Cell 48
GreatDivide.split()
def scale_all_features():
global combined
features = list(combined.columns)
features.remove('PassengerId')
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
print 'Features scaled successfully !'
#Cell 49
GreatDivide.split()
scale_all_features()
It keeps giving me an error:
--------------------------------------------------48--------------------------------------------------
--------------------------------------------------49--------------------------------------------------
Traceback (most recent call last):
File "KaggleTitanic[2-FE]--[01].py", line 350, in <module>
scale_all_features()
File "KaggleTitanic[2-FE]--[01].py", line 332, in scale_all_features
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4061, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4157, in _apply_standard
results[i] = func(v)
File "KaggleTitanic[2-FE]--[01].py", line 332, in <lambda>
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 651, in wrapper
return left._constructor(wrap_results(na_op(lvalues, rvalues)),
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 592, in na_op
result[mask] = op(x[mask], y)
TypeError: ("unsupported operand type(s) for /: 'str' and 'str'", u'occurred at index Ticket')
What's the problem here? All of the previous 49 sections ran with no problem, so if I was getting an error it would have shown by now, right?

You can help insure that the math transformation only occurs on numeric columns with the following.
numeric_cols = combined.columns[combined.dtypes != 'object']
combined.loc[:, numeric_cols] = combined[numeric_cols] / combined[numeric_cols].max()
There is no need for that apply function.

Related

How to Read nrows From Pandas HDF Storage?

What am I trying to do?
pd.read_csv(... nrows=###) can read the top nrows of a file. I'd like to do the same while using pd.read_hdf(...).
What is the problem?
I am confused by the documentation. start and stop look like what I need but when I try it, a ValueError is returned. The second thing I tried was using nrows=10 thinking that it might be an allowable **kwargs. When I do, no errors are thrown but also the full dataset is returned instead of just 10 rows.
Question: How does one correctly read a smaller subset of rows from an HDF file? (edit: without having to read the whole thing into memory first!)
Below is my interactive session:
>>> import pandas as pd
>>> df = pd.read_hdf('storage.h5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
df = pd.read_hdf('storage.h5')
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 367, in read_hdf
raise ValueError('key must be provided when HDF5 file '
ValueError: key must be provided when HDF5 file contains multiple datasets.
>>> import h5py
>>> f = h5py.File('storage.h5', mode='r')
>>> list(f.keys())[0]
'table'
>>> f.close()
>>> df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 740, in select
return it.get_result()
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 1447, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 733, in func
columns=columns, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 2890, in read
return self.obj_type(BlockManager(blocks, axes))
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 2795, in __init__
self._verify_integrity()
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 3006, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
passed, implied))
ValueError: Shape of passed values is (614, 593430), indices imply (614, 10)
>>> df = pd.read_hdf('storage.h5', key='table', nrows=10)
>>> df.shape
(593430, 614)
Edit:
I just attempted to use where:
mylist = list(range(30))
df = pd.read_hdf('storage.h5', key='table', where='index=mylist')
Received a TypeError indicating a Fixed format store (the default format value of df.to_hdf(...)):
TypeError: cannot pass a where specification when reading from a
Fixed format store. this store must be selected in its entirety
Does this mean I can't select a subset of rows if the format is Fixed format?

I ran into the same problem. I am pretty certain by now that https://github.com/pandas-dev/pandas/issues/11188 tracks this very problem. It is a ticket from 2015 and it contains a repro. Jeff Reback suggested that this is actually a bug, and he even pointed us towards a solution back in 2015. It's just that nobody built that solution yet. I might have a try.

Seems like this now works, at least with pandas 1.0.1. Just provide start and stop arguments:
df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)

python pandas rolling function with two arguments in a grouped DataFrame

This is a somewhat extension to my previous problem
python pandas rolling function with two arguments .
How do I perform the same by group? Let's say that the 'C' column below is used for grouping.
I am struggling to:
Group by column 'C'
Within each group, sort by 'A'
Withing each group, apply a rolling function taking two arguments, like kendalltau, to arguments 'A' and 'B'.
The expected result would be a DataFrame like the one below:
I have been trying the 'pass an index' workaround as described in the link above, but the complexity of this case is beyond my skills :-( . This is a toy example, not that far from what I am working with, so for simplicity i used randomly generated data.
rand = np.random.RandomState(1)
dff = pd.DataFrame({'A' : np.arange(20),
'B' : rand.randint(100, 120, 20),
'C' : rand.randint(0, 2, 20)})
def my_tau_indx(indx):
x = dff.iloc[indx, 0]
y = dff.iloc[indx, 1]
tau = sp.stats.mstats.kendalltau(x, y)[0]
return tau
dff['tau'] = dff.sort_values(['C', 'A']).groupby('C').rolling(window = 5).apply(my_tau_indx, args = ([dff.index.values]))
Every fix I make creates yet another bug...
The Above issue has been solved by Nickil Maveli and it works with numpy 1.11.0, pandas 0.18.1, scipy 0.17.1, andwith conda 4.1.4. It generates some warnings, but works.
On my another machine with latest & greatest numpy 1.12.0, pandas 0.19.2, scipy 0.18.1, conda version 3.10.0 and BLAS/LAPACK - it does not work and I get the traceback below. This seems versions related since I upgraded the 1st machine it also stopped working... In the name of science... ;-)
As Nickil suggested, this was due to incompatibility between numpy 1.11 and 1.12. Downgrading numpy helped. Since I had had BLAS/LAPACK on a Windows, I installed numpy 1.11.3+mkl from http://www.lfd.uci.edu/~gohlke/pythonlibs/ .
Traceback (most recent call last):
File "<ipython-input-4-bbca2c0e986b>", line 16, in <module>
t = grp.apply(func)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 651, in apply
return self._python_apply_general(f)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 655, in _python_apply_general
self.axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 1527, in apply
res = f(group)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 647, in f
return func(g, *args, **kwargs)
File "<ipython-input-4-bbca2c0e986b>", line 15, in <lambda>
func = lambda x: pd.Series(pd.rolling_apply(np.arange(len(x)), 5, my_tau_indx), x.index)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\stats\moments.py", line 584, in rolling_apply
kwargs=kwargs)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\stats\moments.py", line 240, in ensure_compat
result = getattr(r, name)(*args, **kwds)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 863, in apply
return super(Rolling, self).apply(func, args=args, kwargs=kwargs)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 621, in apply
center=False)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 560, in _apply
result = calc(values)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 555, in calc
return func(x, window, min_periods=self.min_periods)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 618, in f
kwargs)
File "pandas\algos.pyx", line 1831, in pandas.algos.roll_generic (pandas\algos.c:51768)
File "<ipython-input-4-bbca2c0e986b>", line 8, in my_tau_indx
x = dff.iloc[indx, 0]
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 1294, in __getitem__
return self._getitem_tuple(key)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 1560, in _getitem_tuple
retval = getattr(retval, self.name)._getitem_axis(key, axis=axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 1614, in _getitem_axis
return self._get_loc(key, axis=axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 96, in _get_loc
return self.obj._ixs(key, axis=axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\frame.py", line 1908, in _ixs
label = self.index[i]
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\indexes\range.py", line 510, in __getitem__
return super_getitem(key)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\indexes\base.py", line 1275, in __getitem__
result = getitem(key)
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
The final check:

One way to achieve would be to iterate through every group and use pd.rolling_apply on every such groups.
import scipy.stats as ss
def my_tau_indx(indx):
x = dff.iloc[indx, 0]
y = dff.iloc[indx, 1]
tau = ss.mstats.kendalltau(x, y)[0]
return tau
grp = dff.sort_values(['A', 'C']).groupby('C', group_keys=False)
func = lambda x: pd.Series(pd.rolling_apply(np.arange(len(x)), 5, my_tau_indx), x.index)
t = grp.apply(func)
dff.reindex(t.index).assign(tau=t)
EDIT:
def my_tau_indx(indx):
x = dff.ix[indx, 0]
y = dff.ix[indx, 1]
tau = ss.mstats.kendalltau(x, y)[0]
return tau
grp = dff.sort_values(['A', 'C']).groupby('C', group_keys=False)
t = grp.rolling(5).apply(my_tau_indx).get('A')
grp.head(dff.shape[0]).reindex(t.index).assign(tau=t)

Numpy std calculation: TypeError: cannot perform reduce with flexible type

I am trying to read lines of numbers starting at line 7 and compiling the numbers into a list until there is no more data, then calculate standard deviation and %rms on this list. Seems straightforward but I keep getting the error:
Traceback (most recent call last):
File "rmscalc.py", line 21, in <module>
std = np.std(values)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/fromnumeric.py", line 2817, in std
keepdims=keepdims)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py", line 116, in _std
keepdims=keepdims)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py", line 86, in _var
arrmean = um.add.reduce(arr, axis=axis, dtype=dtype, keepdims=True)
TypeError: cannot perform reduce with flexible type
Here is my code below:
import numpy as np
import glob
import os
values = []
line_number = 6
road = '/Users/allisondavis/Documents/HCl'
for pbpfile in glob.glob(os.path.join(road, 'pbpfile*')):
lines = open(pbpfile, 'r').readlines()
while line_number < 400 :
if lines[line_number] == '\n':
break
else:
variables = lines[line_number].split()
values.append(variables)
line_number = line_number + 3
print values
a = np.asarray(values).astype(np.float32)
std = np.std(a)
rms = std * 100
print rms
Edit: It produces an rms (which is wrong - not sure why yet) but the following error message is confusing: I need the count to be high (picked 400 just to ensure it would get the entire file no matter how large)
Traceback (most recent call last):
File "rmscalc.py", line 13, in <module>
if lines[line_number] == '\n':
IndexError: list index out of range

values is a string array and so is a. Convert a into a numeric type using astype. For example,
a = np.asarray(values).astype(np.float32)
std = np.std(a)

covariance of each key in a dictionary

I have a list, which is a set of tickers. For each ticker, I get the the daily return going back six months. I then want to compute the covariance between each ticker. I am having trouble with np.cov, here is my code to test COV:
newStockDict = {}
for i in newList_of_index:
a = Share(i)
dataB = a.get_historical(look_back_date, end_date)
stockData = pd.DataFrame(dataB)
stockData['Daily Return'] = ""
yList = []
for y in range(0,len(stockData)-1):
stockData['Daily Return'][y] = np.log(float(stockData['Adj_Close'][y])/float(stockData['Adj_Close'][y+1]))
yList = stockData['Daily Return'].values.tolist()
newStockDict[stockData['Symbol'][0]] = yList
g = (np.cov(pd.Series((newStockDict[newList_of_index[0]]))), pd.Series(((newStockDict[newList_of_index[1]]))))
return g
My error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Udaya\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:/Users/Udaya/Documents/Python Scripts/SR_YahooFinanceRead.py", line 150, in <module>
print CumReturnStdDev(stock_list)
File "C:/Users/Udaya/Documents/Python Scripts/SR_YahooFinanceRead.py", line 132, in CumReturnStdDev
g = (np.cov(pd.Series((newStockDict[newList_of_index[0]]))), pd.Series(((newStockDict[newList_of_index[1]]))))
File "C:\Users\Udaya\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 1885, in cov
X -= X.mean(axis=1-axis, keepdims=True)
File "C:\Users\Udaya\Anaconda\lib\site-packages\numpy\core\_methods.py", line 66, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'numpy.float64' and 'str'
>>> TypeError: unsupported operand type(s) for +: 'numpy.float64' and 'str'
I've tried using pd.cov on a dataframe, then np.cov. Nothing works. Here I am actually appending the daily returns to a list, then to a dictionary, before I manually calculate an n by n covariance matrix. But I am unable to get np.cov to work.
Please help. The idea is I can easily construct a dataframe of N tickers, with each row being a daily return. but am unable to compute cov with said dataframe, thus this df-->list-->dict process.

Numpy array from pandas frames can't be count vectorized due to "'float' object has no attribute 'lower'" error

I have a pandas data frame that I am reading from a csv. It includes three columns, a subject line, and two numbers I am not using yet.
>>> input
0 1 2
0 Stress Free Christmas Gift They'll Love 0.010574 8
I have converted the list of subjects to a numpy array, and I want to use count vectorizer for naive bayes. When I do that, I get the following error.
>>> cv=CountVectorizer()
>>> subjects=np.asarray(input[0])
>>> cv.fit_transform(subjects)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'float' object has no attribute 'lower'
These items should definitely all be strings. When I read the csv in with the csv library instead and created an array of that column, I didn't have any problems. Any ideas?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python sklearn kaggle/titanic tutorial fails on the last feature scale - python

You can help insure that the math transformation only occurs on numeric columns with the following. numeric_cols = combined.columns[combined.dtypes != 'object'] combined.loc[:, numeric_cols] = combined[numeric_cols] / combined[numeric_cols].max() There is no need for that apply function.

Related

How to Read nrows From Pandas HDF Storage?

python pandas rolling function with two arguments in a grouped DataFrame

Numpy std calculation: TypeError: cannot perform reduce with flexible type

covariance of each key in a dictionary

Numpy array from pandas frames can't be count vectorized due to "'float' object has no attribute 'lower'" error

Categories

Resources