numpy and pytables issue (error: tuple index out of range) - python

I am new to python and pytables. Currently I am writing a project about clustering and KNN algorithm. That is what I have got.
********** code *****************
import numpy.random as npr
import numpy as np
step0: obtain the cluster
dtype = np.dtype('f4')
pnts_inds = np.arange(100)
npr.shuffle(pnts_inds)
pnts_inds = pnts_inds[:10]
pnts_inds = np.sort(pnts_inds)
for i,ind in enumerate(pnts_inds):
clusters[i] = pnts_obj[ind]
step1: save the result to a HDF5 file called clst_fn.h5
filters = tables.Filters(complevel = 1, complib = 'zlib')
clst_fobj = tables.openFile('clst_fn.h5', 'w')
clst_obj = clst_fobj.createCArray(clst_fobj.root, 'clusters',
tables.Atom.from_dtype(dtype), clusters.shape,
filters = filters)
clst_obj[:] = clusters
clst_fobj.close()
step2: other function
blabla
step3: load the cluster from clst_fn
pnts_fobj= tables.openFile('clst_fn.h5','r')
for pnts in pnts_fobj.walkNodes('/', classname = 'Array'):
break
#
step4: evoke another function (called knn). The function input argument is the data from pnts. I have checked the knn function individually. This function works well if the input is pnts = npr.rand(100,128)
def knn(pnts):
pnts = numpy.ascontiguousarray(pnts)
N = ctypes.c_uint(pnts.shape[0])
D = ctypes.c_uint(pnts.shape[1])
#
evoke knn using the cluster from clst_fn (see step 3)
knn(pnts)
********** end of code *****************
My problem now is that python is giving me a hard time by showing:
error: IndexError: tuple index out of range
This error comes from
"D = ctypes.c_uint(pnts.shape[1])" this line.
Obviously, there must be something wrong with the input argument. Any thought about fixing the problem? Thank you in advance.

Related

Apriori Algorithm - not getting the rules in python

enter image description here
Here is my code and I have given an image of my dataset "Market_Basket_Optimisation". I have made list of lists transaction to give the input in apriori algorithm.But I am not getting the rules. I am new to machine learning and I am not able to find out the error.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Data Preprocessing
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
transactions = []
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])
# Training Apriori on the dataset
from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)
# Visualising the results
results = list(rules)
It is not clear from your question if you are using jupyter notebook or an IDE such as Spyder. If you are using an IDE such as Spyder, you are not likely to see the result unless you use a print statement. I suggest adding another line as follows:
print(resuult)
You should see the rules list. This is the same issue I had and using the print statement solved the problem for me. You will still need to define a function to output the result in a tabular format that makes sense.

getting linear models fama macbeth function output

I am having an issue with this function. I am wanting to perform a cross-sectional regression on 25 portfolios ranked on value and size. I have 7 independent variables as the right side of the equation.
import pandas as pd
import numpy as np
from linearmodels import FamaMacBeth
#creating a multi_index of independent variables
ind_var = pd.read_excel('FAMA_MACBETH.xlsx')
ind_var['date'] = pd.to_datetime(ind_var['date'])
# dropping our dependent variables
ind_var = ind_var.drop(['Mkt_rf', 'div_innovations', 'term_innovations',
'def_innovations', 'rf_innovations', 'hml_innovations',
'smb_innovations'],axis = 1)
ind_var = pd.DataFrame(ind_var.set_index('date').stack())
ind_var.columns = ['x']
x = np.asarray(ind_var)
len(x)
11600
#creatiing a multi_index of dependent variables
# reading in our data
dep_var = pd.read_excel('FAMA_MACBETH.xlsx')
dep_var['date'] = pd.to_datetime(dep_var['date'])
# dropping our independent variables
dep_var = dep_var.drop(['SMALL_LoBM', 'ME1_BM2', 'ME1_BM3', 'ME1_BM4',
'SMALL_HiBM', 'ME2_BM1', 'ME2_BM2', 'ME2_BM3', 'ME2_BM4', 'ME2_BM5',
'ME3_BM1', 'ME3_BM2', 'ME3_BM3', 'ME3_BM4', 'ME3_BM5', 'ME4_BM1',
'ME4_BM2', 'ME4_BM3', 'ME4_BM4', 'ME4_BM5', 'BIG_LoBM', 'ME5_BM2',
'ME5_BM3', 'ME5_BM4', 'BIG_HiBM'],axis = 1)
dep_var = pd.DataFrame(dep_var.set_index('date').stack())
dep_var.columns = ['y']
y = np.asarray(dep_var)
len(y)
3248
mod = FamaMacBeth(y, x)
res = mod.fit(cov_type='kernel', kernel='Parzen')
output with tstats and errors ideally
I have tried numerous methods of getting this to work. I am really thinking of using SAS at this point. Really, I would prefer to get this running with pandas
I expect a cross-sectional regression output with standard errors and t stats
I got it to work in one go. See this site and run the lines of code for OLS below: "Here the difference is presented using the canonical Grunfeld data on investment."
(Note that this line is important: etdata = data.set_index(['firm','year']), else Python won't know the correct dimensions to run F&McB on.)
Then run:
from linearmodels import FamaMacBeth
FamaMacBeth(etdata.invest,etdata[['value','capital']]).fit()
Note, I updated linearmodels to the latest version, that got me access to the data.

imshow() returns invalid dimensions for 2D array when using multiprocessing.Pool

I'm trying to use the multiprocessing module to create figures from 2D arrays faster. In the code below I create a 2D array from a hdf5 data file (please message me if you would like a sample file to test on). Using multiprocessing.Pool, I try to pass this array to the map function but it raises TypeError: Invalid dimensions for image data. I've checked to make sure my array is 2 dimensions using da.shape, so I'm not sure why it isn't working for me. Any help is much appreciated!
To import yt, see yt-project.org/#getyt.
P.S. This is my first question on Stack Overflow so please let me know if/how I can improve.
import yt
import numpy as np
import multiprocessing
from multiprocessing import Pool, Process, Array
fl_nm = raw_input("enter filename: ").strip()
level = int(raw_input("resolution level: ").strip())
ds = yt.load(fl_nm)
all_data_level_x = ds.covering_grid(level=level,left_edge=[-3.70281620e+21,0.00000000e+00,-3.70281620e+21],dims=ds.domain_dimensions*2**level)
disp_array = []
for x in xrange(0,16*2**level):
vbin = []
for z in xrange(0,80*2**level):
v = []
for y in xrange(0,8*2**level):
vel = all_data_level_x["velocity_magnitude"][x,y,z].in_units("km/s")
v.append(vel)
sigma = np.sqrt(np.sum((v - np.mean(v))**2) / np.size(v))
vbin.append(sigma)
disp_array.append(vbin)
print "{0:.1f} %".format((x+1)*100/float(16*2**level))
da = np.array(disp_array)
print "fixed resolution array created"
def __main__(data_array):
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot as plt
plt.imshow(data_array, origin = "lower", aspect = "equal", extent=[-1.2,10.8,-1.2,1.2])
plt.colorbar(fraction=0.046, pad=0.04)
print "plot created. Saving figure..."
fig_nm = 'velocity_disp_{0}_lvl_{1}.png'.format(fl_nm[-4:],level)
plt.savefig(fig_nm)
plt.close()
print "File saved as: " + fig_nm
return
pool = multiprocessing.Pool(4)
pool.map(__main__,da)
pool.map(func, iterable[, chunksize]) iterates the da. So if da is a 2-D array like [[1,2],[3,4]]. The input of your __main__ function will be [1,2] and [3,4] for every process.
I'm not sure what you want to do, so if you really want to get a full help, you can upload your executable project(to github or something else, whatever) and I will check.

sci-kit learn: Reshape your data either using X.reshape(-1, 1)

I'm training a python (2.7.11) classifier for text classification and while running I'm getting a deprecated warning message that I don't know which line in my code is causing it! The error/warning. However, the code works fine and give me the results...
\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
My code:
def main():
data = []
folds = 10
ex = [ [] for x in range(0,10)]
results = []
for i,f in enumerate(sys.argv[1:]):
data.append(csv.DictReader(open(f,'r'),delimiter='\t'))
for f in data:
for i,datum in enumerate(f):
ex[i % folds].append(datum)
#print ex
for held_out in range(0,folds):
l = []
cor = []
l_test = []
cor_test = []
vec = []
vec_test = []
for i,fold in enumerate(ex):
for line in fold:
if i == held_out:
l_test.append(line['label'].rstrip("\n"))
cor_test.append(line['text'].rstrip("\n"))
else:
l.append(line['label'].rstrip("\n"))
cor.append(line['text'].rstrip("\n"))
vectorizer = CountVectorizer(ngram_range=(1,1),min_df=1)
X = vectorizer.fit_transform(cor)
for c in cor:
tmp = vectorizer.transform([c]).toarray()
vec.append(tmp[0])
for c in cor_test:
tmp = vectorizer.transform([c]).toarray()
vec_test.append(tmp[0])
clf = MultinomialNB()
clf .fit(vec,l)
result = accuracy(l_test,vec_test,clf)
print result
if __name__ == "__main__":
main()
Any idea which line raises this warning?
Another issue is that running this code with different data sets gives me the same exact accuracy, and I can't figure out what causes this?
If I want to use this model in another python process, I looked at the documentation and I found an example of using pickle library, but not for joblib. So, I tried following the same code, but this gave me errors:
clf = joblib.load('model.pkl')
pred = clf.predict(vec);
Also, if my data is CSV file with this format: "label \t text \n"
what should be in the label column in test data?
Thanks in advance
Your 'vec' input into your clf.fit(vec,l).fit needs to be of type [[]], not just []. This is a quirk that I always forget when I fit models.
Just adding an extra set of square brackets should do the trick!
It's:
pred = clf.predict(vec);
I used this in my code and it worked:
#This makes it into a 2d array
temp = [2 ,70 ,90 ,1] #an instance
temp = np.array(temp).reshape((1, -1))
print(model.predict(temp))
2 solution: philosophy___make your data from 1D to 2D
Just add: []
vec = [vec]
Reshape your data
import numpy as np
vec = np.array(vec).reshape(1, -1)
If you want to find out where the Warning is coming from you can temporarly promote Warnings to Exceptions. This will give you a full Traceback and thus the lines where your program encountered the warning.
with warnings.catch_warnings():
warnings.simplefilter("error")
main()
If you run the program from the commandline you can also use the -W flag. More information on Warning-handling can be found in the python documentation.
I know it is only one part of your question I answered but did you debug your code?
Since 1D array would be deprecated. Try passing 2D array as a parameter. This might help.
clf = joblib.load('model.pkl')
pred = clf.predict([vec]);
Predict method expects 2-d array , you can watch this video , i have also located the exact time https://youtu.be/KjJ7WzEL-es?t=2602 .You have to change from [] to [[]].

a kind of kmean clustering

I tried to run this code in python2.7 with a matrix 20*20 and i want to get two cluster just like the kmean algorithm.
js
import numpy as np
filename = np.genfromtxt('Matrix.txt')
M = np.sort (np.random.choice (2,20))
##m = np.copy(M) => I get an error there : 'module' object is not callable
M= m #this option work better but i am not sure that it is appropriate
#initialization of the clusters
C = {}
for t in xrange(tmax=100):
#determination of clusters
J = np.mean(filename[:,M], axis = 1)
for k in range (2):
C[k] = np.where (J==k, 0,0) # np.where (J==k)=> another error for 'np.where': it take exactly three arguments but one given.I saw that it could take only one argument
#update
for k in range (2):
J = np.mean(filename[np.ix_(C[k],C[k])], axis = 1)
j= np.argmin(J)
m[k] = C[k][j] #[j] => another error for '[j]': invalid index to scalar variable
#results
print M, C
my result
{0 : 0, 1:0}
the expected result
{0:8, 1:12}
in example meaning that there is 8 elements in cluster '0' and 12 in cluster '1'.
This is probably because of 'np.where ' function but i am not sure.
I run the program without all the errors that i previously mentioned for get this result but it doesn't work as well it should
Thanks for your help
Another variant (it uses scikit library):
import numpy as np
from sklearn import cluster
n_clusters = 2
k_means = cluster.KMeans(n_clusters=n_clusters)
k_means.fit(filename)
values = k_means.cluster_centers_
labels = k_means.labels_
print values
print labels

Categories

Resources