trying to do ward clustering on n by n matrix in scipy

trying to do ward clustering on n by n matrix in scipy - python

I have a similarity score between 0 and 1 from each entry to every other entry in an 100 by 100 matrix. So e.g. [0,0] would be 1, [0,1] might be .54 etc. The similarity score was generated using Shannon Jensen divergence.
I want to use ward clustering (but am open to other suggestions) to cluster these together and I tried the following code:
x = np.array(mylist)
print x.shape#(100,100)
clustering = scipy.cluster.hierarchy.ward(x)
scipy.cluster.hierarchy.dendrogram(clustering)
but I am getting the error:
Traceback (most recent call last):
File "C:/Python27/ward.py", line 38, in <module>
clustering = scipy.cluster.hierarchy.ward(y)
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 465, in ward
return linkage(y, method='ward', metric='euclidean')
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 620, in linkage
y = _convert_to_double(np.asarray(y, order='c'))
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 928, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Do I need to do some transformation on my array first or use some other method?

Related

What is rank list in tucker decomposition?

I am going to decompose a 4D tensor using tucker decomposition in python. I found a library, tensorly, to do this.
I only want to perform the decomposition on the first and second dimensions. To perform tucker decomposition on some modes (not all modes) using tensorly I have to use partial_tucker command. This is my code:
F = 256
D = 96
h = 5
w = 6
ranks = [89, 48]
modes = [0, 1]
tensor = tl.tensor((np.arange(F*D*h*w).reshape((F, D, h, w))).astype(np.float64))
core, factors = partial_tucker(tensor, modes=modes, rank=ranks)
This code works well, but when I am trying to change the rank list, for example:
ranks = [3,4]
I get an error as follows:
Traceback (most recent call last):
File "D:\PhD_Thessaloniki\Codes\LRF_Convolutional\tucker-decomposition.py", line 49, in <module>
core, factors = partial_tucker(tensor, modes=modes, rank=ranks)
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\tensorly\decomposition\_tucker.py", line 109, in partial_tucker
eigenvecs, _, _ = svd_fun(unfold(core_approximation, mode), n_eigenvecs=rank[index], random_state=random_state)
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\tensorly\backend\core.py", line 913, in partial_svd
S, V = scipy.sparse.linalg.eigsh(
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py", line 1689, in eigsh
params.iterate()
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py", line 571, in iterate
raise ArpackError(self.info, infodict=self.iterate_infodict)
scipy.sparse.linalg._eigen.arpack.arpack.ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.
I don't know if there is a constraint to define the rank in tucker decomposition or not, but when I am trying to perform decomposition on only one dimension, for example:
ranks = [3]
modes = [0]
or
ranks = [4]
modes = [1]
works well again.
I want to know:
Is this an algorithmic or code (tensorly) problem (constraint)?
What is this problem?
What rank lists are valid?
Thanks

Tucker relies on higher-order PCA. The error you are seeing is is in the sparse SVD, applied to an unfolding of the main tensor.
You can try a different SVD function (svd parameter in partial_tucker), you can see the available option using tensorly.tenalg.svd.SVD_FUNS.
You also might want to try a tensor with random elements, using tensorly.random.random_tensor or tensorly.randn.

Pybats forecast error when using Poisson family distribution

I am building a timeseries PyBats model using a Poisson distribution to signify the distribution of observations. My model instantiation looks like this
model = define_dglm(
Y=data.actual.values,
X=None,
family="poisson",
k=1,
prior_length=8,
dates=data["month"],
ntrend=2,
seasPeriods=[],
seasHarmComponents=[],
nsamps=10000,
)
Where data.actual.values is a numpy array of integers. After instantiating the model, in order to forecast into the future with pybats I run
forecast_samples = model.forecast_path(k=steps_to_forecast, X=X_future, nsamps=10000)
and get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 289, in forecast_path
return forecast_path_copula(self, k, X, nsamps, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 211, in forecast_path_copula
return forecast_path_copula_sim(mod, k, lambda_mu, lambda_cov, nsamps, t_dist, nu)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in forecast_path_copula_sim
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in <lambda>
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 477, in simulate_from_sampling_model
return np.random.poisson(rate, [nsamps])
File "mtrand.pyx", line 3573, in numpy.random.mtrand.RandomState.poisson
File "_common.pyx", line 824, in numpy.random._common.disc
File "_common.pyx", line 621, in numpy.random._common.discrete_broadcast_d
File "_common.pyx", line 355, in numpy.random._common.check_array_constraint
ValueError: lam value too large
I have tried converting my Y array to floats, and have tried replacing all 0 values with 1 and get the same error. What is causing this error?

The issue is in exceeding the maximum value allowed in numpy.random.poisson. It looks like any value larger than np.random.poisson(1E19) will cause this error.
A couple things you can try:
Use a longer prior length than 8 when defining the model. This will help produce more stable estimates of the coefficients. After defining your model, check what the coefficient mean vector (model.a) and covariance matrix (model.R) are, to make sure they're reasonable. If they're not, you can change them manually.
If some of your 'Y' values are truly that large, a Poisson model is probably not appropriate. I would suggest modeling log(Y) using the normal dlm model in Pybats.
I hope that this help!
Thanks,
Isaac

Creating an ITK displacement field image from a numpy array

In my PhD project I analyze 3D microCT datasets of lung tissue samples. One topic is the simulation of an atelectasis by warping the image using ITK Python. In order to achieve that (with the WarpImageFilter or the ResampleImageFilter in ITK) I have to create a displacement vector field. Therefore, I have to convert a 3D numpy array into an itk image using the GetImageFromArray function. The resulting output should be in a format which the ResampleImageFilter or WarpImageFilter can work with:
Here´s my code:
array1 = []
for i in range (-5,5):
for j in range(-5,5):
for k in range(-5,5):
if i == 0 and j == 0 and k == 0:
array1.append([0, 0, 0])
else:
x = (float(i)/float(i**2 + j**2 + k**2))
y = (float(j)/float(i**2 + j**2 + k**2))
z = (float(k)/float(i**2 + j**2 + k**2))
array1.append([x, y, z])
displacementFieldFileName = itk.image_from_array(np.reshape(array1, (10,10,10,3)), is_vector = True)
The last line shows the conversion from a numpy array into a 3D ITK vector image format which is needed by the filters mentioned above. However, I receive the following error message:
Traceback (most recent call last):
File “Test_Displacement.py”, line 39, in
displacementFieldFileName = itk.image_from_array(np.reshape(array1, (10,10,10,3)), is_vector = True)
File “/XXXX/YYYY/.local/lib/python2.7/site-packages/itkExtras.py”, line 297, in GetImageFromArray
return _GetImageFromArray(arr, “GetImageFromArray”, is_vector)
File “/XXXX/YYYY/.local/lib/python2.7/site-packages/itkExtras.py”, line 291, in _GetImageFromArray
templatedFunction = getattr(itk.PyBuffer[ImageType], function)
File “/XXXX/YYYY/.local/lib/python2.7/site-packages/itkTemplate.py”, line 340, in getitem
raise TemplateTypeError(self, tuple(cleanParameters))
itkTemplate.TemplateTypeError: itk.PyBuffer is not wrapped for input type itk.Image[itk.Vector[itk.D,3],3].
A similar topic can be found here:
https://discourse.itk.org/t/importing-image-from-array-and-axis-reorder/1192
I already tried using dtype=np.float32 and .astype(np.float32) to specify the float data type but this leads to another error:
File "Test_Displacement.py", line 59, in <module>
fieldReader.SetFileName(displacementFieldFileName)
TypeError: in method 'itkImageFileReaderIF3_SetFileName', argument 2 of type 'std::string const &'
How can the displacement field created properly? Any help will be highly appreciated!
Alex

It seems like it's asking for:
itk.Image[itk.Vector[itk.D,3],3]
Not a numpy array. Or maybe your numpy array has the wrong dimensionality.

Save complex numpy array as image using scikit-image

Get following error when try I to use io.imsave("image.jpg",array)
Traceback (most recent call last):
File "Fourer.py", line 37, in <module>
io.imsave( "test.jpg", fImage2)
File "C:\ProgramData\Miniconda3\lib\site-packages\skimage\io\_io.py", line 131, in imsave
if is_low_contrast(arr):
File "C:\ProgramData\Miniconda3\lib\site-packages\skimage\exposure\exposure.py", line 503, in is_low_contrast
dlimits = dtype_limits(image, clip_negative=False)
File "C:\ProgramData\Miniconda3\lib\site-packages\skimage\util\dtype.py", line 49, in dtype_limits
imin, imax = dtype_range[image.dtype.type]
KeyError: <class 'numpy.complex128'>
it's a 2n complex array I use
array = [[ 3.25000000e+02+0.00000000e+00j -1.25000000e+01+1.72047740e+01j
-1.25000000e+01+4.06149620e+00j -1.25000000e+01-4.06149620e+00j
-1.25000000e+01-1.72047740e+01j]
[-6.25000000e+01+8.60238700e+01j -8.88178420e-16+8.88178420e-16j
0.00000000e+00+1.29059879e-15j 0.00000000e+00+1.29059879e-15j
-8.88178420e-16-8.88178420e-16j]
[-6.25000000e+01+2.03074810e+01j -8.88178420e-16+4.44089210e-16j
-3.55271368e-15+5.46706420e-15j -3.55271368e-15+5.46706420e-15j
-8.88178420e-16-4.44089210e-16j]
[-6.25000000e+01-2.03074810e+01j -8.88178420e-16+4.44089210e-16j
-3.55271368e-15-5.46706420e-15j -3.55271368e-15-5.46706420e-15j
-8.88178420e-16-4.44089210e-16j]
[-6.25000000e+01-8.60238700e+01j -8.88178420e-16+8.88178420e-16j
0.00000000e+00-1.29059879e-15j 0.00000000e+00-1.29059879e-15j
-8.88178420e-16-8.88178420e-16j]]
How can i save i complex array as image?

If that matrix was obtained from the FFT of an image, then you first need to do Inverse FFT. Only then, you can save it using io.imsave.
If that is the case, take a look at skimage's:
----> Inverse Fourier Transform

python scikit-learn cosine similarity value error: could not convert integer scalar

I am trying to produce a cosine similarity matrix using text descriptions of apps. The script below first reads in a csv data file (I can provide the data file if needed) which contains two columns, one with two app categories and the other with tokenized, stemmed descriptions for a number of apps in each of these two categories. The script then creates a tfidf matrix and attempts to produce a cosine similarity matrix.
I updated Anaconda 64 bit for Windows yesterday to make sure I have the latest versions of Python, numpy, scipy, and scikit-learn.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])
for i in cats:
print ()
print ('prepping', i)
d2 = data[data.category == i]
descStem = d2.descStem.tolist()
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
print (tfidf_matrix.shape)
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
The script works just fine for the smaller category of comics, with a tdidf_matrix.shape = (3119, 8217). However, I receive the error message below for the larger category of education, with a tfidf_matrix.shape = (90327, 62863). This matrix is larger than 2^32.
Traceback (most recent call last):
File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>
runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)
ValueError: could not convert integer scalar
I can overcome this error by running the code below, but using a dense matrix is a massive memory hog and I need to run this script on 40+ categories.
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)
This is the closest similar issue I could find on StackOverflow, but I couldn't see out how it would help my situation...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

trying to do ward clustering on n by n matrix in scipy - python

Related

What is rank list in tucker decomposition?

Pybats forecast error when using Poisson family distribution

Creating an ITK displacement field image from a numpy array

Save complex numpy array as image using scikit-image

python scikit-learn cosine similarity value error: could not convert integer scalar

Categories

Resources