Appending values to the array is duplicating it - python

I'm running the k-means algorithm 3 times and storing the final centers in an array
center_array = []
backXnorm=Xnorm
for i in range(1,3):
X=dataML
X = X[np.random.default_rng(seed=i).permutation(X.columns.values)]
print(X.head())
Xnorm=mms.fit_transform(X)
km=KMeans(n_clusters=4,n_init=10,max_iter=30,random_state=42)
y_kmeans=km.fit_predict(Xnorm)
center_array.append(km.cluster_centers_)
The values are being duplicated, as it seems that the entire array is added again in each iteration.
Below you have the output I'm getting.
[array([[ 0.91902229, 0.99146416, 0.11154588, -0.41348193, -0.45307083,
0.18579957, 0.20004497, -0.91902229, -0.17537297, -0.99146416,
-0.4091783 , -0.12493111],
[-0.17637011, -0.02577591, -0.48222273, 1.39450598, 1.50699298,
-0.14651225, -0.12975152, 0.17637011, 0.65213679, 0.02577591,
1.37195399, 0.44572744],
[ 0.91902229, -1.00860933, 0.11367937, -0.40910528, -0.45108061,
0.19771608, 0.23722015, -0.91902229, -0.18480587, 1.00860933,
-0.40459059, -0.13536744],
[-1.08811289, -0.0290917 , 0.19925625, -0.46264585, -0.48998741,
-0.14748408, -0.1943812 , 1.08811289, -0.23289607, 0.0290917 ,
-0.45219009, -0.14996175]]), array([[-0.17537297, 0.18579957, -0.91902229, -0.99146416, 0.99146416,
-0.41348193, -0.45307083, -0.4091783 , -0.12493111, 0.11154588,
0.91902229, 0.20004497],
[ 0.65213679, -0.14651225, 0.17637011, 0.02577591, -0.02577591,
1.39450598, 1.50699298, 1.37195399, 0.44572744, -0.48222273,
-0.17637011, -0.12975152],
[-0.18480587, 0.19771608, -0.91902229, 1.00860933, -1.00860933,
-0.40910528, -0.45108061, -0.40459059, -0.13536744, 0.11367937,
0.91902229, 0.23722015],
[-0.23289607, -0.14748408, 1.08811289, 0.0290917 , -0.0290917 ,
-0.46264585, -0.48998741, -0.45219009, -0.14996175, 0.19925625,
-1.08811289, -0.1943812 ]])]
I was expecting the final array to be something like this
[[ 0.91902229, 0.99146416, 0.11154588, -0.41348193, -0.45307083,
0.18579957, 0.20004497, -0.91902229, -0.17537297, -0.99146416,
-0.4091783 , -0.12493111],
[-0.17637011, -0.02577591, -0.48222273, 1.39450598, 1.50699298,
-0.14651225, -0.12975152, 0.17637011, 0.65213679, 0.02577591,
1.37195399, 0.44572744],
[ 0.91902229, -1.00860933, 0.11367937, -0.40910528, -0.45108061,
0.19771608, 0.23722015, -0.91902229, -0.18480587, 1.00860933,
-0.40459059, -0.13536744],
[-1.08811289, -0.0290917 , 0.19925625, -0.46264585, -0.48998741,
-0.14748408, -0.1943812 , 1.08811289, -0.23289607, 0.0290917 ,
-0.45219009, -0.14996175]]
Am I using the append wrong? Should I use another typpe of structure to store the final centers values?

K-means is not sensitive about columns order. That is why you get the same results but shuffled according to the shuffle of the columns.

Related

Loading the binary data to a NumPy array

I am having trouble reading the binary file. I have a NumPy array as,
data = array([[ 0. , 0. , 7.821725 ],
[ 0.05050505, 0. , 7.6358337 ],
[ 0.1010101 , 0. , 7.453858 ],
...,
[ 4.8989897 , 5. , 16.63227 ],
[ 4.949495 , 5. , 16.88153 ],
[ 5. , 5. , 17.130795 ]], dtype=float32)
I wrote this array to a file in binary format.
file = open('model_binary', 'wb')
data.tofile(file)
Now, I am unable to get back the data from the saved binary file. I tried using numpy.fromfile() but it didn't work out for me.
file = open('model_binary', 'rb')
data = np.fromfile(file)
When I printed the data I got [0.00000000e+00 2.19335211e-13 8.33400000e+04 ... 2.04800049e+03 2.04800050e+03 5.25260241e+07] which is absolutely not what I want.
I ran the following code to check what was in the file,
for line in file:
print(line)
break
I got the output as b'\x00\x00\x00\x00\......\c1\x07#\x00\x00\x00\x00S\xc5{#j\xfd\n' which I suppose is in binary format.
I would like to get the array back from the binary file as it was saved. Any help will be appreciated.
As Kevin noted, adding the dtype is required. You might also need to reshape (you have 3 columns in your example. So
file = open('model_binary', 'rb')
data = fromfile(file, dtype=np.float32).reshape((-1,3))
should work for you.
As an aside, I think np.save does save to binary format, and should avoid these issues.

Dot product for correlation with complex numbers

OK, this question probably has a very simple answer, but I've been searching for quite a while with no luck...
I want to get the dot product of 2 complex numbers in complex-plane-space. However, np.dot and np.vdot both give the wrong result.
Example of what I WANT to do:
a = 1+1j
b = 1-1j
dot(a,b) == 0
What I actually get:
np.dot(a,b) == 2+0j
np.vdot(a,b) == 0-2j
np.conj(a)*b == 0-2j
I am able to get what I want using this rather clumsy expression (edit for readability):
a.real*b.real + a.imag*b.imag
But I am very surprised not to find a nice ufunc to do this. Does it not exist? I was not expecting to have to write my own ufunc to vectorize such a common operation.
Part of my concern here, is that it seems like my expression is doing a lot of extra work extracting out the real/imaginary parts when they should be already in adjacent memory locations (considering a,b are actually already combined in a data type like complex64). This has the potential to cause a pretty severe slowdown.
** EDIT
Using Numba I ended up defining a ufunc:
#vectorize
def cdot(a, b):
return (a.real*b.real + a.imag*b.imag)
This allowed me to correlate complex data properly. Here's a correlation image for the guys who helped me!
For arrays and np.complex scalars but not plain python complex numbers you can viewcast to float. For example:
a = np.exp(1j*np.arange(4))
b = np.exp(-1j*np.arange(4))
a
# array([ 1. +0.j , 0.54030231+0.84147098j,
# -0.41614684+0.90929743j, -0.9899925 +0.14112001j])
b
# array([ 1. -0.j , 0.54030231-0.84147098j,
# -0.41614684-0.90929743j, -0.9899925 -0.14112001j])
ar = a[...,None].view(float)
br = b[...,None].view(float)
ar
# array([[ 1. , 0. ],
# [ 0.54030231, 0.84147098],
# [-0.41614684, 0.90929743],
# [-0.9899925 , 0.14112001]])
br
# array([[ 1. , -0. ],
# [ 0.54030231, -0.84147098],
# [-0.41614684, -0.90929743],
# [-0.9899925 , -0.14112001]])
Now, for example, all pairwise dot products:
np.inner(ar,br)
# array([[ 1. , 0.54030231, -0.41614684, -0.9899925 ],
# [ 0.54030231, -0.41614684, -0.9899925 , -0.65364362],
# [-0.41614684, -0.9899925 , -0.65364362, 0.28366219],
# [-0.9899925 , -0.65364362, 0.28366219, 0.96017029]])

Why is this Python array not slicing?

Initial data is:
array([[0.0417634 ],
[0.04493844],
[0.04932728],
[0.04601787],
[0.04511007],
[0.04312284],
[0.0451733 ],
[0.04560687],
[0.04263394],
[0.04183227],
[0.048634 ],
[0.05198746],
[0.05615724],
[0.05787913], dtype=float32)
then i transformed it in 2d array
array2d = np.reshape(dataset, (-1, 2))
now i have
array([[0.0417634 , 0.04493844],
[0.04932728, 0.04601787],
[0.04511007, 0.04312284],
[0.0451733 , 0.04560687],
[0.04263394, 0.04183227],
[0.048634 , 0.05198746],
[0.05615724, 0.05787913],
[0.05989346, 0.0605077 ], dtype=float32)
Now i'm going to calcolulate the mean between each element of the array
paa = []
paa.append(array2d.mean(axis=1))
now i want a list of intervals from this list
intervals = paa[::10]
intervals
but the result is the same list (paa). Why? Already tried to convert it in np.array(paa)
Expected a new list with less elements. Since 10 is the nr of steps i'm expecting [0.0417634, ... paa[11], .... paa[21] .... ]
np.mean will return a np.array. You are taking the result and appending it into a list. When you are slicing it, you're getting the 0th (and only) element in paa, which is an entire np.array.
Get rid of the list and append and slice directly into the result of mean.

How can I convert a multidimension-array-string back to an array in Python?

I found an interesting problem. I am trying to save a huge list of Numpy arrays of different length to a file so I can reuse it later on. While I managed to save the list, I struggle to read it. Neither Numpy nor Python seem to be able to reconvert the string to the initial list. Any tips on how to do that?
I have already tried:
list(), np.array() and np.fromstring()
the list looks like that, just that it continues on about a 100000 lines
[[array([[[ 0.481903 , 0.15787785, 0.05661286],
[-0.08817253, -0.14168766, -0.13894859],
[-0.27888685, -0.11231906, 0.26054043]],
[[ 0.0913363 , 0.09927119, 0.42296773],
[ 0.45385012, 0.0164008 , 0.823071 ],
[-0.7438939 , -0.72650474, -0.4468163 ]],
[[-0.34211668, -0.00215243, 0.26458675],
[-0.23189187, 0.9370323 , -0.6188508 ],
[-0.85894495, 0.43526295, 0.17926843]]], dtype=float32), array([[[ 0.78955674, -0.6114772 , -0.18336566],
[-0.12059411, -1.0608526 , -0.47686368],
[-0.00781631, -0.36990076, 0.23920381]],
[[-0.2827969 , -0.5920803 , 1.1788696 ],
[-0.02591886, -0.24817304, -0.17913376],
[-0.7543818 , -0.00784254, -0.38197488]],
[[-0.566821 , -0.35077536, 0.32748973],
[ 0.26770943, 0.04574856, -0.7584006 ],
[ 1.1999835 , -0.42707324, -0.2599928 ]]], dtype=float32), array([[[ 0.501889 , 0.11805235, -0.28508088],
[-0.18496978, -1.2954917 , 0.39576113],
[ 0.03896124, -0.80981237, 0.8888588 ]],
[[ 0.28127173, -0.04418045, 0.74862033],
[ 0.5746676 , -1.0427617 , -0.00984947],
[ 1.357876 , 0.49865335, -0.5559544 ]],
[[-0.2253674 , 0.01848532, 0.16229743],
[ 0.02945629, -0.3473735 , -0.16368015],
[-0.21004315, 0.75182045, -0.14023288]]], dtype=float32),
list() and np.array() both work but produce totally different results from what the list initially was
EDIT: Answer found thanks to #hpaulj
The list was converted to a ndarray and then saved as a npy binary file using np.save

Memory issue when iterating over a list or arrays

I have a list of arrays such that:
arr = [array([1,2,3,4,5]), array([1,4,6,7]) .......]
which contains 40000 arrays. I would ideally like to have it as a 2d numpy array but I can't guarantee that all the arrays will be the same length.
I want to do something basic to all of the values in the list, such as:
out = (3*arr)+2
but I quickly run out of memory (on a 32GB machine) so its obviously very inefficient. I have tried iterating over the list and appending the results to a new list but this is equally inefficient.
Is there an efficient way to achieve this?
--------------- EDIT --------------
arr looks like:
[array([ 451.481649, 456.490319, 461.498989, 466.507659, 471.516329,
476.524999, 481.533669, 486.542339, 491.551009, 496.559679,
501.568349, 506.577019, 511.585689, 516.594359, 521.603029,
526.611699, 531.62037 , 536.62904 , 541.63771 , 546.64638 ,
551.65505 , 556.66372 , 561.67239 , 566.68106 , 571.68973 ,
576.6984 , 581.70707 , 586.71574 , 591.72441 , 596.73308 ,
601.74175 , 606.75042 , 611.75909 , 616.76776 , 621.77643 ,
626.7851 , 631.79377 , 636.80244 , 641.811111, 646.819781,
651.828451, 656.837121, 661.845791, 666.854461, 671.863131]),
array([ 451.481649, 456.490319, 461.498989, 466.507659, 471.516329,
476.524999, 481.533669, 486.542339, 491.551009, 496.559679,
501.568349, 506.577019, 511.585689, 516.594359, 521.603029,
526.611699, 531.62037 , 536.62904 , 541.63771 , 546.64638 ,
551.65505 , 556.66372 , 561.67239 , 566.68106 , 571.68973 ,
576.6984 , 581.70707 , 586.71574 , 591.72441 , 596.73308 ,
601.74175 , 606.75042 , 611.75909 , 616.76776 , 621.77643 ,
626.7851 , 631.79377 , 636.80244 , 641.811111, 646.819781,
651.828451, 656.837121, 661.845791, 666.854461, 671.863131]),
array([ 451.481649, 456.490319, 461.498989, 466.507659, 471.516329,
476.524999, 481.533669, 486.542339, 491.551009, 496.559679,
501.568349, 506.577019, 511.585689, 516.594359, 521.603029,
526.611699, 531.62037 , 536.62904 , 541.63771 , 546.64638 ,
551.65505 , 556.66372 , 561.67239 , 566.68106 , 571.68973 ,
576.6984 , 581.70707 , 586.71574 , 591.72441 , 596.73308 ,
601.74175 , 606.75042 , 611.75909 , 616.76776 , 621.77643 ,
626.7851 , 631.79377 , 636.80244 , 641.811111, 646.819781,
651.828451, 656.837121, 661.845791, 666.854461, 671.863131])]
If you are really doing it like you wrote, it is likely that you will run into memory issues, since
out = (3*arr)+2
replicates your array 3 times and then tries appending to it. Larger that constant, larger your array gets and hence the memory explosion.
To achieve what you want without memory issues, use
out = [[3*x + 2 for x in arr_list] for arr_list in arr]

Categories

Resources