I have a pandas dataframe like this, where each ID is an observation with variables attr1, attr2 and attr3:
ID attr1 attr2 attr3
20 2 1 2
10 1 3 1
5 2 2 4
7 1 2 1
16 1 2 3
28 1 1 3
35 1 1 1
40 1 2 3
46 1 2 3
21 3 1 3
and made a similarity matrix I want to use where the IDs are compared based on the sum of the pairwise attribute differences.
[[ 0. 4. 3. 3. 3. 2. 2. 3. 3. 2.]
[ 4. 0. 5. 1. 3. 4. 2. 3. 3. 6.]
[ 3. 5. 0. 4. 2. 3. 5. 2. 2. 3.]
[ 3. 1. 4. 0. 2. 3. 1. 2. 2. 5.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 4. 3. 3. 1. 0. 2. 1. 1. 2.]
[ 2. 2. 5. 1. 3. 2. 0. 3. 3. 4.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 6. 3. 5. 3. 2. 4. 3. 3. 0.]]
I tried DBSCAN from sklearn for clustering the data, but it seems only the clusters themselves are labeled? I want to find the ID for the data points in the visualization later. So I only want to cluster the difference between the IDs, but not the IDs themselves. Is there another algorithm better for this kind of data, or a way I can label the distance matrix values so it can be used with the DBSCAN or another method?
ps.the dataset has over 50 attributes and 10000 observations
The labels_ attribute will give you an array of labels for each of your data points from training. The first index of that array is the label of your first training data point and so on.
Related
i have a 2D numpy array. I'm trying to compute the similarities between rows and put it into a similarities array. Is this possible without loop? Thanks for your time!
# ratings.shape = (943, 1682)
arri = np.zeros(943)
arri = np.where(arri == 0)[0]
arrj = np.zeros(943)
arrj = np.where(arrj ==0)[0]
similarities = np.zeros((ratings.shape[0], ratings.shape[0]))
similarities[arri, arrj] = np.abs(ratings[arri]-ratings[arrj])
I want to make a 2D-array similarities in that similarities[i, j] is the differentiation between row i and row j in ratings
[ValueError: shape mismatch: value array of shape (943,1682) could not be broadcast to indexing result of shape (943,)]
[1][1]: https://i.stack.imgur.com/gtst9.png
The problem is how numpy iterates through the array when indexing a two-dimentional array with two arrays.
First some setup:
import numpy;
ratings = numpy.arange(1, 6)
indicesX = numpy.indices((ratings.shape[0],1))[0]
indicesY = numpy.indices((ratings.shape[0],1))[0]
ratings: [1 2 3 4 5]
indicesX: [[0][1][2][3][4]]
indicesY: [[0][1][2][3][4]]
Now lets see what your program produces:
similarities = numpy.zeros((ratings.shape[0], ratings.shape[0]))
similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[0])
similarities:
[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 4.]]
As you can see, numpy iterates over similarities basically like the following:
for i in range(5):
similarities[indicesX[i], indicesY[i]] = numpy.abs(ratings[i]-ratings[0])
similarities:
[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 4.]]
Now instead we need indices like the following to iterate through the entire array:
indecesX = [0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]
indecesY = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4]
We do that the following:
# Reshape indicesX from (x,1) to (x,). Thats important for numpy.tile().
indicesX = indicesX.reshape(indicesX.shape[0])
indicesX = numpy.tile(indicesX, ratings.shape[0])
indicesY = numpy.repeat(indicesY, ratings.shape[0])
indicesX: [0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]
indicesY: [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4]
Perfect! Now just call similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[indicesY]) again and we see:
similarities:
[[0. 1. 2. 3. 4.]
[1. 0. 1. 2. 3.]
[2. 1. 0. 1. 2.]
[3. 2. 1. 0. 1.]
[4. 3. 2. 1. 0.]]
Here the whole code again:
import numpy;
ratings = numpy.arange(1, 6)
indicesX = numpy.indices((ratings.shape[0],1))[0]
indicesY = numpy.indices((ratings.shape[0],1))[0]
similarities = numpy.zeros((ratings.shape[0], ratings.shape[0]))
indicesX = indicesX.reshape(indicesX.shape[0])
indicesX = numpy.tile(indicesX, ratings.shape[0])
indicesY = numpy.repeat(indicesY, ratings.shape[0])
similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[indicesY])
print(similarities)
PS
You commented on your own post to improve it. You should edit your question instead of commenting on it, when you want to improve it.
Say I have a large dataframe, and some lists of columns, and I want to be able to put them in a patsy dmatricies without having to write out each name individually. That is, I want to call the names from a list a list of column names to form the terms. Rather than write out each and every single term in my data frame column.
For example take the following df
df=pd.DataFrame( {'a':[1,2,3,4], 'b':[5,6,7,8],
'c':[8,4,5,3], 'd':[1,3,55,3],
'e':[8,4,5,3]})
df
>>
a b c d e
0 1 5 8 1 8
1 2 6 4 3 4
2 3 7 5 55 5
3 4 8 3 3 3
As I understand it to call this into a d matrix requires me to do the following:
y,x = dmatrices('a~b+c+d+e', data=df)
However I would like to be able to run something more along the lines of:
regress=['b', 'c']
control=['e', 'd']
y,x=dmatricies('a~{}+{}'.format(' '.join(e for e in regressors),
' '.join(c for c in control)), data=df)
However this was unsuccesful.
I also attempted to use a dictionary with two entries, say regress and control, that filled with lists of the column names, and then input that into the first entry of dmatricies but it didnt work either.
Does anyone have any suggestions on a more efficient way to get things into patsy's dmatricies rather than writing out each and every column name we would like to include in the matrix?
Thanks in advance and let me know if I was not clear on anything.
Doing with for loop here
for z in regress:
for t in control:
y,x=dmatrices('a~{}+{}'.format(z,t), data=df)
print('a~{}+{}'.format(z,t))
print(y,x)
a~b+e
[[1.]
[2.]
[3.]
[4.]] [[1. 5. 8.]
[1. 6. 4.]
[1. 7. 5.]
[1. 8. 3.]]
a~c+e
[[1.]
[2.]
[3.]
[4.]] [[1. 8. 8.]
[1. 4. 4.]
[1. 5. 5.]
[1. 3. 3.]]
a~d+e
[[1.]
[2.]
[3.]
[4.]] [[ 1. 1. 8.]
[ 1. 3. 4.]
[ 1. 55. 5.]
[ 1. 3. 3.]]
I have the following input:
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18. 19.]
Expected output:
[ 0. 0. 0. 0. 0. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
0. 0. 0. 0. 0.]
Current code:
from numpy import linspace
input_list = linspace(0,20,20, endpoint = False)
input_list[:5] = 0
input_list[15:] = 0
print(input_list)
I'm wondering if there are more elegant/pythonic ways of doing it?
I mean, you could do this if you just wanted that range.
list(range(5,15))
Or if you want to ignore the first few:
[0]*5+input[5:15]+[0]*5
Or if it's conditionnal
[x if 4<x<15 else 0 for x in input ]
Try a list inclusion:
l1 = [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18. 19.]
l2 = [x for x in l1 if x in range(5, 15) else 0.]
I'm aligning multiple datasets (model and observations) and I thought it would make a lot of sense if xarray.align had a method to propagate NaNs/missing data in one dataset to the others. For now, I'm using xr.dataset.where in combination with np.isfinite, but especially my attempt to generalize this for more than two arrays feels a bit tricky. Is there a better way to do this?
a = xr.DataArray(np.arange(10).astype(float))
b = xr.DataArray(np.arange(10).astype(float))
a[[4, 5]] = np.nan
print(a.values)
print(b.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Default behaviour
c, d = xr.align(a, b)
print(c.values)
print(d.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Desired behaviour
e, f = xr.align(a.where(np.isfinite(b)), b.where(np.isfinite(a)))
print(e.values)
print(f.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
# Attempt to generalize for multiple arrays
c = b.copy()
c [[1, -1]] = np.nan
def align_better(*dataarrays):
allvalid = np.all(np.array([np.isfinite(x) for x in dataarrays]), axis=0)
return xr.align(*[da.where(allvalid) for da in dataarrays])
g, h, i = align_better(a, b, c)
print(g.values)
print(h.values)
print(i.values)
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
From the xarray docs:
Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes.
Array from the aligned objects are suitable as input to mathematical operators, because along each dimension they have the same index and size.
Missing values (if join != 'inner') are filled with NaN.
Nothing about this function deals with the values in the arrays, just the dimensions and coordinates. This function is used for setting up arrays for operations against each other.
If your desired behavior is a function that returns NaN for all arrays where any arrays are NaN, your align_better function seems like a decent way to do it.
The function in my initial attempt was slow because dataarrays were casted to numpy arrays. In this modified version, I first align the datasets. Then I can safely use the .values method. This is much faster.
def align_better(*dataarrays):
""" Align datasets and propage NaNs """
aligned = xr.align(*dataarrays)
allvalid = np.all(np.asarray([np.isfinite(x).values for x in aligned]), axis=0)
return [da.where(allvalid) for da in aligned]
I have a function that returns a numpy array every second , that i want to store in another array for reference. for eg (array_a is returned)
array_a = [[ 25. 50. 25. 25. 50. ]
[ 1. 1. 1. 1. 1. ]]
array_collect = np.append(array_a,array_collect)
But when i Print array_collect , i get an added array, not a bigger array with arrays inside it.
array_collect = [ 25. 50. 25. 25. 50.
1. 1. 1. 1. 1.
25. 50. 25. 25. 50.
1. 1. 1. 1. 1.
25. 50. 25. 25. 50. ]
what i want is
array_collect = [ [[ 25. 50. 25. 25. 50. ]
[1. 1. 1. 1. 1. ]]
[[ 25. 50. 25. 25. 50. ]
[1. 1. 1. 1. 1. ]]
[[ 25. 50. 25. 25. 50. ]
[1. 1. 1. 1. 1. ]] ]
How do i get it ??
You could use vstack:
array_collect = np.array([[25.,50.,25.,25.,50.],[1.,1.,1.,1.,1.]])
array_a = np.array([[2.,5.,2.,2.,5.],[1.,1.,1.,1.,1.]])
array_collect=np.vstack((array_collect,array_a))
However, if you know the total number of minutes in advance, it would be better to define your array first (e.g. using zeros) and gradually fill it - this way, it is easier to stay within memory limits.
no_minutes = 5 #say 5 minutes
array_collect = np.zeros((no_minutes,array_a.shape[0],array_a.shape[1]))
Then, for every minute, m
array_collect[m] = array_a
Just use np.concatenate() and reshape this way:
import numpy as np
array_collect = np.array([[25.,50.,25.,25.,50.],[1.,1.,1.,1.,1.]])
array_a = np.array([[2.,5.,2.,2.,5.],[1.,1.,1.,1.,1.]])
array_collect = np.concatenate((array_collect,array_a),axis=0).reshape(2,2,5)
>>
[[[ 25. 50. 25. 25. 50.]
[ 1. 1. 1. 1. 1.]]
[[ 2. 5. 2. 2. 5.]
[ 1. 1. 1. 1. 1.]]]
I found it , this can be done by using :
np.reshape()
the new array formed can be reshaped using
y= np.reshape(y,(a,b,c))
where a is the no. of arrays stores and (b,c) is the shape of the original array