How can I get sparce confusion matrix? - python

I'm trying to "rolling Random forest classification" for timeseries data.
The model classifies two classes. It changes the data samples and fits several times, which I mean "rolling".
I get confusion matrixes for each sample sets and sum up as final step.
but in several sample sets, only one class exit.
In this case, matrix shows up like below:
[[22]]
I want to make this case like below;
[[22, 0]
[0, 0]]
Do you have any idea to make this happen?

Try this
import numpy as np
import pandas as pd
from scipy import sparse
obs = np.random.randint(0, 2, 50)
pred = np.random.randint(0, 2, 50)
vals = np.ones(50).astype('int')
con = sparse.coo_matrix((vals, (pred, obs)))
print (con.todense())

Related

Generate bootstrap sample from ndarray

Is there a way to generate a bootstrap sample on an N-dimensional array? I am limited to using numpy==1.19.4
I have already tried using a for loop on the other dimensions to no avail, but the following works for 1-dimensional arrays.
import numpy as np
# Set random state and number of resamples
random.seed(random_state)
n_resamples = 9999
# Generate data
data_1d = np.arange(2, 3, 0.1)
data_nd = np.random.default_rng(42).random((2,3,2))
data = data_1d.copy()
# Resample the data with replacement, computing the test statistic for each set of resamples
bs_samples = [np.std(np.random.choice(data, size=len(data))) for _ in range(n_resamples)]
If I get your problem, I use to apply this method:
suppose you have this multi-dimensionale array:
data_nd = np.random.rand(100, 3, 2)
data_nd.shape #(100, 3, 2)
you can sample elements with bootstrap in this way:
n_resamples = 99
data_nd[np.random.randint(len(data_nd), size=len(data_nd)*n_resamples)].reshape(n_resamples, *data_nd.shape).shape
what I'm doing is to randomly extract indices (randint) with replacement and finally reshape the sampling to obtain 99 bootstrapped dataset with the same dimensions of the original one.
Note that by this procedure you are considering as "elements" the arrays along the first ax and so each element that you are sampling have shape (3,2).
I hope that is clear, but if you have any doubt please let me know.

Euclidean distance between the two points using vectorized approach

I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)
euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))

Why does my sklearn.metrics confusion_matrix output look transposed?

It's my understanding that confusion matrices should show the TRUE classes in the columns and the PREDICTED classes in the rows. Therefore the sum of the columns should be equal to the value_counts() of the TRUE series.
I have provided an example here:
from sklearn.metrics import confusion_matrix
pred = [0, 0, 0, 1]
true = [1, 1, 1, 1]
confusion_matrix(true, pred)
Why does this give me the following output? Surely it should be the transpose of that?
array([[0, 0],
[3, 1]], dtype=int64)
The confusion probably arises because sklearn follows a different convention for axes of confusion matrix than the wikipedia article. So, to answer your question: It gives you the output in that specific format because sklearn expects you to read it in a specific way.
Here are the two different ways of writing confusion matrix:
sklearn's way of reading/writing confusion matrix: true labels in rows, and predicted labels in columns
wikipedia example opposite of sklearn
scikit-learn's confusion matrix follows a specific order and structure.
Reference: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
It is possible to do as you wish using sklearn, only change the code below appropriately
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,figsize=(7,4))
ConfusionMatrixDisplay(confusion_matrix(predict,y_test,labels=[1,0]),
display_labels=[1,0]).plot(values_format=".0f",ax=ax)
ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")
plt.show()

Adding new points to the t-SNE model

I try to use t-SNE algorithm in the scikit-learn:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
Output:
array([[ 0.00017599, 0.00003993], #1
[ 0.00009891, 0.00021913],
[ 0.00018554, -0.00009357],
[ 0.00009528, -0.00001407]]) #2
After that I try to add some points with the coordinates exactly like in the first array X to the existing model:
Y = np.array([[0, 0, 0], [1, 1, 1]])
model.fit_transform(Y)
Output:
array([[ 0.00017882, 0.00004002], #1
[ 0.00009546, 0.00022409]]) #2
But coords in the second array not equal to the first and last coords from the first array.
I understand that this is the right behaviour, but how can I add new coords to the model and get the same coords in the output array for the same coords in the input array?
Also I still need to get closest points even after appending new points.
Quoting the author of t-SNE from here: https://lvdmaaten.github.io/tsne/
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
Also, this answer on stats.stackexchange.com contains ideas and a link to
a very nice and very fast recent Python implementation of t-SNE https://github.com/pavlin-policar/openTSNE that allows embedding of new points out of the box
and a link to https://github.com/berenslab/rna-seq-tsne/.

Efficient two dimensional numpy array statistics

I have many 100x100 grids, is there an efficient way using numpy to calculate the median for every grid point and return just one 100x100 grid with the median values? Presently, I'm using a for loop to run through each grid point, calculating the median and then combining them into one grid at the end. I'm sure there's a better way to do this using numpy. Any help would be appreciated! Thanks!
Create as 100x100xN array (or stack together if that's not possible) and use np.median with the correct axis to do it in one go:
import numpy as np
a = np.random.rand(100,100)
b = np.random.rand(100,100)
c = np.random.rand(100,100)
d = np.dstack((a,b,c))
result = np.median(d,axis=2)
How many grids are there?
One option would be to create a 3D array that is 100x100xnumGrids and compute the median across the 3rd dimension.
use axis parameter of median:
import numpy as np
data = np.random.rand(100, 5, 5)
print np.median(data, axis=0)
print np.median(data[:, 0, 0])
print np.median(data[:, 1, 0])

Categories

Resources