I've trained a machine learning model using sklearn and want to simulate the result by sampling the predictions according to the predict_proba probabilities. So I want to do something like
samples = np.random.choice(a = possible_outcomes, size = (n_data, n_samples), p = probabilities)
Where probabilities would be is an (n_data, n_possible_outcomes) array
But np.random.choice only allows 1d arrays for the p argument. I've currently gotten around this using a for-loop like the following implementation
sample_outcomes = np.zeros((len(probs), n_samples))
for i in trange(len(probs)):
sample_outcomes[i, :] = np.random.choice(outcomes, s = n_samples, p=probs[i])
but that's relatively slow. Any suggestions to speed this up would be much appreciated!
If I understood correctly you want a vectorize way of applying choice
several times and each time with a different probabilities vector.
You could implement this by hand as follows:
import numpy as np
# for reproducibility
np.random.seed(42)
# number of samples
k = 5
# possible outcomes
outcomes = np.arange(10)
# generate a random probability matrix for 15 runs
probabilities = np.random.random((15, 10))
probs = probabilities / probabilities.sum(1)[:, None]
# generate the choices by picking those probabilities above a random generated number
# the higher the value in probs the higher the probability to pick it
choices = probs - np.random.random((15, 10))
# to pick the top k using argpartition need to multiply by -1
choices = -1 * choices
# pick the top k values
res = outcomes[np.argpartition(choices, k, axis=1)][:, :k]
# flatten to match the expected output
print(res.flatten())
Output
[1 8 2 5 3 6 4 8 7 0 1 5 9 3 7 1 4 9 0 8 5 0 4 3 6 8 5 1 2 6 5 3 2 0 6 5 4
2 3 7 7 9 4 6 1 3 6 4 2 1 4 9 3 0 1 6 9 2 3 8 5 4 7 6 1 5 3 8 2 1 1 0 9 7
4]
In the above example the code sample 5 (k) elements from a population of 10 (outcomes) 15 times each time with a different probability vector (probs with a shape of 15 by 10).
Here is an example of what you can do, if I understand your question correctly:
import numpy as np
#create a list of indices
index_list = np.arange(len(possible_outcomes))
# sample indices based on the probabilities
choice = np.random.choice(a = index_list, size = n_samples, p = probabilities)
# get samples based on randomly chosen indices
samples = possible_outcomes[choice]
I'm making sure I understand you problem correctly. Can you just create samples as an array of size n_data * n_samples and then use the resize method to get it to the right size?
samples = np.random.choice(a = possible_outcomes, size = n_data * n_samples, p = probabilities)
samples.resize((n_data, n_samples))
Related
I have a vector which contains 10 values of sample 1 and 25 values of sample 2.
Fact = np.array((2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2))
I want to create a stratified output vector where :
sample 1 is divided in 80% : 8 values of 1 and 20% : 2 values of 0.
sample 2 is divided in 80% : 20 values of 1 and 20% : 5 values of 0.
The expected output will be :
Output = np.array((0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1))
How can I automate this ? I can’t use the sampling function from scikit-learn because it is not for a machine learning experience.
Here is one way to get your desired result, with reproducibility of output added. We draw random index values for each of the two groups from the input (fact) array, without replacement. Then, we create a new output array where we assign 1's in locations corresponding to the drawn index values and assign 0's everywhere else.
import numpy as np
from numpy.random import RandomState
rng = RandomState(123)
fact = np.array(
(2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2),
dtype='int8'
)
idx_arr = np.hstack(
(
rng.choice(np.argwhere(fact == 1).flatten(), 8, replace=False),
rng.choice(np.argwhere(fact == 2).flatten(), 20, replace=False),
)
)
out = np.zeros_like(fact, dtype='int8')
np.put(out, idx_arr, 1)
print(out)
# [0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1]
Following the StackOverflow post Elegantly calculate mean of first three values of a list I have tweaked the code to find the maximum.
However, I also require to know the position/index of the max.
So the code below calculates the max value for the first 3 numbers and then the max value for the next 3 numbers and so on.
For example for a list of values [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]. The code below takes the first 3 values 6,3,7 and outputs the max as 7 and then for the next 3 values 4,6,9 outputs the value 9 and so on.
But I also want to find which position/index they are at, 1.e 7 is at position 2 and 9 at position 5. The final result [2,5,8,11,12,...]. Any ideas on how to calculate the index. Thanks in advance.
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
output: test_data : [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]
output: [7, 9, 7, 7, 7, 7, 5]
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
index = [(np.argmax(test_data[i: i+3]) + i) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
print(index)
Problem:
Create the most efficient function to turn 1d array (group_id column) into another 1d array (output column).
The conditions are:
At most n groups can be in any batch, in this example n=2.
Each batch must contain groups of the same size.
Trivial condition: minimise the number of batches.
The function will distribute these groups of different size into batches with unique identifiers, with the condition that each batch has a fixed size AND each batch contains only groups with the same size.
data = {'group_size': [1,2,3,1,2,3,4,5,1,2,1,1,1],
'batch_id': [1,4,6,1,4,6,7,8,2,5,2,3,3]}
df = pd.DataFrame(data=data)
print(df)
group_size batch_id
0 1 1
1 2 4
2 3 6
3 1 1
4 2 4
5 3 6
6 4 7
7 5 8
8 1 2
9 2 5
10 1 2
11 1 3
12 1 3
What I need:
some_function( data['group_size'] ) to give me data['batch_id']
Edit:
My Clumsy Function
def generate_array():
out = 1
batch_size = 2
dictionary = {}
for i in range(df['group_size'].max()):
# get the mini df corresponding to the group size
sub_df = df[df['group_size'] == i+1 ]
# how many batches will we create?
no_of_new_batches = np.ceil ( sub_df.shape[0] / batch_size )
# create new array
a = np.repeat(np.arange(out, out+no_of_new_batches ), batch_size)
shift = len(a) - sub_df.shape[0]
# remove last elements from array to match the size
if len(a) != sub_df.shape[0]:
a = a[0:-shift]
# update batch id
out = out + no_of_new_batches
# create dictionary to store idx
indexes = sub_df.index.values
d = dict(zip(indexes, a))
dictionary.update(d)
array = [dictionary[i] for i in range(len(dictionary))]
return array
generate_array()
Out[78]:
[1.0, 4.0, 6.0, 1.0, 4.0, 6.0, 7.0, 8.0, 2.0, 5.0, 2.0, 3.0, 3.0]
Here is my solution. I don't think it gives exactly the same result as your function, but it satisfies your three rules:
import numpy as np
def package(data, mxsz):
idx = data.argsort()
ds = data[idx]
chng = np.empty((ds.size + 1,), bool)
chng[0] = True
chng[-1] = True
chng[1:-1] = ds[1:] != ds[:-1]
szs = np.diff(*np.where(chng))
corr = (-szs) % mxsz
result = np.empty_like(idx)
result[idx] = (np.arange(idx.size) + corr.cumsum().repeat(szs)) // mxsz
return result
data = np.random.randint(0, 4, (20,))
result = package(data, 3)
print(f'group_size {data}')
print(f'batch_id {result}')
check = np.lexsort((data, result))
print('sorted:')
print(f'group_size {data[check]}')
print(f'batch_id {result[check]}')
Sample run with n=3, the last two lines of the output are the same as the first two, only sorted for easier checking:
group_size [1 1 0 1 2 0 2 2 2 3 1 2 3 2 1 0 1 0 2 0]
batch_id [3 3 1 3 6 1 6 5 6 7 2 5 7 5 2 1 2 0 4 0]
sorted:
group_size [0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3]
batch_id [0 0 1 1 1 2 2 2 3 3 3 4 5 5 5 6 6 6 7 7]
How it works:
1) sort data
2) detect where sorted data change to identify groups of equal values ("groups of group sizes")
3) determine sizes of the groups of groups sizes and for each calculate what misses to a clean multiple of n
4) enumerate the sorted data while at each switch to a new group of group sizes jumping to the next clean multiple of n; we use (3) to do this in a vectorized fashion
5) floor divide by n to get the batch ids
6) shuffle back to original order
I have data in one array say size [1x9] , I am generating random number 1 to 9 and shuffling it, I want to arrange data in that order.
# generating an array of number
BCI = tf.concat(0, [tf.fill([1,3],1),tf.fill([1,3],2),tf.fill([1,3],3)])
# making it in to 1x9
BCI1 = tf.reshape(BCI,[-1])
# generating random numbers with length of BCI and shuffling it
rn = tf.random_shuffle(tf.range(tf.shape(BCI1[0]))
rna = tf.cast(rn,tf.int32)
# rearranging data
BCI2 = tf.gather(BCI1,rna)
print(sess.run(BCI1))
print(sess.run(rn))
print(sess.run(BCI2))
# output is
[1 1 1 2 2 2 3 3 3]
[3 5 0 2 6 1 4 8 7]
[2 2 1 3 1 2 1 3 3] # expected to be [2 2 1 1 3 1 2 3 3]
It because I am not able to copy rn value as constant , when I am running sess.run every time it is changing.
But I need the random values generated in 'rn' first time generated as i need for testing on another ones.
How many times i print rn it should show the same values with out regenerating again.
How to do it ?
I tried by importing random
n = tf.shape(BCI1)
rna = random.sample(list(range(n[0].eval())),9)
but it gives ValueError: Cannot evaluate tensor using eval(): No default session is registered. Use with sess.as_default() or pass an explicit session to eval(session=sess)
`
The tf.random_shuffle() op (and in general the other tf.random_*() ops) will generate new random values on each call to sess.run(). If you want to capture a particular value for a random tensor and use it in multiple calls to sess.run(), you should assign it to a tf.Variable. For example, you could restructure your program as follows to solve the problem:
# generating an array of number
BCI = tf.constant([1, 1, 1, 2, 2, 2, 3, 3, 3])
# generating random numbers with length of BCI and shuffling it
rn = tf.Variable(tf.random_shuffle(tf.range(9)))
rna = tf.cast(rn,tf.int32)
# rearranging data
BCI2 = tf.gather(BCI1, rna)
sess.run(tf.global_variables_initializer())
print(sess.run(BCI1)) # ==> '[1 1 1 2 2 2 3 3 3]'
print(sess.run(rn)) # ==> '[2 8 3 0 1 4 6 5 7]'
print(sess.run(BCI2)) # ==> '[1 3 2 1 1 2 3 2 3]'
print(sess.run(BCI2)) # ==> '[1 3 2 1 1 2 3 2 3]'
I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.
I have the following 'sample' containing features and class (status):
A B C D E F status
1 5 2 5 1 3 1
1 2 3 2 2 1 0
3 4 2 3 5 1 1
1 2 2 1 1 4 0
I saved the feature names as 'features':
A B C D E F
The features 'X':
1 5 2 5 1 3
1 2 3 2 2 1
3 4 2 3 5 1
1 2 2 1 1 4
And the status 'y':
1
0
1
0
Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.
import pandas as pd
import numpy as np
from sklearn import svm
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)
test_size = int(X.shape[0]/2)
clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00
My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.
I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.
From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order