How can I reproduce data in numpy with random.choice?

How can I reproduce data in numpy with random.choice? - python

I have a labeled dataset:
data = np.array([5.2, 4, 5, 2, 5.3, 10, 0])
labels = np.array([1, 0, 1, 2, 1, 3, 4])
I want to pick the data 5.2, 5 and 5.3 with the label 1 and reproduce it, like followed:
datalabel1 = data[(labels == 1)]
Then I want to do a random.choice(), for example (pseudo):
# indices are the indices from label 1
random_choices = np.random.choice(indices, size = 5)
And get as output different values with different indices:
# indices are the different indices of the data from the pool out of random choice
data: [5.3 5.2 5.2 5.2 5]
indices: [4 0 0 2 2]
My goal is to pick out of a pool with label 1 data.

labels == 1 is a boolean mask. You nee to apply it to data, not back to labels to get the data elements labeled 1:
np.random.choice(data[labels == 1], ...)
You can also convert labels == 1 to a set of indices and choose on those before indexing:
indices = np.flatnonzero(labels == 1)
data[np.random.choice(indices, ...)]

Related

if y is a pandas series object with 0 and 1, then what does y.values==0,1 or y.values==0,0 means?

y= pd.Series([0,1,0,1,1,0])
In the code below they have used this and i am stuck on this point. what does y.values==0,0 means and how all the other combination are different from one another.
plt.figure(dpi=120)
plt.scatter(pca[y.values==0,0], pca[y.values==0,1], alpha=0.5, label='Edible', s=2)
plt.scatter(pca[y.values==1,0], pca[y.values==1,1], alpha=0.5, label='Poisonous', s=2)
plt.legend()

Suppose the following numpy array pca and Series y:
import pandas as pd
import numpy as np
pca = np.arange(0, 12).reshape(-1, 2)
y = pd.Series([0, 1, 0, 1, 1, 0])
# pca
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
# y
0 0
1 1
2 0
3 1
4 1
5 0
dtype: int64
To get elements from a 2D array, you have to pass the coordinates of rows and columns you want to get:
# Get rows from pca where y==0 and get the first column (0)
>>> pca[y.values==0, 0] # or pca[y==0, 0]
array([ 0, 4, 10])
# Get rows from pca where y==0 and get the second column (1)
>>> pca[y.values==0, 1] # or pca[y==0, 1]
array([ 1, 5, 11])
# This is the same for other scatter line.
Instead of pass selected rows explicitly, here you are using a boolean mask y==0. It means you return another Series with the same length of y with boolean values:
>>> y == 0 # Original
0 True # 0
1 False # 1
2 True # 0
3 False # 1
4 False # 1
5 True # 0
dtype: bool

Count the occuarences of each elements in numpy array, where elements are elementwise equal with another array?

I have two arrays like
[2,2,0,1,1,1,2] and [2,2,0,1,1,1,0]
I need to count (eg. with bincount) the occourances of each element in the first array, where the elements equal by position in the second array.
So in this case, we get [1,3,2], because 0 occurs once in the same position of the arrays, 1 occurs three times in the same positions and 2 occurs twice in the same positions.
I tried this, but not the desired result:
np.bincount(a[a==b])
Can someone help me?

You must put your lists in np array format:
import numpy as np
a = np.array([2,2,0,1,1,1,2])
b = np.array([2,2,0,1,1,1,0])
np.bincount(a, weights=(a==b)) # [1, 3, 2]

from datatable import dt, f, by
df = dt.Frame(
col1=[2, 2, 0, 1, 1, 1, 2],
col2=[2, 2, 0, 1, 1, 1, 0]
)
df['equal'] = dt.ifelse(f.col1 == f.col2, 1, 0)
df_sub = df[:, {"sum": dt.sum(f.equal)}, by('col1')]
yourlist = df_sub['sum'].to_list()[0]
yourlist
[1, 3, 2]

array_1 = np.array([2,2,0,1,1,1,2])
array_2 = np.array([2,2,0,1,1,1,0])
# set up a bins array for the results:
if array_1.max() > array_2.max():
bins = np.zeros(array_1.max()+1)
else:
bins = np.zeros(array_2.max()+1)
# fill the bin values:
for val1, val2 in zip(array_1, array_2):
if val1 == val2:
bins[val1] += 1
# convert bins to a list with int values
bins = bins.astype(int).tolist()
And the results:
[1, 3, 2]

Pick lines with highest values from np.zeros

I have the following data structure:
Pl = np.zeros((7,2,7))
Pl[0,0,1]=1
Pl[1,0,2]=1
Pl[2,0,3]=1
Pl[5,0,6]=0.9
Pl[5,0,5]=0.1
...
Pl[5,1,4]=1
How can I get the entry with a specified first value and that has the highest assigned value?
For example for x=5, I want to get Pl[5,1,4]. I have seen max but I can't specify the value of x.
Thank you!

import numpy as np
Pl = np.zeros((7, 2, 7))
Pl[0, 0, 1] = 1
Pl[1, 0, 2] = 1
Pl[2, 0, 3] = 1
Pl[5, 0, 6] = 0.9
Pl[5, 0, 5] = 0.1
Pl[5, 1, 4] = 1
res = np.array([5, *np.unravel_index(Pl[5].argmax(), Pl[5].shape)])
# array([5, 1, 4])
Here, Pl[5].argmax() gets the maximal value. It will be a 1D integer index, but you can convert it to a 2D index of Pl[5] with np.unravel_index. Finally, we are missing the index along the zeroeth dimension that we know is 5. Just prepend it and return the array.

How to randomly throw numbers in a 2D dimensional board

I have a 50x50 2D dimensional board with empty cells now. I want to fill 20% cells with 0, 30% cells with 1, 30% cells with 2 and 20% cells with 3. How to randomly throw these 4 numbers onto the board with the percentages?
import numpy as np
from numpy import random
dim = 50
map = [[" "for i in range(dim)] for j in range(dim)]
print(map)

One way to get this kind of randomness would be to start with a random permutation of the numbers from 0 to the total number of cells you have minus one.
perm = np.random.permutation(2500)
now you split the permutation according the proportions you want to get and treat the entries of the permutation as the indices of the array.
array = np.empty(2500)
p1 = int(0.2*2500)
p2 = int(0.3*2500)
p3 = int(0.3*2500)
array[perm[range(0, p1)]] = 0
array[perm[range(p1, p1 + p2)]] = 1
array[perm[range(p1 + p2, p3)]] = 2
array[perm[range(p1 + p2 + p3, 2500)]] = 3
array = array.reshape(50, 50)
This way you ensure the proportions for each number.

Since the percentages sum up to 1, you can start with a board of zeros
bsize = 50
board = np.zeros((bsize, bsize))
In this approach the board positions are interpreted as 1D postions, then we need a set of position equivalent to 80% of all positions.
for i, pos in enumerate(np.random.choice(bsize**2, int(0.8*bsize**2), replace=False)):
# the fisrt 30% will be set with 1
if i < int(0.3*bsize**2):
board[pos//bsize][pos%bsize] = 1
# the second 30% (between 30% and 60%) will be set with 2
elif i < int(0.6*bsize**2):
board[pos//bsize][pos%bsize] = 2
# the rest 20% (between 60% and 80%) will be set with 3
else:
board[pos//bsize][pos%bsize] = 3
At the end the last 20% of positions will remain as zeros
As suggested by #alexis in commentaries, this approach could became more simple by using shuffle method from random module:
from random import shuffle
bsize = 50
board = np.zeros((bsize, bsize))
l = list(range(bsize**2))
shuffle(l)
for i, pos in enumerate(l):
# the fisrt 30% will be set with 1
if i < int(0.3*bsize**2):
board[pos//bsize][pos%bsize] = 1
# the second 30% (between 30% and 60%) will be set with 2
elif i < int(0.6*bsize**2):
board[pos//bsize][pos%bsize] = 2
# the rest 20% (between 60% and 80%) will be set with 3
elif i < int(0.8*bsize**2):
board[pos//bsize][pos%bsize] = 3
The last 20% of positions will remain as zeros again.

A different approach (admittedly it's probabilistic so you won't get perfect proportions as the solution proposed by Brad Solomon)
import numpy as np
res = np.random.random((50, 50))
zeros = np.where(res <= 0.2, 0, 0)
ones = np.where(np.logical_and(res <= 0.5, res > 0.2), 1, 0)
twos = np.where(np.logical_and(res <= 0.8, res > 0.5), 2, 0)
threes = np.where(res > 0.8, 3, 0)
final_result = zeros + ones + twos + threes
Running
np.unique(final_result, return_counts=True)
yielded
(array([0, 1, 2, 3]), array([499, 756, 754, 491]))

Here's an approach with np.random.choice to shuffle indices, then filling those indices with repeats of the inserted ints. It will fill the array in the exact proportions that you specify:
import numpy as np
np.random.seed(444)
board = np.zeros(50 * 50, dtype=np.uint8).flatten()
# The "20% cells with 0" can be ignored since that is the default.
#
# This will work as long as the proportions are "clean" ints
# (I.e. mod to 0; 2500 * 0.2 is a clean 500. Otherwise, need to do some rounding.)
rpt = (board.shape[0] * np.array([0.3, 0.3, 0.2])).astype(int)
repl = np.repeat([1, 2, 3], rpt)
idx = np.random.choice(board.shape[0], size=repl.size, replace=False)
board[idx] = repl
board = board.reshape((50, 50))
Resulting frequencies:
>>> np.unique(board, return_counts=True)
(array([0, 1, 2, 3], dtype=uint8), array([500, 750, 750, 500]))
>>> board
array([[1, 3, 2, ..., 3, 2, 2],
[0, 0, 2, ..., 0, 2, 0],
[1, 1, 1, ..., 2, 1, 0],
...,
[1, 1, 2, ..., 2, 2, 2],
[1, 2, 2, ..., 2, 1, 2],
[2, 2, 2, ..., 1, 0, 1]], dtype=uint8)
Approach
Flatten the board. Easier to work with indices when the board is (temporarily) one-dimensional.
rpt is a 1d vector of the number of repeats per int. It gets "zipped" together with [1, 2, 3] to create repl, which is length 2000. (80% of the size of the board; you don't need to worry about the 0s in this example.)
The indices of the flattened array are effectively shuffled (idx), and the length of this shuffled array is constrained to the size of the replacement candidates. Lastly, those indices in the 1d board are filled with the replacements, after which it can be made 2d again.

Grouping elements of a numpy array using an array of group counts

Given two arrays, one representing a stream of data, and another representing group counts, such as:
import numpy as np
# given group counts: 3 4 3 2
# given flattened data:[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ]
group_counts = np.array([3,4,3,2])
data = np.arange(group_counts.sum()) # placeholder data, real life application will be a very large array
I want to generate matrices based on the group counts for the streamed data, such as:
target_count = 3 # I want to make a matrix of all data items who's group_counts = target_count
# Expected result
# [[ 0 1 2]
# [ 7 8 9]]
To do this I wrote the following:
# Find all matches
match = np.where(groups == group_target)[0]
i1 = np.cumsum(groups)[match] # start index for slicing
i0 = i1 - groups[match] # end index for slicing
# Prep the blank matrix and fill with resuls
matched_matrix = np.empty((match.size,target_count))
# Is it possible to get rid of this loop?
for i in xrange(match.size):
matched_matrix[i] = data[i0[i]:i1[i]]
matched_matrix
# Result: array([[ 0, 1, 2],
[ 7, 8, 9]]) #
This works, but I would like to get rid of the loop and I can't figure out how.
Doing some research I did find numpy.split and numpy.array_split:
match = np.where(group_counts == target_count)[0]
match = np.array(np.split(data,np.cumsum(groups)))[match]
# Result: array([array([0, 1, 2]), array([7, 8, 9])], dtype=object) #
But numpy.split produces a list of dtype=object that I have to convert.
Is there an elegant way to produce the desired result without a loop?

You can repeat group_counts so it has the same size as data, then filter and reshape based on the target:
group_counts = np.array([3,4,3,2])
data = np.arange(group_counts.sum())
target = 3
data[np.repeat(group_counts, group_counts) == target].reshape(-1, target)
#array([[0, 1, 2],
# [7, 8, 9]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I reproduce data in numpy with random.choice? - python

Related

if y is a pandas series object with 0 and 1, then what does y.values==0,1 or y.values==0,0 means?

Count the occuarences of each elements in numpy array, where elements are elementwise equal with another array?

Pick lines with highest values from np.zeros

How to randomly throw numbers in a 2D dimensional board

Grouping elements of a numpy array using an array of group counts

Categories

Resources