Weighted random numbers in Python from a list of values

Weighted random numbers in Python from a list of values - python

I am trying to create a list of 10,000 random numbers between 1 and 1000. But I want 80-85% of the numbers to be the same category( I mean some 100 numbers out of these should appear 80% of the times in the list of random numbers) and the rest appear around 15-20% of the times. Any idea if this can be done in Python/NumPy/SciPy. Thanks.

This can be easily done using 1 call to random.randint() to select a list and another call to random.choice() on the correct list. I'll assume list frequent contain 100 elements which are to be chose 80 percent times and rare contains 900 elements to be chose 20 percent times.
import random
a = random.randint(1,5)
if a == 1:
# Case for rare numbers
choice = random.choice(rare)
else:
# case for frequent numbers
choice = random.choice(frequent)

Here's an approach -
a = np.arange(1,1001) # Input array to extract numbers from
# Select 100 random unique numbers from input array and get also store leftovers
p1 = np.random.choice(a,size=100,replace=0)
p2 = np.setdiff1d(a,p1)
# Get random indices for indexing into p1 and p2
p1_idx = np.random.randint(0,p1.size,(8000))
p2_idx = np.random.randint(0,p2.size,(2000))
# Index and concatenate and randomize their positions
out = np.random.permutation(np.hstack((p1[p1_idx], p2[p2_idx])))
Let's verify after run -
In [78]: np.in1d(out, p1).sum()
Out[78]: 8000
In [79]: np.in1d(out, p2).sum()
Out[79]: 2000

Related

How to shuffle my array with no repetitions?

I'm trying to create an array of 256 stimuli that represents the frequency value to input into my sound stimuli. So far I have created an array of 4 numbers representing the 4 different frequency levels for my audio tones:
#Pitch list - create an array from 1 to 4 repeated for 256 stimuli
pitch_list = [1,2,3,4]
new_pitch_list = np.repeat(pitch_list,64)
random.shuffle(new_pitch_list)
print(new_pitch_list)
#Replace 1-4 integers in new_pitch_list with frequency values
for x in range(0,len(new_pitch_list)):
if new_pitch_list[x] == 1:
new_pitch_list[x] = 500
elif new_pitch_list[x] == 2:
new_pitch_list[x] = 62
elif new_pitch_list[x] == 3:
new_pitch_list[x] = 750
else:
new_pitch_list[x] == 4
new_pitch_list[x] = 875
My code works for randomly producing an array of 256 numbers of which there are 4 possibilities (500, 625, 750, 875). However, my problem is that I need to create the new_pitch_list so there are no repetitions of 2 numbers. I need to do this so the frequency of the audio tones isn't the same for consecutive audio tones.
I understand that I may need to change the way I use the random.shuffle function, however, I'm not sure if I also need to change my for loop as well to make this work.
So far I have tried to replace the random.shuffle function with the random.choice function, but I'm not sure if I'm going in the wrong direction.Because I'm still fairly new to Python coding, I'm not sure if I can solve this problem without having to change my for loop, so any help would be greatly appreciated!

I would make it so that you populate your array with 3 of your 4 values, and then each time you see consecutive duplicate values you replace the second one with the 4th value. Something like this (untested, but you get the gist).
Also - I'd cut out some of the lines you don't need:
new_pitch_list = np.repeat([500, 62, 750],64)
random.shuffle(new_pitch_list)
print(new_pitch_list)
#Replace 1-4 integers in new_pitch_list with frequency values
for x in range(1,len(new_pitch_list)):
if(new_pitch_list[x-1] == new_pitch_list[x]):
new_pitch_list[x] = 875

After you assign each value, remove that value from the list of choices, and use random.choice().
pitches = [600, 62, 750, 875]
last_pitch = random.choice(pitches)
new_pitch_list = [last_pitch]
for _ in range(255):
pitches.remove(last_pitch)
pitch = random.choice(pitches)
new_pitch_list.append(pitch)
pitches.append(last_pitch)

Add to a array using a formula based on the last entry in an array n number of times

I'm attempting to build an array from a known starting value that appends to itself based on a formula that includes the last number of the list. I'd like to do this a specified number of times. So for example:
List starts at 5
The formula I use = last number in a list X 2
The new entry to the list is 10
The next new entry is 20
My non-working code is below:
mean = 198
standard_deviation = 85
list = [((mean)-(standard_deviation*3))]
list.append(((list[-1])+(standard_deviation*.1)))
print(list)
[-55.930192782739596, -47.4592697321513]
I'd like to be able to tell the array to stop after 30 entries.

list = [((mean)-(std*3))]
for n in range(60):
list.append(((list[-1])+(std*.1)))

Distribute a number of elements randomly but weighted onto a list, which should only contain integers

I have a list called employees_summed with len(employees_summed)=9081which contains numbers of employees (per company).
My goal is to distribute an amount of elements bev_calc, say bev_calc=2000 onto this list, according to how many employees there are.
I tried using numpy.random.choice but it doesn't distribute a given number, but rather returns a weighted average of the list iself:
n = sum(employees_summed)
percentage_list = [x / n for x in employees_summed]
weightedList = choice(employees_summed, len(employees_summed), percentage_list)
The goal is to do something like the following:
bev_company = []
percentage = bev_calc / n
for i in employees_summed:
cars_per_building = percentage * i
bev_company.append(cars_per_building)
But due to the length of employees_summed the returned numbers are mostly <1, and round(i) would delete too many of bev_calc, as it would be rounded to 0.
Is there any way to do it, so that bev_company has integers as values with those integers adding up rougly to 2000? Make a random distrbution but weigh it according to how many employees a company has?

Not an efficient method but i think the following code will do what you're looking for:
import numpy as np
np.random.seed(0)
Companies = 10
employees = np.random.randint(5, 100, size = (Companies,))
cars = 100
# make an initial guess of the car distribution and floor its values
guess = employees/employees.sum()*cars
floor_guess = np.floor(guess)
# if the floored values dont add up the the number of cars, scale the
# distribution such that the company with the closest value is rounded up.
while floor_guess.sum()!=100:
distances_up = guess%1
closest = np.argmax(distances_up)
# if there are more than two companies that are the same distance to
# their ceiling, choose one randomly.
if np.sum(distances_up==distances_up[closest])>1:
idx = np.random.choice(np.where(distances_up==distances_up[closest])[0])
guess[idx] = np.ceil(guess[idx])
floor_guess = np.floor(guess)
continue
scale = np.ceil(guess[closest])/guess[closest]
guess *= scale
floor_guess = np.floor(guess)

enumerate in dictionary loop take long time how to improv the speed

I am using python-3.x and I would like to speed my code where in every loop, I am creating new values and I checked if they exist or not in the dictionary by using the (check if) then I will keep the index where it is found if it exists in the dictionary. I am using the enumerate but it takes a long time and it very clear way. is there any way to speed my code by using another way or in my case the enumerate is the only way I need to work with? I am not sure in my case using numpy will be better.
Here is my code:
# import numpy
import numpy as np
# my first array
my_array_1 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_1 = np.array(my_array_1)
# here I want to find the unique values from my_array_1
indx = np.unique(my_array_1, return_index=True, return_counts= True,axis=0)
#then saved the result to dictionary
dic_t= {"my_array_uniq":indx[0], # unique values in my_array_1
"counts":indx[2]} # how many times this unique element appear on my_array_1
# here I want to create random array 100 times
for i in range (100):
print (i)
# my 2nd array
my_array_2 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_2 = np.array(my_array_2)
# I would like to check if the values in my_array_2 exists or not in the dictionary (my_array_uniq":indx[0])
# if it exists then I want to hold the index number of that value in the dictionary and
# add 1 to the dic_t["counts"], which mean this value appear agin and cunt how many.
# if not exists, then add this value to the dic (my_array_uniq":indx[0])
# also add 1 to the dic_t["counts"]
for i, a in enumerate(my_array_2):
ix = [k for k,j in enumerate(dic_t["my_array_uniq"]) if (a == j).all()]
if ix:
print (50*"*", i, "Yes", "at", ix[0])
dic_t["counts"][ix[0]] +=1
else:
# print (50*"*", i, "No")
dic_t["counts"] = np.hstack((dic_t["counts"],1))
dic_t["my_array_uniq"] = np.vstack((dic_t["my_array_uniq"], my_array_2[i]))
explanation:
1- I will create an initial array.
2- then I want to find the unique values, index and count from an initial array by using (np.unique).
3- saved the result to the dictionary (dic_t)
4- Then I want to start the loop by creating random values 100 times.
5- I would like to check if this random values in my_array_2 exist or not in the dictionary (my_array_uniq":indx[0])
6- if one of them exists then I want to hold the index number of that value in the dictionary.
7 - add 1 to the dic_t["counts"], which mean this value appears again and count how many.
8- if not exists, then add this value to the dic as new unique value (my_array_uniq":indx[0])
9 - also add 1 to the dic_t["counts"]

So from what I can see you are
Creating 256 random numbers from a linear distribution of numbers between -1000 and 1000
Generating 100 triplets from those (it could be fewer than 100 due to unique but with overwhelming probability it will be exactly 100)
Then doing pretty much the same thing 100 times and each time checking for each of the triplets in the new list whether they exist in the old list.
You're then trying to get a count of how often each element occurs.
I'm wondering why you're trying to do this, because it doesn't make much sense to me, but I'll give a few pointers:
There's no reason to make a dictionary dic_t if you're only going to hold to objects in it, just use two variables my_array_uniq and counts
You're dealing with triplets of floating point numbers. In the given range, that should give you about 10^48 different possible triplets (I may be wrong on the exact number but it's an absurdly large number either way). The way you're generating them does reduce the total phase-space a fair bit, but nowhere near enough. The probability of finding identical ones is very very low.
If you have a set of objects (in this case number triplets) and you want to determine whether you have seen a given one before, you want to use sets. Sets can only contain immutable objects, so you want to turn your triplets into tuples. Determining whether a given triplet is already contained in your set is then an O(1) operation.
For counting the number of occurences of sth, collections.Counter is the natural datastructure to use.

Parallel algorithm for set splitting

I'm trying to solve an issue with subsection set.
The input data is the list and the integer.
The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches:
a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long)
b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.
Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.
a = [range(12)]
k = 3
One of the possible solution:
[1,2,11,12] [3,4,9,10] [5,6,7,8]
because sum is 26, 26, 26
Not always it is possible to create exactly the equal sums or number of
elements. The difference between maximum and minimum number of elements in
sets could be 0 (if len(a)/k is integer) or 1.
edit 1:
I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.
Best regards,
Szczepan

I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:
Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.
Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.
In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).
How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.
import numpy as np
my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1
split = np.array_split(my_numbers, chunks)
for i in range(iter_limit):
split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
current_min = split_means.min()
current_max = split_means.max()
mean_diff = split_means.ptp()
if(i % 100 == 0 or mean_diff <= accepatable_range):
print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
if mean_diff <= accepatable_range:
print('Acceptable solution found')
break
min_index = split_means.argmin()
max_index = split_means.argmax()
if max_index < min_index:
merged = np.hstack((split.pop(min_index), split.pop(max_index)))
else:
merged = np.hstack((split.pop(max_index), split.pop(min_index)))
reshuffle_range = mean_diff+1
while reshuffle_range > mean_diff:
# this while just ensures that you're not getting worse split, either the same or better
np.random.shuffle(merged)
modified_arrays = np.array_split(merged, 2)
reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
split += modified_arrays

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.