I'm creating a python script that randomly picks 1000 names from the list of male first names located here: http://www.census.gov/genealogy/www/data/1990surnames/names_files.html
That works all fine and dandy, but I would like it so that the names were selected based on the probability column provided by the census text files (second column).
I've been trying to wrap my head around this for the past few hours, but I haven't made any real progress, even looking for other answers.
Can anybody help me out or point me in the right direction? Thanks in advance :)
An easy algorithm for weighted selection is:
Assign each name its relative probability, such that the sum of all probabilities is 1. This relative value is called "weight".
Select a random number between 0 and 1
Walk the list, substracting the weight of each item from your number as you go
When you go to 0 or below, pick the current item.
The third column of the data file is the cumulative probability, the running sum of the second column.
To select a random name with respect to the cumulative probability distribution:
Generate a random number between 0 and 1,
Find the first row whose cumulative probability is bigger than that
random number.
Select the name in that row.
import urllib2
import random
import bisect
url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'
response = urllib2.urlopen(url)
names, cumprobs = [], []
for line in response:
name, prob, cumprob, rank = line.split()
cumprob = float(cumprob)
names.append(name)
cumprobs.append(cumprob)
# normalize the cumulative probabilities to the range [0, 1]
cumprobs = [p/cumprobs[-1] for p in cumprobs]
# print(cumprobs)
# Generate 1000 names at random, using the cumulative probability distribution
N = 1000
selected = [names[bisect.bisect(cumprobs, random.random())] for i in xrange(N)]
print('\n'.join(selected))
Note, the alias method has better computational complexity, but for selection of a mere 1000 items, that may not be very important for your use case.
A quick and VERY dirty hack that will work for smaller datasets is just to add the name in question a number of times equal to the weighted distribution. Note that this will consume a whole ton of memory, especially in larger datasets, so consider this as a quick implementation for small weighted distributions ONLY.
import random
filename = r"location/of/file"
data = list() # accumulator
with open(filename) as in_:
for line in in_:
name, prob, *_ = line.split()
for _ in range(int(float(prob)*1000)):
data.append(name)
print(random.choice(data))
Related
The goal is to sample the n number of data points from the original population. But the original population has serial correlation (consider it as time series data) and I want to choose neighboring three as one unit for each choice. That is to say, the neighboring three data points have to be chosen each time. The choice has to be done without replacement.
It would repeat the choice until the number of sample data points reaches to n. Each chosen data point has to be unique. (Assume the population data points are all unique.)
How can I write this into code? I hope the code is fast.
def subsampling(self, population, size, consecutive = 3):
#make seeds which doesn't have neighbors
seed_samples = np.random.choice(population,
size = int(size/consecutive),
replace = False)
target_samples = set(seed_samples)
#add neighbors to each seed samples
for dpoint in seed_samples:
start = np.searchsorted(population, dpoint, side = 'right')
neighbors = population[start:(start + consecutive -1)]
target_samples.add(neighbors)
return sorted(list(target_samples))
This code is my rough trial but it doesn't give the correct size because there can be duplicate.
Suppose the population is 1000 entries and you want 200 non-overlapping triplets.
One simple method is: extract x[0], x[1],... x[199] 200 unique random numbers from 0 to 599 (600 = 1000-200*2). Sort the values and then required indexes for the triplets are:
0. x[0], x[0]+1, x[0]+2
1. x[1]+2, x[1]+3, x[1]+4
2. x[2]+4, x[2]+5, x[2]+6
...
n. x[n]+2*n, x[n]+2*n+1, x[n]+2*n+2
...
199. x[199]+398, x[199]+399, x[199]+400
import random
number=list(range(1,10))
weighted=[1]*2+[2]*2+[3]*2+[4]*2+[5]*2
number_weighted=random.choice(number,weighted,k=1) **#if k=4 then the same number is chosen sometimes**
I want to use loop 3 times to choose the each one number.
I want to choose the number that independent(not same), weighted.
In python, if you know this problem, I would appreciate it if you taught me
For example,
number=[1,2,3,4,5]
weighted=[0.1,0.1,0.4,0.3,0.1]
then choose two number
i want 3, 4 Probability)
but random.choice function is sometimes 1,1 selected.
so, i think
i take one number (suppose number 3) then
number=[1,2,4,5]
weighted=[0.1,0.1,0.3,0.1]
and i take one number (suppose number 4). use loop function
Your question isn't quite clear, so comment if it doesn't solve your problem.
Define functionn which returns random from the list and weight. another function to make sure you have n randoms from different weights.
And your weight and array was of different length I hope that was an error.
import random
def get_rand(num_list,weight_list,weight):
selection_from= [i for i,v in enumerate(weight_list) if v==weight]
print(selection_from)
rand_index =random.choice(selection_from)
return num_list[rand_index]
def get_n_rand(num_list,weight_list,n):
weights= list(set(weight_list))
random.shuffle(weights)
final_list=[]
# if you don't want numbers from same weight
for weight in weights[:n]:
final_list.append(get_rand(num_list,weight_list,weight))
#if same weight is also fine use this:
#for i in range(n):
# weight = random.choice(weights)
# final_list.append(get_rand(num_list,weight_list,weight))
return final_list
number=list(range(1,10))
weighted=[1]*2+[2]*2+[3]*2+[4]*2+[5]*1
assert(len(number)==len(weighted))
rand=get_n_rand(number,weighted,3)
print("selected numbers:",rand)
print("their weights:",[weighted[number.index(i)] for i in rand])
Since you had hard time understanding,
selection_from= [i for i,v in enumerate(weight_list) if v==weight]
is equivalent to:
selection_from= []
for i in range(len(weight_list)):
v= weight_list[i]
if v==weight:
selection_from.append(i)
The function randint from the random module can be used to produce random numbers. A call on random.randint(1, 6), for example, will produce the values 1 to 6 with equal probability. Write a program that loops 1000 times. On each iteration it makes two calls on randint to simulate rolling a pair of dice. Compute the sum of the two dice, and record the number of times each value appears.
The output should be two columns. One displays all the sums (i.e. from 2 to 12) and the other displays the sums' respective frequencies in 1000 times.
My code is shown below:
import random
freq=[0]*13
for i in range(1000):
Sum=random.randint(1,6)+random.randint(1,6)
#compute the sum of two random numbers
freq[sum]+=1
#add on the frequency of a particular sum
for Sum in xrange(2,13):
print Sum, freq[Sum]
#Print a column of sums and a column of their frequencies
However, I didn't manage to get any results.
You shouldn't use Sum because simple variables should not be capitalized.
You shouldn't use sum because that would shadow the built-in sum().
Use a different non-capitalized variable name. I suggest diceSum; that's also stating a bit about the context, the idea behind your program etc. so a reader understands it faster.
You don't want to make any readers of your code happy? Think again. You asked for help here ;-)
Try this:
import random
freq=[0]*13
for i in range(1000):
Sum=random.randint(1,6)+random.randint(1,6)
#compute the sum of two random numbers
freq[Sum]+=1
#add on the frequency of a particular sum
for Sum in xrange(2,13):
print Sum, freq[Sum]
#Print a column of sums and a column of their frequencies
There's a grammar case error on sum
The seed generator that python uses should suffice to your task.
Looks like a typo error. Sum variable is wrongly typed as sum.
Below is the modified code in python 3.x
#!/usr/bin/env python3
import random
freq= [0]*13
for i in range(1000):
#compute the sum of two random numbers
Sum = random.randint(1,6)+random.randint(1,6)
#add on the frequency of a particular sum
freq[Sum] += 1
for Sum in range(2,13):
#Print a column of sums and a column of their frequencies
print(Sum, freq[Sum])
I have already asked a few questions on here about this same topic, but I'm really trying not to disappoint the professor I'm doing research with. This is my first time using Python and I may have gotten in a little over my head.
Anyways, I was sent a file to read and was able to using this command:
SNdata = numpy.genfromtxt('...', dtype=None,
usecols (0,6,7,8,9,19,24,29,31,33,34,37,39,40,41,42,43,44),
names ['sn','off1','dir1','off2','dir2','type','gal','dist',
'htype','d1','d2','pa','ai','b','berr','b0','k','kerr'])
sn is just an array of the names of a particular supernova; type is an array of the type of supernovae it is (Ia or II), etc.
One of the first things I need to do is simply calculate the probabilities of certain properties given the SN type (Ia or II).
For instance, the column htype is the morphology of a galaxy (given as an integer 1=elliptical to 8=irregular). I need to calculate the probability of an elliptical given a TypeIa and an elliptical given TypeII, for all of the integers to up to 8.
For ellipticals, I know that I just need the number of elements that have htype = 1 and type = Ia divided by the total number of elements of type = Ia. And then the number of elements that have htype = 1 and type = II divided by the total number of elements that have type = II.
I just have no idea how to write code for this. I was planning on finding the total number of each type first and then running a for loop to find the number of elements that have a certain htype given their type (Ia or II).
Could anyone help me get started with this? If any clarification is needed, let me know.
Thanks a lot.
Numpy supports boolean array operations, which will make your code fairly straightforward to write. For instance, you could do:
htype_sums = {}
for htype_number in xrange(1,9):
htype_mask = SNdata.htype == htype_number
Ia_mask = SNdata.type == 'Ia'
II_mask = SNdata.type == 'II'
Ia_sum = (htype_mask & Ia_mask).sum() / Ia_mask.sum()
II_sum = (htype_mask & II_mask).sum() / II_mask.sum()
htype_sums[htype_number] = (Ia_sum, II_sum)
Each of the _mask variables are boolean arrays, so when you sum them you count the number of elements that are True.
You can use collections.Counter to count needed observations.
For example,
from collections import Counter
types_counter = Counter(row['type'] for row in data)
will give you desired counts of sn types.
htypes_types_counter = Counter((row['type'], row['htype']) for row in data)
counts for morphology and types. Then, to get your evaluation for ellipticals, just divide
1.0*htypes_types_counter['Ia', 1]/types_counter['Ia']
I would like to create an array of Zipf Distributed values withing range of [0, 1000].
I am using numpy.random.zipf to create the values but I cannot create them within the range I want.
How can I do that?
normalize and multiply by 1000 ?
a=2
s = np.random.zipf(a, 1000)
result = (s/float(max(s)))*1000
print min(s), max(s)
print min(result), max(result)
althought isn't the whole point of zipf that the range of values is a function of the number of values generated ?
I agree with the original answer (Felix) that forcing Zipf values to a specific range is a very unusual thing, and it likely means that you're doing something wrong.
Having said that, I actually had a similar problem, where I really did need to generate Zipf values conforming to a certain criteria. In my case, I wanted to generate a brand new set of data that was similar to an existing data set. I wanted the sum to be the same as the existing distribution, but the values to be different.
My insight is that it's possible to re-generate the values a few times until you get ones you like.
#Generate a quantity of Zipf-distributed values close to a desired sum
def gen_zipf_values(alpha, sum, quantity):
best = []
best_sum = 0
for _ in range(10):
s = np.random.zipf(alpha,quantity)
this_sum = s.sum()
if (this_sum > best_sum) and (this_sum <= sum):
best = s
best_sum=this_sum
return best
Again, this solution is tailored to my problem, where I wanted to generate values close to a sum, without going over. I also had a pretty good idea of what I wanted alpha to be in each time. I omitted some of the conditions checking, sorting, etc. for clarity.
If you had to do it more than a few times though (i.e. you had to run the for loop 1 million times to get your distribution), you probably have something wrong (like alpha, or unrealistic expectations on the values). I feel it's valid to 'let the computer do the work', or to hand-pick the best option from a few reasonable ones.