I have a list of ranges. I now would like to compute a dictionary key : value, where key is the number and value is in how many of those ranges the number exists.
A bad way to compute this is:
from collections import defaultdict
my_dict = defaultdict(int)
ranges = [range(-4200,4200), range(-420,420), range(-42,42), range(8,9), range(9,9), range(9,10)]
for singleRange in ranges:
for number in singleRange:
my_dict[number] += 1
sort_dict = sorted(my_dict.items(), key=lambda x: x[1], reverse=True)
print(sort_dict)
How would you do this more efficiently?
Improving on my previous answer, this algorithm solves the problem in O(n + m) where n is the length of the total range and m is the number of sub ranges.
The basic idea is to iterate through the n numbers just once, keeping a counter of the number of ranges the current number belongs to. At each step, we check if we have passed a range start, in which case the counter gets incremented. Conversely, if we have have passed a range stop, the counter gets decremented.
The actual implementation below uses numpy and pandas for all the heavy lifting, so the iterative nature of the algorithm may seem unclear, but it's basically just a vectorized version of what I've described.
Compared to the 600 ms of my previous answer, we're down to 20 ms for 10k ranges on my laptop. Moreover, the memory usage is also O(n + m) here while it was O(nm) there, so much larger n and m become possible. You should probably use this solution instead of the first version.
from collections import defaultdict
import numpy as np
import pandas as pd
# Generate data
def generate_ranges(n):
boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
boundaries.sort(axis=1)
return [range(x, y) for x, y in boundaries]
ranges = generate_ranges(10_000)
# Extract boundaries
boundaries = np.array([[range.start, range.stop] for range in ranges])
# Add a +1 offset for range starts and -1 for range stops
offsets = np.array([1, -1])[None, :].repeat(boundaries.shape[0], axis=0)
boundaries = np.stack([boundaries, offsets], axis=-1)
boundaries = boundaries.reshape(-1, 2)
# Compute range counts at each crossing of a range boundary
df = pd.DataFrame(boundaries, columns=["n", "offset"])
df = df.sort_values("n")
df["count"] = df["offset"].cumsum()
df = df.groupby("n")["count"].max()
# Expand to all integers by joining and filling NaN
index = pd.RangeIndex(df.index[0], df.index[-1] + 1)
df = pd.DataFrame(index=index).join(df).fillna(method="ffill")
# Finally wrap the result in a defaultdict
d = defaultdict(int, df["count"].astype(int).to_dict())
Probably something more efficient can be done, but this solution has the advantage of heavily relying on the speed of numpy. For 10k ranges this runs in ~600 ms on my laptop.
from collections import defaultdict
import numpy as np
# Generate data
def generate_ranges(n):
boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
boundaries.sort(axis=1)
return [range(x, y) for x, y in boundaries]
ranges = generate_ranges(10_000)
# Extract boundaries
starts, stops = np.array([[range.start, range.stop] for range in ranges]).T
# Set of all numbers we should test
n = np.arange(starts.min(), stops.max() + 1)[:, None]
# Test those numbers
counts = ((n >= starts[None, :]) & (n < stops[None, :])).sum(axis=1)
# Wrap the result into a dict
d = defaultdict(int, dict(zip(n.flatten(), counts)))
Related
I have a task where I have a list of certain values: l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]. I have a formula for computing a kind of probability on this list as the following (the probability is high in case there is many different values in the list and low if there are few kind of values):
$ p = - \sum_{i=1}^m f_i log_m f_i $
where m is the length of the list, $f_i$ is the frequency of the ith element of the list.
I want to code this in Python with the following:
from math import log
from collections import Counter
-sum([loc*log(loc, len(set(l))) for loc in Counter(l).values()])
But I somehow suspect that this is not the right way. Any better idea?
Additionally: I do not understand the negative sign in the formula, what is the explanation of this?
Here an alternative way to calculate the Entropy of the list using numpy:
import numpy as np
arr = np.array(l)
elem, c = np.unique(arr, return_counts=True)
# occurrences to probabilities
pc = c / c.sum()
# calculate the entropy (and account for log_m)
entropy = -np.sum(pc * np.log(pc)) * (1/np.log(len(c)))
Although the numpy array is a better solution, in case you don't want to use numpy:
You would be better if you saved the counter and use len(Counter) instead of len(set(l)), so that you don't recalculate in in every iteration. len(Counter) is the same as len(set(l)), but does not get recalculated in every iteration (I assume you use cpython3.x )
If you don't get the desired result, then probably your formula is wrong
In your code you use len(set(l)) and not len(l) and you iterate over the frequencies, not the list which is not what you describe in your formula.
You don't need to wrap the expression inside sum within a list since you only need to iterate over it once (Generator expressions vs. list comprehensions)
EDIT: As to why you get a negative result, this is expected
You sum over f[i] * log(f[i]) >= 0
f[i] >= 1: The frequency of ith element of the list
log(f[i]) >= 0 because f[i] >= 1: The log of each frequency in any base (base doesn't matter).
And then take the negative of that. The result will always be less that or equal to 0.
from math import log
from collections import Counter
l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]
f = Counter(l)
# This is from your code
p1 = -sum(f[e] * log(f[e], len(f)) for e in f)
# This is from your formula
p2 = -sum(f[e] * log(f[e], len(l)) for e in l)
print(p1, p2)
I wrote this Python code to do a particular computation in a bigger project and it works fine for smaller values of N but it doesn't scale up very well for large values and even though I ran it for a number of hours to collect the data, I was wondering if there was a way to speed this up
import numpy as np
def FillArray(arr):
while(0 in arr):
ind1 = np.random.randint(0,N)
if(arr[ind1]==0):
if(ind1==0):
arr[ind1] = 1
arr[ind1+1] = 2
elif(ind1==len(arr)-1):
arr[ind1] = 1
arr[ind1-1] = 2
else:
arr[ind1] = 1
arr[ind1+1] = 2
arr[ind1-1] = 2
else:
continue
return arr
N=50000
dist = []
for i in range(1000):
arr = [0 for x in range(N)]
dist.append(Fillarr(arr).count(2))
For N = 50,000, it currently takes slightly over a minute on my computer for one iteration to fill the array. So if I want to simulate this, lets say, a 1000 times, it takes many hours. Is there something I can do to speed this up?
Edit 1: I forgot to mention what it actually does. I have a list of length N and I initialize it by having zeros in each entry. Then I pick a random number between 0 and N and if that index of the list has a zero, I replace it by 1 and its neighboring indices by 2 to indicate they are not filled by 1 but they can't be filled again. I keep doing this till I populate the whole list by 1 and 2 and then I count how many of the entries contain 2 which is the result of this computation. Thus I want to find out if I fill an array randomly with this constraint, how many entries will not be filled.
Obviously I do not claim that this is the most efficient way find this number so I am hoping that perhaps there is a better alternative way if this code can't be speeded up.
As #SylvainLeroux noted in the comments, the approach of trying to find what zero you're going to change by drawing a random location and hoping it's zero is going to slow down when you start running out of zeros. Simply choosing from the ones you know are going to be zero will speed it up dramatically. Something like
def faster(N):
# pad on each side
arr = np.zeros(N+2)
arr[0] = arr[-1] = -1 # ignore edges
while True:
# zeros left
zero_locations = np.where(arr == 0)[0]
if not len(zero_locations):
break # we're done
np.random.shuffle(zero_locations)
for zloc in zero_locations:
if arr[zloc] == 0:
arr[zloc-1:zloc+2] = [2, 1, 2]
return arr[1:-1] # remove edges
will be much faster (times on my old notebook):
>>> %timeit faster(50000)
10 loops, best of 3: 105 ms per loop
>>> %time [(faster(50000) == 2).sum() for i in range(1000)]
CPU times: user 1min 46s, sys: 4 ms, total: 1min 46s
Wall time: 1min 46s
We could improve this by vectorizing more of the computation, but depending on your constraints this might already suffice.
First I will reformulate the problem from tri-variate to bi-variate. What you are doing is spliting the vector of length N into two smaller vectors at random point k.
Lets assume that you start with a vector of zeros, then you put '1' at randomly selected k and from there take two smaller vectors of zeros - [0..k-2] & [k+2.. N-1]. No need for 3rd state. You repeat the process until exhaustion - when you are left with vectors containing only one element.
Using recusion this is reasonably fast even on my iPad mini with Pythonista.
import numpy as np
from random import randint
def SplitArray(l, r):
while(l < r):
k = randint(l, r)
arr[k] = 1
return SplitArray(l, k-2) + [k] + SplitArray(k+2, r)
return []
N = 50000
L = 1000
dist=np.zeros(L)
for i in xrange(L):
arr = [0 for x in xrange(N)]
SplitArray(0, N-1)
dist[i] = arr.count(0)
print dist, np.mean(dist), np.std(dist)
However if you would like to make it really fast then bivariate problem could be coded very effectively and naturally as bit arrays instead of storing 1 and 0 in arrays of integers or worse floats in numpy arrays. The bit manipulation should be quick and in some you easily could get close to machine level speed.
Something along the line: (this is an idea not optimal code)
from bitarray import BitArray
from random import randint
import numpy as np
def SplitArray(l, r):
while(l < r):
k = randint(l, r)
arr.set_bit(k)
return SplitArray(l, k-2) + [k] + SplitArray(k+2, r)
return []
def count0(ba):
i = 0
for n in xrange(1, N):
if ba.get_bit(n) == 0:
i += 1
return i
N = 50000
L = 1000
dist = np.zeros(L)
for i in xrange(L):
arr = BitArray(N, initialize = 0)
SplitArray(1, N)
dist[i] = count0(arr)
print np.mean(dist), np.std(dist)
using bitarray
The solution converges very nicely so perhaps half an hour spent looking for an analytical solution would make this whole MC excercise unnecessary?
I have a block of code that I need to optimize as much as possible since I have to run it several thousand times.
What it does is it finds the closest float in a sub-list of a given array for a random float and stores the corresponding float (ie: with the same index) stored in another sub-list of that array. It repeats the process until the sum of floats stored reaches a certain limit.
Here's the MWE to make it clearer:
import numpy as np
# Define array with two sub-lists.
a = [np.random.uniform(0., 100., 10000), np.random.random(10000)]
# Initialize empty final list.
b = []
# Run until the condition is met.
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a[1]))
# Store value located in sub-list a[0].
b.append(a[0][idx])
The code is reasonably simple but I haven't found a way to speed it up. I tried to adapt the great (and very fast) answer given in a similar question I made some time ago, to no avail.
OK, here's a slightly left-field suggestion. As I understand it, you are just trying to sample uniformally from the elements in a[0] until you have a list whose sum exceeds some limit.
Although it will be more costly memory-wise, I think you'll probably find it's much faster to generate a large random sample from a[0] first, then take the cumsum and find where it first exceeds your limit.
For example:
import numpy as np
# array of reference float values, equivalent to a[0]
refs = np.random.uniform(0, 100, 10000)
def fast_samp_1(refs, lim=10000, blocksize=10000):
# sample uniformally from refs
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# find where the cumsum first exceeds your limit
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
# # if it's ok to be just under lim rather than just over then this might
# # be quicker
# return samp[samp_sum <= lim]
Of course, if the sum of the sample of blocksize elements is < lim then this will fail to give you a sample whose sum is >= lim. You could check whether this is the case, and append to your sample in a loop if necessary.
def fast_samp_2(refs, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Note that concatenating arrays is pretty slow, so it would probably be better to make blocksize large enough to be reasonably sure that the sum of a single block will be >= to your limit, without being excessively large.
Update
I've adapted your original function a little bit so that its syntax more closely resembles mine.
def orig_samp(refs, lim=10000):
# Initialize empty final list.
b = []
a1 = np.random.random(10000)
# Run until the condition is met.
while (sum(b) < lim):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a1))
# Store value located in sub-list a[0].
b.append(refs[idx])
return b
Here's some benchmarking data.
%timeit orig_samp(refs, lim=10000)
# 100 loops, best of 3: 11 ms per loop
%timeit fast_samp_2(refs, lim=10000, blocksize=1000)
# 10000 loops, best of 3: 62.9 µs per loop
That's a good 3 orders of magnitude faster. You can do a bit better by reducing the blocksize a fraction - you basically want it to be comfortably larger than the length of the arrays you're getting out. In this case, you know that on average the output will be about 200 elements long, since the mean of all real numbers between 0 and 100 is 50, and 10000 / 50 = 200.
Update 2
It's easy to get a weighted sample rather than a uniform sample - you can just pass the p= parameter to np.random.choice:
def weighted_fast_samp(refs, weights=None, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True, p=weights)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True,
p=weights)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Write it in cython. That's going to get you a lot more for a high iteration operation.
http://cython.org/
One obvious optimization - don't re-calculate sum on each iteration, accumulate it
b_sum = 0
while b_sum<10000:
....
idx = np.argmin(np.abs(u - a[1]))
add_val = a[0][idx]
b.append(add_val)
b_sum += add_val
EDIT:
I think some minor improvement (check it out if you feel like it) may be achieved by pre-referencing sublists before the loop
a_0 = a[0]
a_1 = a[1]
...
while ...:
....
idx = np.argmin(np.abs(u - a_1))
b.append(a_0[idx])
It may save some on run time - though I don't believe it will matter that much.
Sort your reference array.
That allows log(n) lookups instead of needing to browse the whole list. (using bisect for example to find the closest elements)
For starters, I reverse a[0] and a[1] to simplify the sort:
a = np.sort([np.random.random(10000), np.random.uniform(0., 100., 10000)])
Now, a is sorted by order of a[0], meaning if you are looking for the closest value to an arbitrary number, you can start by a bisect:
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[0].
idx = bisect.bisect(a[0], u)
# now, idx can either be idx or idx-1
if idx is not 0 and np.abs(a[0][idx] - u) > np.abs(a[0][idx - 1] - u):
idx = idx - 1
# Store value located in sub-list a[1].
b.append(a[1][idx])
I have a 10 bins:
bins = [0,1,2,3,4,5,6,7,8,9]
I have a list of 25 values:
values = [10,0,0,14,14,123,235,0,0,0,0,0,12,12,1235,23,234,15,15,23,136,34,34,37,45]
I want to bin the values sequentially into the bins so each value is grouped into its bin:
binnedValues = [[10,0],[0,14,14],[123,235],[0,0,0],[0,0],[12,12,1235],[23,234],[15,15,23],[136,34,34],[37,45]]
As you can see, the number of values in the bin is not always the same, (as len(values) != len(bins))
Also, I have lots of different values lists that are all different sizes. So I need to do this a number of times for the same number of bins, but different lengths of values lists. The above is an example - the real bin size is 10k, and the real len(values) is from ~10k to ~750k..
Is there a way to do this consistently? I need to maintain the order of the values, but split the values list evenly so there is a 'fair' and 'even' number of the values range distributed to each of the bins.
I think I can use numpy.digitize, but having had a look, I can't see how to generate the 'binned' list
Are you trying to split the list into lists of alternating size between 2 and 3 elements? That's doable, then.
from itertools import cycle
values = [10,0,0,14,14,123,235,0,0,0,0,0,12,12,1235,23,234,15,15,23,136,34,34,37,45]
splits = cycle([2,3])
bins = []
count = 0
while count < len(values):
splitby = splits.next()
bins.append(values[count:count+splitby])
count += splitby
print bins
Edit:
Ah, I see what you're requesting... sort of. Something more like:
from itertools import cycle
from math import floor, ceil
values = [10,0,0,14,14,123,235,0,0,0,0,0,12,12,1235,23,234,15,15,23,136,34,34,37,45]
number_bins = 10
bins_lower = int(floor(len(values) / float(number_bins)))
bins_upper = int(ceil(len(values) / float(number_bins)))
splits = cycle([bins_lower, bins_upper])
bins = []
count = 0
while count < len(values):
splitby = splits.next()
bins.append(values[count:count+splitby])
count += splitby
print bins
If you want to more variety in bin size, you can add more numbers to splits
Edit 2:
Ashwin's way, which is more concise without being harder to understand.
from itertools import cycle, islice
from math import floor, ceil
values = [10,0,0,14,14,123,235,0,0,0,0,0,12,12,1235,23,234,15,15,23,136,34,34,37,45]
number_bins = 10
bins_lower = int(floor(len(values) / float(number_bins)))
bins_upper = int(ceil(len(values) / float(number_bins)))
splits = cycle([bins_lower, bins_upper])
it = iter(values)
bins = [list(islice(it,next(splits))) for _ in range(10)]
print bins
The range for x and y is from 0 to 99.
I am currently doing it like this:
excludeFromTrainingSet = []
while len(excludeFromTrainingSet) < 4000:
tempX = random.randint(0, 99)
tempY = random.randint(0, 99)
if [tempX, tempY] not in excludeFromTrainingSet:
excludeFromTrainingSet.append([tempX, tempY])
But it takes ages and I really need to speed this up.
Any ideas?
Vincent Savard has an answer that's almost twice as fast as the first solution offered here.
Here's my take on it. It requires tuples instead of lists for hashability:
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.randint(0, 99), random.randint(0, 99)))
return ret
Just make sure that the limit is sane as other answerers have pointed out. For sane input, this is better algorithmically O(n) as opposed to O(n^2) because of the set instead of list. Also, python is much more efficient about loading locals than globals so always put this stuff in a function.
EDIT: Actually, I'm not sure that they're O(n) and O(n^2) respectively because of the probabilistic component but the estimations are correct if n is taken as the number of unique elements that they see. They'll both be slower as they approach the total number of available spaces. If you want an amount of points which approaches the total number available, then you might be better off using:
import random
import itertools
def method2(size, min_, max_):
range_ = range(min_, max_)
points = itertools.product(range_, range_)
return random.sample(list(points), size)
This will be a memory hog but is sure to be faster as the density of points increases because it avoids looking at the same point more than once. Another option worth profiling (probably better than last one) would be
def method3(size, min_, max_):
range_ = range(min_, max_)
points = list(itertools.product(range_, range_))
N = (max_ - min_)**2
L = N - size
i = 1
while i <= L:
del points[random.randint(0, N - i)]
i += 1
return points
My suggestion :
def method2(size):
randints = range(0, 100)
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < size:
excludeFromTrainingSet.add((random.choice(randints), random.choice(randints)))
return excludeFromTrainingSet
Instead of generation 2 random numbers every time, you first generate the list of numbers from 0 to 99, then you choose 2 and appends to the list. As others pointed out, there are only 10 000 possibilities so you can't loop until you get 40 000, but you get the point.
I'm sure someone is going to come in here with a usage of numpy, but how about using a set and tuple?
E.g.:
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < 40000:
temp = (random.randint(0, 99), random.randint(0, 99))
if temp not in excludeFromTrainingSet:
excludeFromTrainingSet.add(temp)
EDIT: Isn't this an infinite loop since there are only 100^2 = 10000 POSSIBLE results, and you're waiting until you get 40000?
Make a list of all possible (x,y) values:
allpairs = list((x,y) for x in xrange(99) for y in xrange(99))
# or with Py2.6 or later:
from itertools import product
allpairs = list(product(xrange(99),xrange(99)))
# or even taking DRY to the extreme
allpairs = list(product(*[xrange(99)]*2))
Shuffle the list:
from random import shuffle
shuffle(allpairs)
Read off the first 'n' values:
n = 4000
trainingset = allpairs[:n]
This runs pretty snappily on my laptop.
You could make a lookup table of random values... make a random index into that lookup table, and then step through it with a static increment counter...
Generating 40 thousand numbers inevitably will take a while. But you are performing an O(n) linear search on the excludeFromTrainingSet, which takes quite a while especially later in the process. Use a set instead. You could also consider generating a number of coordinate sets e.g. over night and pickle them, so you don't have to generate new data for each test run (dunno what you're doing, so this might or might not help). Using tuples, as someone noted, is not only the semantically correct choice, it might also help with performance (tuple creation is faster than list creation). Edit: Silly me, using tuples is required when using sets, since set members must be hashable and lists are unhashable.
But in your case, your loop isn't terminating because 0..99 is 100 numbers and two-tuples of them have only 100^2 = 10000 unique combinations. Fix that, then apply the above.
Taking Vince Savard's code:
>>> from random import choice
>>> def method2(size):
... randints = range(0, 100)
... excludeFromTrainingSet = set()
... while True:
... x = size - len(excludeFromTrainingSet)
... if not x:
... break
... else:
... excludeFromTrainingSet.add((choice(randints), choice(randints)) for _ in range(x))
... return excludeFromTrainingSet
...
>>> s = method2(4000)
>>> len(s)
4000
This is not a great algorithm because it has to deal with collisions, but the tuple-generation makes it tolerable. This runs in about a second on my laptop.
## for py 3.0+
## generate 4000 points in 2D
##
import random
maxn = 10000
goodguys = 0
excluded = [0 for excl in range(0, maxn)]
for ntimes in range(0, maxn):
alea = random.randint(0, maxn - 1)
excluded[alea] += 1
if(excluded[alea] > 1): continue
goodguys += 1
if goodguys > 4000: break
two_num = divmod(alea, 100) ## Unfold the 2 numbers
print(two_num)