I am working on a python project and I need help to sort a list of elements.
It is a list with investment portfolios returns (10000 items). I want to sort them in intervals so that I can, then, assign probabilities to it.
Basically, I want to say that, for example, I'll have:
Could you help me out with a library/code that could do this sorting?
thanks
TL;DR
import random
import numpy as np
ranges = [2e6, 2.5e6, 3e6, 3.5e6, 4e6, 4.5e6, 5e6]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
_, counts = np.unique(np.digitize(returns, bins=ranges), return_counts=True)
Long Answer
If I understand your question right, you want to count the number of elements in a list, given a condition. Assuming you pick the ranges yourself, a pure Python solution could look like this:
import random
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
ranges = [[2e6, 2.5e6], [2.5e6, 3e6], [3.5e6, 4e6], [4e6, 4.5e6], [4.5e6, 5e6]]
print(f"Probability \t Return Range")
for lower, upper in ranges:
returns_in_range = sum(1 for r in returns if lower <= r < upper)
probability = 100 * (returns_in_range / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Which outputs something like:
Probability Return Range
17.36% 2.0M - 2.5M
16.18% 2.5M - 3.0M
17.13% 3.5M - 4.0M
16.35% 4.0M - 4.5M
16.35% 4.5M - 5.0M
This is a simple solution, however it iterates over the returns once for every range, which is costly in your case. A better way is to iterate over the returns once and keep a counter for every range.
import random
ranges = [[2e6, 2.5e6], [2.5e6, 3e6], [3.5e6, 4e6], [4e6, 4.5e6], [4.5e6, 5e6]]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
counts = [0] * len(ranges)
for ret in returns:
for idx, (lower, upper) in enumerate(ranges):
if lower <= ret < upper:
counts[idx] += 1
# print the results
print(f"Probability \t Return Range")
for idx, (lower, upper) in enumerate(ranges):
returns_in_range = counts[idx]
probability = 100 * (returns_in_range / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Using a library like numpy may increase performance significantly, as it is mostly written in C. The operation you are trying to do is called binning and can be implemented like this:
import random
import numpy as np
ranges = [2e6, 2.5e6, 3e6, 3.5e6, 4e6, 4.5e6, 5e6]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
_, counts = np.unique(np.digitize(returns, bins=ranges), return_counts=True)
# print the results
print(f"Probability \t Return Range")
for idx, (lower, upper) in enumerate(zip(ranges, ranges[1:])):
probability = 100 * (counts[idx] / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Related
I saw a many solutions for generating random floats within a specific range (like this) which actually helps me, and solutions for generating random floats summing to 1 (like this), and separately solutions work perfectly, but I can't figure how to merge them.
Currently my code is:
import random
def sample_floats(low, high, k=1):
""" Return a k-length list of unique random floats
in the range of low <= x <= high
"""
result = []
seen = set()
for i in range(k):
x = random.uniform(low, high)
while x in seen:
x = random.uniform(low, high)
seen.add(x)
result.append(x)
return result
And still, applying
weights = sample_floats(0.055, 1.0, 11)
weights /= np.sum(weights)
Returns weights array, in which there are some floats less that 0.055
Should I somehow implement np.random.dirichlet in function above, or it should be built on the basis of np.random.dirichlet and then implement condition > 0.055? Can't figure any solution.
Thank you in advice!
IIUC, you want to generate an array of k values, with minimum value of low=0.055.
It is easier to generate numbers from 0 that sum up to 1-low*k, and then to add low so that the final array sums to 1. Thus, this guarantees both the lower bound and the sum.
Regarding the high, I am pretty sure it is mathematically impossible to add this constraint as once you fix the lower bound and the sum, there is not enough degrees of freedom to chose an upper bound. The upper bound will be 1-low*(k-1) (here 0.505).
Also, be aware that, with a minimum value, you necessarily enforce a maximum k of 1//low (here 18 values). If you set k higher, the low bound won't be correct.
# parameters
low = 0.055
k = 10
a = np.random.rand(k)
a = (a/a.sum()*(1-low*k))
weights = a+low
# checking that the sum is 1
assert np.isclose(weights.sum(), 1)
Example output:
array([0.13608635, 0.06796974, 0.07444545, 0.1361171 , 0.07217206,
0.09223554, 0.12713463, 0.11012871, 0.1107402 , 0.07297022])
You could generate k-1 numbers iteratively by varying the lower and upper bounds of the uniform random number generator - the constraint at any iteration being that the number generated allows the rest of the numbers to be at least low
def sample_floats(low, high, k=1):
result = []
generated = 0
while generated < k-1:
current_higher_bound = max(low, 1 - (k - 1 - generated)*low - sum(result))
next_num = random.uniform(low, current_higher_bound)
result.append(next_num)
generated += 1
last_num = 1 - sum(result)
result.append(last_num)
return result
print(sample_floats(0.01, 1, k=15))
#[0.08878760926151083,
# 0.17897435239586243,
# 0.5873150041878156,
# 0.021487776792166513,
# 0.011234379498998357,
# 0.012408564286727042,
# 0.015391011259745103,
# 0.01264921242128719,
# 0.010759267284382326,
# 0.010615007333002748,
# 0.010288605412288477,
# 0.010060487014659121,
# 0.010027216923973544,
# 0.010000064276203318,
# 0.010001441651377285]
The samples are correlated, so I believe you can't generate them in an IID way. you can, however, do it in an iterative manner. For example, you can do it as I show in the code below. There are a few more special cases to check like what if the user inputs low<high or high*k<sum. But I figured you can find and account for them using my modification to your code.
import random
import warnings
def sample_floats(low = 0.055, high = 1., x_sum = 1., k = 1):
""" Return a k-length list of unique random floats
in the range of 'low' <= x <= 'high' summing up to 'sum'.
"""
sum_i = 0
xs = []
if x_sum - (k-1)*low < high:
warnings.warn(f'high = {high} is to high to be generated under the'
f' conditions set by k = {k}, sum = {x_sum}, and low = {low}.'
f' high automatically set to {x_sum - (k-1)*low}.')
if k == 1:
if high < x_sum:
raise ValueError(f'The parameter combination k = {k}, sum = {x_sum},'
' and high = {high} is impossible.')
else: return x_sum
high_i = high
for i in range(k-1):
x = random.uniform(low, high_i)
xs.append(x)
sum_i = sum_i + x
if high < (x_sum - sum_i - (k-1-i)*low):
high_i = high
else: high_i = x_sum - sum_i - (k-1-i)*low
xs.append(x_sum - sum_i)
return xs
For example:
random.seed(0)
xs = sample_floats(low = 0.055, high = 0.5, x_sum = 1., k = 5)
print(xs)
print(sum(xs))
Output:
[0.43076772392864643, 0.27801464913542906, 0.08495210994346317, 0.06568433355884717, 0.14058118343361425]
1.0
My dataset has 2 million observations. I want to split it into 200 categories based on the value of a variable, 'rv'. For example, imagine I had the categories 0-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000 I would want to split an observation with value 4500 like this: 1000 in each of the 1st 4 categories, and 500 in the final category. I have the following code, which works but is very slow:
# create random data set
import pandas as pd
import numpy as np
data = np.random.randint(0, 5000, size=2000)
df = pd.DataFrame({'rv': data})
#%% slice
sizes = [0, 1000, 2000, 3000, 4000, 5000]
size_names = ['{:.0f} to {:.0f}'.format(lower, upper) for lower, upper in zip(sizes[0:-1], sizes[1:])]
for lower, upper, name in zip(sizes[0:-1], sizes[1:], size_names):
df[name] = df['rv'].apply(lambda x: max(0, (min(x, upper) - lower)))
# summary table
df_slice = df[size_names].sum()
Are there better ways of doing this, where better means faster principally? With 2 million observations and 200 categories this takes quite a long time (not sure how long as I stopped the code before it had finished).
I wrote an algorithm that sorts the data beforehand, which takes it from a O(n*m) loop (over the data and the categories) to a O(n) loop (just over the data, albeit there is a O(n log n) time for sorting it). By sorting it, you already know which bin you're in and just have to take care of the summing for that particular bin, then apply the sum to that bin and all bins below it once per bin. It takes about 1.2 seconds on 2 million data points over 200 categories. Hope it helps:
from time import time
from random import randint
data = [randint(0, 4999) for i in range(2000000)]
sizes = range(0, 5001, 25)
bound_pairs = [[sizes[i], sizes[i + 1]] for i in range(len(sizes) - 1)]
results = [0 for i in range(len(sizes) - 1)]
data.sort()
curr_bin = 0
curr_bin_count = 0
curr_bin_sum = 0
for d in data:
if d >= bound_pairs[curr_bin][1]:
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
curr_bin_count = 0
curr_bin_sum = 0
while d >= bound_pairs[curr_bin][1]:
curr_bin += 1
curr_bin_count += 1
curr_bin_sum += d - bound_pairs[curr_bin][0]
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
EDIT: There may be some issues here depending on whether you want the upper bound or lower bound to be inclusive or exclusive. I leave the particulars to you.
I want to simply distribute N items in n cells, both numbers N and n can be large, so I wouldn't like to loop over random as here:
import numpy as np
nitems = 100
ncells = 3
cells = np.zeros((ncells), dtype=np.int)
for _ in range(nitems):
dest = np.random.randint(ncells)
cells[dest] += 1
print(cells)
In this case, the output is:
[31 34 35]
(the sum is always N)
Is it there any faster way?
An answer to the question (I have to thank here to #pjs for his help) follows. I think it is the fastest, and probably, the shortest and most space efficient one possible:
from numpy import *
from time import sleep
g_nitems = 10000
g_ncells = 10
g_nsamples = 10000
def genDist(nitems, ncells):
r = sort(random.randint(0, nitems+1, ncells-1))
return concatenate((r,[nitems])) - concatenate(([0],r))
# Some stats
test = zeros(g_ncells, dtype=int)
Max = zeros(g_ncells, dtype=int)
for _ in range(g_nsamples):
tmp = genDist(g_nitems, g_ncells)
print(tmp.sum(), tmp, end='\r')
# print(_, end='\r')
# sleep(0.5)
test += tmp
for i in range(g_ncells):
if tmp[i] > Max[i]:
Max[i] = tmp[i]
print("\n", Max)
print(test//g_nsamples)
On my machine, your code with a timeit took 151 microseconds. The following took 11 microseconds:
import numpy as np
nitems = 100
ncells = 3
values = np.random.randint(0,ncells,nitems)
cells = np.array_split(values,3)
lengths= [ len(cell) for cell in cells ]
print(lengths,np.sum(lengths))
The result of the print is [34, 33, 33] 100.
The magic here is using numpy to do the splitting, but note that this method will split as close to uniform as possible.
If you want the partitioning done randomly:
import numpy as np
nitems = 100
ncells = 3
values = np.random.randint(0,ncells,nitems)
ind_split = [ np.random.randint(0,nitems) ]
ind_split.append(np.random.randint(ind_split[-1],nitems))
cells = np.array_split(values,ind_split)
lengths= [ len(cell) for cell in cells ]
print(lengths,np.sum(lengths))
This takes advantage of numpy.array_split taking indices of where to perform the split as an argument (rather than the number of near-uniform partitions).
You haven't specified that the counts have to have any particular distribution as long as they add up to N, so the following will work as requested:
import numpy as np
nitems = 100
ncells = 3
range_array = [np.random.randint(nitems + 1) for _ in range(ncells - 1)] + [0, nitems]
range_array.sort()
cells = [range_array[i + 1] - range_array[i] for i in range(ncells)]
print(cells)
It generates an ordered set of random values between 0 and nitems, then takes successive differences to generate the desired number of cell counts.
The complexity is O(ncells) rather than O(nitems), so it should be more efficient when there are substantially more items than cells.
What is the best way to perform below actions
find whether the given image is plain or it holds some drawing/graphics.
which pixel value(RGB) has been used maximum in a given image.
i1 = list(self.__img1.getdata())
result=0
resultVal=None
a = list(set(i1))
length = len(i1)
for val in a:
print val
occurencePercent = (i1.count(val) / length) * 100
if occurencePercent > result:
result = occurencePercent
resultVal=val
print resultVal
print result
But since its 640 x 480 it just takes very high time..so what is the best approach..Please guide
this is the solution i arrived for now.But if there some more smarter methods please advice
i1 = list(self.__img1.getdata())
UniqOccurenceData={}
for tempVal in itertools.groupby(i1):
UniqOccurenceData[tempVal[0]] = len(list(tempVal[1]))
maxrgboccurence=max(UniqOccurenceData.iteritems(), key=operator.itemgetter(1))
maxVal = float(maxrgboccurence[1])
maxPercent=float(maxVal/len(i1)) * 100
print maxrgboccurence
I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.
a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
queries = [23,68,103]
for q in queries:
print s[q]
Outputs:
0.0355271185019
0.0226018371526
0.0158992102478
In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.
Is it possible to solve the problem in time proportional to len(l)
and not max(l) (assuming len(queries) < len(l))?
Here is my code for doing this:
def ewma(l, queries, a=0.01):
def decay(t0, x, t1, a):
from math import pow
return pow((1-a), (t1-t0))*x
assert l == sorted(l)
assert queries == sorted(queries)
samples = []
try:
t0, x0 = (0.0, 0.0)
it = iter(queries)
q = it.next()-1.0
for t1 in l:
# new value is decayed previous value, plus a
x1 = decay(t0, x0, t1, a) + a
# take care of all queries between t0 and t1
while q < t1:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
# take care of all queries equal to t1
while q == t1:
samples.append(x1)
q = it.next()-1.0
# update t0, x0
t0, x0 = t1, x1
# take care of any remaining queries
while True:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
except StopIteration:
return samples
I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710
EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.
I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....
This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value
(1-a)^(k-v) *l[v] + ....
Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it
here's the code -
v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...
import math
from bisect import bisect_right
a = .01
limit = 1000
l = [1,5,14,29...]
def find_nearest_lt(l, time):
i = bisect_right(a, x)
if i:
return i-1
raise ValueError
def find_ema(l, time):
i = find_nearest_lt(l, time)
if l[i] == time:
result = a * v[l[i]
i -= 1
else:
result = 0
while (time-l[i]) < limit:
result += math.pow(1-a, time-l[i]) * v[l[i]]
i -= 1
return result
if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html
It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.
Your code will cause an error the first time through this loop:
s = [0]*1000
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?
How about this code:
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]
ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
x = (1-a)*x
if i in y:
x += a
if i == queries[0]:
ewma.append(x)
queries.pop(0)
When it's done, ewma should have the moving averages for each query point.
Edited to include SchighSchagh's improvements.