I want to simply distribute N items in n cells, both numbers N and n can be large, so I wouldn't like to loop over random as here:
import numpy as np
nitems = 100
ncells = 3
cells = np.zeros((ncells), dtype=np.int)
for _ in range(nitems):
dest = np.random.randint(ncells)
cells[dest] += 1
print(cells)
In this case, the output is:
[31 34 35]
(the sum is always N)
Is it there any faster way?
An answer to the question (I have to thank here to #pjs for his help) follows. I think it is the fastest, and probably, the shortest and most space efficient one possible:
from numpy import *
from time import sleep
g_nitems = 10000
g_ncells = 10
g_nsamples = 10000
def genDist(nitems, ncells):
r = sort(random.randint(0, nitems+1, ncells-1))
return concatenate((r,[nitems])) - concatenate(([0],r))
# Some stats
test = zeros(g_ncells, dtype=int)
Max = zeros(g_ncells, dtype=int)
for _ in range(g_nsamples):
tmp = genDist(g_nitems, g_ncells)
print(tmp.sum(), tmp, end='\r')
# print(_, end='\r')
# sleep(0.5)
test += tmp
for i in range(g_ncells):
if tmp[i] > Max[i]:
Max[i] = tmp[i]
print("\n", Max)
print(test//g_nsamples)
On my machine, your code with a timeit took 151 microseconds. The following took 11 microseconds:
import numpy as np
nitems = 100
ncells = 3
values = np.random.randint(0,ncells,nitems)
cells = np.array_split(values,3)
lengths= [ len(cell) for cell in cells ]
print(lengths,np.sum(lengths))
The result of the print is [34, 33, 33] 100.
The magic here is using numpy to do the splitting, but note that this method will split as close to uniform as possible.
If you want the partitioning done randomly:
import numpy as np
nitems = 100
ncells = 3
values = np.random.randint(0,ncells,nitems)
ind_split = [ np.random.randint(0,nitems) ]
ind_split.append(np.random.randint(ind_split[-1],nitems))
cells = np.array_split(values,ind_split)
lengths= [ len(cell) for cell in cells ]
print(lengths,np.sum(lengths))
This takes advantage of numpy.array_split taking indices of where to perform the split as an argument (rather than the number of near-uniform partitions).
You haven't specified that the counts have to have any particular distribution as long as they add up to N, so the following will work as requested:
import numpy as np
nitems = 100
ncells = 3
range_array = [np.random.randint(nitems + 1) for _ in range(ncells - 1)] + [0, nitems]
range_array.sort()
cells = [range_array[i + 1] - range_array[i] for i in range(ncells)]
print(cells)
It generates an ordered set of random values between 0 and nitems, then takes successive differences to generate the desired number of cell counts.
The complexity is O(ncells) rather than O(nitems), so it should be more efficient when there are substantially more items than cells.
Related
I am working on a python project and I need help to sort a list of elements.
It is a list with investment portfolios returns (10000 items). I want to sort them in intervals so that I can, then, assign probabilities to it.
Basically, I want to say that, for example, I'll have:
Could you help me out with a library/code that could do this sorting?
thanks
TL;DR
import random
import numpy as np
ranges = [2e6, 2.5e6, 3e6, 3.5e6, 4e6, 4.5e6, 5e6]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
_, counts = np.unique(np.digitize(returns, bins=ranges), return_counts=True)
Long Answer
If I understand your question right, you want to count the number of elements in a list, given a condition. Assuming you pick the ranges yourself, a pure Python solution could look like this:
import random
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
ranges = [[2e6, 2.5e6], [2.5e6, 3e6], [3.5e6, 4e6], [4e6, 4.5e6], [4.5e6, 5e6]]
print(f"Probability \t Return Range")
for lower, upper in ranges:
returns_in_range = sum(1 for r in returns if lower <= r < upper)
probability = 100 * (returns_in_range / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Which outputs something like:
Probability Return Range
17.36% 2.0M - 2.5M
16.18% 2.5M - 3.0M
17.13% 3.5M - 4.0M
16.35% 4.0M - 4.5M
16.35% 4.5M - 5.0M
This is a simple solution, however it iterates over the returns once for every range, which is costly in your case. A better way is to iterate over the returns once and keep a counter for every range.
import random
ranges = [[2e6, 2.5e6], [2.5e6, 3e6], [3.5e6, 4e6], [4e6, 4.5e6], [4.5e6, 5e6]]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
counts = [0] * len(ranges)
for ret in returns:
for idx, (lower, upper) in enumerate(ranges):
if lower <= ret < upper:
counts[idx] += 1
# print the results
print(f"Probability \t Return Range")
for idx, (lower, upper) in enumerate(ranges):
returns_in_range = counts[idx]
probability = 100 * (returns_in_range / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Using a library like numpy may increase performance significantly, as it is mostly written in C. The operation you are trying to do is called binning and can be implemented like this:
import random
import numpy as np
ranges = [2e6, 2.5e6, 3e6, 3.5e6, 4e6, 4.5e6, 5e6]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
_, counts = np.unique(np.digitize(returns, bins=ranges), return_counts=True)
# print the results
print(f"Probability \t Return Range")
for idx, (lower, upper) in enumerate(zip(ranges, ranges[1:])):
probability = 100 * (counts[idx] / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
My dataset has 2 million observations. I want to split it into 200 categories based on the value of a variable, 'rv'. For example, imagine I had the categories 0-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000 I would want to split an observation with value 4500 like this: 1000 in each of the 1st 4 categories, and 500 in the final category. I have the following code, which works but is very slow:
# create random data set
import pandas as pd
import numpy as np
data = np.random.randint(0, 5000, size=2000)
df = pd.DataFrame({'rv': data})
#%% slice
sizes = [0, 1000, 2000, 3000, 4000, 5000]
size_names = ['{:.0f} to {:.0f}'.format(lower, upper) for lower, upper in zip(sizes[0:-1], sizes[1:])]
for lower, upper, name in zip(sizes[0:-1], sizes[1:], size_names):
df[name] = df['rv'].apply(lambda x: max(0, (min(x, upper) - lower)))
# summary table
df_slice = df[size_names].sum()
Are there better ways of doing this, where better means faster principally? With 2 million observations and 200 categories this takes quite a long time (not sure how long as I stopped the code before it had finished).
I wrote an algorithm that sorts the data beforehand, which takes it from a O(n*m) loop (over the data and the categories) to a O(n) loop (just over the data, albeit there is a O(n log n) time for sorting it). By sorting it, you already know which bin you're in and just have to take care of the summing for that particular bin, then apply the sum to that bin and all bins below it once per bin. It takes about 1.2 seconds on 2 million data points over 200 categories. Hope it helps:
from time import time
from random import randint
data = [randint(0, 4999) for i in range(2000000)]
sizes = range(0, 5001, 25)
bound_pairs = [[sizes[i], sizes[i + 1]] for i in range(len(sizes) - 1)]
results = [0 for i in range(len(sizes) - 1)]
data.sort()
curr_bin = 0
curr_bin_count = 0
curr_bin_sum = 0
for d in data:
if d >= bound_pairs[curr_bin][1]:
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
curr_bin_count = 0
curr_bin_sum = 0
while d >= bound_pairs[curr_bin][1]:
curr_bin += 1
curr_bin_count += 1
curr_bin_sum += d - bound_pairs[curr_bin][0]
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
EDIT: There may be some issues here depending on whether you want the upper bound or lower bound to be inclusive or exclusive. I leave the particulars to you.
We have a 2d list, we can convert it into anything if necessary. Each row contains some positive integers(deltas of the original increasing numbers). Total 2 billion numbers, with more than half equals to 1. When using Elias-Gamma coding, we can encode the 2d list row by row (we'll be accessing arbitrary rows with row index later) using around 3 bits per number based on calculation from the distribution. However, our program has been running for 12 hours and it still hasn't finished the encoding.
Here's what we are doing:
from bitstring import BitArray
def _compress_2d_list(input: List[List[int]]) -> List[BitArray]:
res = []
for row in input:
res.append(sum(_elias_gamma_compress_number(num) for num in row))
return res
def _elias_gamma_compress_number(x: int) -> BitArray:
n = _log_floor(x)
return BitArray(bin="0" * n) + BitArray(uint=x, length=_log_floor(x) + 1)
def log_floor(num: int) -> int:
return floor(log(num, 2))
Called by:
input_2d_list: List[List[int]] # containing 1.5M lists, total 2B numbers
compressed_list = _compress_2d_list(input_2d_list)
How can I optimize my code to make it run faster? I mean, MUCH FASTER...... I am ok with using any reliable popular library or data structure.
Also, how do we decompress faster with BitStream? Currently I read prefix 0's one by one, then read the binary of the compressed number in a while loop. It's not very fast either...
If you are ok with numpy "bitfields" you can get the compression done in a matter of minutes. Decoding is slower by a factor of three but still a matter of minutes.
Sample run:
# create example (1'000'000 numbers)
a = make_example()
a
# array([2, 1, 1, ..., 3, 4, 3])
b,n = encode(a) # takes ~100 ms on my machine
c = decode(b,n) # ~300 ms
# check round trip
(a==c).all()
# True
Code:
import numpy as np
def make_example():
a = np.random.choice(2000000,replace=False,size=1000001)
a.sort()
return np.diff(a)
def encode(a):
a = a.view(f'u{a.itemsize}')
l = np.log2(a).astype('u1')
L = ((l<<1)+1).cumsum()
out = np.zeros(L[-1],'u1')
for i in range(l.max()+1):
out[L-i-1] += (a>>i)&1
return np.packbits(out),out.size
def decode(b,n):
b = np.unpackbits(b,count=n).view(bool)
s = b.nonzero()[0]
s = (s<<1).repeat(np.diff(s,prepend=-1))
s -= np.arange(-1,len(s)-1)
s = s.tolist() # list has faster __getitem__
ns = len(s)
def gen():
idx = 0
yield idx
while idx < ns:
idx = s[idx]
yield idx
offs = np.fromiter(gen(),int)
sz = np.diff(offs)>>1
mx = sz.max()+1
out = np.zeros(offs.size-1,int)
for i in range(mx):
out[b[offs[1:]-i-1] & (sz>=i)] += 1<<i
return out
Some simple optimizations result in a factor of three improvement:
def _compress_2d_list(input):
res = []
for row in input:
res.append(BitArray('').join(BitArray(uint=x, length=2*x.bit_length()-1) for x in row))
return res
However, I think you'll need something better than that. On my machine, this would finish in about 12 hours on 1.5 million lists with 1400 deltas each.
In C it takes about a minute to encode. About 15 seconds to decode.
I have two matrices. One is of size (CxK) and another is of size (SxK) (where S,C, and K all have the potential to be very large). I want to combine these an output matrix using the cosine similarity function (would be of size [CxS]). When I run my code, it takes a very long time to produce an output, and I was wondering if there is any way to optimize what I currently have. [Note, the two input matrices are often very sparse]
I was previously traversing each matrix using two for index,row loops, but I have since switched to the while loops, which improved my run time significantly.
A #this is one of my input matrices (pandas dataframe)
B #this is my second input matrix (pandas dataframe)
C = pd.DataFrame(columns = ['col_1' ,'col_2' ,'col_3'])
i=0
k=0
while i <= 5:
col_1 = A.iloc[i].get('label_A')
while k < 5:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
Right now I have the loops to run on only 5 items from each matrix, producing a 5x5 matrix, but I would obviously like this to work for very large inputs. This is the first time I have done anything like this so please let me know if any facet of code can be improved (data types used to hold matrices, how to traverse them, updating the output matrix, etc.).
Thank you in advance.
This can be done much more easyly and way faster by passing the whole arrays to cosine_similarity after you move the labels to the index:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
c = 50
s = 50
k = 100
A = pd.DataFrame( np.random.rand(c,k))
B = pd.DataFrame( np.random.rand(s,k))
A['label_A'] = [f'A{i}' for i in range(c)]
B['label_B'] = [f'B{i}' for i in range(s)]
C = pd.DataFrame()
# your program
start = time.time()
i=0
k=0
while i < c:
col_1 = A.iloc[i].get('label_A')
while k < s:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
print(f'elementwise: {time.time() - start:7.3f} s')
# my solution
start = time.time()
A = A.set_index('label_A')
B = B.set_index('label_B')
C1 = pd.DataFrame(cosine_similarity(A, B), index=A.index, columns=B.index).stack().rename('col_3')
C1.index.rename(['col_1','col_2'], inplace=True)
C1 = C1.reset_index()
print(f'whole array: {time.time() - start:7.3f} s')
# verification
assert(C[['col_1','col_2']].to_numpy()==C1[['col_1','col_2']].to_numpy()).all()\
and np.allclose(C.col_3.to_numpy(), C1.col_3.to_numpy())
I have a function, for simplicity I'll call it shuffler and it takes an list, gives random a seed 17 and then prints that list shuffled.
def shuffler( n ):
import random
random.seed( 17 )
print( random.shuffle( n ) )
How would I create another function called unshuffler that "unshuffles" that list that is returned by shuffler(), bringing it back to the list I inputted into shuffler() assuming that I know the seed?
Just wanted to contribute an answer that's more compatible with functional patterns commonly used with numpy. Ultimately this solution should perform the fastest as it will take advantage of numpy's internal optimizations, which themselves can be further optimized via the use of projects like numba. It ought to be much faster than using conventional loop structures in python.
import numpy as np
original_data = np.array([23, 44, 55, 19, 500, 201]) # Some random numbers to represent the original data to be shuffled
data_length = original_data.shape[0]
# Here we create an array of shuffled indices
shuf_order = np.arange(data_length)
np.random.shuffle(shuf_order)
shuffled_data = original_data[shuf_order] # Shuffle the original data
# Create an inverse of the shuffled index array (to reverse the shuffling operation, or to "unshuffle")
unshuf_order = np.zeros_like(shuf_order)
unshuf_order[shuf_order] = np.arange(data_length)
unshuffled_data = shuffled_data[unshuf_order] # Unshuffle the shuffled data
print(f"original_data: {original_data}")
print(f"shuffled_data: {shuffled_data}")
print(f"unshuffled_data: {unshuffled_data}")
assert np.all(np.equal(unshuffled_data, original_data))
Here are two functions that do what you need:
import random
import numpy as np
def shuffle_forward(l):
order = range(len(l)); random.shuffle(order)
return list(np.array(l)[order]), order
def shuffle_backward(l, order):
l_out = [0] * len(l)
for i, j in enumerate(order):
l_out[j] = l[i]
return l_out
Example
l = range(10000); random.shuffle(l)
l_shuf, order = shuffle_forward(l)
l_unshuffled = shuffle_backward(l_shuf, order)
print l == l_unshuffled
#True
Reseed the random generator with the seed in question and then shuffle the list 1, 2, ..., n. This tells you exactly what ended up where in the shuffle.
In Python3:
import random
import numpy as np
def shuffle_forward(l):
order = list(range(len(l)); random.shuffle(order))
return list(np.array(l)[order]), order
def shuffle_backward(l, order):
l_out = [0] * len(l)
for i, j in enumerate(order):
l_out[j] = l[i]
return l_out