Related
I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found
It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)
You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]
Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])
You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.
I have a collection of data and a variable containing indexes to some of them.
A filtering operation is applied on the data that eliminates a subset of the data.
I want to shift the indexes so that they refer to the updated collection of data (eliminating indexes to deleted instances).
I'm using the implementation in the function below. I'm also posting the code I used to validate that it works.
Is there a quick & fast way to do the index realignment via the core libraries or a better way in general?
import random
def align_index(wanted_idx, mask):
"""
Function to align a set of indexes to a collection after deletions,
indicated with a mask
Arguments:
wanted_idx: List of desired integer indexes prior to deletion
mask: Binary mask, where 1's indicate elements that survive deletion
Returns:
List of integer indexes to (surviving) desired elements, post-deletion
"""
# rebuild indexes: remove dangling
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
# mark deleted
not_mask = [int(not m) for m in mask]
# cumsum deleted regions
realigned_idx = [k-sum(not_mask[:k+1]) for k in new_idx]
return realigned_idx
# data
data = [random.randint(0,500) for _ in range(1000)]
rng = list(range(len(data)))
for _ in range(1000):
# random data deletion / request
wanted_idx = random.sample(rng, random.randint(5,100))
del_index = random.sample(rng, random.randint(5, 100))
# apply deletion
mask = [int(i not in del_index) for i in range(len(data))]
filtered_data = [data[i] for (i, m) in enumerate(mask) if m]
realigned_index = align_index(wanted_idx, mask)
# verify
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
l1 = [data[k] for k in new_idx]
l2 = [filtered_data[k] for k in realigned_index]
assert l1 == l2
If you use numpy it's quite trivial:
import numpy as np
mask = np.array(mask, dtype=np.bool)
new_idx = np.cumsum(mask, dtype=np.int64)
new_idx[mask] = -1
You shouldn't need to recompute new_idx unless more elements get deleted.
Then you can get the remapped index for old index i just by looking new_idx[i]. Or a whole array at once:
wanted_idx = np.array(wanted_idx, dtype=np.int64)
remapped_idx = new_idx[wanted_idx]
Note that deleted indices get assigned value -1. You can filter these out if you want:
remapped_idx = remapped_idx[remapped_idx >= 0]
I want to calculate the relative rank of each element in an array among elements before it. For example in an array [2,1,4,3], the relative rank (from small to large) of the second element (1) among a subset array of [2,1] is 1. The relative rank of the third element (4) among a subset array of [2,1,4] is 3. The final relative rank of each element should be [1,1,3,3].
I'm using the following python code:
x = np.array([2,1,4,3])
rr = np.ones(4)
for i in range(1,4):
rr[i] = sum(x[i] >= x[:i+1])
Are there any other faster ways?
Not sure if it's faster, but you can do this with a list comprehension, which always brightens my day:
[sorted(x[:i+1]).index(v)+1 for i, v in enumerate(x)]
Here's a vectorized way with broadcasting -
n = len(x)
m1 = x[1:,None]>=x
m2 = np.tri(n-1,n,k=1, dtype=bool)
rr[1:] = (m1 & m2).sum(1)
Alternatively, we could bring in einsum or np.matmul to do the last step of sum-reduction -
(m1.astype(np.float32)[:,None,:] # m2[:,:,None])[:,0,0]
np.einsum('ij,ij->i',m1.astype(np.float32),m2)
Your current algorithm takes quadratic time, which isn't going to scale to large inputs. You can do a lot better.
One way to do better would be to use a sorted data structure, like sortedcontainers.SortedList, and perform a series of lookups and insertions. The following example implementation returns a list, assumes no ties, and starts ranks from 0:
import sortedcontainers
def rank(nums):
sortednums = sortedcontainers.SortedList()
ranks = []
for num in nums:
ranks.append(sortednums.bisect_left(num))
sortednums.add(num)
return ranks
Most of the work is inside the SortedList implementation, and SortedList is pretty fast, so this shouldn't have too much Python overhead. The existence of sortedcontainers definitely makes this more convenient than the next option, if not necessarily more efficient.
This option runs in... O(n log n)-ish time. SortedList uses a two-layer hierarchy instead of a traditional tree structure, making a deliberate tradeoff of more data movement for less pointer chasing, so insertion isn't theoretically O(log n), but it's efficient in practice.
The next option would be to use an augmented mergesort. If you do this, you're going to want to use Numba or Cython, because you'll have to write the loops manually.
The basic idea is to do a mergesort, but tracking the rank of each element in its subarray as you go. When you merge two sorted subarrays, each element on the left side keeps its old rank, while the rank values for elements on the right side get adjusted upward for how many elements on the left were less than them.
This option runs in O(n log n).
An unoptimized implementation operating on Python lists, assuming no ties, and starting ranks at 0, would look like this:
def rank(nums):
_, indexes, ranks = _augmented_mergesort(nums)
result = [None]*len(nums)
for i, rank_ in zip(indexes, ranks):
result[i] = rank_
return result
def _augmented_mergesort(nums):
# returns sorted nums, indexes of sorted nums in original nums, and corresponding ranks
if len(nums) == 1:
return nums, [0], [0]
left, right = nums[:len(nums)//2], nums[len(nums)//2:]
return _merge(*_augmented_mergesort(left), *_augmented_mergesort(right))
def _merge(lnums, lindexes, lranks, rnums, rindexes, rranks):
nums, indexes, ranks = [], [], []
i_left = i_right = 0
def add_from_left():
nonlocal i_left
nums.append(lnums[i_left])
indexes.append(lindexes[i_left])
ranks.append(lranks[i_left])
i_left += 1
def add_from_right():
nonlocal i_right
nums.append(rnums[i_right])
indexes.append(rindexes[i_right] + len(lnums))
ranks.append(rranks[i_right] + i_left)
i_right += 1
while i_left < len(lnums) and i_right < len(rnums):
if lnums[i_left] < rnums[i_right]:
add_from_left()
elif lnums[i_left] > rnums[i_right]:
add_from_right()
else:
raise ValueError("Tie detected")
if i_left < len(lnums):
nums += lnums[i_left:]
indexes += lindexes[i_left:]
ranks += lranks[i_left:]
else:
while i_right < len(rnums):
add_from_right()
return nums, indexes, ranks
For an optimized implementation, you'd want an insertion sort base case, you'd want to use Numba or Cython, you'd want to operate on arrays, and you'd want to not do so much allocation.
You are all my hero! doing great job!. I'd like to show you the comparison of each of your solution:
import numpy as np
import time
import sortedcontainers
def John(x):
n=len(x)
rr=np.ones(n)
for i in range(1,n):
rr[i]=sum(x[i]>=x[:i+1])
return rr
def Matvei(x):
return [sorted(x[:i+1]).index(v)+1 for i, v in enumerate(x)]
def Divarkar1(x):
n = len(x)
m1 = x[1:,None]>=x
m2 = np.tri(n-1,n,k=1, dtype=bool)
rr[1:] = (m1 & m2).sum(1)
return rr
def Divarkar2(x):
n = len(x)
m1 = x[1:,None]>=x
m2 = np.tri(n-1,n,k=1, dtype=bool)
(m1.astype(np.float32)[:,None,:] # m2[:,:,None])[:,0,0]
rr[1:]=np.einsum('ij,ij->i',m1.astype(np.float32),m2)
return rr
def Monica(x):
sortednums = sortedcontainers.SortedList()
ranks = []
for num in x:
ranks.append(sortednums.bisect_left(num))
sortednums.add(num)
return np.array(ranks)+1
x=np.random.rand(4000)
t1=time.time()
rr=John(x)
t2=time.time()
print(t2-t1)
#print(rr)
t1=time.time()
rr=Matvei(x)
t2=time.time()
print(t2-t1)
#print(rr)
t1=time.time()
rr=Divarkar1(x)
t2=time.time()
print(t2-t1)
#print(rr)
t1=time.time()
rr=Divarkar2(x)
t2=time.time()
print(t2-t1)
#print(rr)
t1=time.time()
rr=Monica(x)
t2=time.time()
print(t2-t1)
#print(rr)
The results are:
19.5
2.9
0.079
0.25
0.017
I ran several times and results are similar. The best one is Monica's algorithm!
Many thanks to everyone!
John
when I converted all algorithms into numpy 2D array, I found my algorithm is the best. Of course the performance also depends on the dimension of 2D array. But 380x900 is my case. I think Numpy array calculation benefits it a lot. Here are codes:
import numpy as np
import time
import sortedcontainers
def John(x): #x is 1D array
n=len(x)
rr=[]
for i in range(n):
rr.append(np.sum(x[i]>=x[:i+1]))
return np.array(rr)
def John_2D(rv): #rv is 2d numpy array. rank it along axis 1!
nr,nc=rv.shape
rr=[]
for i in range(nc):
rr.append(np.sum((rv[:,:i+1]<=rv[:,i:i+1]),axis=1))
return np.array(rr).T
def Matvei(x): #x is 1D array
return [sorted(x[:i+1]).index(v)+1 for i, v in enumerate(x)]
def Divarkar1(x):#x is 1D array
n = len(x)
rr=np.ones(n,dtype=int)
m1 = x[1:,None]>=x
m2 = np.tri(n-1,n,k=1, dtype=bool)
rr[1:] = (m1 & m2).sum(1)
return rr
def Divarkar2(x):#x is 1D array
n = len(x)
rr=np.ones(n,dtype=int)
m1 = x[1:,None]>=x
m2 = np.tri(n-1,n,k=1, dtype=bool)
(m1.astype(np.float32)[:,None,:] # m2[:,:,None])[:,0,0]
rr[1:]=np.einsum('ij,ij->i',m1.astype(np.float32),m2)
return rr
def Monica1(nums): #nums is 1D array
sortednums = sortedcontainers.SortedList()
ranks = []
for num in nums:
ranks.append(sortednums.bisect_left(num))
sortednums.add(num)
return np.array(ranks)+1
def Monica2(nums): #nums is 1D array
_, indexes, ranks = _augmented_mergesort(nums)
result = [None]*len(nums)
for i, rank_ in zip(indexes, ranks):
result[i] = rank_
return np.array(result)+1
def _augmented_mergesort(nums): #nums is 1D array
# returns sorted nums, indexes of sorted nums in original nums, and corresponding ranks
if len(nums) == 1:
return nums, [0], [0]
left, right = nums[:len(nums)//2], nums[len(nums)//2:] #split the array by half
return _merge(*_augmented_mergesort(left), *_augmented_mergesort(right))
def _merge(lnums, lindexes, lranks, rnums, rindexes, rranks):
nums, indexes, ranks = [], [], []
i_left = i_right = 0
def add_from_left():
nonlocal i_left
nums.append(lnums[i_left])
indexes.append(lindexes[i_left])
ranks.append(lranks[i_left])
i_left += 1
def add_from_right():
nonlocal i_right
nums.append(rnums[i_right])
indexes.append(rindexes[i_right] + len(lnums))
ranks.append(rranks[i_right] + i_left)
i_right += 1
while i_left < len(lnums) and i_right < len(rnums):
if lnums[i_left] < rnums[i_right]:
add_from_left()
elif lnums[i_left] > rnums[i_right]:
add_from_right()
else:
raise ValueError("Tie detected")
if i_left < len(lnums):
while i_left < len(lnums):
add_from_left()
#nums += lnums[i_left:]
#indexes += lindexes[i_left:]
#ranks += lranks[i_left:]
else:
while i_right < len(rnums):
add_from_right()
return nums, indexes, ranks
def rank_2D(f,nums): #f is method, nums is 2D numpy array
result=[]
for x in nums:
result.append(f(x))
return np.array(result)
x=np.random.rand(6000)
for f in [John, Matvei, Divarkar1, Divarkar2, Monica1, Monica2]:
t1=time.time()
rr=f(x)
t2=time.time()
print(f'{f.__name__+"_1D: ":16} {(t2-t1):.3f}')
print()
x=np.random.rand(380,900)
t1=time.time()
rr=John_2D(x)
t2=time.time()
print(f'{"John_2D:":16} {(t2-t1):.3f}')
#print(rr)
for f in [Matvei, Divarkar1, Divarkar2, Monica1, Monica2]:
t1=time.time()
rr=rank_2D(f,x)
t2=time.time()
print(f'{f.__name__+"_2D: ":16} {(t2-t1):.3f}')
#print(rr)
The typical results are:
John_1D: 0.069
Matvei_1D: 7.208
Divarkar1_1D: 0.163
Divarkar2_1D: 0.488
Monica1_1D: 0.032
Monica2_1D: 0.082
John_2D: 0.409
Matvei_2D: 49.044
Divarkar1_2D: 1.276
Divarkar2_2D: 4.065
Monica1_2D: 1.090
Monica2_2D: 3.571
For 1D array, Monica1 method is the best, but my numpy-version method is not too bad.
For 2D array, my numpy-version method is the best.
Welcome to test and comment.
Thanks
John
Recently I got a coding challenge where I was given some arrays like the following:
[(4, 5.6], (5, 9.1], [-2, -3.5]]
Here ( means that the array is unbounded in left side i.e. does not include that number but includes everything else for example, (4, 5.6] does not include 4 but everything else between 4 and 5.6 and includes 5.6. I can merge the arrays if I have [ instead of ( with the following code. Based on my research, I can not represent such an array in numpy.
So, first thing is how do I represent such an array in my code? Or, is it not an array but represented in a different way?
def MergeIntervals(intervals):
result = []
intervals.sort()
i, L = 0, len(intervals)-1
while i<L:
if intervals[i+1][0] <=intervals[i][1]:
intervals[i+1][0] = intervals[i][0]
intervals[i+1][1] = max(intervals[i][1], intervals[i+1][1])
intervals[i] = None
i+=1
return [interval for interval in intervals if interval]
intervals = [[4,5.6],[5,9.1],[-2, -3.5]]
MergeIntervals(intervals)
[[-2, -3.5], [4, 9.1]]
This is the brute force way I did. I am sure complexity can be improved.
However, I am not sure how to make it operate on an unbounded array like I got in the question.
I have not yet found any similar question and/or answer here.
Thank you and appreciate any help.
Represent every interval as a list of tuples as so (begin point, is begin point included, end point, is end point included).
Assume that the array is sorted and all partial intervals are legal.
Example for legal intervals:
i1 = [(1,True,1.5,False), (3,False,3.5,False), (3.5,False,5,True)]
i2 = [(0.02, False,1,False), (3,True,4,False),(5,False,7,True)]
BTW I liked that question
def mergeTupelToInterval(tup,inter):
if tup[2]<inter[0][0]:
return [tup]+inter
if tup[2]==inter[0][0] and tup[3] == False and inter[0][1]==False:
return [tup] + inter
if tup[0]>inter[-1][2]:
return inter+[tup]
if tup[0]==inter[-1][2] and (not tup[1]) and (not inter[-1][3]):
return inter+[tup]
relevant = []
for i in range(len(inter)):
if inter[i][0]<=tup[0]<inter[i][2]:
relevant.append(i)
continue
if tup[0]==inter[i][2] and (tup[1]or inter[i][3]):
relevant.append(i)
continue
if inter[i][0]<tup[2]<=inter[i][2]:
relevant.append(i)
continue
if tup[2]==inter[i][0] and (tup[3] or inter[i][1]):
relevant.append(i)
min = tup
max = tup
min_index = 0
if len(relevant)>1:
relevant.reverse()
for i in relevant:
if inter[i][0]<=tup[0]:
min = inter[i]
min_index = i
if inter[i][2]>=tup[2]:
max = inter[i]
inter.pop(i)
bool1 = min[1]
if min[0]==tup[0]:
bool1 = (min[1] or tup[1])
bool2 = max[3]
if max[2]==tup[2]:
bool2 = (max[3] or tup[3])
new_tup = (min[0],bool1,max[2],bool2)
inter.insert(min_index,new_tup)
def merge_intervals(i1, i2):
merged = i1.copy()
for i in i2:
mergeTupelToInterval(i,merged)
return merged
i1 = [(1,True,1.5,False), (3,False,3.5,False), (3.5,False,5,True)]
i2 = [(0.02, False,1,False), (3,True,4,False),(5,False,7,True)]
print(merge_intervals(i1, i2))
output is
[(0.02, False, 1.5, False), (3, True, 7, True)]
def models():
default = [0.6,0.67,2.4e-2,1e-2,2e-5,1.2e-3,2e-5]
lower = [np.log10(i/10) for i in default]
upper = [np.log10(i*10) for i in default]
n = 5
a = np.logspace(lower[0],upper[0],n)
b = np.logspace(lower[1],upper[1],n)
c = np.logspace(lower[2],upper[2],n)
d = np.logspace(lower[3],upper[3],n)
e = np.logspace(lower[4],upper[4],n)
f = np.logspace(lower[5],upper[5],n)
g = np.logspace(lower[6],upper[6],n)
combs = itertools.product(a,b,c,d,e,f,g)
list1 = []
for x in combs:
x = list(x)
list1.append(x)
return list1
The code above returns a list of 5^7 = 78,125 lists. Is there a way I can combine items in a,b,c,d,e,f,g, possibly randomly, to create a list of say, 10000, lists?
You could take random samples of each array and combine them, especially if you don't need to guarantee that specific combinations don't occur more than once:
import numpy as np
import random
def random_models(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
results = []
for i in xrange(num_values):
results.append([random.choice(arr) for arr in data_arrays])
return results
l = random_models(10000)
print len(l)
Here's a version that will avoid repeats up until you request more data than can be given without repeating:
def random_models_avoid_repeats(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
results = []
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
results.append(temp_data.pop())
return results
Also note that we can avoid building a results list if you make this a generator by using yield. However, you'd want to consume the results using a forstatement as well:
def random_models_avoid_repeats_generator(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
yield temp_data.pop()
You'd have to use it like this:
for entry in random_models_avoid_repeats_generator(10000):
# Do stuff...
Or manually iterate over it using next().