Say I have this type of array
y
array([299839, 667136, 665420, 665418, 665421, 667135, 299799, 665419, 667137, 299800])
as the result of a "top 10" argpartition:
y = np.argpartiton(-x, np.arange(10))[:10]
Now, I want to remove the elements that are sequential, only keeping the first (maximum) element in the series such that:
y_new
array([299839, 667136, 665420, 299799])
But while that seems like it should be simple I'm not seeing an efficient way to do it (or even a good way to start). Assume the real-world application will do the top 1000 or so and need to do it many times.
Here's one approach based on sorting -
# Get the sorted indices
sidx = y.argsort()
# Get sorted array
ys = y[sidx]
# Get indices at which islands of sequential numbers start/stop
cut_idx = np.flatnonzero(np.concatenate(([True], np.diff(ys)!=1 )))
# Finally get the minimum indices for each island and then index into
# input for the desired output
y_new = y[np.minimum.reduceat(sidx, cut_idx)]
If you would like to keep the order of elements in the output, sort the indices and then index at the last step -
y[np.sort(np.minimum.reduceat(sidx, cut_idx))]
Sample input, output -
In [56]: y
Out[56]:
array([299839, 667136, 665420, 665418, 665421, 667135, 299799, 665419,
667137, 299800])
In [57]: y_new
Out[57]: array([299799, 299839, 665420, 667136])
In [58]: y[np.sort(np.minimum.reduceat(sidx, cut_idx))]
Out[58]: array([299839, 667136, 665420, 299799])
heres my implementation for that problem
from itertools import groupby
from operator import itemgetter
a = [299839, 667136, 665420, 665418, 665421, 667135, 299799, 665419,
667137, 299800]
new = a[:]
# to keep the first number
b = a[0]
new.sort()
# to store diffrent arrays
saver = []
final_array = []
for k, g in groupby(enumerate(new), lambda (i, x): i - x):
ac = map(itemgetter(1), g)
saver.append(ac)
final_array.append(b)
for i in range(len(saver)):
for j in range(len(a)):
if a[j] in saver[i]:
if b == a[j]:
continue
final_array.append(a[j])
break
print final_array
output
[299839, 299799, 665420, 667136]
Related
I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found
It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)
You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]
Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])
You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.
I want to get border of data in a list using python
For example I have this list :
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
I want a code that return data borders. for example:
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
^ ^ ^ ^
b = get_border_index(a)
print(b)
output:
[0,4,7,12]
How can I implement get_border_index(lst: list) -> list function?
The scalable answer that also works for very long lists or arrays is to use np.diff. In that case you should avoid a for loop at all costs.
import numpy as np
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
a = np.array(a)
# this is unequal 0 if there is a step
d = np.diff(a)
# boolean array where the steps are
is_step = d != 0
# get the indices of the steps (first one is trivial).
ics = np.where(is_step)
# get the first dimension and shift by one as you want
# the index of the element right of the step
ics_shift = ics[0] + 1
# and if you need a list
ics_list = ics_shift.tolist()
print(ics_list)
You can use for loop with enumerate
def get_border_index(a):
last_value = None
result = []
for i, v in enumerate(a):
if v != last_value:
last_value = v
result.append(i)
return result
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
b = get_border_index(a)
print(b)
Output
[0, 4, 7, 12]
This code will check if an element in the a list is different then the element before and if so it will append the index of the element to the result list.
Suppose I have the following two arrays:
>>> a = np.random.normal(size=(5,))
>>> a
array([ 1.42185826, 1.85726088, -0.18968258, 0.55150255, -1.04356681])
>>> b = np.random.normal(size=(10,10))
>>> b
array([[ 0.64207828, -1.08930317, 0.22795289, 0.13990505, -0.9936441 ,
1.07150754, 0.1701072 , 0.83970818, -0.63938211, -0.76914925],
[ 0.07776129, -0.37606964, -0.54082077, 0.33910246, 0.79950839,
0.33353221, 0.00967273, 0.62224009, -0.2007335 , -0.3458876 ],
[ 2.08751603, -0.52128218, 1.54390634, 0.96715102, 0.799938 ,
0.03702108, 0.36095493, -0.13004965, -1.12163463, 0.32031951],
[-2.34856521, 0.11583369, -0.0056261 , 0.80155082, 0.33421475,
-1.23644508, -1.49667424, -1.01799365, -0.58232326, 0.404464 ],
[-0.6289335 , 0.63654201, -1.28064055, -1.01977467, 0.86871352,
0.84909353, 0.33036771, 0.2604609 , -0.21102014, 0.78748329],
[ 1.44763687, 0.84205291, 0.76841512, 1.05214051, 2.11847126,
-0.7389102 , 0.74964783, -1.78074088, -0.57582084, -0.67956203],
[-1.00599479, -0.93125754, 1.43709533, 1.39308038, 1.62793589,
-0.2744919 , -0.52720952, -0.40644809, 0.14809867, -1.49267633],
[-1.8240385 , -0.5416585 , 1.10750423, 0.56598464, 0.73927224,
-0.54362927, 0.84243497, -0.56753587, 0.70591902, -0.26271302],
[-1.19179547, -1.38993415, -1.99469983, -1.09749452, 1.28697997,
-0.74650318, 1.76384156, 0.33938808, 0.61647274, -0.42166111],
[-0.14147554, -0.96192206, 0.14434349, 1.28437894, -0.38865447,
-1.42540195, 0.93105528, 0.28993325, -1.16119916, -0.58244758]])
I have to find a way to round all values from b to the nearest value found in a.
Does anyone know of a good way to do this with python? I am at a total loss myself.
Here is something you can try
import numpy as np
def rounder(values):
def f(x):
idx = np.argmin(np.abs(values - x))
return values[idx]
return np.frompyfunc(f, 1, 1)
a = np.random.normal(size=(5,))
b = np.random.normal(size=(10,10))
rounded = rounder(a)(b)
print(rounded)
The rounder function takes the values which we want to round to. It creates a function which takes a scalar and returns the closest element from the values array. We then transform this function to a broadcast-able function using numpy.frompyfunc. This way you are not limited to using this on 2d arrays, numpy automatically does broadcasting for you without any loops.
If you sort a you can use bisect to find the index in array a where each element from the sub arrays of array b would land:
import numpy as np
from bisect import bisect
a = np.random.normal(size=(5,))
b = np.random.normal(size=(10, 10))
a.sort()
size = a.size
for sub in b:
for ind2, ele in enumerate(sub):
i = bisect(a, ele, hi=size-1)
i1, i2 = a[i], a[i-1]
sub[ind2] = i1 if abs(i1 - ele) < abs(i2 - ele) else i2
Assuming a will always be 1 dimensional, and that b can have any dimension in this solution.
Create two temporary arrays tiling a and b into the dimensions of the other (here both will now have a shape of (5,10,10)).
at = np.tile(np.reshape(a, (-1, *list(np.ones(len(b.shape)).astype(int)))), (1, *b.shape))
bt = np.tile(b, (a.size, *list(np.ones(len(b.shape)).astype(int))))
For the nearest operation, you can take the absolute value of the difference between the two. The minimum value of that operation in the first dimension (dimension 0) gives the index in the a array.
idx = np.argmin(np.abs(at-bt),axis=0)
All that is left is to select the values from array a using the index, which will return an array in the shape of b with the nearest values from a.
ans = a[idx]
This method can also be used (modifying how the index is calculated) to do other operations, such as a floor, ceil, etc.
Note that this solution can be memory intensive, which is not much of an issue with small arrays. A looping solution could be less memory intensive at the cost of speed.
I don't know Numpy, but I don't think knowledge of Numpy is needed to be able to answer this question. Assuming that an array can be iterated and modified in the same way as a list, the following code solves your problem by using a nested loop to find the closest value.
for i in range(len(b)):
for k in range(len(b[i])):
closest = a[0]
for j in range(1, len(a)):
if abs(a[j] - b[i][k]) < abs(closest - b[i][k]):
closest = a[j]
b[i][k] = closest
Disclaimer: a more pythonic approach may exist.
thats what I get:
TypeError: 'float' object is unsubscriptable
Thats what I did:
import numpy as N
import itertools
#I created two lists, containing large amounts of numbers, i.e. 3.465
lx = [3.625, 4.625, ...]
ly = [41.435, 42.435, ...] #The lists are not the same size!
xy = list(itertools.product(lx,ly)) #create a nice "table" of my lists
#that iterttools gives me something like
print xy
[(3.625, 41.435), (3.625, 42.435), (... , ..), ... ]
print xy[0][0]
print xy[0][1] #that works just fine, I can access the varios values of the tuple in the list
#down here is where the error occurs
#I basically try to access certain points in "lon"/"lat" with values from xy through `b` and `v`with that iteration. lon/lat are read earlier in the script
b = -1
v = 1
for l in xy:
b += 1
idx = N.where(lon==l[b][b])[0][0]
idy = N.where(lat==l[b][v])[0][0]
lan/lot are read earlier in the script. I am working with a netCDF file and this is the latitude/longitude,read into lan/lot.
Its an array, build with numpy.
Where is the mistake?
I tried to convert b and v with int() to integers, but that did not help.
The N.where is accessing through the value from xy a certain value on a grid with which I want to proceed. If you need more code or some plots, let me know please.
Your problem is that when you loop over xy, each value of l is a single element of your xy list, one of the tuples. The value of l in the first iteration of the loop is (3.625, 41.435), the second is (3.625, 42.435), and so on.
When you do l[b], you get 3.625. When you do l[b][b], you try to get the first element of 3.625, but that is a float, so it has no indexes. That gives you an error.
To put it another way, in the first iteration of the loop, l is the same as xy[0], so l[0] is the same as xy[0][0]. In the second iteration, l is the same as xy[1], so l[0] is the same as xy[1][0]. In the third iteration, l is equivalent to xy[2], and so on. So in the first iteration, l[0][0] is the same as xy[0][0][0], but there is no such thing so you get an error.
To get the first and second values of the tuple, using the indexing approach you could just do:
x = l[0]
y = l[1]
Or, in your case:
for l in xy:
idx = N.where(lon==l[0])[0][0]
idy = N.where(lat==l[1])[0][0]
However, the simplest solution would be to use what is called "tuple unpacking":
for x, y in xy:
idx = N.where(lon==x)[0][0]
idy = N.where(lat==y)[0][0]
This is equivalent to:
for l in xy:
x, y = l
idx = N.where(lon==x)[0][0]
idy = N.where(lat==y)[0][0]
which in turn is equivalent to:
for l in xy:
x = l[0]
y = l[1]
idx = N.where(lon==x)[0][0]
idy = N.where(lat==y)[0][0]
I have a block of code which does the following:
take a float from a list, b_lst below, of index indx
check if this float is located between a float of index i and the next one (of index i+1) in list a_lst
if it is, then store indx in a sub-list of a third list (c_lst) where the index of that sub-list is the index of the left float in a_lst (ie: i)
repeat for all floats in b_lst
Here's a MWE which shows what the code does:
import numpy as np
import timeit
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N).tolist()
# Data lists.
# Note that a_lst is sorted.
a_lst = np.sort(random_data(1000))
b_lst = random_data(5000)
# Fixed index value (int)
c = 25
def func():
# Create empty list with as many sub-lists as elements present
# in a_lst beyond the 'c' index.
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
# For each element in b_lst.
for indx,elem in enumerate(b_lst):
# For elements in a_lst beyond the 'c' index.
for i in range(len(a_lst[c:])-1):
# Check if 'elem' is between this a_lst element
# and the next.
if a_lst[c+i] < elem <= a_lst[c+(i+1)]:
# If it is then store the index of 'elem' ('indx')
# in the 'i' sub-list of c_lst.
c_lst[i].append(indx)
return c_lst
print func()
# time function.
func_time = timeit.timeit(func, number=10)
print func_time
This code works as it should but I really need to improve its performance since it's slowing down the rest of my code.
Add
This is the optimized function based on the accepted answer. It's quite ugly but it gets the job done.
def func_opt():
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
c_opt = np.searchsorted(a_lst[c:], b_lst, side='left')
for elem in c_opt:
if 0<elem<len(a_lst[c:]):
c_lst[elem-1] = np.where(c_opt==elem)[0].tolist()
return c_lst
In my tests this is ~7x faster than the original function.
Add 2
Much faster not using np.where:
def func_opt2():
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
c_opt = np.searchsorted(a_lst[c:], b_lst, side='left')
for indx,elem in enumerate(c_opt):
if 0<elem<len(a_lst[c:]):
c_lst[elem-1].append(indx)
return c_lst
This is ~130x faster than the original function.
Add 3
Following jtaylor's advice I converted the result of np.searchsorted to a list with .tolist():
def func_opt3():
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
c_opt = np.searchsorted(a_lst[c:], b_lst, side='left').tolist()
for indx,elem in enumerate(c_opt):
if 0<elem<len(a_lst[c:]):
c_lst[elem-1].append(indx)
return c_lst
This is ~470x faster than the original function.
You want to take a look at numpy's searchsorted. Calling
np.searchsorted(a_lst, b_lst, side='right')
will return an array of indices, the same length as b_lst, holding before which item in a_lst they should be inserted to preserve order. It will be very fast, as it uses binary search and the looping happens in C. You could then create your subarrays with fancy indexing, e.g.:
>>> a = np.arange(1, 10)
>>> b = np.random.rand(100) * 10
>>> c = np.searchsorted(a, b, side='right')
>>> b[c == 0]
array([ 0.54620226, 0.40043875, 0.62398925, 0.40097674, 0.58765603,
0.14045264, 0.16990249, 0.78264088, 0.51507254, 0.31808327,
0.03895417, 0.92130027])
>>> b[c == 1]
array([ 1.34599709, 1.42645778, 1.13025996, 1.20096723, 1.75724448,
1.87447058, 1.23422399, 1.37807553, 1.64118058, 1.53740299])