Reverse indices of a sorted list - python

I want to return the 'reverse' indices of a sorted list. What I mean by that is: I have an unsorted list U and I sort it via S=sorted(U). Now, I can get the sort indices such that U(idx)=S - but I want S(Ridx) = U.
Here a little example:
U=[5,2,3,1,4]
S=sorted(U)
idx = [U.index(S[i]) for i in range(len(U))]
>>> idx
[3, 1, 2, 4, 0]
Ridx = [S.index(U[i]) for i in range(len(U))]
>>> Ridx
[4, 1, 2, 0, 3]
>>>[U[idx[i]] for i in range(len(U))] == S
True
>>>[S[Ridx[i]] for i in range(len(U))] == U
True
What I need is an efficient way to get Ridx.
Thanks!
Edit:
All right! I did a little speed test for both of the solutions (#Jon Clements and #Whatang) which answered the question.
The script:
import datetime as DT
import random
U=[int(1000*random.random()) for i in xrange(pow(10,8))]
S=sorted(U)
idx = sorted(xrange(len(U)), key=U.__getitem__)
T0 = DT.datetime.now()
ridx = sorted(xrange(len(U)), key=idx.__getitem__)
print [S[ridx[i]] for i in range(len(U))]==U
elapsed = DT.datetime.now()-T0
print str(elapsed)
print '==============='
T0 = DT.datetime.now()
ridx = [ y for (x,y) in sorted(zip(idx, range(len(idx)))) ]
print [S[ridx[i]] for i in range(len(U))]==U
elapsed = DT.datetime.now()-T0
print str(elapsed)
And the results:
True
0:02:45.278000
===============
True
0:06:48.889000
Thank you all for the quick and meaningful help!

The most efficient I can think of (short of possibly looking to numpy) that gets rid of the .index and can be used for both idx and ridx:
U=[5,2,3,1,4]
idx = sorted(xrange(len(U)), key=U.__getitem__)
ridx = sorted(xrange(len(U)), key=idx.__getitem__)
# [3, 1, 2, 4, 0] [4, 1, 2, 0, 3]

Not quite the data structure you asked for, but I think this gets the info you want:
>>> sorted(x[::-1] for x in enumerate(['z', 'a', 'c', 'x', 'm']))
[('a', 1), ('c', 2), ('m', 4), ('x', 3), ('z', 0)]

With numpy you can do
>>> import numpy as np
>>> U = [5, 2, 3, 1, 4]
>>> np.array(U).argsort().argsort()
array([4, 1, 2, 0, 3])

Assuming you already have the list idx, you can do
ridx = [ y for (x,y) in sorted(zip(idx, range(len(idx)))) ]
Then for all i from 0 to len(U)
S[ridx[i]] == U[i]
You can avoid the sort if you use a dictionary:
ridx_dict = dict(zip(idx, range(len(idx))))
which can then be converted to a list:
ridx = [ ridx_dict[k] for k in range(len(idx)) ]
Thinking about permutations is the key to this problem. One way of writing down a permutation is to write all the indexes in order on one line, then on the line below write the new index of the element with that index. e.g., for your example
0 1 2 3 4
3 1 2 4 0
This second line is your idx list. You read down the columns, so the element which starts at index 0 moves to index 3, the element which starts at index 1 stays at index 1, and so on.
The inverse permutation is the ridx you're looking for. To find this, sort the lower line of the your permutation keeping columns together, then write down the new top line. So the example becomes:
4 1 2 0 3
0 1 2 3 4

If I understand the question correctly (which I didn't) I think U.index(S[i]) is what you are looking for
EDIT: so I guess you could save a dictionary of the original indices and keep the retrieval syntax pretty simple
OIDX = {U[i]: i for i in range(0, len(U))}
S = sorted(U)
OIDX[S[i]]

Related

Finding indices of first non-zero items in a list

I have the following list :
list_test = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
I would like to find the indices of all the first numbers in the list that are not equal to zero.
In this case the output should be:
output = [3,5,10]
Is there a Pythonic way to do this?
According to the output, I think you want the first index of continuous non-zero sequences.
As for Pythonic, I understand it as list generator, while it's poorly readable.
# works with starting with non-zero element.
# list_test = [1, 0, 0, 1, 0, 2, 5, 4, 0, 0, 5, 5, 3, 0, 0]
list_test = [0, 0, 0, 1, 0, 2, 5, 4, 0, 0, 5, 5, 3, 0, 0]
output = [i for i in range(len(list_test)) if list_test[i] != 0 and (i == 0 or list_test[i - 1] == 0)]
print(output)
There is also a numpy based solution:
import numpy as np
l = np.array([0,0,0,1,0,2,5,4,0,0,5,5,3,0,0])
non_zeros = np.where(l != 0)[0]
diff = np.diff(non_zeros)
np.append(non_zeros [0], non_zeros [1 + np.where(diff>=2)[0]]) # array([ 3, 5, 10], dtype=int64)
Explanation:
First, we find the non-zero places, then we calculate the pair differences of those position (we need to add 1 because its out[i] = a[i+1] - a[i], read more about np.diff) then we need to add the first element of non-zero and also all the values where the difference was greater then 1)
Note:
It will also work for the case where the array start with non-zero element or all non-zeros.
From the Link.
l = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
v = {}
for i, x in enumerate(l):
if x != 0 and x not in v:
v[x] = i
list_test = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
res = {}
for index, item in enumerate(list_test):
if item > 0:
res.setdefault(index, None)
print(res.keys())
I don't knwo what you mean by Pythonic way, but this is an answer using a simple loop:
list_test = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
out = []
if list_test[0] == 0:
out.append(0)
for i in range(1, len(list_test)):
if (list_test[i-1] == 0) and (list_test[i] != 0):
out.append(i)
Don't hesitate to precise what you mean by "Pythonic" !

How to merge an array with its array elements in Python?

I have an array like below;
constants = ['(1,2)', '(1,5,1)', '1']
I would like to transform the array into like below;
constants = [(1,2), 1, 2, 3, 4, 5, 1]
For doing this, i tried some operations;
from ast import literal_eval
import numpy as np
constants = literal_eval(str(constants).replace("'",""))
constants = [(np.arange(*i) if len(i)==3 else i) if isinstance(i, tuple) else i for i in constants]
And the output was;
constants = [(1, 2), array([1, 2, 3, 4]), 1]
So, this is not expected result and I'm stuck in this step. The question is, how can i merge the array with its parent array?
This is one approach.
Demo:
from ast import literal_eval
constants = ['(1,2)', '(1,5,1)', '1']
res = []
for i in constants:
val = literal_eval(i) #Convert to python object
if isinstance(val, tuple): #Check if element is tuple
if len(val) == 3: #Check if no of elements in tuple == 3
val = list(val)
val[1]+=1
res.extend(range(*val))
continue
res.append(val)
print(res)
Output:
[(1, 2), 1, 2, 3, 4, 5, 1]
I'm going to assume that this question is very literal, and that you always want to transform this:
constants = ['(a, b)', '(x, y, z)', 'i']
into this:
transformed = [(a,b), x, x+z, x+2*z, ..., y, i]
such that the second tuple is a range from x to y with step z. So your final transformed array is the first element, then the range defined by your second element, and then your last element. The easiest way to do this is simply step-by-step:
constants = ['(a, b)', '(x, y, z)', 'i']
literals = [eval(k) for k in constants] # get rid of the strings
part1 = [literals[0]] # individually make each of the three parts of your list
part2 = [k for k in range(literals[1][0], literals[1][1] + 1, literals[1][2])] # or if you don't need to include y then you could just do range(literals[1])
part3 = [literals[2]]
transformed = part1 + part2 + part3
I propose the following:
res = []
for cst in constants:
if isinstance(cst,tuple) and (len(cst) == 3):
#add the range to the list
res.extend(range(cst[0],cst[1], cst[2]))
else:
res.append(cst)
res has the result you want.
There may be a more elegant way to solve it.
Please use code below to resolve parsing described above:
from ast import literal_eval
constants = ['(1,2)', '(1,5,1)', '1']
processed = []
for index, c in enumerate(constants):
parsed = literal_eval(c)
if isinstance(parsed, (tuple, list)) and index != 0:
processed.extend(range(1, max(parsed) + 1))
else:
processed.append(parsed)
print processed # [(1, 2), 1, 2, 3, 4, 5, 1]

find the start position of the longest sequence of 1's

I want to find the start position of the longest sequence of 1's in my array:
a1=[0,0,1,1,1,1,0,0,1,1]
#2
I am following this answer to find the length of the longest sequence. However, I was not able to determine the position.
Inspired by this solution, here's a vectorized approach to solve it -
# Get start, stop index pairs for islands/seq. of 1s
idx_pairs = np.where(np.diff(np.hstack(([False],a1==1,[False]))))[0].reshape(-1,2)
# Get the island lengths, whose argmax would give us the ID of longest island.
# Start index of that island would be the desired output
start_longest_seq = idx_pairs[np.diff(idx_pairs,axis=1).argmax(),0]
Sample run -
In [89]: a1 # Input array
Out[89]: array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1])
In [90]: idx_pairs # Start, stop+1 index pairs
Out[90]:
array([[ 2, 6],
[ 8, 10]])
In [91]: np.diff(idx_pairs,axis=1) # Island lengths
Out[91]:
array([[4],
[2]])
In [92]: np.diff(idx_pairs,axis=1).argmax() # Longest island ID
Out[92]: 0
In [93]: idx_pairs[np.diff(idx_pairs,axis=1).argmax(),0] # Longest island start
Out[93]: 2
A more compact one-liner using groupby(). Uses enumerate() on the raw data to keep the starting positions through the analysis pipeline, evenutally ending up with the list of tuples [(2, 4), (8, 2)] each tuple containing the starting position and length of non-zero runs:
from itertools import groupby
L = [0,0,1,1,1,1,0,0,1,1]
print max(((lambda y: (y[0][0], len(y)))(list(g)) for k, g in groupby(enumerate(L), lambda x: x[1]) if k), key=lambda z: z[1])[0]
lambda: x is the key function for groupby() since we enumerated L
lambda: y packages up results we need since we can only evaluate g once, without saving
lambda: z is the key function for max() to pull out the lengths
Prints '2' as expected.
This seems to work, using groupby from itertools, this only goes through the list once:
from itertools import groupby
pos, max_len, cum_pos = 0, 0, 0
for k, g in groupby(a1):
if k == 1:
pat_size = len(list(g))
pos, max_len = (pos, max_len) if pat_size < max_len else (cum_pos, pat_size)
cum_pos += pat_size
else:
cum_pos += len(list(g))
pos
# 2
max_len
# 4
You could use a for loop and check if the next few items (of length m where m is the max length) are the same as the maximum length:
# Using your list and the answer from the post you referred
from itertools import groupby
L = [0,0,1,1,1,1,0,0,1,1]
m = max(sum(1 for i in g) for k, g in groupby(L))
# Here is the for loop
for i, s in enumerate(L):
if len(L) - i + 2 < len(L) - m:
break
if s == 1 and 0 not in L[i:i+m]:
print i
break
This will give:
2
Another way of doing in a single loop, but without resorting to itertool's groupby.
max_start = 0
max_reps = 0
start = 0
reps = 0
for (pos, val) in enumerate(a1):
start = pos if reps == 0 else start
reps = reps + 1 if val == 1 else 0
max_reps = max(reps, max_reps)
max_start = start if reps == max_reps else max_start
This could also be done in a one-liner fashion using reduce:
max_start = reduce(lambda (max_start, max_reps, start, reps), (pos, val): (start if reps == max(reps, max_reps) else max_start, max(reps, max_reps), pos if reps == 0 else start, reps + 1 if val == 1 else 0), enumerate(a1), (0, 0, 0, 0))[0]
In Python 3, you cannot unpack tuples inside the lambda arguments definition, so it's preferable to define the function using def first:
def func(acc, x):
max_start, max_reps, start, reps = acc
pos, val = x
return (start if reps == max(reps, max_reps) else max_start,
max(reps, max_reps),
pos if reps == 0 else start,
reps + 1 if val == 1 else 0)
max_start = reduce(func, enumerate(a1), (0, 0, 0, 0))[0]
In any of the three cases, max_start gives your answer (i.e. 2).
Using more_itertools, a third-party library:
Given
import itertools as it
import more_itertools as mit
lst = [0, 0, 1, 1, 1, 1, 0, 0, 1, 1]
Code
longest_contiguous = max([tuple(g) for _, g in it.groupby(lst)], key=len)
longest_contiguous
# (1, 1, 1, 1)
pred = lambda w: w == longest_contiguous
next(mit.locate(mit.windowed(lst, len(longest_contiguous)), pred=pred))
# 2
See also the more_itertools.locate docstring for details on how these tools work.
For another solution that uses only Numpy, I think this should work in all the cases. The most upvoted solution is probably faster though.
tmp = np.cumsum(np.insert(np.array(a1) != 1, 0, False)) # value of tmp[i+1] was not incremented when a1[i] is 1
# [0, 1, 2, 2, 2, 2, 2, 3, 4, 4, 4]
values, counts = np.unique(tmp, return_counts=True)
# [0, 1, 2, 3, 4], [1, 1, 5, 1, 3]
counts_idx = np.argmax(counts)
longest_sequence_length = counts[counts_idx] - 1
# 4
longest_sequence_idx = np.argmax(tmp == values[counts_idx])
# 2
I've implemented a run-searching function for numpy arrays in haggis.npy_util.mask2runs. You can use it like this:
runs, lengths = mask2runs(a1, return_lengths=True)
result = runs[lengths.argmax(), 0]

Algorithm to offset a list of data

Given a list of data as follows:
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
I would like to create an algorithm that is able to offset the list of certain number of steps. For example, if the offset = -1:
def offsetFunc(inputList, offsetList):
#make something
return output
where:
output = [0,0,0,0,1,1,5,5,5,5,5,5,3,3,3,2,2]
Important Note: The elements of the list are float numbers and they are not in any progression. So I actually need to shift them, I cannot use any work-around for getting the result.
So basically, the algorithm should replace the first set of values (the 4 "1", basically) with the 0 and then it should:
Detect the lenght of the next range of values
Create a parallel output vectors with the values delayed by one set
The way I have roughly described the algorithm above is how I would do it. However I'm a newbie to Python (and even beginner in general programming) and I have figured out time by time that Python has a lot of built-in functions that could make the algorithm less heavy and iterating. Does anyone have any suggestion to better develop a script to make this kind of job? This is the code I have written so far (assuming a static offset at -1):
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
output = []
PrevVal = 0
NextVal = input[0]
i = 0
while input[i] == NextVal:
output.append(PrevVal)
i += 1
while i < len(input):
PrevVal = NextVal
NextVal = input[i]
while input[i] == NextVal:
output.append(PrevVal)
i += 1
if i >= len(input):
break
print output
Thanks in advance for any help!
BETTER DESCRIPTION
My list will always be composed of "sets" of values. They are usually float numbers, and they take values such as this short example below:
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
In this example, the first set (the one with value "1.236") is long 4 while the second one is long 6. What I would like to get as an output, when the offset = -1, is:
The value "0.000" in the first 4 elements;
The value "1.236" in the second 6 elements.
So basically, this "offset" function is creating the list with the same "structure" (ranges of lengths) but with the values delayed by "offset" times.
I hope it's clear now, unfortunately the problem itself is still a bit silly to me (plus I don't even speak good English :) )
Please don't hesitate to ask any additional info to complete the question and make it clearer.
How about this:
def generateOutput(input, value=0, offset=-1):
values = []
for i in range(len(input)):
if i < 1 or input[i] == input[i-1]:
yield value
else: # value change in input detected
values.append(input[i-1])
if len(values) >= -offset:
value = values.pop(0)
yield value
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
print list(generateOutput(input))
It will print this:
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]
And in case you just want to iterate, you do not even need to build the list. Just use for i in generateOutput(input): … then.
For other offsets, use this:
print list(generateOutput(input, 0, -2))
prints:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 5, 5, 5, 3, 3]
Using deque as the queue, and using maxlen to define the shift length. Only holding unique values. pushing inn new values at the end, pushes out old values at the start of the queue, when the shift length has been reached.
from collections import deque
def shift(it, shift=1):
q = deque(maxlen=shift+1)
q.append(0)
for i in it:
if q[-1] != i:
q.append(i)
yield q[0]
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
print list(shift(Sample))
#[0, 0, 0, 0, 1.236, 1.236, 1.236, 1.236, 1.236, 1.236]
My try:
#Input
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
shift = -1
#Build service structures: for each 'set of data' store its length and its value
set_lengths = []
set_values = []
prev_value = None
set_length = 0
for value in input:
if prev_value is not None and value != prev_value:
set_lengths.append(set_length)
set_values.append(prev_value)
set_length = 0
set_length += 1
prev_value = value
else:
set_lengths.append(set_length)
set_values.append(prev_value)
#Output the result, shifting the values
output = []
for i, l in enumerate(set_lengths):
j = i + shift
if j < 0:
output += [0] * l
else:
output += [set_values[j]] * l
print input
print output
gives:
[1, 1, 1, 1, 5, 5, 3, 3, 3, 3, 3, 3, 2, 2, 2, 5, 5]
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]
def x(list, offset):
return [el + offset for el in list]
A completely different approach than my first answer is this:
import itertools
First analyze the input:
values, amounts = zip(*((n, len(list(g))) for n, g in itertools.groupby(input)))
We now have (1, 5, 3, 2, 5) and (4, 2, 6, 3, 2). Now apply the offset:
values = (0,) * (-offset) + values # nevermind that it is longer now.
And synthesize it again:
output = sum([ [v] * a for v, a in zip(values, amounts) ], [])
This is way more elegant, way less understandable and probably way more expensive than my other answer, but I didn't want to hide it from you.

Find span where condition is True using NumPy

Imagine I have a numpy array and I need to find the spans/ranges where that condition is True. For example, I have the following array in which I'm trying to find spans where items are greater than 1:
[0, 0, 0, 2, 2, 0, 2, 2, 2, 0]
I would need to find indices (start, stop):
(3, 5)
(6, 9)
The fastest thing I've been able to implement is making a boolean array of:
truth = data > threshold
and then looping through the array using numpy.argmin and numpy.argmax to find start and end positions.
pos = 0
truth = container[RATIO,:] > threshold
while pos < len(truth):
start = numpy.argmax(truth[pos:]) + pos + offset
end = numpy.argmin(truth[start:]) + start + offset
if not truth[start]:#nothing more
break
if start == end:#goes to the end
end = len(truth)
pos = end
But this has been too slow for the billions of positions in my arrays and the fact that the spans I'm finding are usually just a few positions in a row. Does anyone know a faster way to find these spans?
How's one way. First take the boolean array you have:
In [11]: a
Out[11]: array([0, 0, 0, 2, 2, 0, 2, 2, 2, 0])
In [12]: a1 = a > 1
Shift it one to the left (to get the next state at each index) using roll:
In [13]: a1_rshifted = np.roll(a1, 1)
In [14]: starts = a1 & ~a1_rshifted # it's True but the previous isn't
In [15]: ends = ~a1 & a1_rshifted
Where this is non-zero is the start of each True batch (or, respectively, end batch):
In [16]: np.nonzero(starts)[0], np.nonzero(ends)[0]
Out[16]: (array([3, 6]), array([5, 9]))
And zipping these together:
In [17]: zip(np.nonzero(starts)[0], np.nonzero(ends)[0])
Out[17]: [(3, 5), (6, 9)]
If you have access to the scipy library:
You can use scipy.ndimage.measurements.label to identify any regions of non zero value. it returns an array where the value of each element is the id of a span or range in the original array.
You can then use scipy.ndimage.measurements.find_objects to return the slices you would need to extract those ranges. You can access the start / end values directly from those slices.
In your example:
import numpy
from scipy.ndimage.measurements import label, find_objects
data = numpy.array([0, 0, 0, 2, 2, 0, 2, 2, 2, 0])
labels, number_of_regions = label(data)
ranges = find_objects(labels)
for identified_range in ranges:
print(identified_range[0].start, identified_range[0].stop)
You should see:
3 5
6 9
Hope this helps!

Categories

Resources