How to exclude one min and one max number? - python

I have list:
numbers = [2,3,1,6,5]
And I must remove one min and one max number:
sorted(numbers)[1:-1]
And this is ok, but I want get additional information - position of removed numbers in original list:
remains = sorted(numbers)[1:-1]
min_number_position = 2
max_number_position = 3
How to do it? Numbers can be repeated.

Just use min and max functions in couple with index method of list to get position:
min_position = numbers.index(min(numbers))
max_position = numbers.index(max(numbers))
del numbers[min_position]
del numbers[max_position]

A pure python solution by creating arg sorted array (as created by numpy.argsort()) . Example -
numbers = [2,3,1,6,5]
argsorted = sorted(range(len(numbers)),key=lambda x:numbers[x])
maxpos,minpos = argsorted[-1],argsorted[0]
remains = [numbers[i] for i in argsorted[1:-1]]
Demo -
>>> numbers = [2,3,1,6,5]
>>> argsorted = sorted(range(len(numbers)),key=lambda x:numbers[x])
>>> argsorted
[2, 0, 1, 4, 3]
>>> maxpos,minpos = argsorted[-1],argsorted[0]
>>> remains = [numbers[i] for i in argsorted[1:-1]]
>>> remains
[2, 3, 5]
>>> maxpos
3
>>> minpos
2
If you can use numpy library, this can be easily done using array.argsort() . Example -
nnumbers = np.array(numbers)
nnumargsort = nnumbers.argsort()
minpos,maxpos = nnumargsort[[0,-1]]
remains = nnumbers[nnumargsort[1:-1]]
Demo -
In [136]: numbers = [2,3,1,6,5]
In [137]: nnumbers = np.array(numbers)
In [138]: nnumargsort = nnumbers.argsort()
In [139]: minpos,maxpos = nnumargsort[[0,-1]]
In [140]: remains = nnumbers[nnumargsort[1:-1]]
In [141]: remains
Out[141]: array([2, 3, 5])
In [142]: maxpos
Out[142]: 3
In [143]: minpos
Out[143]: 2

>>> sorted(enumerate(numbers), key=operator.itemgetter(1))
[(2, 1), (0, 2), (1, 3), (4, 5), (3, 6)]
The rest is left as an exercise for the reader.

You can use a function and return the index of max and min with list.index method :
>>> def func(li):
... sorted_li=sorted(li)
... return (li.index(sorted_li[0]),sorted_li[1:-1],li.index(sorted_li[-1]))
...
>>> min_number_position,remains,max_number_position=func(numbers)
>>> min_number_position
2
>>> remains
[2, 3, 5]
>>> max_number_position
3
In python 3.X you can use unpacking assignment :
>>> def func(li):
... mi,*re,ma=sorted(li)
... return (li.index(mi),re,li.index(ma))

Related

Lists become pd.Series, the again lists with one dimension more

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]
This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)
Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

Round off each value of a given vector(B) to closest value in A

Given a set of numbers:
A = np.array([12,10,7,4,2,0,-3])
and another set of values:
B = np.array([14,8.8,2.3,-4,5.5])
Is there a method in python which can round off B to the nearest value of A?
Here's one approach:
res = A[np.abs(A-B[:, None]).argmin(axis=1)]
[12 10 2 -3 7]
To understand how this works from a pure Python perspective, consider this list comprehension:
[A[np.abs(A-b).argmin()] for b in B]
Note this does not deal with ties, argmin extracts the first minimum index.
Here is an O(n log n) solution:
>>> AS = np.sort(A)
>>> bnd = (AS[:-1] + AS[1:]) / 2
>>> nearest = AS[bnd.searchsorted(B)]
>>>
>>> nearest
array([12, 10, 2, -3, 4])
Or if you want ties to be rounded up:
>>> nearest = AS[bnd.searchsorted(B, 'right')]
>>> nearest
array([12, 10, 2, -3, 7])

Find size/internal structure of list in Python

If I have a list c like so:
a = [1,2,3,4]
c = [a,a]
What's the simplest way of finding that it's a list of length two where each element is a list of length 4? If I do len(c) I get 2 but it doesn't give any indication that those elements are lists or their size unless I explicitly do something like
print(type(c[0]))
print(len(c[0]))
print(len(c[1]))
I could do something like
import numpy as np
np.asarray(c).shape
which gives me (2,4), but that only works when the internal lists are of equal size. If instead, the list is like
a = [1,2,3,4]
b = [1,2]
d = [a,b]
then np.asarray(d).shape just gives me (2,). In this case, I could do something like
import pandas as pd
df = pd.DataFrame(d)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
0 2 non-null int64
1 2 non-null int64
2 1 non-null float64
3 1 non-null float64
dtypes: float64(2), int64(2)
memory usage: 144.0 bytes
From this, I can tell that there are lists inside the original list, but I would like to be able to see this without using pandas. What's the best way to look at the internal structure of a list?
Depending on the output format you expect, you could write a recursive function that returns nested tuples of length and shape.
Code
def shape(lst):
length = len(lst)
shp = tuple(shape(sub) if isinstance(sub, list) else 0 for sub in lst)
if any(x != 0 for x in shp):
return length, shp
else:
return length
Examples
lst = [[1, 2, 3, 4], [1, 2, 3, 4]]
print(shape(lst)) # (2, (4, 4))
lst = [1, [1, 2]]
print(shape(lst)) # (2, (0, 2))
lst = [1, [1, [1]]]
print(shape(lst)) # (2, (0, (2, (0, 1))))
This way is returning the type of element of list, and the first item is the parent list info.
def check(item):
res = [(type(item), len(item))]
for i in item:
res.append((type(i), (len(i) if hasattr(i, '__len__') else None)))
return res
>>> a = [1,2,3,4]
>>> c = [a,a]
>>> check(c)
[(list, 2), (list, 4), (list, 4)]

matching occurrence of opposite number in a list python

I have a list such has
results = [100, 100, -100, 100, -100, -100]
I would like to figure out the first occurrence of the opposite number. so first 100 would be match with the first -100, the second 100 would be match with the second -100.
I would like to have position as output such has:
[0, 2], [1, 4], [3, 5]
i.e : [0,2] represent the results[0] and results[2] where first occurrence of 100 is match with the first occurrence of -100
edit : you can assume there will always be the same amount of positive / negative and that the list will only contain 1 number
any help would be appricated
For your simple case where the list only contains 2 integers (x and -x), you could simply zip() together the indexes:
indexes = [[],[]]
for i,x in enumerate(results):
indexes[0].append(i) if x > 0 else indexes[1].append(i)
list(zip(*indexes))
Example:
>>> results = [100, 100, -100, 100, -100, -100]
>>> indexes = [[],[]]
>>> for i,x in enumerate(results): indexes[0].append(i) if x > 0 else indexes[1].append(i)
...
>>> list(zip(*indexes))
[(0, 2), (1, 4), (3, 5)]
Note for small inputs 2 separate list comprehensions (e.g. [i for i,x in enumerate(results) if x > 0] may be faster than appending in a for loop.
IMO, the fastest approach (for large inputs) should be the following one (though, my solution doesn't assume that the input list contains just one value and its opposite, so it can be made even faster if that assumption is added):
x = [100, 300, -300, 100, -100, -100]
from collections import defaultdict, deque
unmatched_positives = defaultdict(deque)
solution=[]
for i, val in enumerate(x):
if val > 0:
unmatched_positives[val].append(i)
else:
solution.append( (unmatched_positives[-val].popleft(), i) )
print('Unsorted solution:', solution)
# If you need the result to be sorted
print('Sorted solution:', sorted(solution))
Output:
Unsorted solution: [(1, 2), (0, 4), (3, 5)]
Sorted solution: [(0, 4), (1, 2), (3, 5)]
This should work:
results = [100, 100, -100, 100, -100, -100]
solution = []
for i, x in enumerate(results):
if x > 0 and isinstance(x, int):
y = results.index(-x)
results[results.index(-x)] = 'found'
solution.append([i,y])
print solution
This would work as well for the general case in which different numbers occur:
solutions = []
for x in set(abs(x) for x in results):
solutions += list(zip([i for i, x2 in enumerate(results) if x2 == x],
[i for i, x2 in enumerate(results) if x2 == x*-1]))
Well we can do this efficiently in two phases. In the analysis phase, we filter out positive numbers, sort them and group them by index, like:
from itertools import groupby
subresult = dict(map(lambda x:(x[0],iter(tuple(x[1]))),
groupby(sorted(filter(lambda x:x[1] < 0,enumerate(results)),
key=lambda x:x[::-1]),lambda x:x[1])
))
Or we can generate it step-by-step, like:
subresult = filter(lambda x:x[1] < 0,enumerate(results)) # filter negative values
subresult = sorted(subresult,key=lambda x:x[::-1]) # sort them on value and then on index
subresult = groupby(subresult,lambda x:x[1]) # group them on the value
subresult = map(lambda x:(x[0],iter(tuple(x[1]))),subresult) # construct a sequence of tuples (value,list of indices)
subresult = dict(subresult) # make it a dictionary
This generates a dictionary:
{-100: <itertools._grouper object at 0x7fedfb523ef0>}
Next in construction phase, we iterate over all positive integers, and always take the next opposite one from the subresult dictionary. Like:
end_result = [[i,next(subresult[-v])[0]] for i,v in enumerate(results) if v > 0]
This generates:
>>> subresult = dict(map(lambda x:(x[0],iter(tuple(x[1]))),groupby(sorted(filter(lambda x:x[1] < 0,enumerate(results)),key=lambda x:x[::-1]),lambda x:x[1])))
>>> [[i,next(subresult[-v])[0]] for i,v in enumerate(results) if v > 0]
[[0, 2], [1, 4], [3, 5]]
Usually because of the dictionary lookup and because we use an iterator (that thus does bookkeeping on at which index we are), this will work quite efficiently.
How about this simple observation based approach? Split it into two lists using list comprehension and then just zip them in the order you want it.
Using list comprehension
In [18]: neg_list = [idx for idx, el in enumerate(results) if el < 0]
In [19]: pos_list = [idx for idx, el in enumerate(results) if el > 0]
In [20]: neg_list
Out[20]: [2, 4, 5]
In [21]: pos_list
Out[21]: [0, 1, 3]
In [22]: list(zip(pos_list, neg_list))
Out[22]: [(0, 2), (1, 4), (3, 5)]
You can also modify what index you need from the order you zip them.
NumPy Version:
For larger lists (or arrays equivalently), the numpy version should be much faster.
In [30]: res = np.array(results)
In [38]: pos_idx = np.where(res > 0)[0]
In [39]: pos_idx
Out[39]: array([0, 1, 3])
In [40]: neg_idx = np.where(res < 0)[0]
In [42]: neg_idx
Out[42]: array([2, 4, 5])
In [44]: list(zip(pos_idx, neg_idx))
Out[44]: [(0, 2), (1, 4), (3, 5)]
# If you want to avoid using zip, then
# just use np.vstack and transpose the result
In [59]: np.vstack((pos_idx, neg_idx)).T
Out[59]:
array([[0, 2],
[1, 4],
[3, 5]])
P.S.: You could also use generator comprehension to achieve the same result but please note that it will be exhausted after you convert the generator to list once.
Using generator comprehension
In [24]: neg_gen = (idx for idx, el in enumerate(results) if el < 0)
In [25]: pos_gen = (idx for idx, el in enumerate(results) if el > 0)
In [27]: list(zip(pos_gen, neg_gen))
Out[27]: [(0, 2), (1, 4), (3, 5)]
# on 2nd run, there won't be any element in the generator.
In [28]: list(zip(pos_gen, neg_gen))
Out[28]: []
pos = {}
for i,item in enumerate(results ):
if item < 0: continue
if item not in pos:
pos[item] = []
pos[item].append(i)
[ [pos[-item].pop(0), i] for i,item in enumerate(results ) if item < 0]
[[0, 2], [1, 4], [3, 5]]
For the sample case where results only contains two different integers:
import numpy as np
results = np.array([100, 100, -100, 100, -100, -100])
output = list(zip(np.where(results > 0)[0], np.where(results < 0)[0]))
Output:
[(0, 2), (1, 4), (3, 5)]
Time is ~0.002 for results * 1000.

Checking for and indexing non-unique/duplicate values in a numpy array

I have an array traced_descIDs containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs are associated with it.
As an example, if we take the traced_descIDs here, I want the following process to occur:
traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]
I'm currently finding out which objects have more than 1 entry by:
mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)
however, this takes too long as len( traced_descIDs ) is around 150,000. Is there a faster way to achieve the same result?
Any help greatly appreciated. Cheers.
While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:
a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
If you are using a version of numpy earlier than 1.9, then that last line would have to be:
unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)
The contents of the three arrays we have created are:
>>> unq
array([ 1, 23, 90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])
To get the repeated items:
cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]
>>> dup_ids
array([ 1, 345])
Getting the indices is a little more involved, but pretty straightforward:
cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])
>>> dup_idx
[array([0, 5]), array([1, 3])]
There is scipy.stats.itemfreq which would give the frequency of each item:
>>> xs = np.array([1, 345, 23, 345, 90, 1])
>>> ifreq = sp.stats.itemfreq(xs)
>>> ifreq
array([[ 1, 2],
[ 23, 1],
[ 90, 1],
[345, 2]])
>>> [(xs == w).nonzero()[0] for w in ifreq[ifreq[:,1] > 1, 0]]
[array([0, 5]), array([1, 3])]
Your current approach is O(N**2), use a dictionary to do it in O(N)time:
>>> from collections import defaultdict
>>> traced_descIDs = [1, 345, 23, 345, 90, 1]
>>> d = defaultdict(list)
>>> for i, x in enumerate(traced_descIDs):
... d[x].append(i)
...
>>> for k, v in d.items():
... if len(v) == 1:
... del d[k]
...
>>> d
defaultdict(<type 'list'>, {1: [0, 5], 345: [1, 3]})
And to get the items and indices:
>>> from itertools import izip
>>> dupIds, dupInds = izip(*d.iteritems())
>>> dupIds, dupInds
((1, 345), ([0, 5], [1, 3]))
Note that if you want to preserver the order of items in dupIds then use collections.OrderedDict and dict.setdefault() method.
td = np.array(traced_descIDs)
si = np.argsort(td)
td[si][np.append(False, np.diff(td[si]) == 0)]
That gives you:
array([ 1, 345])
I haven't figured out the second part quite yet, but maybe this will be inspiration enough for you, or maybe I'll get back to it. :)
A solution of the same vectorized efficiency as proposed by Jaime is embedded in the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
print(npi.group_by(traced_descIDs, np.arange(len(traced_descIDs))))
This gets us most of the way there; but if we also want to filter out singleton groups while avoiding any python loops and staying entirely vectorized, we can go a little lower level, and do:
g = npi.group_by(traced_descIDs)
unique = g.unique
idx = g.split_array_as_list(np.arange(len(traced_descIDs)))
duplicates = unique[g.count>1]
idx_duplicates = np.asarray(idx)[g.count>1]
print(duplicates, idx_duplicates)
np.unqiue for Ndims
I had a similar problem with an ndArray in which I want to find which rows are duplicated.
x = np.arange(60).reshape(5,4,3)
x[1] = x[0]
0 and 1 should be duplicates in axis 0. I used np.unique and returned all options. Then use Jaime's method to locate the duplicates.
_,i,_,c = np.unique(x,1,1,1,axis=0)
x_dup = x[i[1<c]]
I unnecessarily use return_inverse for clarity. Here are the result:
>>> print(x_dupilates)
[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]]

Categories

Resources