Filtering lists

Filtering lists - python

I want to filter repeated elements in my list
for instance
foo = ['a','b','c','a','b','d','a','d']
I am only interested with:
['a','b','c','d']
What would be the efficient way to do achieve this ?
Cheers

list(set(foo)) if you are using Python 2.5 or greater, but that doesn't maintain order.

Cast foo to a set, if you don't care about element order.

Since there isn't an order-preserving answer with a list comprehension, I propose the following:
>>> temp = set()
>>> [c for c in foo if c not in temp and (temp.add(c) or True)]
['a', 'b', 'c', 'd']
which could also be written as
>>> temp = set()
>>> filter(lambda c: c not in temp and (temp.add(c) or True), foo)
['a', 'b', 'c', 'd']
Depending on how many elements are in foo, you might have faster results through repeated hash lookups instead of repeated iterative searches through a temporary list.
c not in temp verifies that temp does not have an item c; and the or True part forces c to be emitted to the output list when the item is added to the set.

>>> bar = []
>>> for i in foo:
if i not in bar:
bar.append(i)
>>> bar
['a', 'b', 'c', 'd']
this would be the most straightforward way of removing duplicates from the list and preserving the order as much as possible (even though "order" here is inherently wrong concept).

If you care about order a readable way is the following
def filter_unique(a_list):
characters = set()
result = []
for c in a_list:
if not c in characters:
characters.add(c)
result.append(c)
return result
Depending on your requirements of speed, maintanability, space consumption, you could find the above unfitting. In that case, specify your requirements and we can try to do better :-)

If you write a function to do this i would use a generator, it just wants to be used in this case.
def unique(iterable):
yielded = set()
for item in iterable:
if item not in yielded:
yield item
yielded.add(item)

Inspired by Francesco's answer, rather than making our own filter()-type function, let's make the builtin do some work for us:
def unique(a, s=set()):
if a not in s:
s.add(a)
return True
return False
Usage:
uniq = filter(unique, orig)
This may or may not perform faster or slower than an answer that implements all of the work in pure Python. Benchmark and see. Of course, this only works once, but it demonstrates the concept. The ideal solution is, of course, to use a class:
class Unique(set):
def __call__(self, a):
if a not in self:
self.add(a)
return True
return False
Now we can use it as much as we want:
uniq = filter(Unique(), orig)
Once again, we may (or may not) have thrown performance out the window - the gains of using a built-in function may be offset by the overhead of a class. I just though it was an interesting idea.

This is what you want if you need a sorted list at the end:
>>> foo = ['a','b','c','a','b','d','a','d']
>>> bar = sorted(set(foo))
>>> bar
['a', 'b', 'c', 'd']

import numpy as np
np.unique(foo)

You could do a sort of ugly list comprehension hack.
[l[i] for i in range(len(l)) if l.index(l[i]) == i]

Related

Faster Python List Comprehension

I have a bit of code that runs many thousands of times in my project:
def resample(freq, data):
output = []
for i, elem in enumerate(freq):
for _ in range(elem):
output.append(data[i])
return output
eg. resample([1,2,3], ['a', 'b', 'c']) => ['a', 'b', 'b', 'c', 'c', 'c']
I want to speed this up as much as possible. It seems like a list comprehension could be faster. I have tried:
def resample(freq, data):
return [item for sublist in [[data[i]]*elem for i, elem in enumerate(frequencies)] for item in sublist]
Which is hideous and also slow because it builds the list and then flattens it. Is there a way to do this with one line list comprehension that is fast? Or maybe something with numpy?
Thanks in advance!
edit: Answer does not necessarily need to eliminate the nested loops, fastest code is the best

I highly suggest using generators like so:
from itertools import repeat, chain
def resample(freq, data):
return chain.from_iterable(map(repeat, data, freq))
This will probably be the fastest method there is - map(), repeat() and chain.from_iterable() are all implemented in C so you technically can't get any better.
As for a small explanation:
repeat(i, n) returns an iterator that repeats an item i, n times.
map(repeat, data, freq) returns an iterator that calls repeat every time on an element of data and an element of freq. Basically an iterator that returns repeat() iterators.
chain.from_iterable() flattens the iterator of iterators to return the end items.
No list is created on the way, so there is no overhead and as an added benefit - you can use any type of data and not just one char strings.
While I don't suggest it, you are able to convert it into a list() like so:
result = list(resample([1,2,3], ['a','b','c']))

import itertools
def resample(freq, data):
return itertools.chain.from_iterable([el]*n for el, n in zip(data, freq))
Besides faster, this also has the advantage of being lazy, it returns a generator and the elements are generated step by step

No need to create lists at all, just use a nested loop:
[e for i, e in enumerate(data) for j in range(freq[i])]
# ['a', 'b', 'b', 'c', 'c', 'c']
You can just as easily make this lazy by removing the brackets:
(e for i, e in enumerate(data) for j in range(freq[i]))

Indexing a list with nested lists [duplicate]

The list.index(x) function returns the index in the list of the first item whose value is x.
Is there a function, list_func_index(), similar to the index() function that has a function, f(), as a parameter. The function, f() is run on every element, e, of the list until f(e) returns True. Then list_func_index() returns the index of e.
Codewise:
>>> def list_func_index(lst, func):
for i in range(len(lst)):
if func(lst[i]):
return i
raise ValueError('no element making func True')
>>> l = [8,10,4,5,7]
>>> def is_odd(x): return x % 2 != 0
>>> list_func_index(l,is_odd)
3
Is there a more elegant solution? (and a better name for the function)

You could do that in a one-liner using generators:
next(i for i,v in enumerate(l) if is_odd(v))
The nice thing about generators is that they only compute up to the requested amount. So requesting the first two indices is (almost) just as easy:
y = (i for i,v in enumerate(l) if is_odd(v))
x1 = next(y)
x2 = next(y)
Though, expect a StopIteration exception after the last index (that is how generators work). This is also convenient in your "take-first" approach, to know that no such value was found --- the list.index() function would throw ValueError here.

One possibility is the built-in enumerate function:
def index_of_first(lst, pred):
for i,v in enumerate(lst):
if pred(v):
return i
return None
It's typical to refer a function like the one you describe as a "predicate"; it returns true or false for some question. That's why I call it pred in my example.
I also think it would be better form to return None, since that's the real answer to the question. The caller can choose to explode on None, if required.

#Paul's accepted answer is best, but here's a little lateral-thinking variant, mostly for amusement and instruction purposes...:
>>> class X(object):
... def __init__(self, pred): self.pred = pred
... def __eq__(self, other): return self.pred(other)
...
>>> l = [8,10,4,5,7]
>>> def is_odd(x): return x % 2 != 0
...
>>> l.index(X(is_odd))
3
essentially, X's purpose is to change the meaning of "equality" from the normal one to "satisfies this predicate", thereby allowing the use of predicates in all kinds of situations that are defined as checking for equality -- for example, it would also let you code, instead of if any(is_odd(x) for x in l):, the shorter if X(is_odd) in l:, and so forth.
Worth using? Not when a more explicit approach like that taken by #Paul is just as handy (especially when changed to use the new, shiny built-in next function rather than the older, less appropriate .next method, as I suggest in a comment to that answer), but there are other situations where it (or other variants of the idea "tweak the meaning of equality", and maybe other comparators and/or hashing) may be appropriate. Mostly, worth knowing about the idea, to avoid having to invent it from scratch one day;-).

Not one single function, but you can do it pretty easily:
>>> test = lambda c: c == 'x'
>>> data = ['a', 'b', 'c', 'x', 'y', 'z', 'x']
>>> map(test, data).index(True)
3
>>>
If you don't want to evaluate the entire list at once you can use itertools, but it's not as pretty:
>>> from itertools import imap, ifilter
>>> from operator import itemgetter
>>> test = lambda c: c == 'x'
>>> data = ['a', 'b', 'c', 'x', 'y', 'z']
>>> ifilter(itemgetter(1), enumerate(imap(test, data))).next()[0]
3
>>>
Just using a generator expression is probably more readable than itertools though.
Note in Python3, map and filter return lazy iterators and you can just use:
from operator import itemgetter
test = lambda c: c == 'x'
data = ['a', 'b', 'c', 'x', 'y', 'z']
next(filter(itemgetter(1), enumerate(map(test, data))))[0] # 3

A variation on Alex's answer. This avoids having to type X every time you want to use is_odd or whichever predicate
>>> class X(object):
... def __init__(self, pred): self.pred = pred
... def __eq__(self, other): return self.pred(other)
...
>>> L = [8,10,4,5,7]
>>> is_odd = X(lambda x: x%2 != 0)
>>> L.index(is_odd)
3
>>> less_than_six = X(lambda x: x<6)
>>> L.index(less_than_six)
2

you could do this with a list-comprehension:
l = [8,10,4,5,7]
filterl = [a for a in l if a % 2 != 0]
Then filterl will return all members of the list fulfilling the expression a % 2 != 0. I would say a more elegant method...

Intuitive one-liner solution:
i = list(map(lambda value: value > 0, data)).index(True)
Explanation:
we use map function to create a list containing True or False based on if each element in our list pass the condition in the lambda or not.
then we convert the map output to list
then using the index function, we get the index of the first true which is the same index of the first value passing the condition.

Returning semi-unique values from a list

Not sure how else to word this, but say I have a list containing the following sequence:
[a,a,a,b,b,b,a,a,a]
and I would like to return:
[a,b,a]
How would one do this in principle?

You can use itertools.groupby, this groups consecutive same elements in the same group and return an iterator of key value pairs where the key is the unique element you are looking for:
from itertools import groupby
[k for k, _ in groupby(lst)]
# ['a', 'b', 'a']
lst = ['a','a','a','b','b','b','a','a','a']

Psidoms way is a lot better, but I may as well write this so you can see how it'd be possible just using basic loops and statements. It's always good to figure out what steps you'd need to take for any problem, as it usually makes coding the simple things a bit easier :)
original = ['a','a','a','b','b','b','a','a','a']
new = [original[0]]
for letter in original[1:]:
if letter != new[-1]:
new.append(letter)
Basically it will append a letter if the previous letter is something different.

Using list comprehension:
original = ['a','a','a','b','b','b','a','a','a']
packed = [original[i] for i in range(len(original)) if i == 0 or original[i] != original[i-1]]
print(packed) # > ['a', 'b', 'a']
Similarly (thanks to pylang) you can use enumerate instead of range:
[ x for i,x in enumerate(original) if i == 0 or x != original[i-1] ]

more_itertools has an implementation of the unique_justseen recipe from itertools:
import more_itertools as mit
list(mit.unique_justseen(["a","a","a","b","b","b","a","a","a"]))
# ['a', 'b', 'a']

Powerset with frozenset in Python

I'm sitting here for almost 5 hours trying to solve the problem and now I'm hoping for your help.
Here is my Python Code:
def powerset3(a):
if (len(a) == 0):
return frozenset({})
else:
s=a.pop()
b=frozenset({})
b|=frozenset({})
b|=frozenset({s})
for subset in powerset3(a):
b|=frozenset({str(subset)})
b|=frozenset({s+subset})
return b
If I run the program with:
print(powerset3(set(['a', 'b'])))
I get following solution
frozenset({'a', 'b', 'ab'})
But I want to have
{frozenset(), frozenset({'a'}), frozenset({'b'}), frozenset({'b', 'a'})}
I don't want to use libraries and it should be recursive!
Thanks for your help

Here's a slightly more readable implementation using itertools, if you don't want to use a lib for the combinations, you can replace the combinations code with its implementation e.g. from https://docs.python.org/2/library/itertools.html#itertools.combinations
def powerset(l):
result = [()]
for i in range(len(l)):
result += itertools.combinations(l, i+1)
return frozenset([frozenset(x) for x in result])
Testing on IPython, with different lengths
In [82]: powerset(['a', 'b'])
Out[82]:
frozenset({frozenset(),
frozenset({'b'}),
frozenset({'a'}),
frozenset({'a', 'b'})})
In [83]: powerset(['x', 'y', 'z'])
Out[83]:
frozenset({frozenset(),
frozenset({'x'}),
frozenset({'x', 'z'}),
frozenset({'y'}),
frozenset({'x', 'y'}),
frozenset({'z'}),
frozenset({'y', 'z'}),
frozenset({'x', 'y', 'z'})})
In [84]: powerset([])
Out[84]: frozenset({frozenset()})

You sort of have the right idea. If a is non-empty, then the powerset of a can be formed by taking some element s from a, and let's called what's left over rest. Then build up the powerset of s from the powerset of rest by adding to it, for each subset in powerset3(rest) both subset itself and subset | frozenset({s}).
That last bit, doing subset | frozenset({s}) instead of string concatenation is half of what's missing with your solution. The other problem is the base case. The powerset of the empty set is not the empty set, is the set of one element containing the empty set.
One more issue with your solution is that you're trying to use frozenset, which is immutable, in mutable ways (e.g. pop(), b |= something, etc.)
Here's a working solution:
from functools import partial
def helper(x, accum, subset):
return accum | frozenset({subset}) | frozenset({frozenset({x}) | subset})
def powerset(xs):
if len(xs) == 0:
return frozenset({frozenset({})})
else:
# this loop is the only way to access elements in frozenset, notice
# it always returns out of the first iteration
for x in xs:
return reduce(partial(helper, x), powerset(xs - frozenset({x})), frozenset({}))
a = frozenset({'a', 'b'})
print(powerset(a))

How to check if all items in a list are there in another list?

I have two lists say
List1 = ['a','c','c']
List2 = ['x','b','a','x','c','y','c']
Now I want to find out if all elements of List1 are there in List2. In this case all there are. I can't use the subset function because I can have repeated elements in lists. I can use a for loop to count the number of occurrences of each item in List1 and see if it is less than or equal to the number of occurrences in List2. Is there a better way to do this?
Thanks.

When number of occurrences doesn't matter, you can still use the subset functionality, by creating a set on the fly:
>>> list1 = ['a', 'c', 'c']
>>> list2 = ['x', 'b', 'a', 'x', 'c', 'y', 'c']
>>> set(list1).issubset(list2)
True
If you need to check if each element shows up at least as many times in the second list as in the first list, you can make use of the Counter type and define your own subset relation:
>>> from collections import Counter
>>> def counterSubset(list1, list2):
c1, c2 = Counter(list1), Counter(list2)
for k, n in c1.items():
if n > c2[k]:
return False
return True
>>> counterSubset(list1, list2)
True
>>> counterSubset(list1 + ['a'], list2)
False
>>> counterSubset(list1 + ['z'], list2)
False
If you already have counters (which might be a useful alternative to store your data anyway), you can also just write this as a single line:
>>> all(n <= c2[k] for k, n in c1.items())
True

Be aware of the following:
>>>listA = ['a', 'a', 'b','b','b','c']
>>>listB = ['b', 'a','a','b','c','d']
>>>all(item in listB for item in listA)
True
If you read the "all" line as you would in English, This is not wrong but can be misleading, as listA has a third 'b' but listB does not.
This also has the same issue:
def list1InList2(list1, list2):
for item in list1:
if item not in list2:
return False
return True
Just a note. The following does not work:
>>>tupA = (1,2,3,4,5,6,7,8,9)
>>>tupB = (1,2,3,4,5,6,6,7,8,9)
>>>set(tupA) < set(TupB)
False
If you convert the tuples to lists it still does not work. I don't know why strings work but ints do not.
Works but has same issue of not keeping count of element occurances:
>>>set(tupA).issubset(set(tupB))
True
Using sets is not a comprehensive solution for multi-occurrance element matching.
But here is a one-liner solution/adaption to shantanoo's answer without try/except:
all(True if sequenceA.count(item) <= sequenceB.count(item) else False for item in sequenceA)
A builtin function wrapping a list comprehension using a ternary conditional operator. Python is awesome! Note that the "<=" should not be "==".
With this solution sequence A and B can be type tuple and list and other "sequences" with "count" methods. The elements in both sequences can be most types. I would not use this with dicts as it is now, hence the use "sequence" instead of "iterable".

A solution using Counter and the builtin intersection method (note that - is proper multiset difference, not an element-wise subtraction):
from collections import Counter
def is_subset(l1, l2):
c1, c2 = Counter(l1), Counter(l2)
return not c1 - c2
Test:
>>> List1 = ['a','c','c']
>>> List2 = ['x','b','a','x','c','y','c']
>>> is_subset(List1, List2)
True

I can't use the subset function because I can have repeated elements in lists.
What this means is that you want to treat your lists as multisets rather than sets. The usual way to handle multisets in Python is with collections.Counter:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.
And, while you can implement subset for multisets (implemented with Counter) by looping and comparing counts, as in poke's answer, this is unnecessary—just as you can implement subset for sets (implemented with set or frozenset) by looping and testing in, but it's unnecessary.
The Counter type already implements all the set operators extended in the obvious way for multisets.<1 So you can just write subset in terms of those operators, and it will work for both set and Counter out of the box.
With (multi)set difference:2
def is_subset(c1, c2):
return not c1 - c2
Or with (multi)set intersection:
def is_subset(c1, c2):
def c1 & c2 == c1
1. You may be wondering why, if Counter implements the set operators, it doesn't just implement < and <= for proper subset and subset. Although I can't find the email thread, I'm pretty sure this was discussed, and the answer was that "the set operators" are defined as the specific set of operators defined in the initial version of collections.abc.Set (which has since been expanded, IIRC…), not all operators that set happens to include for convenience, in the exact same way that Counter doesn't have named methods like intersection that's friendly to other types than & just because set does.
2. This depends on the fact that collections in Python are expected to be falsey when empty and truthy otherwise. This is documented here for the builtin types, and the fact that bool tests fall back to len is explained here—but it's ultimately just a convention, so that "quasi-collections" like numpy arrays can violate it if they have a good reason. It holds for "real collections" like Counter, OrderedDict, etc. If you're really worried about that, you can write len(c1 - c2) == 0, but note that this is against the spirit of PEP 8.

This will return true is all the items in List1 are in List2
def list1InList2(list1, list2):
for item in list1:
if item not in list2:
return False
return True

def check_subset(list1, list2):
try:
[list2.remove(x) for x in list1]
return 'all elements in list1 are in list2'
except:
return 'some elements in list1 are not in list2'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering lists - python

I want to filter repeated elements in my list for instance foo = ['a','b','c','a','b','d','a','d'] I am only interested with: ['a','b','c','d'] What would be the efficient way to do achieve this ? Cheers

list(set(foo)) if you are using Python 2.5 or greater, but that doesn't maintain order.

Cast foo to a set, if you don't care about element order.

>>> bar = [] >>> for i in foo: if i not in bar: bar.append(i) >>> bar ['a', 'b', 'c', 'd'] this would be the most straightforward way of removing duplicates from the list and preserving the order as much as possible (even though "order" here is inherently wrong concept).

If you write a function to do this i would use a generator, it just wants to be used in this case. def unique(iterable): yielded = set() for item in iterable: if item not in yielded: yield item yielded.add(item)

This is what you want if you need a sorted list at the end: >>> foo = ['a','b','c','a','b','d','a','d'] >>> bar = sorted(set(foo)) >>> bar ['a', 'b', 'c', 'd']

import numpy as np np.unique(foo)

You could do a sort of ugly list comprehension hack. [l[i] for i in range(len(l)) if l.index(l[i]) == i]

Related

Faster Python List Comprehension

Indexing a list with nested lists [duplicate]

Returning semi-unique values from a list

Powerset with frozenset in Python

How to check if all items in a list are there in another list?

Categories

Resources