find common data python - python

Using
def compare_lsts(list1,list2):
first_set = set(list1)
second_set=set(list2)
results =[x for x in list1 if x in list2]
print(results)
and running compare_lsts([1,2,3,4,5],[3,8,9,1,7]) gives the numbers contained in both sets, i.e. [1,3].
However making list 1 contain more than 1 list e.g. compare_lsts([[1,2,3,4,5],[5,8,2,9,12],[3,7,19,4,16]],[3,7,2,16,19]) gives [],[],[].
I have used for list in list1 followed by results for the loop. I clearly don't know what I am doing.
Basically the question is: How does one compare items in one static list with as many lists as there are?

First of all, you already started using sets, so you should definitely use them, as they are faster when checking containment. Also, there are already a few helpful built-in features for sets, so for comparing two lists, you can just intersect the sets to get those items that are in both lists:
>>> set1 = set([1, 2, 3, 4, 5])
>>> set2 = set([3, 8, 9, 1, 7])
>>> set1 & set2
{1, 3}
>>> list(set1 & set2) # in case you need a list as the output
[1, 3]
Similarly, you can also find the union of two sets to get those items that are in any of the sets:
>>> set1 | set2
{1, 2, 3, 4, 5, 7, 8, 9}
So, if you want to find all items from list2 that are in any of list1’s sublists, then you could intersect all the sublists with list2 and then union all those results:
>>> sublists = [set([1, 2, 3, 4, 5]), set([5, 8, 2, 9, 12]), set([3, 7, 19, 4, 16])]
>>> otherset = set([3, 7, 2, 16, 19])
>>> intersections = [sublist & otherset for sublist in sublists]
>>> intersections
[{2, 3}, {2}, {16, 3, 19, 7}]
>>> union = set()
>>> for intersection in intersections:
union = union | intersection
>>> union
{16, 19, 2, 3, 7}
You can also do that a little bit nicer using functools.reduce:
>>> import functools
>>> functools.reduce(set.union, intersections)
{16, 19, 2, 3, 7}
Similarly, if you want to actually intersect those results, you could do that as well:
>>> functools.reduce(set.intersection, intersections)
set()
And finally, you can pack that all in a nice function:
def compareLists (mainList, *otherLists):
mainSet = set(mainList)
otherSets = [set(otherList) for otherList in otherLists]
intersections = [mainSet & otherSet for otherSet in otherSets]
return functools.reduce(set.union, intersections) # or replace with set.intersection
And use it like this:
>>> compareLists([1, 2, 3, 4, 5], [3, 8, 9, 1, 7])
{1, 3}
>>> compareLists([3, 7, 2, 16, 19], [1, 2, 3, 4, 5], [5, 8, 2, 9, 12], [3, 7, 19, 4, 16])
{16, 19, 2, 3, 7}
Note, that I replaced the order of the arguments in the function, so the main list (in your case list2) is mentioned first as that is the one the others are compared to.

If you're after elements from the first that are in all of the lists:
set(first).intersection(second, third) # fourth, fifth, etc...
>>> set([1, 2, 3]).intersection([2, 3, 4], [3, 4, 5])
set([3])
If you're after elements from the first that are in any of the other lists:
>>> set([1, 2, 3]) & set([4]).union([5])
set([2])
So, then a simple func:
def in_all(fst, *rst):
return set(fst).intersection(*rst)
def in_any(fst, *rst):
it = iter(rst)
return set(fst) & set(next(it, [])).union(*it)

Not sure if it's the best way but:
def flat(l):
c_l = []
for i in l:
if isinstance(i,list):
map(c_l.append,i)
else:
c_l.append(i)
return c_l
def compare_lsts(a,b):
if all([True if isinstance(x,list) else False for x in a]): #if there is sublists in a
a = flat(a) #flats a
if all([True if isinstance(x,list) else False for x in b]): #if there is sublists in b
b = flat(b) #flats b
return list(set(a) & set(b)) #intersection between a and b
print (compare_lsts([[1,2,3,4,5],[5,8,2,9,12],[3,7,19,4,16]],[3,7,2,16,19]) #[16, 3, 2, 19, 7])

Related

Why loop through series in list comprehension gives different result from looping through list?

Can anyone help me understand why list comprehension generates different results when I just changed series to list?
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
[i for i in ser1 if i not in ser2]
# the output is [5]
but if I change to loop through list inside list comprehension, I get the result I want:
l1 = [1, 2, 3, 4, 5]
l2 = [4, 5, 6, 7, 8]
[i for i in l1 if i not in l2]
# the output is [1, 2, 3]
Why series generates wrong answer?
Thanks in advance.
For a pandas series, the in operator refers to the keys (indexes), not the contents...
Ah, someone just posted a link to an extensive answer; I won't recreate it here
However, one further note: depending on the situation, another way to get a similar result is with sets:
s1 = {1, 2, 3, 4, 5}
s2 = {4, 5, 6, 7, 8}
s1 - s2
# answer is {1, 2, 3} in arbitrary order; may be shuffled

Finding common elements from lists in list

I have a list that includes more lists (this is how it looks https://pastebin.com/BW4B9gfa). The number of lists is not constant. I need to create another list that contains only elements that are in all lists in the main list.
I made something like this as a prototype but it doesn't work:
def common_elements(list_of_lists):
lists = list_of_lists
common = lists[0].intersection(lists[1].intersection(lists[2].intersection(lists[3].intersection(lists[4].intersection(lists[5])))))
return common
I see too something like this:
A = [1,2,3,4]
B = [2,4,7,8]
commonalities = set(A) - (set(A) - set(B))
but I don't know how to use it with bigger number of lists.
You can simply do if you have a list of sets, to get a list of sets just do (lists = [set(list) for list in lists]).
lists[0].intersection(*lists)
You need to convert the first list to a set so you can use the intersection() method.
Use a loop rather than hard-coding all the indexes of the list elements.
def common_elements(lists):
if len(lists) == 0:
return []
common = set(lists[0])
for l in lists[1:]:
common = common.intersection(l)
return list(common)
using functools.reduce():
from functools import reduce
items = [[1, 2, 4], [1, 3, 4], [1, 4, 6], [1, 4, 7, 9]]
eggs = reduce(lambda x, y: set(x) & set(y), items)
print(eggs)
output:
{1, 4}
If you want to get intermediate results, you can use itertools.accumulate()
from itertools import accumulate
items = [[1, 2, 4, 5], [1, 3, 4, 5], [1, 4, 6], [1, 4, 7, 9]]
eggs = list(accumulate(items, func = lambda x, y: set(x) & set(y)))
print(eggs)
output:
[[1, 2, 4, 5], {1, 4, 5}, {1, 4}, {1, 4}]

How to get the unselected population in python random module

So, I know I can get a random list from a population using the random module,
l = [0, 1, 2, 3, 4, 8 ,9]
print(random.sample(l, 3))
# [1, 3, 2]
But, how do I get the list of the unselected ones? Do, I need to remove them manually from the list? Or, is there a method to get them too?
Edit: The list l from example doesn't contain the same items multiple times, but when it does I wouldn't want it removed more than it's selected as sample.
l = [0, 1, 2, 3, 4, 8 ,9]
s1 = set(random.sample(l, 3))
s2 = set(l).difference(s1)
>>> s1
{0, 3, 8}
>>> s2
{1, 2, 4, 9}
Update: same items multiple times
You can shuffle your list first and partition your population after in two:
l = [7, 4, 5, 4, 5, 9, 8, 6, 6, 6, 9, 8, 6, 3, 8]
pop = l[:]
random.shuffle(pop)
pop1, pop2 = pop[:3], pop[3:]
>>> pop1
[8, 4, 9]
>>> pop2
[7, 6, 8, 6, 5, 6, 9, 6, 5, 8, 4, 3]
Because your list can contain multiple same items, you can change to the approach below:
import random
l = [0, 1, 2, 3, 4, 8 ,9]
random.shuffle(l)
selected = l[:3]
unselected = l[3:]
print(selected)
# [4, 0, 1]
print(unselected)
# [8, 2, 3, 9]
If you want to keep track of duplicates, you could count the items of each type and compare the population count to the sample count.
If you don't care about the order of items in the population, you could do it like this:
from collections import Counter
import random
population = [1, 1, 2, 2, 9, 7, 9]
sample = random.sample(population, 3)
pop_count = Counter(population)
samp_count = Counter(sample)
unsampled = [
k
for k in pop_count
for i in range(pop_count[k] - samp_count[k])
]
If you care about the order in the population, you could do something like this:
check = sample.copy()
unsampled = []
for val in population:
if val in check:
check.remove(val)
else:
unsampled.append(val)
Or there's this weird list comprehension (not recommended):
check = sample.copy()
unsampled = [
x
for x in population
if x not in check or check.remove(x)
]
The if clause here uses two tricks:
both parts of the test will be Falseish if x is not in check (list.remove() always returns None), and
remove() will only be called if the first part fails, i.e., if x is in check.
Basically, if (and only if) x is in check, it will bomb through and check the next condition, which will also be False (None), but will have the side effect of removing one copy of x from check.
You can do with:
import random
l = [0, 1, 2, 3, 4, 8 ,9]
rand = random.sample(l, 3)
rest = list(set(l) - set(rand))
print(f"initial list: {l}")
print(f"random list: {rand}")
print (f"rest list: {rest}")
Result:
initial list: [0, 1, 2, 3, 4, 8, 9]
random list: [2, 9, 0]
rest list: [8, 1, 3, 4]

Mapping two list without looping

I have two lists of equal length. The first list l1 contains data.
l1 = [2, 3, 5, 7, 8, 10, ... , 23]
The second list l2 contains the category the data in l1 belongs to:
l2 = [1, 1, 2, 1, 3, 4, ... , 3]
How can I partition the first list based on the positions defined by numbers such as 1, 2, 3, 4 in the second list, using a list comprehension or lambda function. For example, 2, 3, 7 from the first list belongs to the same partition as they have corresponding values in the second list.
The number of partitions is known at the beginning.
You can use a dictionary:
>>> l1 = [2, 3, 5, 7, 8, 10, 23]
>>> l2 = [1, 1, 2, 1, 3, 4, 3]
>>> d = {}
>>> for i, j in zip(l1, l2):
... d.setdefault(j, []).append(i)
...
>>>
>>> d
{1: [2, 3, 7], 2: [5], 3: [8, 23], 4: [10]}
If a dict is fine, I suggest using a defaultdict:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for number, category in zip(l1, l2):
... d[category].append(number)
...
>>> d
defaultdict(<type 'list'>, {1: [2, 3, 7], 2: [5], 3: [8, 23], 4: [10]})
Consider using itertools.izip for memory efficiency if you are using Python 2.
This is basically the same solution as Kasramvd's, but I think the defaultdict makes it a little easier to read.
This will give a list of partitions using list comprehension :
>>> l1 = [2, 3, 5, 7, 8, 10, 23]
>>> l2 = [1, 1, 2, 1, 3, 4, 3]
>>> [[value for i, value in enumerate(l1) if j == l2[i]] for j in set(l2)]
[[2, 3, 7], [5], [8, 23], [10]]
A nested list comprehension :
[ [ l1[j] for j in range(len(l1)) if l2[j] == i ] for i in range(1, max(l2)+1 )]
If it is reasonable to have your data stored in numpy's ndarrays you can use extended indexing
{i:l1[l2==i] for i in set(l2)}
to construct a dictionary of ndarrays indexed by category code.
There is an overhead associated with l2==i (i.e., building a new Boolean array for each category) that grows with the number of categories, so that you may want to check which alternative, either numpy or defaultdict, is faster with your data.
I tested with n=200000, nc=20 and numpy was faster than defaultdict + izip (124 vs 165 ms) but with nc=10000 numpy was (much) slower (11300 vs 251 ms)
Using some itertools and operator goodies and a sort you can do this in a one liner:
>>> l1 = [2, 3, 5, 7, 8, 10, 23]
>>> l2 = [1, 1, 2, 1, 3, 4, 3]
>>> itertools.groupby(sorted(zip(l2, l1)), operator.itemgetter(0))
The result of this is a itertools.groupby object that can be iterated over:
>>> for g, li in itertools.groupby(sorted(zip(l2, l1)), operator.itemgetter(0)):
>>> print(g, list(map(operator.itemgetter(1), li)))
1 [2, 3, 7]
2 [5]
3 [8, 23]
4 [10]
This is not a list comprehension but a dictionary comprehension. It resembles #cromod's solution but preserves the "categories" from l2:
{k:[val for i, val in enumerate(l1) if k == l2[i]] for k in set(l2)}
Output:
>>> l1
[2, 3, 5, 7, 8, 10, 23]
>>> l2
[1, 1, 2, 1, 3, 4, 3]
>>> {k:[val for i, val in enumerate(l1) if k == l2[i]] for k in set(l2)}
{1: [2, 3, 7], 2: [5], 3: [8, 23], 4: [10]}
>>>

Python, Remove list B from list A to make list C?

How do I remove a list from with in a list, into a new list? So subtract b from a to produce a new list, c?
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,]
b = [3, 4, 5, 6]
c = []?
Convert the lists into a set and take the set difference.
c = list(set(a).difference(set(b))
To keep ordering and get speedup from using set membership:
bs = set(b)
c = [x for x in a if x not in bs]
Or use a list comprehension:
c = [x for x in a if x not in b]
Depending on what you're doing, you might be better off with sets in the first place:
>>> a = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, }
>>> b = {3, 4, 5, 6}
>>> a
set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b
set([3, 4, 5, 6])
>>> a.difference(b)
set([0, 1, 2, 7, 8, 9])
collections.Counter is another useful standard type if you want to count multiple repetitions:
>>> from collections import Counter as C
>>> a = C([1,1,1,2,2,3,4])
>>> b = C([1,4,5])
>>> a - b
Counter({1: 2, 2: 2, 3: 1})

Categories

Resources