Fast way to find lists contains two particular items? - python

I have a list of lists (about 200) contains different strings:
lists = [
['a', 'b', 'c', 'g', ...],
['b', 'c', 'f', 'a', ...],
...
]
now I'd like to find out all the lists that contains two given strings, in the given order.
for example, given ('a', 'g'), ['a', 'b', 'c', 'g', ...] will be matched.
what's the pythonic way of doing this?

In my opinion the most Pythonic way would be:
selection = [L for L in lists
if x1 in L and x2 in L and L.index(x1) < L.index(x2)]
the defect is that it will search each element twice, first to check the presence (forgetting the index) and second to check the ordering.
An alternative could be
def match(a, b, L):
try:
return L.index(a) < L.index(b)
except ValueError:
return False
selection = [L for L in lists if match(x1, x2, L)]
but I find it slightly uglier and I wouldn't use it unless performance is a problem.
If the logic required instead is to accept a list containing [... x2 ... x1 ... x2 ...] then the check is different:
selection = [L for L in lists
if x1 in L and x2 in L[L.index(x1)+1:]]
that translated to english as "if x1 is in the list and x2 is the part following first x1" that also works as expected if x1 and x2 are the same value.

Related

How is does zip(*) generate n-grams?

I am reviewing some notes on n-grams, and I came accross a couple of interesting functions. First there's this one to generate bigrams:
def bigrams(word):
return sorted(list(set(''.join(bigram)
for bigram in zip(word,word[1:]))))
def bigram_print(word):
print("The bigrams of", word, "are:")
print(bigrams(word))
bigram_print("ababa")
bigram_print("babab")
After doing some reading and playing on my own with Python I understand why this works. However, when looking at this function, I am very puzzled by the use of zip(*word[i:]) here. I understand that the * is an unpacking operator (as explained here), but I really am getting tripped up by how it's working in combination with the list comprehension here. Can anyone explain?
def ngrams(word, n):
return sorted(list(set(''.join(ngram)
for ngram in zip(*[word[i:]
for i in range(n)]))))
def ngram_print(word, n):
print("The {}-grams of {} are:".format(n, word))
print(ngrams(word, n))
for n in [2, 3, 4]:
ngram_print("ababa", n)
ngram_print("babab", n)
print()
The following example should explain how this works. I have added code and a visual representation of it.
Intuition
The core idea is to zip together multiple versions of the same list where each of them starts from the next subsequent element.
Lets say L is a list of words/elements ['A', 'B', 'C', 'D']
Then, what's happening here is that L, L[1:], L[2:] get zipped which means the first elements of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and second elements get clubbed together and so on..
Visually this can be shown as:
The statement we are worried about -
zip ( * [L[i:] for i in range(n)])
#|___||_______||________________________|
# | | |
# zip unpack versions of L with subsequent 0 to n elements skipped
Code example
l = ['A','B','C','D']
print('original list: '.ljust(27),l)
print('list skipping 1st element: ',l[1:])
print('list skipping 2 elements: '.ljust(27),l[2:])
print('bi-gram: '.ljust(27), list(zip(l,l[1:])))
print('tri-gram: '.ljust(27), list(zip(l,l[1:],l[2:])))
original list: ['A', 'B', 'C', 'D']
list skipping 1st element: ['B', 'C', 'D']
list skipping 2 elements: ['C', 'D']
bi-gram: [('A', 'B'), ('B', 'C'), ('C', 'D')]
tri-gram: [('A', 'B', 'C'), ('B', 'C', 'D')]
As you can see, you are basically zipping the same list but with one skipped. This zips (A, B) and (B, C) ... together for bigrams.
The * operator is for unpacking. When you change the i value to skip elements, you are basically zipping a list of [l[0:], l[1:], l[2:]...]. This is passed to the zip() and unpacked inside it with *.
zip(*[word[i:] for i in range(n)] #where word is the list of words
Alternate to list comprehension
The above list comprehension is equivalent to -
n = 3
lists = []
for i in range(3):
print(l[i:]) #comment this if not needed
lists.append(l[i:])
out = list(zip(*lists))
print(out)
['A', 'B', 'C', 'D']
['B', 'C', 'D']
['C', 'D']
[('A', 'B', 'C'), ('B', 'C', 'D')]
If you break down
zip(*[word[i:] for i in range(n)])
You get:
[word[i:] for i in range(n)]
Which is equivalent to:
[word[0:], word[1:], word[2:], ... word[n-1:]]
Which are each strings that start from different positions in word
Now, if you apply the unpacking * operator to it:
*[word[0:], word[1:], word[2:], ... word[n-1:]]
You get each of the lists word[0:], word[1:] etc passed to zip()
So, zip is getting called like this:
zip(word[0:], word[1:], word[2:], ... word[n-1:])
Which - according to how zip works - would create n-tuples, with each entry coming from one of the corresponding arguments:
[(words[0:][0], words[1:][0]....),
(words[0:][1], words[1:][1]....)
...
If you map the indexes, you'll see that these values correspond to the n-gram definitions for word

Count letter differences of two strings Python

I'm having some problems with an exercise about strings in python.
I have 2 different lists:
list1= "ABCDEFABCDEF"
and
list2= "AZBYCXDWEVFABCDEF"
I need to compare those 2 lists according to their position so the 1 letter together, then the 2...using the min length (so here length of list1) and store the letters in a new variable according to if they are different or the same.
identicals=[]
different=[]
I tried to code something and it seems to find the same ones, but doesn't work on the different ones since it copies them multiple times.
for x in list1:
for y in list2:
if list1>list2:
if x==y:
identicals.append(x)
if x!=y :
different.append(x)
if list2>list1:
if y==x:
identicals.append(y)
if y!=x:
different.append(y)
EDIT: Output result should be something like this:
identicals=['A']
different=["Z","B","Y","C","X","D","W","E","V",F","A"]
The thing is that the letter A is only shown on identicals but not in different even if F!=A.
You are getting unwanted duplicates because you have a nested pair of for loops, so each item in list2 get tested for every item in list1.
The key idea is to iterate over the two strings in parallel. You can do that with the built-in zip function, which yields a tuple of the corresponding items from each iterable you feed it, stopping as soon as one of the iterables runs out of items.
From your example code, it looks like you want to take the items for the different list from the longer string. To do that efficiently, figure out which string is the longer before you start looping.
I've renamed your strings because it's confusing to give strings a name starting with "list".
s1 = "ABCDEFABCDEF"
s2 = "AZBYCXDWEVFABCDEF"
identicals = []
different = []
small, large = (s1, s2) if len(s1) <= len(s2) else (s2, s1)
for x, y in zip(small, large):
if x == y:
identicals.append(y)
else:
different.append(y)
print(identicals)
print(different)
output
['A']
['Z', 'B', 'Y', 'C', 'X', 'D', 'W', 'E', 'V', 'F', 'A']
We can make the for loop more compact at the expense of readability. We put our destination lists into a tuple and then use the equality test to select which list in that tuple to append to. This works because False has a numeric value of 0, and True has a numeric value of 1.
for x, y in zip(small, large):
(different, identicals)[x == y].append(y)
The problem is the inner loop. You are comparing each of the letters in list1 with all the letters of list2.
Instead you should have a single loop:
identicals=[]
different=[]
short_list = list1 if len(list1)<= len(list2) else list2
for i in range(len(short_list):
if list1[i] == list2[i]:
identicals.append(list1[i])
else:
different.append(short_list[i])
Try this
a = "ABCDEFABCDEF"
b = "AZBYCXDWEVFABCDEF"
import numpy
A = numpy.array(list(a))
B = numpy.array(list(b))
common = A[:len(B)] [ (A[:len(B)] == B[:len(A)]) ]
different = A[:len(B)] [ - (A[:len(B)] == B[:len(A)]) ]
>>> list(common)
['A']
>>> list(different)
['B', 'C', 'D', 'E', 'F', 'A', 'B', 'C', 'D', 'E', 'F']

Python: match lists in two lists of lists

I have two lists of lists. The first is composed of lists formatted as follows:
listInA = [id, a1, a2, a3]
The second is composed of lists formatted similarly, with the id first:
listInB = [id, b1, b2, b3]
Neither list is sorted, and they are not of equal lengths. What is the best way to make a list of lists, with each list of the format:
listInC = [id, a1, a2, a3, b1, b2, b3]
where the id's are matched between both lists? Thanks!
You can create a dictionary using dict comprehension from the second list of lists from ID to list. Then, create your new list using list comprehension, appending the list based on IDs.
listA = [
[1, 'a', 'b', 'c'],
[2, 'd', 'e', 'f'],
]
listB = [
[2, 'u', 'v', 'w'],
[1, 'x', 'y', 'z'],
]
b_map = {b[0]: b for b in listB}
print([a + b_map[a[0]][1:] for a in listA])
Output:
[
[1, 'a', 'b', 'c', 'x', 'y', 'z'],
[2, 'd', 'e', 'f', 'u', 'v', 'w']
]
The fact that the lists are not sorted and are not of equal length increases the difficulty of coming up with an efficient solution to the problem. However, a quick and dirty solution that would work in the end is definitely still feasible.
The ID seems to be first in both lists. Since this is the case, then for each list a in A, we can get the first element of a and check the lists b in B. If the first elements match, then we can create a list including the remaining elements of a and b and append that to C. In short...
def foo(A, B):
C = []
for a in A:
aID = a[0]
for b in B:
if aID == b[0]:
c = [aID, a[1], a[2], a[3], b[1], b[2], b[3]]
C.append(c)
return C
When dealing with large list sizes for A and B, the efficiency of this solution would drop abysmally, but it should work for reasonably-sized lists.

Comparing Order of 2 Python Lists

I am looking for some help comparing the order of 2 Python lists, list1 and list2, to detect when list2 is out of order.
list1 is static and contains the strings a,b,c,d,e,f,g,h,i,j. This is the "correct" order.
list2 contains the same strings, but the order and the number of strings may change. (e.g. a,b,f,d,e,g,c,h,i,j or a,b,c,d,e)
I am looking for an efficient way to detect when list2 is our of order by comparing it against list1.
For example, if list2 is a,c,d,e,g,i should return true (as the strings are in order)
While, if list2 is a,d,b,c,e should return false (as string d appears out of order)
First, let's define list1:
>>> list1='a,b,c,d,e,f,g,h,i,j'.split(',')
>>> list1
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
While your list1 happens to be in alphabetical order, we will not assume that. This code works regardless.
Now, let's create a list2 that is out-of-order:
>>> list2 = 'a,b,f,d,e,g,c,h,i,j'.split(',')
>>> list2
['a', 'b', 'f', 'd', 'e', 'g', 'c', 'h', 'i', 'j']
Here is how to test whether list2 is out of order or not:
>>> list2 == sorted(list2, key=lambda c: list1.index(c))
False
False means out-of-order.
Here is an example that is in order:
>>> list2 = 'a,b,d,e'.split(',')
>>> list2 == sorted(list2, key=lambda c: list1.index(c))
True
True means in-order.
Ignoring elements of list1 not in list2
Let's consider a list2 that has an element not in list1:
>>> list2 = 'a,b,d,d,e,z'.split(',')
To ignore the unwanted element, let's create list2b:
>>> list2b = [c for c in list2 if c in list1]
We can then test as before:
>>> list2b == sorted(list2b, key=lambda c: list1.index(c))
True
Alternative not using sorted
>>> list2b = ['a', 'b', 'd', 'd', 'e']
>>> indices = [list1.index(c) for c in list2b]
>>> all(c <= indices[i+1] for i, c in enumerate(indices[:-1]))
True
Why do you need to compare it to list1 since it seems like list1 is in alphabetical order? Can't you do the following?
def is_sorted(alist):
return alist == sorted(alist)
print is_sorted(['a','c','d','e','g','i'])
# True
print is_sorted(['a','d','b','c','e'])
# False
Here's a solution that runs in expected linear time. That isn't too important if list1 is always 10 elements and list2 isn't any longer, but with longer lists, solutions based on index will experience extreme slowdowns.
First, we preprocess list1 so we can quickly find the index of any element. (If we have multiple list2s, we can do this once and then use the preprocessed output to quickly determine whether multiple list2s are sorted):
list1_indices = {item: i for i, item in enumerate(list1)}
Then, we check whether each element of list2 has a lower index in list1 than the next element of list2:
is_sorted = all(list1_indices[x] < list1_indices[y] for x, y in zip(list2, list2[1:]))
We can do better with itertools.izip and itertools.islice to avoid materializing the whole zip list, letting us save a substantial amount of work if we detect that list2 is out of order early in the list:
# On Python 3, we can just use zip. islice is still needed, though.
from itertools import izip, islice
is_sorted = all(list1_indices[x] < list1_indices[y]
for x, y in izip(list2, islice(list2, 1, None)))
is_sorted = not any(list1.index(list2[i]) > list1.index(list2[i+1]) for i in range(len(list2)-1))
The function any returns true if any of the items in an iterable are true. I combined this with a generator expression that loops through all the values of list2 and makes sure they're in order according to list1.
if list2 == sorted(list2,key=lambda element:list1.index(element)):
print('sorted')
Let's assume that when you are writing that list1 is strings a,b,c,d,e,f,g,h,i that this means that a could be 'zebra' and string b could actually be 'elephant' so the order may not be alphabetical. Also, this approach will return false if an item is in list2 but not in list1.
good_list2 = ['a','c','d','e','g','i']
bad_list2 = ['a','d','b','c','e']
def verify(somelist):
list1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
while len(list1) > 0:
try:
list1 = list1[:list1.index(somelist.pop())]
except ValueError:
return False
return True

list comprehension question

Is there a way to add multiple items to a list in a list comprehension per iteration? For example:
y = ['a', 'b', 'c', 'd']
x = [1,2,3]
return [x, a for a in y]
output: [[1,2,3], 'a', [1,2,3], 'b', [1,2,3], 'c', [1,2,3], 'd']
sure there is, but not with a plain list comprehension:
EDIT: Inspired by another answer:
y = ['a', 'b', 'c', 'd']
x = [1,2,3]
return sum([[x, a] for a in y],[])
How it works: sum will add a sequence of anythings, so long as there is a __add__ member to do the work. BUT, it starts of with an initial total of 0. You can't add 0 to a list, but you can give sum() another starting value. Here we use an empty list.
If, instead of needing an actual list, you wanted just a generator, you can use itertools.chain.from_iterable, which just strings a bunch of iterators into one long iterator.
from itertools import *
return chain.from_iterable((x,a) for a in y)
or an even more itertools friendly:
return itertools.chain.from_iterable(itertools.izip(itertools.repeat(x),y))
There are other ways, too, of course: To start with, we can improve Adam Rosenfield's answer by eliminating an unneeded lambda expression:
return reduce(list.__add__,([x, a] for a in y))
since list already has a member that does exactly what we need. We could achieve the same using map and side effects in list.extend:
l = []
map(l.extend,[[x, a] for a in y])
return l
Finally, lets go for a pure list comprehension that is as inelegant as possible:
return [ y[i/2] if i%2 else x for i in range(len(y)*2)]
Here's one way:
y = ['a', 'b', 'c', 'd']
x = [1,2,3]
return reduce(lambda a,b:a+b, [[x,a] for a in y])
x = [1,2,3]
y = ['a', 'b', 'c', 'd']
z = []
[z.extend([x, a]) for a in y]
(The correct value will be in z)

Categories

Resources