Running across a subset of a list in python - python

Is it possible to run through a subset in a python list?
I have the following problem, I have two lists, list1 is very long and list2 is quite short. Now, I want to check which elements of the list2 are also in list1. My current version looks like this:
for item in list1:
if item in list2:
# do something
This takes a very long time. Is it possible to get a subset and then run through the list?
I need to do this many times.

If the list elements are hashable, you can find the intersection using sets:
>>> for x in set(list2).intersection(list1):
print x
If they are not hashable, you can at least speed-up the search by sorting the shorter list and doing bisected lookups:
>>> from bisect import bisect_left
>>> list2.sort()
>>> n = len(list2)
>>> for x in list1:
i = bisect_left(list2, x)
if i != n and list2[i] == x:
print x
If your data elements are neither hashable nor sortable, then you won't be able to speed-up your original code:
>>> for x in list1:
if x in list2:
print x
The running time of the set-intersection approach is proportional to the sum of lengths of the two lists, O(n1 + n2). The running time of the bisected-search approach is O((n1 + n2) * log(n2)). The running time of the original brute-force approach is O(n1 * n2).

You can use sets here, they provide O(1) lookup compared to O(N) by lists.
But sets expect that the items must be hashable(immutable).
s = set(list1)
for item in lis2:
if item in s:
#do something

you can use shorthand
list3 = [l for l in list1 if l in list2]
if your list has elements that repeat
l2 = list(set(list2))
l1 = list(set(list1))
list3 = [l for l l1 if l in l2]

Related

Check if part of a multi-dimensional list is in a seperate muti-dimensional list

Here is some example code.
list1 = [['one','a'],['two','a'],['three','a'],['four','a']]
list2 = [['three','b'],['four','a'],['five','b']]
for l in list1:
if l not in list2:
print(l[0])
and the output from this code.
one
two
three
because ['four','a'] does indeed appear in both lists.
What I am trying to do is check if just the first item of each entry within the first list appears in the second list, I have tried variations of the following
list1 = [['one','a'],['two','a'],['three','a'],['four','a']]
list2 = [['three','b'],['four','a'],['five','b']]
for l in list1:
if l[0] not in list2:
print(l[0])
however, that code returns
one
two
three
four
though both 'three' and 'four' do appear in the second list.
I have used different methods before now to find the values that appear in only one of a pair of lists, then used that to make a master list that contains all possible values with no duplicates and I believe the same should be possible using this method but the syntax is a mystery to me. Where I am going wrong here?
You could use not any() and then you check specific requirements in the comprehension:
list1 = [['one','a'],['two','a'],['three','a'],['four','a']]
list2 = [['three','b'],['four','a'],['five','b']]
for l in list1:
if not any(l[0] == l2[0] for l2 in list2):
print(l[0])
# one
# two
You could also use sets if order doesn't matter:
list1 = [['one','a'],['two','a'],['three','a'],['four','a']]
list2 = [['three','b'],['four','a'],['five','b']]
set(l[0] for l in list1) - set(l2[0] for l2 in list2)
# {'one', 'two'}
you can use set operations
list1 = [['one','a'],['two','a'],['three','a'],['four','a']]
list2 = [['three','b'],['four','a'],['five','b']]
result = set(i[0] for i in list1) - set(i[0] for i in list2)
print(result)
# output {'one', 'two'}

extract strings from a list based on another string in python

I have a list containing integers like this (not in order):
list1 = [2,1,3]
I have a second list like this:
list2 = ['Contig_1_Length_1000','Contig_2_Length_500','Contig_3_Length_400','Contig_4_Length_300','Contig_5_Length_200','Contig_6_Length_100']
These lists are from fasta files. list 2 always start with "Contig_", but may not always in a well sorted order. I'd like to return a list like this:
list3 = ['Contig_1_Length_1000','Contig_2_Length_500','Contig_3_Length_400']
list3 contains contigs whose number only appeared in list1.
How to do this in python?
Thank you very much!
You can create a dictionary from the second list for an O(n) (linear) solution:
import re
list1 = [2,1,3]
list2 = ['Contig_1_Length_1000','Contig_2_Length_500','Contig_3_Length_400','Contig_4_Length_300','Contig_5_Length_200','Contig_6_Length_100']
new_result = {int(re.findall('(?<=^Contig_)\d+', i)[0]):i for i in list2}
final_result = [new_result[i] for i in list1]
Output:
['Contig_2_Length_500', 'Contig_1_Length_1000', 'Contig_3_Length_400']
You can use list comprehension like this:
list3 = [i for i in list2 if any(j in i for j in list1)]
You can use startswith - it takes a tuple of multiple starting strings to scan efficiently:
[i for i in list2 if i.startswith(tuple(list1))]
['Contig_1_Length_1000', 'Contig_2_Length_500', 'Contig_3_Length_400']
A pretty simple list comprehension like:
list1 = ['Contig_1','Contig_2','Contig_3']
list2 = ['Contig_1_Length_1000','Contig_2_Length_500','Contig_3_Length_400','Contig_4_Length_300','Contig_5_Length_200','Contig_6_Length_100']
list3 = [s for s in list2 for k in list1 if k in s]
print(list3)
gives an output of:
['Contig_1_Length_1000', 'Contig_2_Length_500', 'Contig_3_Length_400']
You'll have to iterate over the two input lists, and see for each combination whether there's a match. One way to do this is
[list2_item for list2_item in list2 if any([list1_item in list2_item for list1_item in list1])]
I tried Ajax1234 's method of using re, blhsing 's code which is close the same as mine except it uses a generator rather than a list (and has more opaque variable names), jeremycg 's method of startswith, and bilbo_strikes_back 's method of zip. The zip method was by far the fastest, but it just takes the first three elements of list2 without concern for the contents of list1, so we might as well do list3 = list2[:3], which was even faster. Ajax1234 's method took about twice as long as blhsing 's, which took slightly longer than mine. jeremycg 's took slightly more than half as much time, but keep in mind that it assumes that the substring will be at the beginning.
try zip and slicing
list1 = ['Contig_1','Contig_2','Contig_3']
list2 = ['Contig_1_Length_1000','Contig_2_Length_500','Contig_3_Length_400','Contig_4_Length_300','Contig_5_Length_200','Contig_6_Length_100']
list3 = [x[1] for x in zip(list1, list2)]
print(list3)

Search through lists in Python to find matches?

I have Python lists with various strings in them such as:
List1 = ["a","b","c","d"]
List2 = ["b","d","e","f"]
List3 = []
List4 = ["d","f","g"]
I need to iterate through these lists, provided they are not blank, and finds items that are in all non-blank lists. In the above example, the exact matches list would be ["d"], since that is the only item that appears in all non-blank lists. List3 is blank, so it would not matter that it is not in that list.
Here's some functional programming beauty:
from operator import and_
from functools import reduce
Lists = List1, List2, List3, List4
result = reduce(and_, map(set, filter(None, Lists)))
I can't test this right now, but something like the following should work:
intersection(set(l) for l in [List1, List2, List3, List4] if l)
It uses Python's built-in set datatype to do the intersection operation.
for thing in list1: # iterate each item, you can check before hand if its not empty
if len(list2) > 0: # if not empty
if thing in list2: # in the list
# do the same thing for the other lists
something like that

Python script to remove unique elements from a list and print the list with repeated elements in proper order

I have written a script to remove all unique elements from a list and print the list with only repeated elements:
Below are some examples how the output list for an input list should be
Input list1:
1,2,1,1,3,5,3,4,3,1,6,7,8,5
Output List1:
1,1,1,3,5,3,3,1,5
Input list2:
1,2,1,1,3,3,4,3,1,6,5
Output List2:
1,1,1,3,3,3,1
#! /bin/python
def remove_unique(*n):
dict1={}
list1=[]
for i in range(len(n)):
for j in range(i+1,len(n)):
if n[i] == n[j]:
dict1[j]=n[j]
dict1[i]=n[i]
for x in range(len(n)):
if x in dict1.keys():
list1.append(dict1[x])
return list1
lst1=remove_unique(1,2,1,1,3,5,3,4,3,1,6,7,8,5)
for n in lst1:
print(n, end=" ")
The script above works exactly as expected when tested with few smaller lists. However I want some ideas on how to optimize the script (both time and space complexities considered) for input lists with bigger lengths ( 50000 <=len(list) <= 50M )
your script has a number of issues:
the classical if x in dict1.keys() => if x in dict1 to be sure to use the dictionary check instead of linear
no list comprehension: append in a loop, not as performant.
O(n^2) complexity because of the double loop
My approach:
You could count your elements using collections.Counter, then filter out a new list using a list comprehension using a filter on the number of ocurrences:
from collections import Counter
list1 = [1,2,1,1,3,5,3,4,3,1,6,7,8,5]
c = Counter(list1)
new_list1 = [k for k in list1 if c[k]>1]
print(new_list1)
result:
[1, 1, 1, 3, 5, 3, 3, 1, 5]
I may be wrong but, the complexity of this approach is (roughly) O(n*log(n)) (linear scan of the list plus the hashing of the keys in the dictionary and the lookup in the list comprehension). So, it's good performance-wise.

I need to make two lists the same

I have two quite long lists and I know that all of the elements of the shorter are contained in the longer, yet I need to isolate the elements in the longer list which are not in the shorter so that I can remove them individually from the dictionary I got the longer list from.
What I have so far is:
for e in range(len(lst_ck)):
if lst_ck[e] not in lst_rk:
del currs[lst_ck[e]]
del lst_ck[e]
lst_ck is the longer list and lst_rk is the shorter, currs is the dictionary from which came lst_ck. If it helps, they are both lists of 3 digit keys from dictionaries.
Use sets to find the difference:
l1 = [1,2,3,4]
l2 = [1,2,3,4,6,7,8]
print(set(l2).difference(l1))
set([6, 7, 8]) # in l2 but not in l1
Then remove the elements.
diff = set(l2).difference(l1):
your_list[:] = [ele for ele in your_list of ele not in diff]
If you lists are very big you may prefer a generator expression:
your_list[:] = (ele for ele in your_list of ele not in diff)
If you don't care of multiple occurrences of the same item, use set.
diff = set(lst_ck) - set(lst_rk)
If you care, try this:
diff = [e for e in lst_rk if e not in lst_ck]

Categories

Resources