Comparing lists and extracting unique values - python

I have two lists:
l1: 38510 entries
l2: 6384 entries
I want to extract only values, which are present in both lists.
So far that was my approach:
equals = []
for quote in l2:
for quote2 in l1:
if quote == quote2:
equals.append(quote)
len(equals)) = 4999
len(set(equals))) = 4452
First of all, I have the feeling this approach is pretty inefficient, because I am checking every value in l1 several times ..
Furthermore, it seems that I get still duplicates. Is this due to the inner-loop for l1?
Thank you!!

You can use list comprehension and the in operator.
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [2, 4, 6, 8, 0]
[x for x in a if x in b]
#[2, 4, 6, 8]

You were on the right track by using sets. One of set's coolest features is that you can get the intersection between two sets. An intersection is another way to say the values that occur in both sets. You can read about it more in the docs
Here is my example:
l1_set = set(l1)
l2_set = set(l2)
equals = l1_set & l2_set
#If you really want it as a list
equals = list(equals)
print(equals)
The & operator tells python to return a new set that only has values in both sets. At the end, I went ahead and converted equals back to a list because that's what your original example wanted. You can omit that if you don't need it.

1. This is the simplest method where we haven’t used any built-in functions.
# Two lists in most simple way of showing the intersection
def intersection(list_one, list_two):
temp_list = [value for value in list_one if value in list_two]
return temp_list
# Illustrate the intersection
list_one = [4, 9, 1, 17, 11, 26, 28, 54, 69]
list_two = [9, 9, 74, 21, 45, 11, 63, 28, 26]
print(intersection(list_one, list_two))
# [123, 3, 23, 15]
2. You can use the python set() method.
# Two lists using set() method
def intersection(list_one, list_two):
return list(set(list_one) & set(list_two))
# Illustrate the intersection
list_one = [15, 13, 123, 23, 31, 10, 3, 311, 738, 25, 124, 19]
list_two = [12, 14, 1, 15, 36, 123, 23, 3, 315, 87]
print(intersection(list_one, list_two))
# [123, 3, 23, 15]
3. In this technique, we can use the built-in function called intersection() to compute the intersected list.
First, we need to use set() for a larger list then compute the intersection.
# Two lists using set() and intersection()
def intersection_list(list_one, list_two):
return list(set(list_one).intersection(list_two))
# Illustrate the intersection
list_one = [15, 13, 123, 23, 31, 10, 3, 311, 738, 25, 124, 19]
list_two = [12, 14, 1, 15, 36, 123, 23, 3, 315, 87, 978, 4, 13, 19, 20, 11]
if len(list_one) < len(list_two):
list_one, list_two = list_two, list_one
print(intersection_list(list_one, list_two))
# [3, 13, 15, 19, 23, 123]
Additional you can follow the bellow tutorials
Geeksforgeeks
docs.python.org
LearnCodingFast

Let's assume that all the entries in both of your lists are integers. If so, computing the intersection between the 2 lists would be more efficient than using list comprehension:
import timeit
l1 = [i for i in range(0, 38510)]
l2 = [i for i in range(0, 6384)]
st1 = timeit.default_timer()
# Using list comprehension
l3 = [i for i in l1 if i in l2]
ed1 = timeit.default_timer()
# Using set
st2 = timeit.default_timer()
l4 = list(set(l1) & set(l2))
ed2 = timeit.default_timer()
print(ed1-st1) # 5.7621682 secs
print(ed2-st2) # 0.004478600000000554 secs

As you have such long lists, you might want to use numpy which is specialized in providing efficient list processing for Python.
You can enjoy the fast processing with its numpy function. For your case, you can use numpy.intersect1d() to get the sorted, unique values that are in both of the input arrays, as follows:
import numpy as np
l1 = [1, 3, 5, 10, 11, 12]
l2 = [2, 3, 4, 10, 12, 14, 16, 18]
l_uniques = np.intersect1d(l1, l2)
print(l_uniques)
[ 3 10 12]
You can keep the resulting list as numpy array for further fast processing or further convert it back to Python list by:
l_uniques2 = l_uniques.tolist()

Related

Check if double of element in first list is present in second list and print the output

Suppose
List1 = [ 23, 45, 6, 7, 34]
List2 = [46, 23, 1, 14, 68, 56]
Compare List1 and List2 and print the element of List1 which have a double value in List2
Output = [23,7,34]
Try this:
Output = [i for i in List1 if i*2 in List2]
You can convert list2 to a set for efficient lookups, and use a list comprehension with the said condition for the desired output:
set2 = set(List2)
[i for i in List1 if i * 2 in set2]
You already have the answer but just of the sake of simplicity. Basically you want to iterate through List1 and check if double value is in List2. If so add element to the output array.
List1 = [ 23, 45, 6, 7, 34]
List2 = [46, 23, 1, 7, 14, 68, 56]
output = []
for i in List1:
if i*2 in List2:
output.append(i)
print output
You already got the answers. However, just for fun, I came up with the following method. I did not benchmark all the approaches listed here. It can be fun to do that. This is an interesting question and can be investigated more. However, just for the sake of it I present the solution I did.
import numpy as np
l = np.array(List1) * 2
print(l)
## array([46, 90, 12, 14, 68])
print(set(l) & set(List2))
## {68, 46, 14}
l2 = set(l) & set(List2)
print([List1[list(np.nonzero(l == i))[0][0]] for i in l if i in l2])
## [23, 7, 34]
It uses the broadcasting of numpy along with the fast intersection operation of Python set. This maybe useful if the two lists are very big.

How to remove varying multiple strings from a string extracted from a csv file

I am quite new to programming and have a string with integrated list values. I am trying to isolate the numerical values in the string to be able to use them later.
I have tried to split the string, and change it back to a list and remove the EU variables with a loop. The initial definition produces the indexes of the duplicates and writes them in a list/string format that I am trying to change.
This is the csv file extract example:
Country,Population,Number,code,area
,,,,
Canada,8822267,83858,EU15,central
Denmark,11413058,305010,EU6,west
Southafrica,705034,110912,EU6,south
We are trying to add up repeating EU number populations.
def duplicates(listed, number):
return [i for i,x in enumerate(listed) if x == number]
a=list((x, duplicates(EUlist, x)) for x in set(EUlist) if EUlist.count(x) > 1)
str1 = ''.join(str(e) for e in a)
for x in range (6,27):
str2=str1.replace("EUx","")
#split=str1.split("EUx")
#Here is where I tried to split it as a list. Changing str1 back to a list. str1= [x for x in split]
This is what the code produces:
('EU6', [1, 9, 10, 14, 17, 19])('EU12', [21, 25])('EU25', [4, 5, 7, 12, 15, 16, 18, 20, 23, 24])('EU27', [2, 22])('EU9', [6, 13])('EU15', [0, 8, 26])
I am trying to isolate the numbers in the square brackets so it prints:
[1, 9, 10, 14, 17, 19]
[21, 25]
[4, 5, 7, 12, 15, 16, 18, 20, 23, 24]
[2, 22]
[6, 13]
[0, 8, 26]
This will allow me to isolate the indexes for further use.
I'm not sure without example data but I think this might do the trick:
def duplicates(listed, number):
return [i for i,x in enumerate(listed) if x == number]
a=list((x, duplicates(EUlist, x)) for x in set(EUlist) if EUlist.count(x) > 1)
for item in a:
print(item[1])
At least I think this should print what you asked for in the question.
As an alternative you can use pandas module and save some typing. Remove the four commas on second line and then:
import pandas as pd
csvfile = r'C:\Test\pops.csv'
df = pd.read_csv(csvfile)
df.groupby('membership')['Population'].sum()
Will output:
membership
Brexit 662307
EU10 10868
EU12 569219
EU15 8976639
EU25 17495803
EU27 900255
EU28 41053
EU6 13694963
EU9 105449

Recursive function that takes in one list and returns two lists

I am asked to define a recursive function that takes in a list and then assigns the values of that list among two other lists in such a way that when you take the sum of each of those two lists you get two results that are in close proximity to each other.
Example:
If I run:
print(proximity_lists([5, 8, 8, 9, 17, 21, 24, 27, 31, 41]))
I get back two lists :
[31, 27, 21, 9, 8] #sum = 96
[41, 24, 17, 8, 5] #sum = 95
This is how I did it, however I can't get my head around understanding how to return two lists in a recursive function. So far I was comfortable with conditions where I had to return one list.
This is my code so far:
def proximity_lists(lst, lst1 = [], lst2 = []):
"""
parameters : lst of type list;
returns : returns two lists such that the sum of the elements in the lst1
is in the proximity of the sum of the elements in the lst2
"""
if not lst:
if abs(sum(lst1)-sum(lst2)) in range(5):
return lst1, lst2
else:
return {Not sure what to put here} + proximity_lists(lst[1:])
As far as range() goes, it can take anything for an argument as long as it's the closest they can get in the proximity of each other. I picked 5 because based on the example output above the difference between them is 1.
I need to add that this has to be done without the help of any modules.It has be done using simple functions.
This is potentially not the optimal solution in terms of performance (exponential complexity), but maybe it gets you started:
def proximity_lists(values):
def _recursion(values, list1, list2):
if len(values) == 0:
return list1, list2
head, tail = values[0], values[1:]
r1, r2 = _recursion(tail, list1 + [head], list2)
s1, s2 = _recursion(tail, list1, list2 + [head])
if abs(sum(r1) - sum(r2)) < abs(sum(s1) - sum(s2)):
return r1, r2
return s1, s2
return _recursion(values, [], [])
values = [5, 8, 8, 9, 17, 21, 24, 27, 31, 41]
s1, s2 = proximity_lists(values)
print(sum(s1), sum(s2))
print(s1)
print(s2)
96 95
[24, 31, 41]
[5, 8, 8, 9, 17, 21, 27]
If it is not OK to have a wrapper function, just call _recursion(values, [], []) directly.
You can find all permutations of the original input for the first list, and filter the original to obtain the second. This answer assumes that "close proximity" means a difference less than or equal to 1 between the sums of the two lists:
from collections import Counter
def close_proximity(d, _dist = 1):
def _valid(c, _original):
return abs(sum(c) - sum([i for i in _original if i not in c])) <= _dist
def combos(_d, current = []):
if _valid(current, _d) and current:
yield current
else:
for i in _d:
_c1, _c2 = Counter(current+[i]), Counter(_d)
if all(_c2[a] >= b for a, b in _c1.items()):
yield from combos(_d, current+[i])
return combos(d)
start = [5, 8, 8, 9, 17, 21, 24, 27, 31, 41]
t = next(close_proximity(start))
_c = [i for i in start if i not in t]
print(t, _c, abs(sum(t) - sum(_c)))
Output:
[5, 8, 8, 9, 17, 21, 27] [24, 31, 41] 1
I can't get my head around understanding how to return two lists in a
recursive function.
Here's a simple solution that produces your original result but without extra arguments, inner functions, etc. It just keeps augmenting the lesser list from the next available value:
def proximity_lists(array):
if array:
head, *tail = array
a, b = proximity_lists(tail)
([a, b][sum(b) < sum(a)]).append(head)
return [a, b]
return [[], []]
USAGE
>>> proximity_lists([5, 8, 8, 9, 17, 21, 24, 27, 31, 41])
[[41, 24, 17, 8, 5], [31, 27, 21, 9, 8]]
>>>

Removing duplicates in list of lists

I have a list consisting of lists, and each sublist has 4 items(integers and floats) in it. My problem is that I want to remove those sublists whose index=1 and index=3 match with other sublists.
[[1, 2, 0, 50], [2, 19, 0, 25], [3, 12, 25, 0], [4, 18, 50, 50], [6, 19, 50, 67.45618854993529], [7, 4, 50, 49.49657024231138], [8, 12, 50, 41.65340802385248], [9, 12, 50, 47.80600357035001], [10, 18, 50, 47.80600357035001], [11, 18, 50, 53.222014760339356], [12, 18, 50, 55.667812693447615], [13, 12, 50, 41.65340802385248], [14, 12, 50, 47.80600357035001], [15, 13, 50, 47.80600357035001], [16, 3, 50, 49.49657024231138], [17, 3, 50, 49.49657024231138], [18, 4, 50, 49.49657024231138], [19, 5, 50, 49.49657024231138]]
For example,[7, 4, 50, 49.49657024231138] and [18, 4, 50, 49.49657024231138] have the same integers at index 1 and 3. So I want to remove one, which one doesn't matter.
I have looked at codes which allow me to do this on the basis of single index.
def unique_items(L):
found = set()
for item in L:
if item[1] not in found:
yield item
found.add(item[1])
I have been using this code which allows me to remove lists but only on the basis of a single index.(I haven't really understood the code completely.But it is working.)
Hence, the problem is removing sublists only on the basis of duplicate values of index=1 and index=3 in the list of lists.
If you need to compare (item[1], item[3]), use a tuple. Tuple is hashable type, so it can be used as a set member or dict key.
def unique_items(L):
found = set()
for item in L:
key = (item[1], item[3]) # use tuple as key
if key not in found:
yield item
found.add(key)
This is how you could make it work:
def unique_items(L):
# Build a set to keep track of all the indices we've found so far
found = set()
for item in L:
# Now check if the 2nd and 4th index of the current item already are in the set
if (item[1], item[3]) not in found:
# if it's new, then add its 2nd and 4th index as a tuple to our set
found.add((item[1], item[3])
# and give back the current item
# (I find this order more logical, but it doesn't matter much)
yield item
This should work:
from pprint import pprint
d = {}
for sublist in lists:
k = str(sublist[1]) + ',' + str(sublist[3])
if k not in d:
d[k] = sublist
pprint(d.values())

Splitting a list in python

Hey im new to python. How do you get a portion of a list by the relative value of its sorting key.
example...
list = [11,12,13,14,15,16,1,2,3,4,5,6,7,8,9,10]
list.sort()
newList = list.split("all numbers that are over 13")
assert newList == [14,15,16]
>>> l = [11,12,13,14,15,16,1,2,3,4,5,6,7,8,9,10]
>>> sorted(x for x in l if x > 13)
[14, 15, 16]
or with filter (would be a little bit slower if you have big list, because of lambda)
>>> sorted(filter(lambda x: x > 13, l))
[14, 15, 16]
Use [item for item in newList if item > 13].
There is a decent chance this could be replaced with the generator expression (item for item in newList if item > 13), which filters lazily rather than storing the whole list in memory.
You might also be interested in changing the code just a bit to something like
all_numbers = [11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
filtered_sorted_numbers = sorted(number for number in all_numbers if number > 13)
which performs the sorting—a worst case O(n log n) operation—on only the filtered values.

Categories

Resources