Removing duplicates in list of lists

Removing duplicates in list of lists - python

I have a list consisting of lists, and each sublist has 4 items(integers and floats) in it. My problem is that I want to remove those sublists whose index=1 and index=3 match with other sublists.
[[1, 2, 0, 50], [2, 19, 0, 25], [3, 12, 25, 0], [4, 18, 50, 50], [6, 19, 50, 67.45618854993529], [7, 4, 50, 49.49657024231138], [8, 12, 50, 41.65340802385248], [9, 12, 50, 47.80600357035001], [10, 18, 50, 47.80600357035001], [11, 18, 50, 53.222014760339356], [12, 18, 50, 55.667812693447615], [13, 12, 50, 41.65340802385248], [14, 12, 50, 47.80600357035001], [15, 13, 50, 47.80600357035001], [16, 3, 50, 49.49657024231138], [17, 3, 50, 49.49657024231138], [18, 4, 50, 49.49657024231138], [19, 5, 50, 49.49657024231138]]
For example,[7, 4, 50, 49.49657024231138] and [18, 4, 50, 49.49657024231138] have the same integers at index 1 and 3. So I want to remove one, which one doesn't matter.
I have looked at codes which allow me to do this on the basis of single index.
def unique_items(L):
found = set()
for item in L:
if item[1] not in found:
yield item
found.add(item[1])
I have been using this code which allows me to remove lists but only on the basis of a single index.(I haven't really understood the code completely.But it is working.)
Hence, the problem is removing sublists only on the basis of duplicate values of index=1 and index=3 in the list of lists.

If you need to compare (item[1], item[3]), use a tuple. Tuple is hashable type, so it can be used as a set member or dict key.
def unique_items(L):
found = set()
for item in L:
key = (item[1], item[3]) # use tuple as key
if key not in found:
yield item
found.add(key)

This is how you could make it work:
def unique_items(L):
# Build a set to keep track of all the indices we've found so far
found = set()
for item in L:
# Now check if the 2nd and 4th index of the current item already are in the set
if (item[1], item[3]) not in found:
# if it's new, then add its 2nd and 4th index as a tuple to our set
found.add((item[1], item[3])
# and give back the current item
# (I find this order more logical, but it doesn't matter much)
yield item

This should work:
from pprint import pprint
d = {}
for sublist in lists:
k = str(sublist[1]) + ',' + str(sublist[3])
if k not in d:
d[k] = sublist
pprint(d.values())

Related

Python - 2D list - find duplicates in one column and sum values in another column

I have a 2D list that contains soccer player names, the number of times they scored a goal, and the number of times they attempted a shot on goal, respectively.
player_stats = [['Adam', 5, 10], ['Kyle', 12, 18], ['Jo', 20, 35], ['Adam', 15, 20], ['Charlie', 31, 58], ['Jo', 6, 14], ['Adam', 10, 15]]
From this list, I'm trying to return another list that shows only one instance of each player with their respective total goals and total attempts on goal, like so:
player_stats_totals = [['Adam', 30, 45], ['Kyle', 12, 18], ['Jo', 26, 49], ['Charlie', 31, 58]]
After searching on Stack Overflow I was able to learn (from this thread) how to return the indexes of the duplicate players
x = [player_stats[i][0] for i in range (len(player_stats))]
for i in range (len(x)):
if (x[i] in x[:i]) or (x[i] in x[i+1:]): print (x[i], i)
but got stuck on how to proceed thereafter and if indeed this method is strictly relevant for what I need(?)
What's the most efficient way to return the desired list of totals?

Use a dictionary to accumulate the values for a given player:
player_stats = [['Adam', 5, 10], ['Kyle', 12, 18], ['Jo', 20, 35], ['Adam', 15, 20], ['Charlie', 31, 58], ['Jo', 6, 14], ['Adam', 10, 15]]
lookup = {}
for player, first, second in player_stats:
# if the player has not been seen add a new list with 0, 0
if player not in lookup:
lookup[player] = [0, 0]
# get the accumulated total so far
first_total, second_total = lookup[player]
# add the current values to the accumulated total, and update the values
lookup[player] = [first_total + first, second_total + second]
# create the output in the expected format
res = [[player, first, second] for player, (first, second) in lookup.items()]
print(res)
Output
[['Adam', 30, 45], ['Kyle', 12, 18], ['Jo', 26, 49], ['Charlie', 31, 58]]
A more advanced, and pythonic, version is to use a collections.defaultdict:
from collections import defaultdict
player_stats = [['Adam', 5, 10], ['Kyle', 12, 18], ['Jo', 20, 35],
['Adam', 15, 20], ['Charlie', 31, 58], ['Jo', 6, 14], ['Adam', 10, 15]]
lookup = defaultdict(lambda: [0, 0])
for player, first, second in player_stats:
# get the accumulated total so far
first_total, second_total = lookup[player]
# add the current values to the accumulated total, and update the values
lookup[player] = [first_total + first, second_total + second]
# create the output in the expected format
res = [[player, first, second] for player, (first, second) in lookup.items()]
print(res)
This approach has the advantage of skipping the initialisation. Both has approaches are O(n).
Notes
The expression:
res = [[player, first, second] for player, (first, second) in lookup.items()]
is a list comprehension, equivalent to the following for loop:
res = []
for player, (first, second) in lookup.items():
res.append([player, first, second])
Additionally, read this for understanding unpacking.

What you want to do is use a dictionary where the key is the player name and the value is a list containing [goals, shots]. Constructing it would look like this:
all_games_stats = {}
for stat in player_stats:
player, goals, shots = stat
if player not in all_games_stats:
all_games_stats[player] = [goals, shots]
else:
stat_list = all_games_stats[player]
stat_list[0] += goals
stat_list[1] += shots
Then, if you want to represent the players and their stats as a list, you would do:
list(all_games_stats.items())

You can convert the list to a dictionary. (It can always be changed back once done) This works:
player_stats = [['Adam', 5, 10], ['Kyle', 12, 18], ['Jo',
20, 35], ['Adam', 15, 20], ['Charlie', 31, 58], ['Jo', 6,
14], ['Adam', 10, 15]]
new_stats = {}
for item in player_stats:
if not item[0] in new_stats:
new_stats[item[0]] = [item[1],item[2]]
else:
new_stats[item[0]][0] += item[1]
new_stats[item[0]][1] += item[2]
print(new_stats)

I might as well submit something, too. Here's yet another method with some list comprehension worked in:
# Unique values to new dictionary with goal and shots on goal default entries
agg_stats = dict.fromkeys(set([p[0] for p in player_stats]), [0, 0])
# Iterate over the player stats list
for player in player_stats:
# Set entry to sum of current and next stats values for the corresponding player.
agg_stats[player[0]] = [sum([agg_stats.get(player[0])[i], stat]) for i, stat in enumerate(player[1:])]

Yet another way, storing the whole triples (including the name) in the dict and updating them:
stats = {}
for name, goals, attempts in player_stats:
entry = stats.setdefault(name, [name, 0, 0])
entry[1] += goals
entry[2] += attempts
player_stats_totals = list(stats.values())
And for fun a solution with complex numbers, which makes adding nice but requires annoying conversion back:
from collections import defaultdict
tmp = defaultdict(complex)
for name, *stats in player_stats:
tmp[name] += complex(*stats)
player_stats_totals = [[name, int(stats.real), int(stats.imag)]
for name, stats in tmp.items()]

Pythonic way to separate a list into several sublists where each sublist can only take values within a given range

I have a list of integers which span several useful ranges. I found in:
Split a list based on a condition?
A way to do that and my code works, but I wondered if there was a better way to do it while making it readable?
EDIT: I would encourage anyone with a similar need to explore all of the answers given, they all have different methods and merits.
Thank you to all who helped on this.
Working inelegant code
my_list = [1, 2, 11, 29, 37]
r1_lim = 10
r2_lim = 20
r3_lim = 30
r4_lim = 40
r1_goodvals = list(range(1, r1_lim+1))
print("r1_goodvals : ", r1_goodvals)
r2_goodvals = list(range(r1_lim+1, r2_lim+1))
print("r1_goodvals : ", r2_goodvals)
r3_goodvals = list(range(r2_lim+1, r3_lim+1))
print("r3_goodvals : ", r3_goodvals)
r4_goodvals = list(range(r3_lim+1, r4_lim+1))
print("r4_goodvals : ", r4_goodvals)
r1, r2, r3, r4 = [], [], [], []
for x in my_list:
if x in r1_goodvals:
r1.append(x)
elif x in r2_goodvals:
r2.append(x)
elif x in r3_goodvals:
r3.append(x)
elif x in r4_goodvals:
r4.append(x)
print("r1 : ", r1)
print("r2 : ", r2)
print("r3 : ", r3)
print("r4 : ", r4)
Output
r1_goodvals : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
r1_goodvals : [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
r3_goodvals : [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
r4_goodvals : [31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
r1 : [1, 2]
r2 : [11]
r3 : [29]
r4 : [37]

You can achieve linear time complexity with minimal code by initializing the limits as a list in reverse order and making comparisons between the lowest limit at the end of the limit list to the current item in the iteration of the input list. If that limit is lower, append a new sub-list to the result list and pop the limit list for the next lowest limit. Then append the current item to the last sub-list, which will always be the one within the current limit.
my_list = [1, 2, 11, 29, 37]
limits = [40, 30, 20, 10]
r = [[]]
for i in my_list:
if limits[-1] < i:
r.append([])
limits.pop()
r[-1].append(i)
r becomes:
[[1, 2], [11], [29], [37]]
Like #Drecker mentions in the comment, however, that this solution comes with the assumption that my_list is pre-sorted. If the assumption isn't valid then it would require a cost of O(n log n) sort the list first.
Also as #Drecker mentions in the comment, some may find the use of an iterator to be more Pythonic than a list with pop, in which case the limits can be listed in the more intuitive ascending order, but there would need to be an additional variable to keep track of the current lowest limit since calling next on an iterator consumes the item:
my_list = [1, 2, 11, 29, 37]
limits = iter((10, 20, 30, 40))
limit = next(limits)
r = [[]]
for i in my_list:
if limit < i:
r.append([])
limit = next(limits)
r[-1].append(i)

If I needed this to be more readable, and keep the same functionality, I would probably do in this kind of way. Alltough if you would start adding more r_n limits this approach would quickly become cluttered.
my_list = [1, 2, 11, 29, 37]
r1, r2, r3, r4 = [], [], [], []
for x in my_list:
if x in range(1, 11):
r1.append(x)
elif x in range(11, 21):
r2.append(x)
elif x in range(21, 31):
r3.append(x)
elif x in range(31, 41):
r4.append(x)
print("r1 : ", r1)
print("r2 : ", r2)
print("r3 : ", r3)
print("r4 : ", r4)
Using list comprehension in this case would make the runtime O(n * number_of_r_n) since you would then need to loop over the entire my_list array for each 'range'. While this has runtime O(n) with n being the length of the array.

Another solution would involve using binary search:
from bisect import bisect
my_list = [1, 2, 11, 29, 37, 100]
limits = [10, 20, 30, 40]
groups = [[] for _ in range(len(limits) + 1)]
for x in my_list:
groups[bisect(limits, x)].append(x)
print(groups)
[[1, 2], [11], [29], [37], [100]]
This is quite fast solution even for high number of limits O(number_of_elements * log(number_of_limits)) and in certain sense it is as fast as you can get, for arbitrary limits.
However, if you have additional information -- for example you want to group the numbers based on their rounding to tens and the list is pre-sorted, you could use itertools.groupby:
from itertools import groupby
my_list = [1, 2, 11, 29, 37, 100]
groups = {key: list(lst) for key, lst in groupby(my_list, lambda x: round(x, -1))}
# We could use 'x // 10' instead of 'round(x, -1)' to get 'flooring'
# hence, results would be more similar to your original usage
print(groups)
{0: [1, 2], 10: [11], 30: [29], 40: [37], 100: [100]}
You can drop the requirement on the pre-sorting, by replacing the comprehension by a full for-loop and using collections.defaultdict(list).
This solution is just O(number_of_elements) in terms of time complexity

Comparing lists and extracting unique values

I have two lists:
l1: 38510 entries
l2: 6384 entries
I want to extract only values, which are present in both lists.
So far that was my approach:
equals = []
for quote in l2:
for quote2 in l1:
if quote == quote2:
equals.append(quote)
len(equals)) = 4999
len(set(equals))) = 4452
First of all, I have the feeling this approach is pretty inefficient, because I am checking every value in l1 several times ..
Furthermore, it seems that I get still duplicates. Is this due to the inner-loop for l1?
Thank you!!

You can use list comprehension and the in operator.
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [2, 4, 6, 8, 0]
[x for x in a if x in b]
#[2, 4, 6, 8]

You were on the right track by using sets. One of set's coolest features is that you can get the intersection between two sets. An intersection is another way to say the values that occur in both sets. You can read about it more in the docs
Here is my example:
l1_set = set(l1)
l2_set = set(l2)
equals = l1_set & l2_set
#If you really want it as a list
equals = list(equals)
print(equals)
The & operator tells python to return a new set that only has values in both sets. At the end, I went ahead and converted equals back to a list because that's what your original example wanted. You can omit that if you don't need it.

1. This is the simplest method where we haven’t used any built-in functions.
# Two lists in most simple way of showing the intersection
def intersection(list_one, list_two):
temp_list = [value for value in list_one if value in list_two]
return temp_list
# Illustrate the intersection
list_one = [4, 9, 1, 17, 11, 26, 28, 54, 69]
list_two = [9, 9, 74, 21, 45, 11, 63, 28, 26]
print(intersection(list_one, list_two))
# [123, 3, 23, 15]
2. You can use the python set() method.
# Two lists using set() method
def intersection(list_one, list_two):
return list(set(list_one) & set(list_two))
# Illustrate the intersection
list_one = [15, 13, 123, 23, 31, 10, 3, 311, 738, 25, 124, 19]
list_two = [12, 14, 1, 15, 36, 123, 23, 3, 315, 87]
print(intersection(list_one, list_two))
# [123, 3, 23, 15]
3. In this technique, we can use the built-in function called intersection() to compute the intersected list.
First, we need to use set() for a larger list then compute the intersection.
# Two lists using set() and intersection()
def intersection_list(list_one, list_two):
return list(set(list_one).intersection(list_two))
# Illustrate the intersection
list_one = [15, 13, 123, 23, 31, 10, 3, 311, 738, 25, 124, 19]
list_two = [12, 14, 1, 15, 36, 123, 23, 3, 315, 87, 978, 4, 13, 19, 20, 11]
if len(list_one) < len(list_two):
list_one, list_two = list_two, list_one
print(intersection_list(list_one, list_two))
# [3, 13, 15, 19, 23, 123]
Additional you can follow the bellow tutorials
Geeksforgeeks
docs.python.org
LearnCodingFast

Let's assume that all the entries in both of your lists are integers. If so, computing the intersection between the 2 lists would be more efficient than using list comprehension:
import timeit
l1 = [i for i in range(0, 38510)]
l2 = [i for i in range(0, 6384)]
st1 = timeit.default_timer()
# Using list comprehension
l3 = [i for i in l1 if i in l2]
ed1 = timeit.default_timer()
# Using set
st2 = timeit.default_timer()
l4 = list(set(l1) & set(l2))
ed2 = timeit.default_timer()
print(ed1-st1) # 5.7621682 secs
print(ed2-st2) # 0.004478600000000554 secs

As you have such long lists, you might want to use numpy which is specialized in providing efficient list processing for Python.
You can enjoy the fast processing with its numpy function. For your case, you can use numpy.intersect1d() to get the sorted, unique values that are in both of the input arrays, as follows:
import numpy as np
l1 = [1, 3, 5, 10, 11, 12]
l2 = [2, 3, 4, 10, 12, 14, 16, 18]
l_uniques = np.intersect1d(l1, l2)
print(l_uniques)
[ 3 10 12]
You can keep the resulting list as numpy array for further fast processing or further convert it back to Python list by:
l_uniques2 = l_uniques.tolist()

How to get duplicate strings of list with indices in Python

I do realize this has already been addressed here (e.g., Removing duplicates in the lists), Accessing the index in 'for' loops?, Append indices to duplicate strings in Python efficiently and many more...... Nevertheless, I hope this question was different.
Pretty much I need to write a program that checks if a list has any duplicates and if it does, returns the duplicate element along with the indices.
The sample list sample_list
sample = """An article is any member of a class of dedicated words that are used with noun phrases to
mark the identifiability of the referents of the noun phrases. The category of articles constitutes a
part of speech. In English, both "the" and "a" are articles, which combine with a noun to form a noun
phrase."""
sample_list = sample.split()
my_list = [x.lower() for x in sample_list]
len(my_list)
output: 55
The common approach to get a unique collection of items is to use a set, set will help here to remove duplicates.
unique_list = list(set(my_list))
len(unique_list)
output: 38
This is what I have tried but honestly, I don't know what to do next...
from functools import partial
def list_duplicates_of(seq,item):
start_at = -1
locs = []
while True:
try:
loc = seq.index(item,start_at+1)
except ValueError:
break
else:
locs.append(loc)
start_at = loc
return locs
dups_in_source = partial(list_duplicates_of, my_list)
for i in my_list:
print(i, dups_in_source(i))
This returns all the elements with indices and duplicate indices
an [0]
article [1]
.
.
.
form [51]
a [6, 33, 48, 52]
noun [15, 26, 49, 53]
phrase. [54]
Here I want to return only duplicate elements along with their indices like below
of [5, 8, 21, 24, 30, 35]
a [6, 33, 48, 52]
are [12, 43]
with [14, 47]
.
.
.
noun [15, 26, 49, 53]

You could do something along these lines:
from collections import defaultdict
indeces = defaultdict(list)
for i, w in enumerate(my_list):
indeces[w].append(i)
for k, v in indeces.items():
if len(v) > 1:
print(k, v)
of [5, 8, 21, 24, 30, 35]
a [6, 33, 48, 52]
are [12, 43]
with [14, 47]
noun [15, 26, 49, 53]
to [17, 50]
the [19, 22, 25, 28]
This uses collections.defaultdict and enumerate to efficiently collect the indeces of each word. Ridding this of duplicates remains a simple conditional comprehension or loop with an if statement.

Recursive function that takes in one list and returns two lists

I am asked to define a recursive function that takes in a list and then assigns the values of that list among two other lists in such a way that when you take the sum of each of those two lists you get two results that are in close proximity to each other.
Example:
If I run:
print(proximity_lists([5, 8, 8, 9, 17, 21, 24, 27, 31, 41]))
I get back two lists :
[31, 27, 21, 9, 8] #sum = 96
[41, 24, 17, 8, 5] #sum = 95
This is how I did it, however I can't get my head around understanding how to return two lists in a recursive function. So far I was comfortable with conditions where I had to return one list.
This is my code so far:
def proximity_lists(lst, lst1 = [], lst2 = []):
"""
parameters : lst of type list;
returns : returns two lists such that the sum of the elements in the lst1
is in the proximity of the sum of the elements in the lst2
"""
if not lst:
if abs(sum(lst1)-sum(lst2)) in range(5):
return lst1, lst2
else:
return {Not sure what to put here} + proximity_lists(lst[1:])
As far as range() goes, it can take anything for an argument as long as it's the closest they can get in the proximity of each other. I picked 5 because based on the example output above the difference between them is 1.
I need to add that this has to be done without the help of any modules.It has be done using simple functions.

This is potentially not the optimal solution in terms of performance (exponential complexity), but maybe it gets you started:
def proximity_lists(values):
def _recursion(values, list1, list2):
if len(values) == 0:
return list1, list2
head, tail = values[0], values[1:]
r1, r2 = _recursion(tail, list1 + [head], list2)
s1, s2 = _recursion(tail, list1, list2 + [head])
if abs(sum(r1) - sum(r2)) < abs(sum(s1) - sum(s2)):
return r1, r2
return s1, s2
return _recursion(values, [], [])
values = [5, 8, 8, 9, 17, 21, 24, 27, 31, 41]
s1, s2 = proximity_lists(values)
print(sum(s1), sum(s2))
print(s1)
print(s2)
96 95
[24, 31, 41]
[5, 8, 8, 9, 17, 21, 27]
If it is not OK to have a wrapper function, just call _recursion(values, [], []) directly.

You can find all permutations of the original input for the first list, and filter the original to obtain the second. This answer assumes that "close proximity" means a difference less than or equal to 1 between the sums of the two lists:
from collections import Counter
def close_proximity(d, _dist = 1):
def _valid(c, _original):
return abs(sum(c) - sum([i for i in _original if i not in c])) <= _dist
def combos(_d, current = []):
if _valid(current, _d) and current:
yield current
else:
for i in _d:
_c1, _c2 = Counter(current+[i]), Counter(_d)
if all(_c2[a] >= b for a, b in _c1.items()):
yield from combos(_d, current+[i])
return combos(d)
start = [5, 8, 8, 9, 17, 21, 24, 27, 31, 41]
t = next(close_proximity(start))
_c = [i for i in start if i not in t]
print(t, _c, abs(sum(t) - sum(_c)))
Output:
[5, 8, 8, 9, 17, 21, 27] [24, 31, 41] 1

I can't get my head around understanding how to return two lists in a
recursive function.
Here's a simple solution that produces your original result but without extra arguments, inner functions, etc. It just keeps augmenting the lesser list from the next available value:
def proximity_lists(array):
if array:
head, *tail = array
a, b = proximity_lists(tail)
([a, b][sum(b) < sum(a)]).append(head)
return [a, b]
return [[], []]
USAGE
>>> proximity_lists([5, 8, 8, 9, 17, 21, 24, 27, 31, 41])
[[41, 24, 17, 8, 5], [31, 27, 21, 9, 8]]
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicates in list of lists - python

If you need to compare (item[1], item[3]), use a tuple. Tuple is hashable type, so it can be used as a set member or dict key. def unique_items(L): found = set() for item in L: key = (item[1], item[3]) # use tuple as key if key not in found: yield item found.add(key)

This should work: from pprint import pprint d = {} for sublist in lists: k = str(sublist[1]) + ',' + str(sublist[3]) if k not in d: d[k] = sublist pprint(d.values())

Related

Python - 2D list - find duplicates in one column and sum values in another column

Pythonic way to separate a list into several sublists where each sublist can only take values within a given range

Comparing lists and extracting unique values

How to get duplicate strings of list with indices in Python

Recursive function that takes in one list and returns two lists

Categories

Resources