Efficiently remove duplicates, order-agnostic, from list of lists - python

The following list has some duplicated sublists, with elements in different order:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
['there', 'hi'],
['jumps', 'dog', 'over','lazy', 'the'],
]
How can I remove duplicates, retaining the first instance seen, to get:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
]
I tried to:
[list(i) for i in set(map(tuple, l1))]
Nevertheless, I do not know if this is the fastest way of doing it for large lists, and my attempt is not working as desired. Any idea of how to remove them efficiently?

This one is a little tricky. You want to key a dict off of frozen counters, but counters are not hashable in Python. For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
The same idea in a one-liner would look like this:
[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]

I did a quick benchmark, comparing the various answers:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
from collections import Counter
def method1():
"""manually construct set, keyed on sorted tuple"""
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
return result
def method2():
"""frozenset-of-Counter"""
return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
def method3():
"""wim"""
return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
from timeit import timeit
print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))
Prints:
0.0025010189856402576
0.016385524009820074
0.0026451340527273715

This:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]
l2 gives the list with reverse duplicates removed.
Compare with: Pythonic way of removing reversed duplicates in list

#wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist.
To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead. Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:
from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
This returns:
[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]

Related

Find unique words in a list of lists in python

I have a list of lists that I would like to iterate over using a for loop, and create a new list with only the unique words. This is similar to a question asked previously, but I could not get the solution to work for me for a list within a list
For example, the nested list is as follows:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']],
The desired output would be a single list:
List_Unique = [['is','and','so','he','his','run']]
I have tried the following two variations of code, but the output of all of them is a list of repeats:
unique_redundant = []
for i in redundant_search:
redundant_i = [j for j in i if not i in unique_redundant]
unique_redundant.append(redundant_i)
unique_redundant
unique_redundant = []
for list in redundant_search:
for j in list:
redundant_j = [i for i in j if not i in unique_redundant]
unique_redundant.append(length_j)
unique_redundant
Example output given for the above two (incorrect) variations
(I ran the code on my real set of data and it gave repeating lists within lists of the same pair of words, though this isn't the actual two words, just an example):
List_Unique = [['is','and'],['is','and'],['is','and']]
I'd suggest using the set() class union() in this way:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
set().union(*ListofList)
# => {'run', 'and', 'so', 'is', 'his', 'he'}
Explanation
It works like the following:
test_set = set().union([1])
print(test_set)
# => {1}
The asterisk operator before the list (*ListofList) unpacks the list:
lst = [[1], [2], [3]]
print(lst) #=> [[1], [2], [3]]
print(*lst) #=> [1] [2] [3]
First flatten the list with itertools.chain, then use set to return the unique elements and pass that into a list:
from itertools import chain
if __name__ == '__main__':
print([{list(chain(*list_of_lists))}])
Use itertools.chain to flatten the list and dict.fromkeys to keep the unique values in order:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
from itertools import chain
List_Unique = [list(dict.fromkeys(chain.from_iterable(ListofList)))]
Just index out nested list with the help of while and acquire all the values in new list while cnt<len(listoflist)
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
list_new=[]
cnt=0
while cnt<len(ListofList):
for i in ListofList[cnt]:
if i in list_new:
continue
else:
list_new.append(i)
cnt+=1
print(list_new)
OUTPUT
['is', 'and', 'so', 'he', 'his', 'run']
flat_list = [item for sublist in ListofList for item in sublist]
# use this if order should not change
List_Unique = []
for item in flat_list:
if item not in List_Unique:
List_Unique.append(item)
# use this if order is not an issue
# List_Unique = list(set(flat_list))
You could try this:
ListofList = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
uniqueItems = []
for firstList in ListofList:
for item in firstList:
if item not in uniqueItems:
uniqueItems.append(item)
print(uniqueItems)
It uses a nested for loop to access each item and check whether it is in uniqueItems.
using basic set concept, set consists of unique elements
lst = [['is', 'and', 'is'], ['so', 'he', 'his'], ['his', 'run']]
new_list = []
for x in lst:
for y in set(x):
new_list.append(y)
print(list(set(new_list)))
['run', 'and', 'is', 'so', 'he', 'his']

How to Remove Duplicate Lists from a List in Python [duplicate]

The following list has some duplicated sublists, with elements in different order:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
['there', 'hi'],
['jumps', 'dog', 'over','lazy', 'the'],
]
How can I remove duplicates, retaining the first instance seen, to get:
l1 = [
['The', 'quick', 'brown', 'fox'],
['hi', 'there'],
['jumps', 'over', 'the', 'lazy', 'dog'],
]
I tried to:
[list(i) for i in set(map(tuple, l1))]
Nevertheless, I do not know if this is the fastest way of doing it for large lists, and my attempt is not working as desired. Any idea of how to remove them efficiently?
This one is a little tricky. You want to key a dict off of frozen counters, but counters are not hashable in Python. For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
The same idea in a one-liner would look like this:
[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
I did a quick benchmark, comparing the various answers:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
from collections import Counter
def method1():
"""manually construct set, keyed on sorted tuple"""
seen = set()
result = []
for x in l1:
key = tuple(sorted(x))
if key not in seen:
result.append(x)
seen.add(key)
return result
def method2():
"""frozenset-of-Counter"""
return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
def method3():
"""wim"""
return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
from timeit import timeit
print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))
Prints:
0.0025010189856402576
0.016385524009820074
0.0026451340527273715
This:
l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]
l2 gives the list with reverse duplicates removed.
Compare with: Pythonic way of removing reversed duplicates in list
#wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist.
To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead. Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:
from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
This returns:
[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]

Nested List Iteration

I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:
corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']
corpus = [_.split(' ') for _ in corpus]
[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]
So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.
for _ in range(0, len(corpus)):
for x in corpus[_]:
if x == 'is' or x == 'a':
corpus[_].remove(x)
[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]
The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.
What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.
All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.
for _ in range(0, len(corpus)):
for x in corpus[_][:]: # <--- create a copy of the list using [:]
if x == 'is' or x == 'a':
corpus[_].remove(x)
OUTPUT
[['he', 'brave', 'king'],
['she', 'kind', 'queen'],
['he', 'young', 'boy'],
['she', 'gentle', 'girl']]
Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).
def reject_if(predicate, iterable):
for element in iterable:
if not predicate(element):
yield element
Once you have the method in place, you can use this way:
stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]
nested = [input()]
nested = [i.split() for i in nested]

Build a dictionary from list of lists

I am trying to build an inverted index, i.e. map a text to the document it came from.
It's position within the list/document.
In my case i have parsed list containing lists(i.e list of lists).
My input is like this.
[
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
This is my code
def create_inverted(mylists):
myDict = {}
for sublist in mylists:
for i in range(len(sublist)):
if sublist[i] in myDict:
myDict[sublist[i]].append(i)
else:
myDict[sublist[i]] = [i]
return myDict
It does build the dictionary, but when i do a search i am not getting the correct
result. I am trying to do something like this.
documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
index = {'owl': [0, 2],
'lion': [0, 1], # IDs are sorted.
'deer': [1],
'leopard': [2]}
def indexed_search(documents, index, query):
return [documents[doc_id] for doc_id in index[query]]
print indexed_search(documents, index, 'lion')
Where i can enter search text and it gets the list ids.
Any Ideas.
You're mapping each word to the positions it was found in in each document, not which document it was found in. You should store indexes into the list of documents instead of indexes into the documents themselves, or perhaps just map words to documents directly instead of to indices:
def create_inverted_index(documents):
index = {}
for i, document in enumerate(documents):
for word in set(document):
if word in index:
index[word].append(i)
else:
index[word] = [i]
return index
Most of this is the same as your code. The main differences are in these two lines:
for i, document in enumerate(documents):
for word in set(document):
which correspond to the following part of your code:
for sublist in mylists:
for i in range(len(sublist)):
enumerate iterates over the indices and elements of a sequence. Since enumerate is on the outer loop, i in my code is the index of the document, while i in your code is the index of a word within a document.
set(document) creates a set of the words in the document, where each word appears only once. This ensures that each word is only counted once per document, rather than having 10 occurrences of 2 in the list for 'Cheetos' if 'Cheetos' appears in document 2 10 times.
At first I would extract all possible words and store them in one set.
Then I look up each word in each list and collect all the indexes of lists the word happens to be in...
source = [
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
allWords = set(word for lst in source for word in lst)
wordDict = { word: [
i for i, lst in enumerate(source) if word in lst
] for word in allWords }
print wordDict
Out[30]:
{'a': [1, 2, 3],
'afraid': [3],
'always': [1, 4],
'and': [2],
...
This is straightforward as long you don't need efficient code:
documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
def index(docs):
doc_index = {}
for doc_id, doc in enumerate(docs, 1):
for term_pos, term in enumerate(doc, 1):
doc_index.setdefault(term, {}).setdefault(doc_id, []).append(term_pos)
return doc_index
Now you get a two-level dictionary giving you access to the document ids, and then to the positions of the terms in this document:
>>> index(documents)
{'lion': {1: [2], 2: [1]}, 'leopard': {3: [2]}, 'deer': {2: [2]}, 'owl': {1: [1], 3: [1]}}
This is only a preliminary step for indexing; afterwards, you need to separate the term dictionary from the document postings from the positions postings. Typically, the dictionary is stored in a tree-like structures (there are Python packages for this), and the document postings and positions postings are represented as arrays of unsigned integers.
I'd accumulate the indices into a set to avoid duplicates and then sort
>>> documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
>>> from collections import defaultdict
>>> D = defaultdict(set)
>>> for i, doc in enumerate(documents):
... for word in doc:
... D[word].add(i)
...
>>> D ## Take a look at the defaultdict
defaultdict(<class 'set'>, {'owl': {0, 2}, 'leopard': {2}, 'lion': {0, 1}, 'deer': {1}})
>>> {k:sorted(v) for k,v in D.items()}
{'lion': [0, 1], 'owl': [0, 2], 'leopard': [2], 'deer': [1]}

Group a list by word length

For example, I have a list, say
list = ['sight', 'first', 'love', 'was', 'at', 'It']
I want to group this list by word length, say
newlist = [['sight', 'first'],['love'], ['was'], ['at', 'It']]
Please help me on it.
Appreciation!
Use itertools.groupby:
>>> from itertools import groupby
>>> lis = ['sight', 'first', 'love', 'was', 'at', 'It']
>>> [list(g) for k, g in groupby(lis, key=len)]
[['sight', 'first'], ['love'], ['was'], ['at', 'It']]
Note that for itertools.groupby to work properly all the items must be sorted by length, otherwise use collections.defaultdict(O(N)) or sort the list first and then use itertools.groupby(O(NlogN)). :
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> lis = ['sight', 'first', 'foo', 'love', 'at', 'was', 'at', 'It']
>>> for x in lis:
... d[len(x)].append(x)
...
>>> d.values()
[['at', 'at', 'It'], ['foo', 'was'], ['love'], ['sight', 'first']]
If you want the final output list to be sorted too then better sort the list items by length and apply itertools.groupby to it.
You can use a temp dictionary then sort by length:
li=['sight', 'first', 'love', 'was', 'at', 'It']
d={}
for word in li:
d.setdefault(len(word), []).append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
# [['sight', 'first'], ['love'], ['was'], ['at', 'It']]
You can use defaultdict:
from collections import defaultdict
d=defaultdict(list)
for word in li:
d[len(word)].append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
or use __missing__ like so:
class Dicto(dict):
def __missing__(self, key):
self[key]=[]
return self[key]
d=Dicto()
for word in li:
d[len(word)].append(word)
result=[d[n] for n in sorted(d, reverse=True)]
print result
Since the groupby solution was already taken ;-)
from collections import defaultdict
lt = ['sight', 'first', 'love', 'was', 'at', 'It']
d = defaultdict(list)
for x in lt:
d[len(x)].append(x)
d.values()
[['at', 'It'], ['was'], ['love'], ['sight', 'first']]

Categories

Resources