Find duplicates in a list of tuples

Find duplicates in a list of tuples - python

You are given information about users of your website. The information includes username, a phone number and/or an email. Write a program that takes in a list of tuples where each tuple represents information for a particular user and returns a list of lists where each sublist contains the indices of tuples containing information about the same person. For example:
Input:
[("MLGuy42", "andrew#example.com", "123-4567"),
("CS229DungeonMaster", "123-4567", "ml#example.net"),
("Doomguy", "john#example.org", "carmack#example.com"),
("andrew26", "andrew#example.com", "mlguy#example.com")]
Output:
[[0, 1, 3], [2]]
Since "MLGuy42", "CS229DungeonMaster" and "andrew26" are all the same person.
Each sublist in the output should be sorted and the outer list should be sorted by the first element in the sublist.
Below is the code snippet that I did for this problem. It seems to work fine, but I'm wondering if there is a better/optimized solution. Any help would be appreciated. Thanks!
def find_duplicates(user_info):
results = list()
seen = dict()
for i, user in enumerate(user_info):
first_seen = True
key_info = None
for info in user:
if info in seen:
first_seen = False
key_info = info
break
if first_seen:
results.append([i])
pos = len(results) - 1
else:
index = seen[key_info]
results[index].append(i)
pos = index
for info in user:
seen[info] = pos
return results

I think I've reached to an optimized working solution using a graph. Basically, I've created a graph with each node contains its user information and its index. Then, use dfs to traverse the graph and find the duplicates.

I think we can simplify this using sets:
from random import shuffle
def find_duplicates(user_info):
reduced = unreduced = {frozenset(info): [i] for i, info in enumerate(user_info)}
while reduced is unreduced or len(unreduced) > len(reduced):
unreduced = dict(reduced) # make a copy
for identifiers_1, positions_1 in unreduced.items():
for identifiers_2, positions_2 in unreduced.items():
if identifiers_1 is identifiers_2:
continue
if identifiers_1 & identifiers_2:
del reduced[identifiers_1], reduced[identifiers_2]
reduced[identifiers_1 | identifiers_2] = positions_1 + positions_2
break
else: # no break
continue
break
return sorted(sorted(value) for value in reduced.values())
my_input = [ \
("CS229DungeonMaster", "123-4567", "ml#example.net"), \
("Doomguy", "john#example.org", "carmack#example.com"), \
("andrew26", "andrew#example.com", "mlguy#example.com"), \
("MLGuy42", "andrew#example.com", "123-4567"), \
]
shuffle(my_input) # shuffle to prove order independence
print(my_input)
print(find_duplicates(my_input))
OUTPUT
> python3 test.py
[('CS229DungeonMaster', '123-4567', 'ml#example.net'), ('MLGuy42', 'andrew#example.com', '123-4567'), ('andrew26', 'andrew#example.com', 'mlguy#example.com'), ('Doomguy', 'john#example.org', 'carmack#example.com')]
[[0, 1, 2], [3]]
>

Related

how to select pieces of 2 lists using np.sign?

I have 2 separate lists which I would like to select some pieces of both. In the first variable set, I need to select based on its sign then when I could have zero crossing, I should select from both lists based on the indices coming from zero_crossing.
I could define the function to have zero_crossing indices but I do not know how to select from both lists using the indices.
def func_1(arr1):
for i, a in enumerate(arr1):
zero_crossings = np.where(np.diff(np.sign(arr1)))[0]
return zero_crossings
def func_2(arr1,arr2):
res_list1 = []
res_list2 = []
for i in (zero_crossings):
res_list1.append(arr1[i:i+1])
res_list2.append(arr2[i:i+1])
zero_crossing = [3,12,18]
list1 = [-12,-14,-16,-10,1,3,6,2,,1,5,5,3,-1,-12,-2,-3,-5,-3,3,2,5,2]
list2 = [0.00040409 0.00041026 0.0004162 ... 0.00116538 0.001096 0.00102431]
The expected results:
new_list_1 = [list1[0:3]+list1[3:12]+list1[12:18]]
new_list_1 = [list2[0:3]+list2[3:12]+list2[12:18]]
for i in range(len(zero_crossing)):
list_{%d} = []
list_i = list1[zero_crossing[i]:zero_crossing[i+1]]
I want to use list1 to see where we have a change in sign then through the list of indices of sign changing, try to select the values of both lists.
All efforts will be appreciated.

Forming the sublist using same indices

Hi I am trying to solve a problem where I have to return the indices in a sublist of the same person. When i say same person , I mean if they have the same username,phone or email(any one of them).
I understand that these identites are mostly unique but for the sake of questions lets assume.
eg.
data = [("username1","phone_number1", "email1"),
("usernameX","phone_number1", "emailX"),
("usernameZ","phone_numberZ", "email1Z"),
("usernameY","phone_numberY", "emailX"),
("username2","phone_number2", "emailX")]
Expected output :
[[0,1,3,4][2]]
Explaination: As 0,1 have the same phone and 3 and 4 have the same email so They all fall under one category. and 2 index falls in the other catoegry.
My approach until now is :
data = [("username1","phone_number1", "email1"),
("usernameX","phone_number1", "emailX"),
("usernameZ","phone_numberZ", "email1Z"),
("usernameY","phone_numberY", "emailX"),
]
def match(t1,t2):
if(t1[0] == t2[0] or t1[1] == t2[1] or t1[2] == t2[2]):
return True
else:
return False
# print(match(data[1],data[3]))
together = []
for i in range(len(data)):
temp = {i}
for j in range(len(data)):
if(match(data[i],data[j])):
temp.add(j)
together.append(temp)
for i in range(len(data)):
ans = together[i]
for j in range(i+1,len(data)):
if(bool(ans.intersection(together[j]))):
ans = ans.union(together[j])
print(ans)
I am not able to reach desired result.
Any help is appreciated. Thank you.

A first solution is similar to yours with some enhancements:
Leveraging any for the match, such that it doesn't require to know the number of items inside the tuples.
Checking if a user is already identified as part of "together" to skip useless comparison
Here it is:
together = set()
for user_idx, user in enumerate(data):
if user_idx in together:
continue # That user is already matched
# No need to check with previous users
for other_idx, other in enumerate(data[user_idx + 1 :], user_idx + 1):
# Match
if any(val_ref == val_other for val_ref, val_other in zip(user, other)):
together.update((user_idx, other_idx))
isolated = set(range(len(data))) ^ together
Another solution use tricks by going through a numpy array to identify isolated users. With numpy it is easy to compare a user to every other user (aka the original array). An isolated user will only match one time to itself on each of its fields, hence summing the boolean values along fields will return, for an isolated user, the length of the tuple of fields.
data = np.array(data)
# For each user, match it with the whole matrice
matches = sum(user == data for user in data)
# Isolated users only match with themselves, hence only have 1 on their line
isolated = set(np.where(np.sum(matches, axis=1) == data.shape[1])[0])
# Together are other users
together = set(range(len(data))) ^ set(isolated)
see the matches array for better understanding:
[[1 2 1]
[1 2 3]
[1 1 1]
[1 1 3]
[1 1 3]]
However, it is not leveraging any of the optimisation mentioned before.
Still, numpy is fast so it should be ok.

Is there a better way to combine multiple items in a python list

I've created a function to combine specific items in a python list, but I suspect there is a better way I can't find despite extreme googling. I need the code to be fast, as I'm going to be doing this thousands of times.
mergeleft takes a list of items and a list of indices. In the example below, I call it as mergeleft(fields,(2,4,5)). Items 5, 4, and 2 of list fields will be concatenated to the item immediately to the left. In this case, 3 and d get concatenated to c; b gets concatenated to a. The result is a list ('ab', 'cd3', 'f').
fields = ['a','b','c','d', 3,'f']
def mergeleft(x, fieldnums):
if 1 in fieldnums: raise Exception('Cannot merge field 1 left')
if max(fieldnums) > len(x): raise IndexError('Fieldnum {} exceeds available fields {}'.format(max(fieldnums),len(x)))
y = []
deleted_rows = ''
for i,l in enumerate(reversed(x)):
if (len(x) - i) in fieldnums:
deleted_rows = str(l) + deleted_rows
else:
y.append(str(l)+deleted_rows)
deleted_rows = ''
y.reverse()
return y
print(mergeleft(fields,(2,4,5)))
# Returns ['ab','cd3','f']

fields = ['a','b','c','d', 3,'f']
This assumes a list of indices in monotonic ascending order.
I reverse the order, so that I'm merging right-to-left.
For each given index, I merge that element into the one on the left, converting to string at each point.
Do note that I've changed the fieldnums type to list, so that it's easily reversible. You can also just traverse the tuple in reverse order.
def mergeleft(lst, fieldnums):
fieldnums.reverse()
for pos in fieldnums:
# Merge this field left
lst[pos-2] = str(lst[pos-2]) + str(lst[pos-1])
lst = lst[:pos-1] + lst[pos:]
return lst
print(mergeleft(fields,[2,4,5]))
Output:
['ab', 'cd3', 'f']

Here's a decently concise solution, probably among many.
def mergeleft(x, fieldnums):
if 1 in fieldnums: raise Exception('Cannot merge field 1 left')
if max(fieldnums) > len(x): raise IndexError('Fieldnum {} exceeds available fields {}'.format(max(fieldnums),len(x)))
ret = list(x)
for i in reversed(sorted(set(fieldnums))):
ret[i-1] = str(ret[i-1]) + str(ret.pop(i))
return ret

matching results in a list of lists

I have a list with the following structure;
[('0','927','928'),('2','693','694'),('2','742','743'),('2','776','777'),('2','804','805'),
('2','987','988'),('2','997','998'),('2','1019','1020'),
('2','1038','1039'),('2','1047','1048'),('2','1083','1084'),('2','659','660'),
('2','677','678'),('2','743','744'),('2','777','778'),('2','805','806'),('2','830','831')
the 1st number is an id, the second a position of a word and the third number is the position of a second word. What I need to do and am struggling with is finding sets of words next to each other.
These results are given for searches of 3 words, so there is the positions of word 1 with word 2 and positions of word 2 with word 3. For example ;
I run the phrase query "women in science" I then get the values given in the list above, so ('2','776','777') is the results for 'women in' and ('2','777','778') is the results for 'in science'.
I need to find a way to match these results up, so for every document it groups the words together depending on amounts of word in the query. (so if there is 4 words in the query there will be 3 results that need to be matched together).
Is this possible?

You need to quickly find word info by its position. Create a dictionary keyed by word position:
# from your example; I wonder why you use strings and not numbers.
positions = [('0','927','928'),('2','693','694'),('2','742','743'),('2','776','777'),('2','804','805'),
('2','987','988'),('2','997','998'),('2','1019','1020'),
('2','1038','1039'),('2','1047','1048'),('2','1083','1084'),('2','659','660'),
('2','677','678'),('2','743','744'),('2','777','778'),('2','805','806'),('2','830','831')]
# create the dictionary
dict_by_position = {w_pos:(w_id, w_next) for (w_id, w_pos, w_next) in positions}
Now it's a piece of cake to follow chains:
>>> dict_by_position['776']
('2', '777')
>>> dict_by_position['777']
('2', '778')
Or programmatically:
def followChain(start, position_dict):
result = []
scanner = start
while scanner in position_dict:
next_item = position_dict[scanner]
result.append(next_item)
unused_id, scanner = next_item # unpack the (id, next_position)
return result
>>> followChain('776', dict_by_position)
[('2', '777'), ('2', '778')]
Finding all chains that are not subchains of each other:
seen_items = set()
for start in dict_by_position:
if start not in seen_items:
chain = followChain(start, dict_by_position)
seen_items.update(set(chain)) # mark all pieces of chain as seen
print chain # or do something reasonable instead

The following will do what you're asking, as I understand it - it's not the prettiest output in the world, and I think that if possible you should be using numbers if numbers are what you're trying to work with.
There are probably more elegant solutions, and simplifications that could be made to this:
positions = [('0','927','928'),('2','693','694'),('2','742','743'),('2','776','777'),('2','804','805'),
('2','987','988'),('2','997','998'),('2','1019','1020'),
('2','1038','1039'),('2','1047','1048'),('2','1083','1084'),('2','659','660'),
('2','677','678'),('2','743','744'),('2','777','778'),('2','805','806'),('2','830','831')]
sorted_dict = {}
sorted_list = []
grouped_list = []
doc_ids = []
def sort_func(positions):
for item in positions:
if item[0] not in doc_ids:
doc_ids.append(item[0])
for doc_id in doc_ids:
sorted_set = []
for item in positions:
if item[0] != doc_id:
continue
else:
if item[1] not in sorted_set:
sorted_set.append(item[1])
if item[2] not in sorted_set:
sorted_set.append(item[2])
sorted_list = sorted(sorted_set)
sorted_dict[doc_id] = sorted_list
for k in sorted_dict:
group = []
grouped_list = []
for i in sorted_dict[k]:
try:
if int(i)-1 == int(sorted_dict[k][sorted_dict[k].index(i)-1]):
group.append(i)
else:
if group != []:
grouped_list.append(group)
group = [i]
except IndexError:
group.append(i)
continue
if grouped_list != []:
sorted_dict[k] = grouped_list
else:
sorted_dict[k] = group
return sorted_dict
My output for the above was:
{'0': ['927', '928'], '2': [['1019', '1020'], ['1038', '1039'], ['1047', '1048'], ['1083', '1084'], ['659', '660'], ['677', '678'], ['693', '694'], ['742', '743', '744'], ['776', '777', '778'], ['804', '805', '806'], ['830', '831'], ['987', '988']]}

2d list not working

I am trying to create a 2D list, and I keep getting the same error "TypeError: list indices must be integers, not tuple" I do not understand why, or how to use a 2D list correctly.
Total = 0
server = xmlrpclib.Server(url);
mainview = server.download_list("", "main")
info = [[]]
info[0,0] = hostname
info[0,1] = time
info[0,2] = complete
info[0,3] = Errors
for t in mainview:
Total += 1
print server.d.get_hash(t)
info[Total, 0] = server.d.get_hash(t)
info[Total, 1] = server.d.get_name(t)
info[Total, 2] = server.d.complete(t)
info[Total, 3] = server.d.message(t)
if server.d.complete(t) == 1:
Complete += 1
else:
Incomplete += 1
if (str(server.d.message(t)).__len__() >= 3):
Error += 1
info[0,2] = Complete
info[0,3] = Error
everything works, except for trying to deal with info.

Your mistake is in accessing the 2D-list, modify:
info[0,0] = hostname
info[0,1] = time
info[0,2] = complete
info[0,3] = Errors
to:
info[0].append(hostname)
info[0].append(time)
info[0].append(complete)
info[0].append(Errors)
Same goes to info[Total, 0] and etc.

The way you created info, it is a list containing only one element, namely an empty list. When working with lists, you have to address the nested items like
info[0][0] = hostname
For initialization, you have to create a list of lists by e.g.
# create list of lists of 0, size is 10x10
info = [[0]*10 for i in range(10)]
When using numpy arrays, you can address the elements as you did.
One advantage of "lists of lists" is that not all entries of the "2D list" shall have the same data type!

info = [[] for i in range(4)] # create 4 empty lists inside a list
info[0][0].append(hostname)
info[0][1].append(time)
info[0][2].append(complete)
info[0][3].append(Errors)
You need to create the 2d array first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find duplicates in a list of tuples - python

I think I've reached to an optimized working solution using a graph. Basically, I've created a graph with each node contains its user information and its index. Then, use dfs to traverse the graph and find the duplicates.

Related

how to select pieces of 2 lists using np.sign?

Forming the sublist using same indices

Is there a better way to combine multiple items in a python list

matching results in a list of lists

2d list not working

Categories

Resources