What is an efficient python algorithm to remove all mirrored text duplicates in a list where the items are in the format as below?
ExList = [' dutch italian english', ' italian english dutch', ' dutch italian german', ' dutch german italian' ]
Required result: [' dutch english italian ', 'dutch german italian' ]
This solution uses the set datastructure and focuses on producing compact code, mostly with list/set/generator comprehenstions. If this is a homework task for a beginner course and you just copy the result, it will be very obvious that you did not write the code yourself. Try to follow the thought process and reproduce the results yourself.
1) split each element at " " (space)
for item in ExList:
splitted = item.split(" ")
2) remove now empty elements due to superfluous spaces in the input. This can be done in 1 line with the step above (empty strings are "falsy") using a list comprehenstion:
for item in ExList:
splitted = [lang for lang in item.split(" ") if lang]
3) Put the result in a set, which by definition disregards order and ignores duplicates. For this step we primarily need the property of unordered identity, meaning set([1, 2]) == set([2, 1]). This can be combined with the line above using a generator comprehension:
for item in ExList:
itemSet = set(lang for lang in item.split(" ") if lang)
Now, within that loop, put all those sets of languages into another set. This time, because all the item sets with the same items in any order are considered equal, the outer set will automatically disregard any duplicates. To be able to put the item set into another set, it needs to be immutable (because mutability might cause a change in identity), which is called a frozenset in python. The code looks like this:
ExList = [' dutch italian english', ' italian english dutch', ' dutch italian german', ' dutch german italian' ]
result = set()
for item in ExList:
result.add(frozenset(lang for lang in item.split(" ") if lang))
Or, as a set comprehension on one line:
result = {frozenset(lang for lang in item.split(" ") if lang) for item in ExList}
The result is as follows:
>>> print(result)
{frozenset({'italian', 'dutch', 'german'}), frozenset({'italian', 'dutch', 'english'})}
you can turn that back into lists if the set print output looks confusing to you
>>> print([list(itemSet) for itemSet in result])
[['italian', 'dutch', 'german'], ['italian', 'dutch', 'english']]
This may work for you:
def unique_list(s):
x = set([tuple(sorted(s.split())) for s in ExList])
return [" ".join(s) for s in x]
print(unique_list(ExList)
This might not be the most efficient solution, but hope it will be of some help.
Using the property that keys of dictionary are unique.
m_dict = {}
for a in ExList:
b = a.split()
b.sort()
m_dict[' '.join(b)] = None
print m_dict.keys()
I managed to do that but the case I'm struggling with is when I have to consider 'color' equal to 'colour' for all such words and return count accordingly. To do this, I wrote a dictionary of common words with spelling changes in American and GB English for this, but pretty sure this isn't the right approach.
ukus=dict() ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE',
'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}
This is the dictionary I wrote to check the values. As you can see this is degrading the performance. Pyenchant isn't available for 64bit python. Someone please help me out. Thank you in advance.
Okay, I think I know enough from your comments to provide this as a solution. The function below allows you to choose either UK or US replacement (it uses US default, but you can of course flip that) and allows for you to either perform minor hygiene on the string.
import re
ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE'}
usuk={'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}
def str_wd_count(my_string, uk=False, hygiene=True):
us = not(uk)
# if the UK flag is TRUE, default to UK version, else default to US version
print "Using the "+uk*"UK"+us*"US"+" dictionary for default words"
# optional hygiene of non-alphanumeric characters for pure word counting
if hygiene:
my_string = re.sub('[^ \d\w]',' ',my_string)
my_string = re.sub(' {1,}',' ',my_string)
# create a list of the unqique words in the text
ttl_wds = [ukus.get(w,w) if us else usuk.get(w,w) for w in my_string.upper().split(' ')]
wd_counts = {}
for wd in ttl_wds:
wd_counts[wd] = wd_counts.get(wd,0)+1
return wd_counts
As a sample of use, consider the string
str1 = 'The colour of the dog is not the same as the color of the tire, or is it tyre, I can never tell which one will fulfill'
# Resulting sorted dict.items() With Default Settings
'[(THE,5),(TIRE,2),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(IT,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1)]'
# Resulting sorted dict.items() With hygiene=False
'[(THE,5),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1),(IT,1),(TYRE,,1)]'
# Resulting sorted dict.items() With UK Swap, hygiene=True
'[(THE,5),(OF,2),(IS,2),(TYRE,2),(COLOUR,2),(WHICH,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(OR,1),(WILL,1),(AS,1),(CAN,1),(TELL,1),(NOT,1),(FULFILL,1),(ONE,1),(IT,1)]'
# Resulting sorted dict.items() With UK Swap, hygiene=False
'[(THE,5),(OF,2),(IS,2),(COLOUR,2),(ONE,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(FULFILL,1),(TYRE,,1),(IT,1),(OR,1)]'
You can use the resulting dictionary of word counts in any way you'd like, and if you need the original string with the modifications added it is easy enough to modify the function to also return that.
Step 1:
Create a temporary string and then replace all the words with values of your dict with it's corresponding keys as:
>>> temp_string = str(my_string)
>>> for k, v in ukus.items():
... temp_string = temp_string.replace(" {} ".format(v), " {} ".format(k)) # <--surround by space " " to replace only words
Step 2:
Now, in order to find words in the string, firstly split it into list of words and then use itertools.Counter() to get count of each element in the list. Below is the sample code:
>>> from collections import Counter
>>> my_string = 'Hello World! Hello again. I am saying Hello one more time'
>>> count_dict = Counter(my_string.split())
# Value of count_dict:
# Counter({'Hello': 3, 'saying': 1, 'again.': 1, 'I': 1, 'am': 1, 'one': 1, 'World!': 1, 'time': 1, 'more': 1})
>>> count_dict['Hello']
3
Step 3:
Now, since you want the count of both "colour" and "color" in your dict, re-iterate the dict to add those values, and the missing values as "0"
for k, v in ukus.items():
if k in count_dict:
count_dict[v] = count_dict[k]
else:
count_dict[v] = count_dict[k] = 0
I need to keep a count of words in the list that appear once in a list, and one list for words that appear twice without using any count method, I tried using a set but it removes only the duplicate not the original. Is there any way to keep the words appearing once in one list and words that appear twice in another list?
the sample file is text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'Andy Gosling\n'], so technically Andy, and Andy would be in one list, and the rest in the other.
Using dictionaries is not allowed :/
for word in text:
clean = clean_up(word)
for words in clean.split():
clean2 = clean_up(words)
l = clean_list.append(clean2)
if clean2 not in clean_list:
clean_list.append(clean2)
print(clean_list)
This is a very bad, unPythonic way of doing things; but once you disallow Counter and dict, this is about all that's left. (Edit: except for sets, d'oh!)
text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n', 'Andy Gosling\n']
once_words = []
more_than_once_words = []
for sentence in text:
for word in sentence.split():
if word in more_than_once_words:
pass # do nothing
elif word in once_words:
once_words.remove(word)
more_than_once_words.append(word)
else:
once_words.append(word)
which results in
# once_words
['Fennimore', 'Cooper', 'Peter,', 'Paul,', 'and', 'Mary', 'Gosling']
# more_than_once_words
['Andy']
It is a silly problem removing key data structures or loops or whatever. Why not just program is C then? Tell your teacher to get a job...
Editorial aside, here is a solution:
>>> text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n','Andy Gosling\n']
>>> data=' '.join(e.strip('\n,.') for e in ''.join(text).split()).split()
>>> data
['Andy', 'Fennimore', 'Cooper', 'Peter', 'Paul', 'and', 'Mary', 'Andy', 'Gosling']
>>> [e for e in data if data.count(e)==1]
['Fennimore', 'Cooper', 'Peter', 'Paul', 'and', 'Mary', 'Gosling']
>>> list({e for e in data if data.count(e)==2})
['Andy']
If you can use a set (I wouldn't use it either, if you're not allowed to use dictionaries), then you can use the set to keep track of what words you have 'seen'... and another one for the words that appear more than once. Eg:
seen = set()
duplicate = set()
Then, each time you get a word, test if it is on seen. If it is not, add it to seen. If it is in seen, add it to duplicate.
At the end, you'd have a set of seen words, containing all the words, and a duplicate set, with all those that appear more than once.
Then you only need to substract duplicate from seen, and the result is the words that have no duplicates (ie. the ones that appear only once).
This can also be implemented using only lists (which would be more honest to your homework, if a bit more laborious).
from itertools import groupby
from operator import itemgetter
text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n', 'Andy Gosling\n']
one, two = [list(group) for key, group in groupby( sorted(((key, len(list(group))) for key, group in groupby( sorted(' '.join(text).split()))), key=itemgetter(1)), key=itemgetter(1))]