Issues removing words from a list in Python

Issues removing words from a list in Python - python

I'm building a Wordle solver. Basically removing words from a list, if they don't have specific characters, or don't have them at specific locations. I'm not concerned about the statistics for optimal choices yet.
When I run the below code (I think all relevant sections are included), my output is clear that it found a letter matching position to the 'word of the day'. But then the next iteration, it will choose a word that doesn't have that letter, when it should only select from remaining words.
Are words not actually being removed? Or is there something shadowing a scope I can't find?
I've rewritten whole sections, with the exact same problem happening.
#Some imports and reading the word list here.
def word_compare(word_of_the_day, choice_word):
results = []
index = 0
letters[:] = choice_word
for letter in letters:
if letter is word_of_the_day[index]:
results.append((letter, 2, index))
elif letter in word_of_the_day:
results.append((letter, 1, index))
else:
results.append((letter, 0, index))
index += 1
print("\nIteration %s\nWord of the Day: %s,\nChoice Word: %s,\nResults: %s" % (
iteration, word_of_the_day, choice_word, results))
return results
def remove_wrong_words():
for item in results:
if item[1] == 0:
for word in words:
if item[0] in word:
words.remove(word)
for item in results:
if item[1] == 2:
for word in words:
if word[item[2]] != item[0]:
words.remove(word)
print("Words Remaining: %s" % len(words))
return words
words, letters = prep([])
# choice_word = best_word_choice()
choice_word = "crane"
iteration = 1
word_of_the_day = random.choice(words)
while True:
if choice_word == word_of_the_day:
break
else:
words.remove(choice_word)
results = word_compare(word_of_the_day, choice_word)
words = remove_wrong_words()
if len(words) < 10:
print(words)
choice_word = random.choice(words)
iteration += 1
Output I'm getting:
Iteration 1
Word of the Day: stake,
Choice Word: crane,
Results: [('c', 0, 0), ('r', 0, 1), ('a', 2, 2), ('n', 0, 3), ('e', 2, 4)]
Words Remaining: 386
Iteration 2
Word of the Day: stake,
Choice Word: lease,
Results: [('l', 0, 0), ('e', 1, 1), ('a', 2, 2), ('s', 1, 3), ('e', 2, 4)]
Words Remaining: 112
Iteration 3
Word of the Day: stake,
Choice Word: paste,
Results: [('p', 0, 0), ('a', 1, 1), ('s', 1, 2), ('t', 1, 3), ('e', 2, 4)]
Words Remaining: 81
Iteration 4
Word of the Day: stake,
Choice Word: spite,
... This continues for a while until solved. In this output, 'a' is found to be in the correct place (value of 2 in the tuple) on the second iteration. This should remove all words from the list that don't have 'a' as the third character. Instead 'paste' and 'spite' are chosen for later iterations from that same list, instead of having been removed.

Your issue has to do with removing an item from a list while you iterate over it. This often results in skipping later values, as the list iteration is being handled by index, under the covers.
Specifically, the problem is here (and probably in the other loop too):
for word in words:
if item[0] in word:
words.remove(word)
If the if condition is true for the first word in the words list, the second word will not be checked. That's because when the for loop asks the list iterator for the next value, it's going to yield the second value of the list as it now stands, which is going to be the third value from the original list (since the first one is gone).
There are a few ways you could avoid this problem.
One approach is to iterate on a copy of the list you're going to modify. This means that the iterator won't ever skip over anything, since the copied list is not having anything removed from it as you go (only the original list is changing). A common way to make the copy is with a slice:
for word in words[:]: # iterate on a copy of the list
if item[0] in word:
words.remove(word) # modify the original list here
Another option is to build a new list full of the valid values from the original list, rather than removing the invalid ones. A list comprehension is often good enough for this:
words = [word for word in words if item[0] not in word]
This may be slightly complicated in your example because you're using global variables. You would either need to change that design (and e.g. accept a list as an argument and return the new version), or add global words statement to let the function's code rebind the global variable (rather than modifying it in place).

I think one of your issues is the following line: if letter is word_of_the_day[index]:. This should be == not is as the latter checks for whether the two objects being compared have the same memory address (i.e. id()), not whether they have the same value. Thus, results will never return a tuple with a value of 2 in position 1, so this means the second for loop in remove_wrong_words won't do anything either. There may be more going on but I'd like a concrete example to run before digging in further.

Related

How to optimize comparing combinations in a python list

I have a list of 16,000 lists. Each sublist contains a tuple and a score, like this:
mylist = [
[('only','one','time'),10.54],
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223],
[('red','hot','chili'),1.98]
]
My goal is to iterate through combinations in mylist and remove any element when a superset or subset is detected. The element to be removed is based on the lowest score between the two. So in this example, I want to remove
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223]
because ('one','time') is a subset of ('only','one','time') and between the two, ('one','time') has the lowest score 10.54>3.21.
('red','hot','chili','peppers') is a superset of ('red','hot','chili') and between the two, 0.223<1.98
My initial solution was a brute force - get every possible combination from the list choose 2, then compare the tuples for subsets using the all() function, then drop the items with the min() score.
This performs poorly due the to number of combinations to search:
from itertools import combinations
removelist = []
for x,y in combinations(mylist,2):
if (all(word in x[0] for word in y[0]) or all(word in y[0] for word in x[0])):
smallest = min([x,y],key=itemgetter(1))
removelist.append(smallest)
removelist = set(removelist)
outlist = [x for x in mylist if x not in removelist]
return outlist
returns:
outlist = [
[('only','one','time'),10.54],
[('red','hot','chili'),1.98]
]
So for a list of ~16,000 sublists, this would be roughly:
combinations = n! / (r! * (n-r)!)
combinations = 16,000! / (2! * (15998)!)
combinations = 16,000 * 15999 / 2
combinations = 127,992,000
Is there a smarter way to do this, reducing the 127 million items I need to check?

This might be a thousand times faster than yours. First I convert the word tuples to sets for simpler and faster subset checks, like #Alexander. Then I sort by set size, so I don't have to check for superset. (Because if |A| ≤ |B|, then the only way A is a superset of B is if it is B, in which case it is also a subset of B).
And then comes my main trick. Let's say we have the word set {'red','hot','chili'}, and we want to find the word sets of which it is a subset. Do we need to check all other (larger or equal-size) sets? No. It suffices to check only those sets that contain the word 'red'. Or only those with 'hot'. Or only those with 'chili'. Let's take the rarest word, i.e., the one in the fewest sets (in this case I'd guess 'chili').
I decided to call your lists "songs", makes it nice to talk about them.
from collections import defaultdict
def process_list_stefan(mylist):
# Change to sets, attach the index, and sort by number of words (many to few)
songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
songs.sort(key=lambda song: -len(song[1]))
# Check songs against others, identify the ones to remove
remove = set()
songs_with_word = defaultdict(list)
for song in songs:
i, words1, score1 = song
# Pick the song's rarest word
word = min(words1, key=lambda word: len(songs_with_word[word]))
# Go through songs containing that word
for j, words2, score2 in songs_with_word[word]:
if words1 <= words2:
# Lower score loses. In case of tie, lower index loses.
remove.add(min((score1, i), (score2, j))[1])
# Make this song available as superset candidate
for word in words1:
songs_with_word[word].append(song)
# Apply the removals
return [song for i, song in enumerate(mylist) if i not in remove]
Update: Actually, instead of just using the song's rarest word and going through all its "supersets" (sets containing that word), consider all words in the current song and use the intersection of their "supersets". In my testing with made up data, it's even faster by about factor 1.6:
from collections import defaultdict
def process_list_stefan(mylist):
# Change to sets, attach the index, and sort by number of words (many to few)
songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
songs.sort(key=lambda song: -len(song[1]))
# Helper: Intersection of sets
def intersect(sets):
s = next(sets).copy()
for t in sets:
s &= t
return s
# Check songs against others, identify the ones to remove
remove = set()
songs_with_word = defaultdict(set)
for song in songs:
i, words1, score1 = song
for j in intersect(songs_with_word[word] for word in words1):
# Lower score loses. In case of tie, lower index loses.
remove.add(min((score1, i), (mylist[j][1], j))[1])
# Make this song available as superset candidate
for word in words1:
songs_with_word[word].add(i)
# Apply the removals
return [song for i, song in enumerate(mylist) if i not in remove]

First, create a new list that retains the original scores but converts the tuples of words into sets for faster comparisons and set membership testing.
Enumerate through each set of words and scores in this new list and compare against all remaining sets and scores. Using sets, we can detect subsets via s1.issubset(s2) and supersets via s1.issuperset(s2).
Once the subset/superset has been detected, we compare scores. If the current record has a higher score, we mark the other for removal and then continue comparing against the remaining records. Otherwise, we add the current index location to a set of indices to be subsequently removed and continue any remaining comparisons against this record.
Once we have processed all of the records, we use a conditional list comprehension to create a new list of all records to keep.
In terms of subset comparisons, worst case time complexity is O(n^2) / 2, which is still O(n^2). Of course, each subset comparison has its own time complexity based on the number of unique words in each sublist. This solution thus makes the same number of comparisons as the OP's for x,y in combinations(mylist,2) method, but the subset/superset comparisons are done using sets rather than lists. As a result, this method should still be significantly faster.
def process_list(my_list):
# Convert tuples to sets.
my_sets = [(set(tuples), score) for tuples, score in my_list]
idx_to_remove = set()
for i, (subset1, score1) in enumerate(my_sets):
for j, (subset2, score2) in enumerate(my_sets[(i + 1):], start=i + 1):
if subset1.issubset(subset2) | subset1.issuperset(subset2):
# Subset/Superset detected.
idx_to_remove.add(i if score1 < score2 else j)
# Remove filtered items from list and return filtered list.
return [tup for n, tup in enumerate(my_list) if n not in idx_to_remove]
# TEST CASES
# Case 1.
mylist = [
[('only','one','time'), 10.54],
[('one','time'), 3.21],
[('red','hot','chili','peppers'), 0.223],
[('red','hot','chili'), 1.98],
]
>>> process_list(mylist)
[[('only', 'one', 'time'), 10.54], [('red', 'hot', 'chili'), 1.98]]
# Case 2.
# ('a', 'b', 'd') is superset of ('a', 'b') and has a lower score, so remove former.
# ('a', 'b') is a subset of ('a', 'b', 'c') and has a lower score, so remove former.
mylist = [[('a', 'b', 'c'), 3], [('a', 'b', 'd'), 1], [('a', 'b'), 2]]
>>> process_list(mylist)
[[('a', 'b', 'c'), 3]]
# Case 3. Same items as Case 2, but different order. Same logic as Case 2.
mylist = [[('a', 'b'), 2], [('a', 'b', 'c'), 3], [('a', 'b', 'd'), 1]]
>>> process_list(mylist)
[[('a', 'b', 'c'), 3]]
# Case 4.
# ('a', 'b', 'c') is a superset of ('a', 'b') and has a lower score, so remove former.
# ('d','c') is a subset of ('d','c','w') and has a lower score, so remove former.
mylist = [[('a', 'b'), 2], [('a', 'b', 'c'), 1], [('d','c','w'), 4], [('d','c'), 2]]
>>> process_list(mylist)
[[('a', 'b'), 2], [('d', 'c', 'w'), 4]]

Reducing compute time for Anagram word search

The code below is a brute force method of searching a list of words and creating sub-lists of any that are Anagrams.
Searching the entire English dictionary is prohibitively time consuming so I'm curious of anyone has tips for reducing the compute complexity of the code?
def anogramtastic(anagrms):
d = []
e = []
for j in range(len(anagrms)):
if anagrms[j] in e:
pass
else:
templist = []
tester = anagrms[j]
tester = list(tester)
tester.sort()
tester = ''.join(tester)
for k in range(len(anagrms)):
if k == j:
pass
else:
testers = anagrms[k]
testers = list(testers)
testers.sort()
testers = ''.join(testers)
if testers == tester:
templist.append(anagrms[k])
e.append(anagrms[k])
if len(templist) > 0:
templist.append(anagrms[j])
d.append(templist)
d.sort(key=len,reverse=True)
return d
print(anogramtastic(wordlist))

How about using a dictionary of frozensets? Frozensets are immutable, meaning you can hash them for constant lookup. And when it comes to anagrams, what makes two words anagrams of each other is that they have the same letters with the same count. So you can construct a frozenset of {(letter, count), ...} pairs, and hash these for efficient lookup.
Here's a quick little function to convert a word to a multiset using collections.Counter:
from collections import Counter, defaultdict
def word2multiset(word):
return frozenset(Counter(word).items())
Now, given a list of words, populate your anagram dictionary like this:
list_of_words = [... ]
anagram_dict = defaultdict(set)
for word in list_of_words:
anagram_dict[word2multiset(word)].add(word)
For example, when list_of_words = ['hello', 'olleh', 'test', 'apple'], this is the output of anagram_dict after a run of the loop above:
print(anagram_dict)
defaultdict(set,
{frozenset({('e', 1), ('h', 1), ('l', 2), ('o', 1)}): {'hello',
'olleh'},
frozenset({('e', 1), ('s', 1), ('t', 2)}): {'test'},
frozenset({('a', 1), ('e', 1), ('l', 1), ('p', 2)}): {'apple'}})

Unless I'm misunderstanding the problem, simply grouping the words by sorting their characters should be an efficient solution -- as you've already realized. The trick is to avoid comparing every word to all the other ones. A dict with the char-sorted string as key will make finding the right group for each word fast; a lookup/insertion will be O(log n).
#!/usr/bin/env python3
#coding=utf8
from sys import stdin
groups = {}
for line in stdin:
w = line.strip()
g = ''.join(sorted(w))
if g not in groups:
groups[g] = []
groups[g].append(w)
for g, words in groups.items():
if len(words) > 1:
print('%2d %-20s' % (len(words), g), ' '.join(words))
Testing on my words file (99171 words), it seems to work well:
anagram$ wc /usr/share/dict/words
99171 99171 938848 /usr/share/dict/words
anagram$ time ./anagram.py < /usr/share/dict/words | tail
2 eeeprsw sweeper weepers
2 brsu burs rubs
2 aeegnrv avenger engrave
2 ddenoru redound rounded
3 aesy ayes easy yeas
2 gimnpu impugn umping
2 deeiinsst densities destinies
2 abinost bastion obtains
2 degilr girdle glider
2 orsttu trouts tutors
real 0m0.366s
user 0m0.357s
sys 0m0.012s

You can speed things up considerably by using a dictionary for checking membership instead of doing linear searches. The only "trick" is to devise a way to create keys for it such that it will be the same for anagrammatical words (and not for others).
In the code below this is being done by creating a sorted tuple from the letters in each word.
def anagramtastic(words):
dct = {}
for word in words:
key = tuple(sorted(word)) # Identifier based on letters.
dct.setdefault(key, []).append(word)
# Return a list of all that had an anagram.
return [words for words in dct.values() if len(words) > 1]
wordlist = ['act', 'cat', 'binary', 'brainy', 'case', 'aces',
'aide', 'idea', 'earth', 'heart', 'tea', 'tee']
print('result:', anagramtastic(wordlist))
Output produced:
result: [['act', 'cat'], ['binary', 'brainy'], ['case', 'aces'], ['aide', 'idea'], ['earth', 'heart']]

Trying to sort a dict by dict.values()

The task is to read a file, create a dict and print out the word and its counter value. Below is code that works fine, but I can't seem to get my mind to understand why in the print_words() function, I can't change the sort to:
words = sorted(word_count.values())
and then print the word and its counter, sorted by the counter (number of times that word is in word_count[]).
def word_count_dict(filename):
word_count = {}
input_file = open(filename, 'r')
for line in input_file:
words = line.split()
for word in words:
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] + 1
input_file.close()
return word_count
def print_words(filename):
word_count = word_count_dict(filename)
words = sorted(word_count.keys())
for word in words:
print word, word_count[word]

If you sorted output by value (including the keys), the simplest approach is sorting the items (key-value pairs), using a key argument to sorted that sorts on the value, then iterating the result. So for your example, you'd replace:
words = sorted(word_count.keys())
for word in words:
print word, word_count[word]
with (adding from operator import itemgetter to the top of the module):
# key=itemgetter(1) means the sort key is the second value in each key-value
# tuple, meaning the value
sorted_word_counts = sorted(word_count.items(), key=itemgetter(1))
for word, count in sorted_word_counts:
print word, count

First thing to note is that dictionaries are not considered to be ordered, although this may change in the future. Therefore, it is good practice to convert your dict to a list of tuples ordered in some way.
The below function will help you convert a dictionary to a list of tuples ordered by values.
d = {'a': 5, 'b': 1, 'c': 7, 'd': 3}
def order_by_values(dct):
rev = sorted((v, k) for k, v in dct.items())
return [t[::-1] for t in rev]
order_by_values(d) # [('b', 1), ('d', 3), ('a', 5), ('c', 7)]

Mapping modified string indices to original string indices in Python

I'm relatively new to programming and wanted to get some help on a problem I've have. I need to figure out a way to map the indices of a string back to an original string after removing certain positions. For example, say I had a list:
original_string = 'abcdefgh'
And I removed a few elements to get:
new_string = acfh
I need a way to get the "true" indices of new_string. In other words, I want the indices of the positions I've kept as they were in original_string. Thus returning:
original_indices_of_new_string = [0,2,5,7]
My general approach has been something like this:
I find the positions I've removed in the original_string to get:
removed_positions = [1,3,4,6]
Then given the indices of new_string:
new_string_indices = [0,1,2,3]
Then I think I should be able to do something like this:
original_indices_of_new_string = []
for i in new_string_indices:
offset = 0
corrected_value = i + offset
if corrected_value in removed_positions:
#somehow offset to correct value
offset+=1
else:
original_indices_of_new_string.append(corrected_value)
This doesn't really work because the offset is reset to 0 after every loop, which I only want to happen if the corrected_value is in removed_positions (ie. I want to offset 2 for removed_positions 3 and 4 but only 1 if consecutive positions weren't removed).
I need to do this based off positions I've removed rather than those I've kept because further down the line I'll be removing more positions and I'd like to just have an easy function to map those back to the original each time. I also can't just search for the parts I've removed because the real string isn't unique enough to guarantee that the correct portion gets found.
Any help would be much appreciated. I've been using stack overflow for a while now and have always found the question I've had in a previous thread but couldn't find something this time so I decided to post a question myself! Let me know if anything needs clarification.
*Letters in the string are a not unique

Given your string original_string = 'abcdefgh' you can create a tuple of the index, and character of each:
>>> li=[(i, c) for i, c in enumerate(original_string)]
>>> li
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h')]
Then remove your desired charaters:
>>> new_li=[t for t in li if t[1] not in 'bdeg']
>>> new_li
[(0, 'a'), (2, 'c'), (5, 'f'), (7, 'h')]
Then rejoin that into a string:
>>> ''.join([t[1] for t in new_li])
acfh
Your 'answer' is the method used to create new_li and referring to the index there:
>>> ', '.join(map(str, (t[0] for t in new_li)))
0, 2, 5, 7

You can create a new class to deal with this stuff
class String:
def __init__(self, myString):
self.myString = myString
self.myMap = {}
self.__createMapping(self.myString)
def __createMapping(self, myString):
index = 0
for character in myString:
# If the character already exists in the map, append the index to the list
if character in self.myMap:
self.myMap[character].append(index)
else:
self.myMap[character] = [index,]
index += 1
def removeCharacters(self, myList):
for character in self.myString:
if character in myList:
self.myString = self.myString.replace(character, '')
del self.myMap[character]
return self.myString
def getIndeces(self):
return self.myMap
if __name__ == '__main__':
myString = String('abcdef')
print myString.removeCharacters(['a', 'b']) # Prints cdef
print myString.getIndeces() # Prints each character and a list of the indeces these occur at
This will give a mapping of the characters and a list of the indeces that they occur at. You can add more functionality if you want a single list returned, etc. Hopefully this gives you an idea of how to start

If removing by index, you simply need to start with a list of all indexes, e.g.: [0, 1, 2, 3, 4] and then, as you remove at each index, remove it from that list. For example, if you're removing indexes 1 and 3, you'll do:
idxlst.remove(1)
idxlst.remove(3)
idxlst # => [0, 2, 4]
[update]: if not removing by index, it's probably easiest to find the index first and then proceed with the above solution, e.g. if removing 'c' from 'abc', do:
i = mystr.index('c')
# remove 'c'
idxlst.remove(i)

Trying to stay as close as possible to what you were originally trying to accomplish, this code should work:
big = 'abcdefgh'
small='acfh'
l = []
current = 0
while len(small) >0:
if big[current] == small[0]:
l.append(current)
small = small[1:]
else:
current += 1
print(l)
The idea is working from the front so you don't need to worry about offset.
A precondition is of course that small actually is obtained by removing a few indices from big. Otherwise, an IndexError is thrown. If you need the code to be more robust, just catch the exception at the very end and return an empty list or something. Otherwise the code should work fine.

Assuming the character in your input string are unique, this is what is happening with your code:
original_indices_of_new_string = []
for i in new_string_indices:
offset = 0
corrected_value = i + offset
if corrected_value in removed_positions:
#somehow offset to correct value
offset+=1
else:
original_indices_of_new_string.append(corrected_value)
Setting offset to 0 every time in the loop is as good as having it preset to 0 outside the loop. And if you are adding 0 everytime to i in the loop, might as well use i. That boils down your code to:
if i in removed_positions:
#somehow offset to correct value
pass
else:
original_indices_of_new_string.append(i)
This code gives the output as [0, 2] and the logic is right (again assuming the characters in the input are unique) What you should be doing is, running the loop for the length of the original_string. That will give you what you want. Like this:
original_indices_of_new_string = []
for i in range(len(original_string)):
if i in removed_positions:
#somehow offset to correct value
pass
else:
original_indices_of_new_string.append(i)
print original_indices_of_new_string
This prints:
[0, 2, 5, 7]
A simpler one liner to achieve the same would be:
original_indices_of_new_string = [original_string.index(i) for i in new_string for j in i]
Hope this helps.

It may help to map the characters in the new string with their positions in the original string in a dictionary and recover the new string like this:
import operator
chars = {'a':0, 'c':2, 'f':6, 'h':8}
sorted_chars = sorted(chars.iteritems(), key=operator.itemgetter(1))
new_string = ''.join([char for char, pos in sorted_chars]) # 'acfh'

Can Words in List1 be Spelled by Letters in List2

I'm new to coding and python, and I'm trying a version of the Scrabble Challenge at OpenHatch: https://openhatch.org/wiki/Scrabble_challenge.
The goal is to check whether each word in a list can be spelled by letters in a tile rack. I wrote a for-loop to check whether each letter in the word is in the tile rack, and if so, remove the letter from the rack (to deal with duplicates). However, I'm stumped on how to add a word to my valid_word list if the for-loop finds that each letter of the word is in the rack.
In this example, 'age' should be valid, but 'gag' should not be, as there is only one 'g' in the rack.
word_list = ['age', 'gag']
rack = 'page'
valid_words = []
for word in word_list:
new_rack = rack
for x in range(len(word)):
if word[x] in new_rack:
new_rack = new_rack.replace(str(word[x]), "")

I would probably use a Counter here to simplify things. What the Counter class does is create a mapping of the items in an iterable to its frequency. I can use that to check whether the frequency of the individual characters is greater than those in the rack and print the word accordingly.
>>> from collections import Counter
>>> word_list = ['age', 'gag']
>>> rack = Counter('page')
>>> print rack
Counter({'a': 1, 'p': 1, 'e': 1, 'g': 1})
>>> for word in word_list:
word_count = Counter(word)
for key, val in word_count.iteritems():
if rack[key] < val:
break
else:
print word
age # Output.
Also, Counter has the nice property that it returns a 0 if the given key does not exist in the Counter class. So, we can skip the check to see whether the tile has the key, since rack[key] < val would fail in that case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.