Find anagrams of a given word in a file - python

Alright so for class we have this problem where we need to be able to input a word and from a given text file (wordlist.txt) a list will be made using any anagrams of that word found in the file.
My code so far looks like this:
def find_anagrams1(string):
"""Takes a string and returns a list of anagrams for that string from the wordlist.txt file.
string -> list"""
anagrams = []
file = open("wordlist.txt")
next = file.readline()
while next != "":
isit = is_anagram(string, next)
if isit is True:
anagrams.append(next)
next = file.readline()
file.close()
return anagrams
Every time I try to run the program it just returns an empty list, despite the fact that I know there are anagrams present. Any ideas on what's wrong?
P.S. The is_anagram function looks like this:
def is_anagram(string1, string2):
"""Takes two strings and returns True if the strings are anagrams of each other.
list,list -> string"""
a = sorted(string1)
b = sorted(string2)
if a == b:
return True
else:
return False
I am using Python 3.4

The problem is that you are using the readline function. From the documentation:
file.readline = readline(...)
readline([size]) -> next line from the file, as a string.
Retain newline. A non-negative size argument limits the maximum
number of bytes to return (an incomplete line may be returned then).
Return an empty string at EOF.
The key information here is "Retain newline". That means that if you have a file containing a list of words, one per line, each word is going to be returned with a terminal newline. So when you call:
next = file.readline()
You're not getting example, you're getting example\n, so this will never match your input string.
A simple solution is to call the strip() method on the lines read from the file:
next = file.readline().strip()
while next != "":
isit = is_anagram(string, next)
if isit is True:
anagrams.append(next)
next = file.readline().strip()
file.close()
However, there are several problems with this code. To start with, file is a terrible name for a variable, because this will mask the python file module.
Rather than repeatedly calling readline(), you're better off taking advantage of the fact that an open file is an iterator which yields the lines of the file:
words = open('wordlist.txt')
for word in words:
word = word.strip()
isit = is_anagram(string, word)
if isit:
anagrams.append(word)
words.close()
Note also here that since is_anagram returns True or False, you
don't need to compare the result to True or False (e.g., if isit
is True). You can simply use the return value on its own.

Yikes, don't use for loops:
import collections
def find_anagrams(x):
anagrams = [''.join(sorted(list(i))) for i in x]
anagrams_counts = [item for item, count in collections.Counter(anagrams).items() if count > 1]
return [i for i in x if ''.join(sorted(list(i))) in anagrams_counts]

Here's another solution, that I think is quite elegant. This runs in O(n * m) where n is the number of words and m is number of letters (or average number of letters/word).
# anagarams.py
from collections import Counter
import urllib.request
def word_hash(word):
return frozenset(Counter(word).items())
def download_word_file():
url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-no-swears.txt'
urllib.request.urlretrieve(url, 'words.txt')
def read_word_file():
with open('words.txt') as f:
words = f.read().splitlines()
return words
if __name__ == "__main__":
# downloads a file to your working directory
download_word_file()
# reads file into memory
words = read_word_file()
d = {}
for word in words:
k = word_hash(word)
if k in d:
d[k].append(word)
else:
d[k] = [word]
# Prints the filtered results to only words with anagrams
print([x for x in d.values() if len(x) > 1])

Related

Python simple check

I am just learning Python and I want to check if a list contains a word. When I check for the word it always returns 0, even though the function can find it and print it. But the if/else statement always return 0, when I use 2 returns as below. Can you help me?
def line_number(text, word):
"""
Returns the line number (beginning with 1) where the word appears for the first time
:param text: Text in which the word is searched for
:param word: A word to search for
:return: List of Line Numbers
"""
lines = text.splitlines()
i = 0
for line in lines:
i = i +1
if word in line:
return i
else:
return 0
# print("nope")
words = ['erwachet', 'Brust', 'Wie', 'Ozean', 'Knabe']
for word in words:
num = line_number(wilhelm_tell, word)
if num > 0:
print(f"Das Word {word} findet sich auf Zeile {num}.")
else:
print(f"Das Wort {word} wurde nicht gefunden!")
You should return 0 after the for loop ends and not inside the loop.
def line_number(text, word):
"""
Returns the line number (beginning with 1) where the word appears for the first time
:param text: Text in which the word is searched for
:param word: A word to search for
:return: List of Line Numbers
"""
lines = text.splitlines()
i = 0
for line in lines:
i = i +1
if word in line:
return i
return 0
The problem is the else statement inside the loop because it will break the loop in the first iteration.
I think the problem is that you're returning 0 on the first line the word is not in. So instead, you should return 0 after you looked through every line:
def line_number(text, word):
"""
Returns the line number (beginning with 1) where the word appears for the first time
:param text: Text in which the word is searched for
:param word: A word to search for
:return: List of Line Numbers
"""
lines = text.splitlines()
i = 0
for line in lines:
i = i +1
if word in line:
return i
return 0
I think the problem might be that you are returning 0 if the word is not within a line. When you use the return keyword, it will exit the function and return the value. So if a word is not within the first line of the text, it will return 0 (even though the word is present on further lines)
Here is a little example of what i think is happening:
def is_element_within_list(element_to_search, some_list):
for element in some_list:
if element == element_to_search:
return True
else:
return False
some_list = [1, 2, 3]
element_to_search = 2
print(is_element_within_list(element_to_search, some_list))
# output: False
We have a function that checks if an element is within a list (we could use the keyword "in", but is for the sake of the example). So, despite that 2 is within some_list, the function output is False, because the function is returning False on the else if the elements are no the same

Binary Search using a for loop, searching for words in a list and comparing

I'm trying to compare the words in "alice_list" to "dictionary_list", and if a word isnt found in the "dictionary_list" to print it and say it is probably misspelled. I'm having issues where its not printing anything if its not found, maybe you guys could help me out. I have the "alice_list" being appended to uppercase, as the "dictionary_list" is all in capitals. Any help with why its not working would be appreciated as I'm about to pull my hair out over it!
import re
# This function takes in a line of text and returns
# a list of words in the line.
def split_line(line):
return re.findall('[A-Za-z]+(?:\'[A-Za-z]+)?', line)
# --- Read in a file from disk and put it in an array.
dictionary_list = []
alice_list = []
misspelled_words = []
for line in open("dictionary.txt"):
line = line.strip()
dictionary_list.extend(split_line(line))
for line in open("AliceInWonderLand200.txt"):
line = line.strip()
alice_list.extend(split_line(line.upper()))
def searching(word, wordList):
first = 0
last = len(wordList) - 1
found = False
while first <= last and not found:
middle = (first + last)//2
if wordList[middle] == word:
found = True
else:
if word < wordList[middle]:
last = middle - 1
else:
first = middle + 1
return found
for word in alice_list:
searching(word, dictionary_list)
--------- EDITED CODE THAT WORKED ----------
Updated a few things if anyone has the same issue, and used "for word not in" to double check what was being outputted in the search.
"""-----Binary Search-----"""
# search for word, if the word is searched higher than list length, print
words = alice_list
for word in alice_list:
first = 0
last = len(dictionary_list) - 1
found = False
while first <= last and not found:
middle = (first + last) // 2
if dictionary_list[middle] == word:
found = True
else:
if word < dictionary_list[middle]:
last = middle - 1
else:
first = middle + 1
if word > dictionary_list[last]:
print("NEW:", word)
# checking to make sure words match
for word in alice_list:
if word not in dictionary_list:
print(word)
Your function split_line() returns a list. You then take the output of the function and append it to the dictionary list, which means each entry in the dictionary is a list of words rather than a single word. The quick fix it to use extend instead of append.
dictionary_list.extend(split_line(line))
A set might be a better choice than a list here, then you wouldn't need the binary search.
--EDIT--
To print words not in the list, just filter the list based on whether your function returns False. Something like:
notfound = [word for word in alice_list if not searching(word, dictionary_list)]
Are you required to use binary search for this program? Python has this handy operator called "in". Given an element as the first operand and and a list/set/dictionary/tuple as the second, it returns True if that element is in the structure, and false if it is not.
Examples:
1 in [1, 2, 3, 4] -> True
"APPLE" in ["HELLO", "WORLD"] -> False
So, for your case, most of the script can be simplified to:
for word in alice_list:
if word not in dictionary_list:
print(word)
This will print each word that is not in the dictionary list.

Python - importing 127,000+ words to a list, but function only returning partial results

this function is meant to compare all 127,000 + words imported from a dictionary file to a user inputed length. It then should return the amount of words that are equal to that length. It does do this to an extent.
If I enter "15" it returns "0".
If I enter "4" it returns "3078".
I am positive that there are words that are 15 characters in length but it returns "0" anyways.
I should also mention that if I enter anything greater that 15 the result is still 0 when there is words greater that 15.
try:
dictionary = open("dictionary.txt")
except:
print("Dictionary not found")
exit()
def reduceDict():
first_list = []
for line in dictionary:
line = line.rstrip()
if len(line) == word_length:
for letter in line:
if len([ln for ln in line if line.count(ln) > 1]) == 0:
if first_list.count(line) < 1:
first_list.append(line)
else:
continue
if showTotal == 'y':
print('|| The possible words remaing are: ||\n ',len(first_list))
My reading of:
if len([ln for ln in line if line.count(ln) > 1]) == 0:
is that the words in question can't have any repeated letters which could explain why no words are being found -- once you get up to 15, repeated letters are quite common. Since this requirement wasn't mentioned in the explanation, if we drop then we can write:
def reduceDict(word_length, showTotal):
first_list = []
for line in dictionary:
line = line.rstrip()
if len(line) == word_length:
if line not in first_list:
first_list.append(line)
if showTotal:
print('The number of words of length {} is {}'.format(word_length, len(first_list)))
print(first_list)
try:
dictionary = open("dictionary.txt")
except FileNotFoundError:
exit("Dictionary not found")
reduceDict(15, True)
Which turns up about 40 words from my Unix words file. If we want to put back the unique letters requirement:
import re
def reduceDict(word_length, showTotal):
first_list = []
for line in dictionary:
line = line.rstrip()
if len(line) == word_length and not re.search(r"(.).*\1", line):
if line not in first_list:
first_list.append(line)
if showTotal:
print('The number of words of length {} is {}'.format(word_length, len(first_list)))
print(first_list)
Which starts returning 0 results around 13 letters as one might expect.
In your code, you don't need the this line -
for letter in line:
In your list comprehension, if your intention is to loop over all the words in the line use this -
if len([ln for ln in line.split() if line.count(ln) > 1]) == 0:
In you code the loop in list comprehension loops over every character and checks if that character appears more than once in line. That way if your file contains chemotherapeutic it will not be added to the list first_list as there are letters that appears multiple times. So, unless your file contains word with more than 14 letters where all letters appear only once, you code will fail to find them.

Sorting words in a text file (with parameters) and writing them into a new file with Python

I have a file.txt with thousands of words, and I need to create a new file based on certain parameters, and then sort them a certain way.
Assuming the user imports the proper libraries when they test, what is wrong with my code? (There are 3 separate functions)
For the first, I must create a file with words containing certain letters, and sort them lexicographically, then put them into a new file list.txt.
def getSortedContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
The second is similar, but must be sorted reverse lexicographically, if the string inputted is NOT in the word.
def getReverseSortedNotContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s not in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
For the third, I must sort words that contain a certain amount of integers, and sort lexicographically by the last character in each word.
def getRhymeSortedCount(n, ifile, ofile):
toWrite = ""
for line in ifile:
word = line[:-1] #gets rid of \n
if len(word) == n:
toWrite += word + "\n"
reversetoWrite = toWrite[::-1]
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
reversetoWrite = toWrites[::-1]
ofile.write(reversetoWrites[:-1])
Could someone please point me in the right direction for these? Right now they are not sorting as they're supposed to.
There is a lot of stuff that is unclear here so I'll try my best to clean this up.
You're concatenating strings together into one big string then appending that one big string into a list. You then tried to sort your 1-element list. This obviously will do nothing. Instead put all the strings into a list and then sort that list
IE: for your first example do the following:
def getSortedContain(s,ifile,ofile):
words = [word for word in ifile if s in words]
words.sort()
ofile.write("\n".join(words))

Anagram not matching up (string to list), Python

I'm trying to make a script where I can input an anagram of any word and it will read from a dictionary to see if there's a match
(ex. estt returns: = unjumble words: test)
If there are two matches it will write
(ex. estt returns: there are multiple matches: test, sett(assuming sett is a word lol)
I couldn't even get one match going, keeps returning "no match" even though if I look at my list made from a dictionary I see the words.
Here's the code I wrote so far
def anagrams(s):
if s =="":
return [s]
else:
ans = []
for w in anagrams(s[1:]):
for pos in range(len(w)+1):
ans.append(w[:pos]+s[0]+w[pos:])
return ans
dic_list = []
def dictionary(filename):
openfile = open(filename,"r")
read_file = openfile.read()
lowercase = read_file.lower()
split_words = lowercase.split()
for words in split_words:
dic_list.append(words)
def main():
dictionary("words.txt")
anagramsinput = anagrams(input("unjumble words here: "))
for anagram in anagramsinput:
if anagram in dic_list:
print(anagram)
else:
print("no match")
break
It's as if anagram isn't in dic_list. what's happening?
You are breaking after a single check in your loop, remove the break to get all anagrams:
def main():
dictionary("words.txt")
anagramsinput = anagrams(input("unjumble words here: "))
for anagram in anagramsinput:
if anagram in dic_list: # don't break, loop over every possibility
print(anagram)
If you don't want to print no match just remove it, also if you want all possible permutations of the letters use itertools.permutations:
from itertools import permutations
def anagrams(s):
return ("".join(p) for p in permutations(s))
Output:
unjumble words here: onaacir
aaronic
In your anagrams function you are returning before you finish the outer loop therefore missing many permutations:
def anagrams(s):
if s =="":
return [s]
else:
ans = []
for w in anagrams(s[1:]):
for pos in range(len(w)+1):
ans.append(w[:pos]+s[0]+w[pos:])
return ans # only return when both loops are done
Now after both changes your code will work

Categories

Resources