Sorting a concordance?

Sorting a concordance? - python

For my homework, I need to isolate the most frequent 50 words in a text. I have tried a whole lot of things, and in my most recent attempt, I have done a concordance using this:
concordance = {}
lineno = 0
for line in vocab:
lineno = lineno + 1
words = re.findall(r'[A-Za-z][A-Za-z\'\-]*', line)
for word in words:
word = word.title()
if word in concordance:
concordance[word].append(lineno)
else:
concordance[word] = [lineno]
listing = []
for key in sorted(concordance.keys()):
listing.append( [key, concordance[key] ])
What I would like to know is whether I can sort the subsequent concordance in order of most frequently used word to least frequently used word, and then isolate and print the top 50? I am not permitted to import any modules other than re and sys, and I'm struggling to come up with a solution.

sorted is a builtin which does not require import. Try something like:
list(sorted(concordance.items(), key = lambda (k,v): v))[:50]
Not tested, but you get the idea.
The list constructor is there because sorted returns a generator, which you can't slice directly (itertools provides a utility to do that, but you can't import it).
There are probably slightly more efficient ways to take the first 50, but I doubt it matters here.

Few hints:
Use enumerate(list) in your for loop to get the line number and the line at once.
Try using \w for word characters in your regular expression instead of listing [A-Za-z...].
Read about the dict.items() method. It will return a list of (key, value) pairs.
Manipulate that list with list.sort(key=function_to_compare_two_items).
You can define that function with a lambda, but it is not necessary.
Use the len(list) function to get the length of the list. You can use it to get the number of matches of a word (which are stored in a list).
UPDATE: Oh yeah, and use slices to get a part of the resulting list. list[:50] to get the first 50 items (equivalent to list[0:50]), and list[5:10] to get the items from index 5 inclusive to index 10 exclusive.
To print them, loop through the resulting list, then print every word. Alternatively, you can use something similar to print '[separator]'.join(list) to print a string with all the items separated by '[separator]'.
Good luck.

Related

In Python: How can i get my code to print out all the possible words I can spell based on my input?

I think I was close to figuring out how to print out all the possible words based on user input from my set dictionary. it's based on the assumption that the user input is 'ART' so the possible words I have in my dictionary are ART, RAT, TART, and TAR but only the three letter combinations are printing out. can anyone tell me where I am going wrong? Thanks!
Dictionary = ["tar","art","tart","rat"] #creates dictionary of set words
StoredLetters = input('input your word here: ') #allows the user to input any word
list(StoredLetters)
def characters(word):
Dictionary = {}
for i in word:
Dictionary[i] = Dictionary.get(i, 0) + 1
return Dictionary
def all_words(StoredLetters, wordSet):
for word in StoredLetters:
flag = 1
words = characters(word)
for key in words:
if key not in wordSet:
flag = 0
else:
if wordSet.count(key) != words[key]:
flag = 0
if flag == 1:
print(word)
if __name__ == "__main__":
print(all_words(Dictionary, StoredLetters))

It appears there are a few things that could contribute to this.
You are swapping the parameters on all words def allwords(Dictionary, StoredLetters): when you call it in main allwords(StoredLetters, Dictionary). Without specifying the name (look up named parameters in python) you would be swapping the input.
In the characters function it would appear you are resetting the dictionary variable. Try using unique names when creating new variables. This is causing the dictionary of words you set at the top to be emptied out when characters(word) is called

First off, you are confusing things by having the name of your variable StoredLetters also being the name of one of the arguments to your all_words function.
Second, you are actually passing in StoredLetters, which is art, as the 2nd argument to the function, so it is wordSet in the function, not StoredLetters!
You should really keep things more clear by using different variable names, and making it obvious what are you using for which argument. words isn't really words, it's a dictionary with letters as keys, and how many times they appear as the values! Making code clear and understandable goes a long way to making it easy to debug. You have word, StoredLetters, wordSet, another StoredLetters argument, words = characters(word) which doesn't do what is expected. This could all use a good cleanup.
As for the functionality, with art, each letter only appears once, so for tart, which has t twice, if wordSet.count(key) != words[key] will evaluate as True, and flag will be set to 0, and the word will not be printed.
Hope that helps, and happy coding!

Based on the follow-up comments, the rule is that we must use all characters in the target word, but we can use each character as many times as we want.
I'd set up the lookup "dictionary" data structure as a Python dict which maps sorted, unique characters as tuples in each dictionary word to a list of the actual words that can be formed from those characters.
Next, I'd handle the lookups as follows:
Sort the unique characters of the user input (target word) and index into the dictionary to get the list of words it could make. Using a set means that we allow repetition and sorting the characters means we normalize for all of the possible permutations of those letters.
The above alone can give false positives, so we filter the resulting word list to remove any actual result words that are shorter than the target word. This ensures that we handle a target word like "artt" correctly and prevent it from matching "art".
Code:
from collections import defaultdict
class Dictionary:
def __init__(self, words):
self.dictionary = defaultdict(list)
for word in words:
self.dictionary[tuple(sorted(set(word)))].append(word)
def search(self, target):
candidates = self.dictionary[tuple(sorted(set(target)))]
return [x for x in candidates if len(x) >= len(target)]
if __name__ == "__main__":
dictionary = Dictionary(["tar", "art", "tart", "rat"])
tests = ["art", "artt", "ar", "arttt", "aret"]
for test in tests:
print(f"{test}\t=> {dictionary.search(test)}")
Output:
art => ['tar', 'art', 'tart', 'rat']
artt => ['tart']
ar => []
arttt => []
aret => []
The issues in the original code have been addressed nicely in the other answers. The logic doesn't seem clear since it's comparing characters to words and variable names often don't match the logic represented by the code.
It's fine to use a frequency counter, but you'll be stuck iterating over the dictionary, and you'll need to check that each count of a character in a dictionary word is greater than the corresponding count in the target word. I doubt the code I'm offering is optimal, but it should be much faster than the counter approach, I think.

Python searching a large list speed

I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:
from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]
openFile = open("sentences.txt","r")
for line in openFile:
sentenceList.append[line]
percentage = 0
count = 0
for y in sentenceList:
if y not in headwordList:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
count=count+1
Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.

Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
You're converting each sentence to lower 650,000 times. No need for that.

You should change headwordList into a set.
The test word in headwordList will be very slow. It must do a string comparison on each word in headwordList, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).
With a set, it always takes the same amount of time to do the in test; it doesn't depend on the number of elements in the set. So that will be a huge speedup.
Now, this whole loop can be simplified:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
All this does is find the word from headwordList that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList to just headwords as I want you to make it be a set and not a list.
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
m, word = max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords, calling SequenceMatcher() on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher result and the second value is the word. The max() function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key= argument. Now max() finds the value with the highest ratio for us. max() consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m and word.
In Python, it is best practice to use variable names like sentence_list rather than sentenceList. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/
It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append() method function to append values.
Also, you might do better to build a dictionary of words and their ratios.
Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.
EDIT: This is to answer a question about generator expressions needing to be parenthesized.
Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word = max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:
m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)
Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key= argument, we need a fully parenthesized generator expression.
But I think it's easier to read if you split out the genexp on its own line.
EDIT: #Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher for speed. I don't have time to test this, but I think this is the right way to do it.
Happily, the code got simpler! Always a good sign.
EDIT: I just tested the code. This code works for me; see if it works for you.
from difflib import SequenceMatcher
headwords = [
# This is a list of 650,000 words
# Dummy list:
"happy",
"new",
"year",
]
def words_from_file(filename):
with open(filename, "rt") as f:
for line in f:
for word in line.split():
yield word
def _match(matcher, s):
matcher.set_seq2(s)
return (matcher.ratio(), s)
ratios = {}
best_ratio = 0
matcher = SequenceMatcher()
for word in words_from_file("sentences.txt"):
matcher.set_seq1(word.lower())
if word not in headwords:
ratio, word = max(_match(matcher, word.lower()) for word in headwords)
best_ratio = max(best_ratio, ratio) # remember best ratio
if ratio > 0.86:
ratios[word] = ratio
print(best_ratio)
print(ratios)

1) I would store headwordList as a set, not a list, allowing for faster access as it is a hashed data structure.
2) You have sentenceList defined as a list then attempt to use it as a dictionary with sentenceList[x] = y. I would define a different structure specifically for counts.
3) You construct sentenceList which doesn't need to be done.
for line in file:
if line not in headwordList...
4) You never tokenize line which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist

This is a data structures question. What you want to do, is to turn your list into something with faster element lookup speed, for example a binary search tree would work great here: time complexity is only O (log n) as opposed to O (n) in a list (which is in comparison incredibly fast).
There's a fairly simple explanation here:
http://interactivepython.org/runestone/static/pythonds/Trees/balanced.html
But if you are not familiar with tree concepts, you might want to start few chapters earlier:
http://interactivepython.org/runestone/static/pythonds/Trees/trees.html

comparing occurrence of strings in list in python

i'm super duper new in python. I'm kinda stuck for one of my class exercises.
The question goes something like this: You have a file that contains characters i.e. words. (I'm still at the stage where all the terms get mixed up, I apologize if that is not the correct term)
Example of the file.txt content: accbd
The question asks me to import the file to python editor and make sure that no letter occurs more than letter that comes later than it in the alphabet. e.g. a cannot occur more frequently than b; b cannot occur more than c, and so on. In the example file, c occurs more frequently than d, so I need to raise an error message.
Here's my pathetic attempt :
def main():
f=open('.txt','r') # 1st import the file and open it.
data = f.read() #2nd read the file
words = list(data) #3rd create a list that contains every letter
newwords = sorted(words) # sort according to alphabetical order
I'm stuck at the last part which is to count that the former word doesn't occur more than the later word, and so on. I tried two ways but neither is working. Here's trial 1:
from collections import counter
for i in newwords:
try:
if counter(i) <=counter(i+1):
print 'ok'
else:
print 'not ok between indexes %d and %d' % (i, i+1)
except:
pass
The 2nd trial is similar
for i in newwords:
try:
if newwords.count(i) <= newwords.count(i+1):
print 'ok'
else:
print 'ok between indexes %d and %d' % (i, i+1)
except:
pass
What is the correct way to compare the count for each word in sequential order?

I had posted an answer, but I see it's for an assignment, so I'll try to explain instead of just splatting a solution here.
My suggestion would be to solve it in three steps:
1) in the first line, create a list of sorted characters that appear in the string:
from the data string you can use set(data) to pick every unique character
if you use sort() on this set you can create a list of characters, sorted alphabetically.
2) then use this list in a for loop (or list comprehension) to create a second list, of their number of occurrences in data, using data.count(<letter in the list>); note that the elements in this second list are technically sorted based on the alphabetical order of the letters in the first list you made (because of the for loop).
3) compare this second list of values with a sorted version of itself (now sorted by values), and see if they match or not. If they don't match, it's because some of the initial letters appears too many times compared to the next ones.

To be a little more clear:
In [2]: string = 'accbd'
In [3]: import collections
In [4]: collections.Counter(string)
Out[4]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})
Then it's just a for loop with enumerate(list_).

How to find index of an element in Python list? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to find positions of the list maximum?
A question from homework:
Deﬁne a function censor(words,nasty) that takes a list of words, and
replaces all the words appearing in nasty with the word CENSORED, and
returns the censored list of words.
>>> censor([’it’,’is’,’raining’], [’raining’])
[’it’,’is’,’CENSORED’]
I see solution like this:
find an index of nasty
replace words matching that index
with "CENSORED"
but i get stuck on finding the index..

You can find the index of any element of a list by using the .index method.
>>> l=['a','b','c']
>>> l.index('b')
1

Actually you don't have to operate with indexes here. Just iterate over words list and check if the word is listed in nasty. If it is append 'CENSORED' to the result list, else append the word itself.
Or you can involve list comprehension and conditional expression to get more elegant version:

Your approach might work, but it’s unnecessarily complicated.
Python allows a very simple syntax to check whether something is contained in a list:
censor = [ 'bugger', 'nickle' ]
word = 'bugger'
if word in censor: print 'CENSORED'
With that approach, simply walk over your list of words and test for each words whether it’s in the censor list.
To walk over your list of words, you can use the for loop. Since you might need to modify the current word, use an index, like so:
for index in len(words)):
print index, words[index]
Now all you need to do is put the two code fragments together.

You could use the handy built-in enumerate() function to step through the items in the list. For example:
def censor(words, nasty):
for i,word in enumerate(words):
if word...

Finding words from random input letters in python. What algorithm to use/code already there?

I am trying to code a word descrambler like this one here and was wondering what algorithms I should use to implement this. Also, if anyone can find existing code for this that would be great as well. Basically the functionality is going to be like a boggle solver but without being a matrix, just searching for all word possibilities from a string of characters. I do already have adequate dictionaries.
I was planning to do this in either python or ruby.
Thanks in advance for your help guys!

I'd use a Trie. Here's an implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber)

I may be missing an understanding of the game but barring some complications in the rules, such as with the introduction of "joker" (wildcard) letters, missing or additional letters, multiple words etc... I think the following ideas would help turn the problem in a somewhat relatively uninteresting thing. :-(
Main idea index words by the ordered sequence of their letters.
For example "computer" gets keyed as "cemoprtu". Whatever the random drawings provide is sorting in kind, and used as key to find possible matches.
Using trie structures as suggested by perimosocordiae, as the underlying storage for these sorted keys and associated words(s)/wordIds in the "leaf" nodes, Word lookup can be done in O(n) time, where n is the number of letters (or better, on average due to non-existing words).
To further help with indexing we can have several tables/dictionaries, one per number of letters. Also depending on statistics the vowels and consonants could be handled separately. Another trick would be to have a custom sort order, placing the most selective letters first.
Additional twists to the game (such as finding words made from a subset of the letters) is mostly a matter of iterating the power set of these letters and checking the dictionary for each combination.
A few heuristics can be introduced to help prune some of the combinations (for example combinations without vowels [and of a given length] are not possible solutions etc. One should manage these heuristics carefully for the lookup cost is relatively small.

For your dictionary index, build a map (Map[Bag[Char], List[String]]). It should be a hash map so you can get O(1) word lookup. A Bag[Char] is an identifier for a word that is unique up to character order. It's is basically a hash map from Char to Int. The Char is a given character in the word and the Int is the number of times that character appears in the word.
Example:
{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]
To find words, take every combination of characters from the input string and look it up in this map. The complexity of this algorithm is O(2^n) in the length of the input string. Notably, the complexity does not depend on the length of the dictionary.

This sounds like Rabin-Karp string search would be a good choice. If you use a rolling hash-function then at each position you need one hash value update and one dictionary lookup. You also need to create a good way to cope with different word lengths, like truncating all words to the shortest word in the set and rechecking possible matches. Splitting the word set into separate length ranges will reduce the amount of false positives at the expense of increasing the hashing work.

There are two ways to do this. One is to check every candidate permutation of letters in the word to see if the candidate is in your dictionary of words. That's an O(N!) operation, depending on the length of the word.
The other way is to check every candidate word in your dictionary to see if it's contained within the word. This can be sped up by aggregating the dictionary; instead of every candidate word, you check all words that are anagrams of each other at once, since if any one of them is contained in your word, all of them are.
So start by building a dictionary whose key is a sorted string of letters and whose value is a list of the words that are anagrams of the key:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
for line in f.readlines():
if line[0].isupper(): continue
word = line.strip()
key = "".join(sorted(word.lower()))
d[key].append(word)
Now we need a function to see if a word contains a candidate. This function assumes that the word and candidate are both sorted, so that it can go through them both letter by letter and give up quickly when it finds that they don't match.
>>> def contains(sorted_word, sorted_candidate):
wchars = (c for c in sorted_word)
for cc in sorted_candidate:
while(True):
try:
wc = wchars.next()
except StopIteration:
return False
if wc < cc: continue
if wc == cc: break
return False
return True
Now find all the candidate keys in the dictionary that are contained by the word, and aggregate all of their values into a single list:
>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']
That last step takes about a quarter second on my laptop; there are 195K keys in my dictionary (I'm using the BSD Unix words file).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.