Python: unique words and their frequency in descending order - python

I'm doing a pretty simple homework problem for a Python class involving all sorts of statistics on characters, words and their relative frequencies etc. At the moment I'm trying to analyse a string of text and get a list of every unique word in the text followed by the number of times it is used. I have very limited knowledge of Python (or any language for that matter) as this is an introductory course and so have only come up with the following code:
for k in (""",.’?/!":;«»"""):
text=text.replace(k,"")
text=text.split()
list1=[(text.count(text[n]),text[n]) for n in range(0,len(text))]
for item in sorted(list1, reverse=True):
print("%s : %s" % (item[1], item[0]))
This unfortunately prints out each individual word of the text (in order of appearance), followed by its frequency n, n times. Obviously this is extremely useless, and I'm wondering if I can add in a nifty little bit of code to what I've already written to make each word appear in this list only once, and then eventually in descending order. All the other questions like this I've seen use a lot of code we haven't learned, so I think the answer should be relatively simple.

Take a look at collections.Counter. You can use it to count your word frequencies, and it'll help you print out the list in sorted order, with the most_common method.
(No example code as this is a homework question, you'll have to do some work yourself).

Related

Finding the shortest unique substring

I have a name and a list of names. I can guarantee that the selected name is contained by the list of other names.
I'd like to generate the shortest substring of the selected name that is contained only by that name, and not by any of the other names in the data.
>>> names = ['smith','jones','williams','brown','wilson','taylor','johnson','white','martin','anderson']
>>> find_substring('smith', names)
"sm"
>>> find_substring('williams', names)
"ll"
>>> find_substring('taylor', names)
"y"
I can probably brute-force this fairly easily, by taking the first letter of the selected name and seeing if it matches any of the names, then iterating through the rest of the letters followed by pairs of letters, etc.
My problem is that my list contains more than ten thousand names and they're fairly long - more similar to book titles. Brute force would take forever.
Is there some simple way to efficiently achieve this?
I believe your best bet would be brute force, however, keep a dictionary of checked letter combinations and whether or not they matched any other names.
["s":true, "m": true, "sm": false"]
Consulting this list first would help reduce the code of checking against other strings and speed up the method as it runs.
A variation of a common suffix tree might be enough to achieve this at less than O(n^2) time (used in bioinformatics for large genome sequencing), but as #HeapOverflow mentioned in the comments, I do not believe brute forcing this problem would be much of an issue unless you are considering running the algorithm with literally hundreds of millions of strings.
Using the Wikipedia article above for reference: you can built the tree at O(n) time (all strings, not individual string), and use it to find all z occurences of a string P of length m in O(m + z) time. Implemented right you'll likely be looking at a time of O(n) + O(am + az) = O(am + az) time for a list of a words (anyone is welcome to double check my math on this).

Text Segmentation using Python package of wordsegment

Folks,
I am using python library of wordsegment by Grant Jenks for the past couple of hours. The library works fine for any incomplete words or separating combined words such as e nd ==> end and thisisacat ==> this is a cat.
I am working on the textual data which involves numbers as well and using this library on this textual data is having a reverse effect. The perfectly fine text of increased $55 million or 23.8% for converts to something very weird increased 55millionor238 for (after performing join operation on the retuned list). Note that this happens randomly (may or may not happen) for any part of the text which involves numbers.
Have anybody worked with this library before?
If yes, have you faced similar situation and found a workaround?
If not, do you know of any other python library that does this trick for us?
Thank you.
Looking at the code, the segment function first runs clean which removes all non-alphanumeric character, it then searches for known unigrams and bigrams within the text clump and scores the words it finds based on the their frequency of occurrence in English.
'increased $55 million or 23.8% for'
becomes
'increased55millionor238for'
When searching for sub-terms, it finds 'increased' and 'for', but the score for the unknown phrase '55millionor238' is better than the score for breaking it up for some reason.
It seems to do better with unknown text, especially smaller unknown text elements. You could substitute out non-alphabetic character sequences, run it through segment and then substitute back in.
import re
from random import choices
CONS = 'bdghjklmpqvwxz'
def sub_map(s, mapping):
out = s
for k,v in mapping.items():
out = out.replace(k,v)
return out
mapping = {m.group():''.join(choices(cons, k=3)) for m
in re.finditer(r'[0-9\.,$%]+', s)}
revmap = {v:k for k,v in mapping.items()}
word_list = wordsegment.segment(sub_map(s, mapping))
word_list = [revmap.get(w,w) for w in word_list]
word_list
# returns:
['increased', '$55', 'million', 'or', '23.8%', 'for']
There are implementations in Ruby and Python at Need help understanding this Python Viterbi algorithm.
The algorithm (and those implementations) are pretty straightforward, and copy & paste may be better than using a library because (in my experience) this problem almost always needs some customisation to fit the data at hand (i. e. language/specific topics/custom entities/date or currency formats).

Finding a substring's position in a larger string

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?
The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.
Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.
For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.
If you're interested in learning more about these approaches there are three main algorithms used in practice:
Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]
There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario
There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.
Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.
Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).
Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].
Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).
BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.
This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.
The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.
Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]
Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

Why is a list of cumulative frequency sums required for implementing a random word generator?

I'm working on exercise 13.7 from Think Python: How to Think Like a Computer Scientist. The goal of the exercise is to come up with a relatively efficient algorithm that returns a random word from a file of words (let's say a novel), where the probability of the word being returned is correlated to its frequency in the file.
The author suggests the following steps (there may be a better solution, but this is assumably the best solution for what we've covered so far in the book).
Create a histogram showing {word: frequency}.
Use the keys method to get a list of words in the book.
Build a list that contains the cumulative sum of the word frequencies, so that the last item in this list is the total number of words in the book, n.
Choose a random number from 1 to n.
Use a bisection search to find the index where the random number would be inserted in the cumulative sum.
Use the index to find the corresponding word in the word list.
My question is this: What's wrong with the following solution?
Turn the novel into a list t of words, exactly as they as they appear in the novel, without eliminating repeat instances or shuffling.
Generate a random integer from 0 to n, where n = len(t) – 1.
Use that random integer as an index to retrieve a random word from t.
Thanks.
Your approach is (also) correct, but it uses space proportional to the input text size. The approach suggested by the book uses space proportional only to the number of distinct words in the input text, which is usually much smaller. (Think about how often words like "the" appear in English text.)

algorithm for testing mutliple substrings in multiple strings

I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.
Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.
It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.
If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.
you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'
Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

Categories

Resources