repeated phrases in the text Python - python

I have a problem and I have no idea how to solve it. Please, give a piece of advice.
I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.

You have, it seems to me, two problems.
The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?
A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.
A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:
>>> def words(text):
pattern = re.compile(r"[^\s]+")
non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
for match in pattern.finditer(text):
nxt = non_alpha.sub("", match.group()).lower()
if nxt: # skip blank, non-alpha words
yield nxt
>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']
The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:
>>> def phrases(words):
phrase = []
for word in words:
phrase.append(word)
if len(phrase) > 3:
phrase.remove(phrase[0])
if len(phrase) == 3:
yield tuple(phrase)
>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]
There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.
Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:
>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
counts[phrase] += 1
This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.

the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.
d = collections.defaultdict(int)
d[phrase]+=1
as I said, its very crude. But should certainly get you started

I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.

Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.
>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
... phrases[(a, b, c)] += 1
...
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []

Related

Comprehension dictionary counter and refactoring python code

I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.
I tried to do a comprehension dictionary for word_dict, but I don't find a way to do it. I had two problems with it:
I tried to add word_dict[word] += 1 in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
I wanted to check if the element was already in the comprehension dictionary (which I'm creating) using if word not in word_dict and it didn't work.
The comprehension dictionary is:
word_dict = {word_dict[word] := 0 if word not in word_dict else word_dict[word] := word_dict[word] + 1 for word in text_split}
Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.
text = "hello Hello, water! WATER:HELLO. water , HELLO"
# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello water WATER HELLO water HELLO'
# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']
word_dict = {}
for word in text_split:
if word not in word_dict:
word_dict[word] = 0
word_dict[word] += 1
word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}
Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter to create a dictionary, where the keys are words, and the associated values are counts/occurrences:
import re
from collections import Counter
text = "hello Hello, water! WATER:HELLO. water , HELLO"
pattern = r"\b\w+\b"
print(Counter(re.findall(pattern, text)))
Output:
Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>>
Here's what the regex pattern is composed of:
\b - represents a word boundary (will not be included in the match)
\w+ - one or more characters from the set [a-zA-Z0-9_].
\b - another word boundary (will also not be included in the match)
Welcome to Python. There is the library collections (https://docs.python.org/3/library/collections.html), which has a class called Counter. It seems very likely that this could fit in your code. Is that a take?
from collections import Counter
...
word_dict = Counter(text_split)

Python find word occurences in list of phrases and link word to phrases

Let's say, I have a .txt file with phrases divided by new line (\n)
I split them in list of phrases
["Rabbit eats banana", "Fox eats apple", "bear eats sanwich", "Tiger sleeps"]
What I need to do:
I need to make list of word objects, each word should have:
name
frequency(how many times it occured in phrases)
list of phrases it belongs to
For word eats the result will be:
{'name':'eats', 'frequency': '3', 'phrases': [0,1,2]}
What I've already done:
Right now, I am doing it simple, but not effective:
I get the list of words(by splitting .txt file by space character (" ")
words = split_my_input_file_by_spaces
#["banana", 'eats', 'apple', ....]
And loop for every word and every phrase:
for word in words:
for phrase in phrases:
if word in phrase:
#add word freq +1
What is the problem with current aproach:
I will have up to 10k phrases, so I encountered some problems with speed and performance. And I want to make it faster
I saw this interesting and promising way of counting occurences(but I don't know how can make a list of phrases each word belongs to)
from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})
What you're referring to (the interesting and promising way of counting occurrences) is called a HashMap, or a dictionary in Python. These are key-value stores, which allow you to store & update some value (such as a count, a list of phrases, or a Word object) with constant time retrieval.
You mentioned that you ran into some runtime issues. Switching to a HashMap-based approach will speed up the runtime of your algorithm significantly (from quadratic to linear).
phrases = ["Hello there", "Hello where"];
wordCounts = {};
wordPhrases = {};
for phrase in phrases:
for word in phrase.split():
if (wordCounts.get(word)):
wordCounts[word] = wordCounts[word] + 1
wordPhrases[word].append(phrase)
else:
wordCounts[word] = 1
wordPhrases[word] = [phrase]
print(wordCounts)
print(wordPhrases)
Output:
{'there': 1, 'where': 1, 'Hello': 2}
{'there': ['Hello there'], 'where': ['Hello where'],
'Hello': ['Hello there', 'Hello where']
}
This will leave you with two dictionaries:
For each word, how often does it appear: {word: count}
For each word, in what phrases does it appear: {word: [phrases...]}
Some small effort is required from this point to achieve the output you're looking for.

simple [pythonic] way to create ids for a given set of words

I have a large corpus of texts on which I want to run a couple of algorithms. These algorithms do not care about what a word is - a word is just a unique object to them. Hence I want to reduce the size of the text by simply replacing the words with integer IDs.
An example:
my_string = "an example sentence with an example repetition."
my_ids = get_ids_from_string(my_string)
print my_ids
>>> [0, 1, 2, 3, 0, 1, 4] ### note that the ID for 'example' is always the same
I am looking for a neat, efficient, pythonic way of solving this.
You're not winning much by replacing strings with integers --- you'd win just as much by making sure that identical strings are stored only once in memory.
my_string = "an example sentence with an example repetition."
words = my_string.split()
unique_words = [intern(word) for word in words]
The "unique_words" list is equal to the "words" list, but intern() guarantees that the strings will be used once. If you do this on a large corpus of text with a much smaller set of possible words, it will not use substantially more memory than integers.
The answer that I came up with is the following(you can re-use the next_i-function for multiple id-generators):
from collections import defaultdict
COUNTER = defaultdict(lambda : -1)
def next_i(counter_id=0):
"""Return a new ID"""
global COUNTER
COUNTER[counter_id] += 1
return COUNTER[counter_id]
id_generator = defaultdict(lambda : next_i(0))
my_string = "an example sentence with an example repetition."
my_ids = [id_generator[t] for t in my_string.lower().split()]
print my_ids
>>> [0, 1, 2, 3, 0, 1, 4]
On my 6 million document setup this algorithm finishes in 45.56s.
I'd use the fromkeys method of a collections.OrderedDict:
>>> from collections import OrderedDict
>>> my_string = "an example sentence with an example repetition."
>>> words = my_string.split()
>>> code = {w: i for i,w in enumerate(OrderedDict.fromkeys(words))}
>>> [code[w] for w in words]
[0, 1, 2, 3, 0, 1, 4]
This works because OrderedDict.fromkeys will make a dictionary containing unique words in the order they were first seen:
>>> OrderedDict.fromkeys(words)
OrderedDict([('an', None), ('example', None), ('sentence', None), ('with', None), ('repetition.', None)])
Since that works but apparently doesn't perform very well, maybe:
from itertools import count
counter = count()
code = {}
coded = [code[w] if w in code else code.setdefault(w, next(counter)) for w in words]
We have a corpus (about a million documents) that we produce recommendations from for my company so size reduction is very important to us. The biggest gains in space reduction we get are from doing three simple things, two of them using nltk:
Use the default punkt sentence tokenization to reduce the sentences to more meaningful symbols.
Remove all stopwords as they don't contain any useful information.
Stem the words using the nltk porter stemmer.
This is not the most processor-efficient way to do this but will be pretty efficient in terms of space if that is your bottleneck. If the sentence tokenizer is too slow (I think it's the slowest part), it would still probably be worth it to do the stemming and remove the stopwords.

Split a string into all possible ordered phrases

I am trying to explore the functionality of Python's built-in functions. I'm currently trying to work up something that takes a string such as:
'the fast dog'
and break the string down into all possible ordered phrases, as lists. The example above would output as the following:
[['the', 'fast dog'], ['the fast', 'dog'], ['the', 'fast', 'dog']]
The key thing is that the original ordering of the words in the string needs to be preserved when generating the possible phrases.
I've been able to get a function to work that can do this, but it is fairly cumbersome and ugly. However, I was wondering if some of the built-in functionality in Python might be of use. I was thinking that it might be possible to split the string at various white spaces, and then apply that recursively to each split. Might anyone have some suggestions?
Using itertools.combinations:
import itertools
def break_down(text):
words = text.split()
ns = range(1, len(words)) # n = 1..(n-1)
for n in ns: # split into 2, 3, 4, ..., n parts.
for idxs in itertools.combinations(ns, n):
yield [' '.join(words[i:j]) for i, j in zip((0,) + idxs, idxs + (None,))]
Example:
>>> for x in break_down('the fast dog'):
... print(x)
...
['the', 'fast dog']
['the fast', 'dog']
['the', 'fast', 'dog']
>>> for x in break_down('the really fast dog'):
... print(x)
...
['the', 'really fast dog']
['the really', 'fast dog']
['the really fast', 'dog']
['the', 'really', 'fast dog']
['the', 'really fast', 'dog']
['the really', 'fast', 'dog']
['the', 'really', 'fast', 'dog']
Think of the set of gaps between words. Every subset of this set corresponds to a set of split points and finally to the split of the phrase:
the fast dog jumps
^1 ^2 ^3 - these are split points
For example, the subset {1,3} corresponds to the split {"the", "fast dog", "jumps"}
Subsets can be enumerated as binary numbers from 1 to 2^(L-1)-1 where L is number of words
001 -> the fast dog, jumps
010 -> the fast, dog jumps
011 -> the fast, dog, jumps
etc.
I'll elaborate a bit on #grep's solution, while using only built-ins as you stated in your question and avoiding recursion. You could possibly implement his answer somehow along these lines:
#! /usr/bin/python3
def partition (phrase):
words = phrase.split () #split your phrase into words
gaps = len (words) - 1 #one gap less than words (fencepost problem)
for i in range (1 << gaps): #the 2^n possible partitions
r = words [:1] #The result starts with the first word
for word in words [1:]:
if i & 1: r.append (word) #If "1" split at the gap
else: r [-1] += ' ' + word #If "0", don't split at the gap
i >>= 1 #Next 0 or 1 indicating split or don't split
yield r #cough up r
for part in partition ('The really fast dog.'):
print (part)
The operation you request is usually called a “partition”, and it can be done over any kind of list. So, let's implement partitioning of any list:
def partition(lst):
for i in xrange(1, len(lst)):
for r in partition(lst[i:]):
yield [lst[:i]] + r
yield [lst]
Note that there will be lots of partitions for longer lists, so it is preferable to implement it as a generator. To check if it works, try:
print list(partition([1, 2, 3]))
Now, you want to partition a string using words as elements. The easiest way to do this operation is to split text by words, run the original partitioning algorithm, and merge groups of words back into strings:
def word_partition(text):
for p in partition(text.split()):
yield [' '.join(group) for group in p]
Again, to test it, use:
print list(word_partition('the fast dog'))

How return frequency distributions as integers

I'm using FreqDist to get the number of occurrences of a word to appear in a corpora as its quite fast. The issue is ftable is not returning answers as an integer that I need to do some basic operations on.
words = brown.words()
content = [w for w in words if w.lower()]
ftable = nltk.FreqDist(content)
e.g:
percent = ftable[sing]/ftable[s])*100
I've tried things like ftable.N[sing] but no luck.
Thanks!
EDIT: Also in comments.
The w.lower() is to lowercase the words in the corpora so that when I run a for loop against them I'm only comparing lowercase values as the ftable matches strings exactly. as Hello != hello
if using Counter is it as fast?
Is there an easy way of lowering the case of the corpora/wordlist being searched?
It seems like you're looking for collections.Counter, not a frequency distribution:
>>> from collections import Counter
>>> words = ['a', 'a', 'a', 'b']
>>> Counter(words)
Counter({'a': 3, 'b': 1})
Also, [w for w in words if w.lower()] might be a little slow. You could speed it up by using filter(None, words).

Categories

Resources