How return frequency distributions as integers - python

I'm using FreqDist to get the number of occurrences of a word to appear in a corpora as its quite fast. The issue is ftable is not returning answers as an integer that I need to do some basic operations on.
words = brown.words()
content = [w for w in words if w.lower()]
ftable = nltk.FreqDist(content)
e.g:
percent = ftable[sing]/ftable[s])*100
I've tried things like ftable.N[sing] but no luck.
Thanks!
EDIT: Also in comments.
The w.lower() is to lowercase the words in the corpora so that when I run a for loop against them I'm only comparing lowercase values as the ftable matches strings exactly. as Hello != hello
if using Counter is it as fast?
Is there an easy way of lowering the case of the corpora/wordlist being searched?

It seems like you're looking for collections.Counter, not a frequency distribution:
>>> from collections import Counter
>>> words = ['a', 'a', 'a', 'b']
>>> Counter(words)
Counter({'a': 3, 'b': 1})
Also, [w for w in words if w.lower()] might be a little slow. You could speed it up by using filter(None, words).

Related

Comprehension dictionary counter and refactoring python code

I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.
I tried to do a comprehension dictionary for word_dict, but I don't find a way to do it. I had two problems with it:
I tried to add word_dict[word] += 1 in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
I wanted to check if the element was already in the comprehension dictionary (which I'm creating) using if word not in word_dict and it didn't work.
The comprehension dictionary is:
word_dict = {word_dict[word] := 0 if word not in word_dict else word_dict[word] := word_dict[word] + 1 for word in text_split}
Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.
text = "hello Hello, water! WATER:HELLO. water , HELLO"
# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello water WATER HELLO water HELLO'
# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']
word_dict = {}
for word in text_split:
if word not in word_dict:
word_dict[word] = 0
word_dict[word] += 1
word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}
Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter to create a dictionary, where the keys are words, and the associated values are counts/occurrences:
import re
from collections import Counter
text = "hello Hello, water! WATER:HELLO. water , HELLO"
pattern = r"\b\w+\b"
print(Counter(re.findall(pattern, text)))
Output:
Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>>
Here's what the regex pattern is composed of:
\b - represents a word boundary (will not be included in the match)
\w+ - one or more characters from the set [a-zA-Z0-9_].
\b - another word boundary (will also not be included in the match)
Welcome to Python. There is the library collections (https://docs.python.org/3/library/collections.html), which has a class called Counter. It seems very likely that this could fit in your code. Is that a take?
from collections import Counter
...
word_dict = Counter(text_split)

Matching 2 words in 2 lines and +1 to the matching pair?

So Ive got a variable list which is always being fed a new line
And variable words which is a big list of single word strings
Every time list updates I want to compare it to words and see if any strings from words are in list
If they do match, lets say the word and is in both of them, I then want to print "And : 1". Then if next sentence has that as well, to print "And : 2", etc. If another word comes in like The I want to print +1 to that
So far I have split the incoming text into an array with text.split() - unfortunately that is where im stuck. I do see some use in [x for x in words if x in list] but dont know how I would use that. Also how I would extract the specific word that is matching
You can use a collections.Counter object to keep a tally for each of the words that you are tracking. To improve performance, use a set for your word list (you said it's big). To keep things simple assume there is no punctuation in the incoming line data. Case is handled by converting all incoming words to lowercase.
from collections import Counter
words = {'and', 'the', 'in', 'of', 'had', 'is'} # words to keep counts for
word_counts = Counter()
lines = ['The rabbit and the mole live in the ground',
'Here is a sentence with the word had in it',
'Oh, it also had in in it. AND the and is too']
for line in lines:
tracked_words = [w for word in line.split() if (w:=word.lower()) in words]
word_counts.update(tracked_words)
print(*[f'{word}: {word_counts[word]}'
for word in set(tracked_words)], sep=', ')
Output
the: 3, and: 1, in: 1
the: 4, in: 2, is: 1, had: 1
the: 5, and: 3, in: 4, is: 2, had: 2
Basically this code takes a line of input, splits it into words (assuming no punctuation), converts these words to lowercase, and discards any words that are not in the main list of words. Then the counter is updated. Finally the current values of the relevant words is printed.
This does the trick:
sentence = "Hello this is a sentence"
list_of_words = ["this", "sentence"]
dict_of_counts = {} #This will hold all words that have a minimum count of 1.
for word in sentence.split(): #sentence.split() returns a list with each word of the sentence, and we loop over it.
if word in list_of_words:
if word in dict_of_counts: #Check if the current sentence_word is in list_of_words.
dict_of_counts[word] += 1 #If this key already exists in the dictionary, then add one to its value.
else:
dict_of_counts[word] = 1 #If key does not exists, create it with value of 1.
print(f"{word}: {dict_of_counts[word]}") #Print your statement.
The total count is kept in dict_of_counts and would look like this if you print it:
{'this': 1, 'sentence': 1}
You should use defaultdict here for the fastest processing.
from collections import defaultdict
input_string = "This is an input string"
list_of_words = ["input", "is"]
counts = defaultdict(int)
for word in input_string.split():
if word in list_of_words:
counts[word] +=1

Convert consecutive duplicate character string to known word

I am trying to convert a string with consecutive duplicate characters to it's 'dictionary' word. For example, 'aaawwesome' should be converted to 'awesome'.
Most answers I've come across have either removed all duplicates (so 'stations' would be converted to 'staion' for example), or have removed all consecutive duplicates using itertools.groupby(), however this doesn't account for cases such as 'happy' which would be converted to 'hapy'.
I have also tried using string intersections with Counter(), however this disregards the order of the letters so 'botaniser' and 'baritones' would match incorrectly.
In my specific case, I have a few lists of words:
list_1 = ["wife", "kid", "hello"]
list_2 = ["husband", "child", "goodbye"]
and, given a new word, for example 'hellllo', I want to check if it can be reduced down to any of the words, and if it can, replace it with that word.
use the enchant module, you may need to install it using pip
See which letters duplicate, remove the letter from the word until the word is in the English dictionary.
import enchant
d = enchant.Dict("en_US")
list_1 = ["wiffe", "kidd", "helllo"]
def dup(x):
for n,j in enumerate(x):
y = [g for g,m in enumerate(x) if m==j]
for h in y:
if len(y)>1 and not d.check(x) :
x = x[:h] + x[h+1:]
return x
list_1 = list(map(dup,list_1))
print(list_1)
>>> ['wife', 'kid', 'hello']

simple [pythonic] way to create ids for a given set of words

I have a large corpus of texts on which I want to run a couple of algorithms. These algorithms do not care about what a word is - a word is just a unique object to them. Hence I want to reduce the size of the text by simply replacing the words with integer IDs.
An example:
my_string = "an example sentence with an example repetition."
my_ids = get_ids_from_string(my_string)
print my_ids
>>> [0, 1, 2, 3, 0, 1, 4] ### note that the ID for 'example' is always the same
I am looking for a neat, efficient, pythonic way of solving this.
You're not winning much by replacing strings with integers --- you'd win just as much by making sure that identical strings are stored only once in memory.
my_string = "an example sentence with an example repetition."
words = my_string.split()
unique_words = [intern(word) for word in words]
The "unique_words" list is equal to the "words" list, but intern() guarantees that the strings will be used once. If you do this on a large corpus of text with a much smaller set of possible words, it will not use substantially more memory than integers.
The answer that I came up with is the following(you can re-use the next_i-function for multiple id-generators):
from collections import defaultdict
COUNTER = defaultdict(lambda : -1)
def next_i(counter_id=0):
"""Return a new ID"""
global COUNTER
COUNTER[counter_id] += 1
return COUNTER[counter_id]
id_generator = defaultdict(lambda : next_i(0))
my_string = "an example sentence with an example repetition."
my_ids = [id_generator[t] for t in my_string.lower().split()]
print my_ids
>>> [0, 1, 2, 3, 0, 1, 4]
On my 6 million document setup this algorithm finishes in 45.56s.
I'd use the fromkeys method of a collections.OrderedDict:
>>> from collections import OrderedDict
>>> my_string = "an example sentence with an example repetition."
>>> words = my_string.split()
>>> code = {w: i for i,w in enumerate(OrderedDict.fromkeys(words))}
>>> [code[w] for w in words]
[0, 1, 2, 3, 0, 1, 4]
This works because OrderedDict.fromkeys will make a dictionary containing unique words in the order they were first seen:
>>> OrderedDict.fromkeys(words)
OrderedDict([('an', None), ('example', None), ('sentence', None), ('with', None), ('repetition.', None)])
Since that works but apparently doesn't perform very well, maybe:
from itertools import count
counter = count()
code = {}
coded = [code[w] if w in code else code.setdefault(w, next(counter)) for w in words]
We have a corpus (about a million documents) that we produce recommendations from for my company so size reduction is very important to us. The biggest gains in space reduction we get are from doing three simple things, two of them using nltk:
Use the default punkt sentence tokenization to reduce the sentences to more meaningful symbols.
Remove all stopwords as they don't contain any useful information.
Stem the words using the nltk porter stemmer.
This is not the most processor-efficient way to do this but will be pretty efficient in terms of space if that is your bottleneck. If the sentence tokenizer is too slow (I think it's the slowest part), it would still probably be worth it to do the stemming and remove the stopwords.

repeated phrases in the text Python

I have a problem and I have no idea how to solve it. Please, give a piece of advice.
I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.
You have, it seems to me, two problems.
The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?
A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.
A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:
>>> def words(text):
pattern = re.compile(r"[^\s]+")
non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
for match in pattern.finditer(text):
nxt = non_alpha.sub("", match.group()).lower()
if nxt: # skip blank, non-alpha words
yield nxt
>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']
The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:
>>> def phrases(words):
phrase = []
for word in words:
phrase.append(word)
if len(phrase) > 3:
phrase.remove(phrase[0])
if len(phrase) == 3:
yield tuple(phrase)
>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]
There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.
Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:
>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
counts[phrase] += 1
This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.
the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.
d = collections.defaultdict(int)
d[phrase]+=1
as I said, its very crude. But should certainly get you started
I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.
Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.
>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
... phrases[(a, b, c)] += 1
...
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []

Categories

Resources