Count only the words in a text file Python

Count only the words in a text file Python - python

I have to count all the words in a file and create a histogram of the words. I am using the following python code.
for word in re.split('[,. ]',f2.read()):
if word not in histogram:
histogram[word] = 1
else:
histogram[word]+=1
f2 is the file I am reading.I tried to parse the file by multiple delimiters but it still does not work. It counts all strings in the file and makes a histogram, but I only want words. I get results like this:
1-1-3: 3
where "1-1-3" is a string that occurs 3 times. How do I check so that only actual words are counted? casing does not matter. I also need to repeat this question but for two word sequences, so an output that looks like:
and the: 4
where "and the" is a two word sequence that appears 4 times. How would I group two word sequences together for counting?

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk import bigrams
from string import punctuation
# preparatory stuff
>>> tokenizer = RegexpTokenizer(r'[^\W\d]+')
>>> my_string = "this is my input string. 12345 1-2-3-4-5. this is my input"
# single words
>>> tokens = tokenizer.tokenize(my_string)
>>> Counter(tokens)
Counter({'this': 2, 'input': 2, 'is': 2, 'my': 2, 'string': 1})
# word pairs
>>> nltk_bigrams = bigrams(my_string.split())
>>> bigrams_list = [' '.join(x).strip(punctuation) for x in list(nltk_bigrams)]
>>> Counter([x for x in bigrams_list if x.replace(' ','').isalpha()])
Counter({'is my': 2, 'this is': 2, 'my input': 2, 'input string': 1})

Assuming you want to count all words in a string you could do something like this using a defaultdict as counters:
#!/usr/bin/env python3
# coding: utf-8
from collections import defaultdict
# For the sake of simplicty we are using a string instead of a read file
sentence = "The quick brown fox jumps over the lazy dog. THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG. The quick brown fox"
# Specify the word pairs you want to count as a single phrase
special_pairs = [('the', 'quick')]
# Convert sentence / input to lowercase in order to neglect case sensitivity and print lowercase sentence to double-check
sentence = sentence.lower()
print(sentence)
# Split string into single words
word_list = sentence.split(' ')
print(word_list)
# Since we know all the word in our input sentence we have to correct the word_list with our word pairs which need
# to be counted as a single phrase and not two single words
for pair in special_pairs:
for index, word in enumerate(word_list):
if pair[0] == word and pair[1] == word_list[index+1]:
word_list.remove(pair[0])
word_list.remove(pair[1])
word_list.append(' '.join([pair[0], pair[1]]))
d = defaultdict(int)
for word in word_list:
d[word] += 1
print(d.items())
Output:
the quick brown fox jumps over the lazy dog. the quick brown fox jumps over the lazy dog. the quick brown fox
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox']
dict_items([('lazy', 2), ('dog.', 2), ('fox', 3), ('brown', 3), ('jumps', 2), ('the quick', 3), ('the', 2), ('over', 2)])

Related

Working through bugs of Python program, not sure why they are occuring

I am building a program to add a list with a specific length to a bigger list from a string. For example, In the string, "The quick brown fox jumped over the lazy dog" if I were to split it up in a list of 4, my return value would be
"[[The, quick, brown, fox], [jumped, over, the, lazy], [dog]]"
My code is:
text = "The quick brown fox jumped over the lazy dog"
micro = []
count = 1
split = text.split(" ")
total_list = []
for i in range(0, len(split)):
print(split[i], count)
if count < 5:
micro.append(split[i])
print(micro)
if count == 4:
total_list.append(micro)
print(total_list)
micro.clear()
count = 0
count+=1
print(total_list)
The idea of this is to split the text into a big list, keep a counter to add in groups of 4, and then add that smaller split to the overall list. Because this string is odd, I know that I won't add dog to the end, which I don't know how to fix given my current setup. My output is:The 1
['The']
quick 2
['The', 'quick']
brown 3
['The', 'quick', 'brown']
fox 4
['The', 'quick', 'brown', 'fox']
[['The', 'quick', 'brown', 'fox']]
jumped 1
['jumped']
over 2
['jumped', 'over']
the 3
['jumped', 'over', 'the']
lazy 4
['jumped', 'over', 'the', 'lazy']
[['jumped', 'over', 'the', 'lazy'], ['jumped', 'over', 'the', 'lazy']]
[[], []]
Mostly, I am confused and am wondering if there is an easier ways to do this. I am hoping to use this to break down in order to use apriori. Because of the scale of the data I am working with (1100+ set with text ~100 words)I want to break down and that is why I want a list of lists.I want to be less intensive overall.
Any help would be appreciated. Thanks.

You can use the range and array slice for this:
text = "The quick brown fox jumped over the lazy dog"
text = text.split()
result = []
step = 4
for i in range(0, len(text), step):
result.append(text[i: i + step])
result
#[['The', 'quick', 'brown', 'fox'], ['jumped', 'over', 'the', 'lazy'], ['dog']]

What is the most pythonic way to split a string into contiguous, overlapping list of words

Say I had a sentence "The cat ate the mouse." I want to split the sentence with size = 2.
So the result array becomes:
["the cat", "cat ate", "ate the", "the mouse"]
If my size was 3, it should become:
["the cat ate", "cat ate the", "ate the mouse"]
My method I have right now uses tons of for loops and I'm not sure if there is a best way.

Using list slice, you can get sub-list.
>>> words = "The cat ate the mouse.".rstrip('.').split()
>>> words[0:3]
['The', 'cat', 'ate']
Use str.join to convert the list to a string joined by delimiter:
>>> ' '.join(words[0:3])
'The cat ate'
List comprehension provides a conside way to create words list:
>>> n = 2
>>> [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)]
['The cat', 'cat ate', 'ate the', 'the mouse']
>>> n = 3
>>> [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)]
['The cat ate', 'cat ate the', 'ate the mouse']
# [' '.join(words[0:3]), ' '.join(words[1:4]),...]

you can use nltk library to do all the job
import nltk
from nltk.util import ngrams
text = "The cat ate the mouse."
tokenize = nltk.word_tokenize(text)
bigrams = ngrams(tokenize,3)
for gram in bigrams:
print gram
what gives us:
('The', 'cat', 'ate')
('cat', 'ate', 'the')
('ate', 'the', 'mouse')
('the', 'mouse', '.')

Python: Expanding the scope of the iterator variable in the any() function

I wrote some structurally equivalent real world code, where I anticipated the result for firstAdjective would be quick. But the result shows that word is out of scope. What is a neat solution that will overcome this but still retain the 'linguistic' style of what I want to do?
>>> text = 'the quick brown fox jumps over the lazy dog'
>>> adjectives = ['slow', 'quick', 'brown', 'lazy']
>>> if any(word in text for word in adjectives):
... firstAdjective = word
...
Traceback (most recent call last):
File "<interactive input>", line 2, in <module>
NameError: name 'word' is not defined

You can use next on a generator expression:
firstAdjective = next((word for word in adjectives if word in text), None)
if firstAdjective:
...
A default value of None is returned when the word is not found (credit #Bakuriu)
Trial:
>>> firstadjective = next((word for word in adjectives if word in text), None)
>>> firstadjective
'quick'

Would something like
for word in filter(lambda x: x in adjectives, text.split()):
print word
work?

Your example does not work because the word variable only exists during the evaluation of the condition. Once it is done finding whether the condition is true or false, the variable goes out of scope and does not exist anymore.
Even declaring it first doesn't help, as it doesn't affect the existing variable:
text = 'the quick brown fox jumps over the lazy dog'
adjectives = ['slow', 'quick', 'brown', 'lazy']
word = None
if any(word in text for word in adjectives):
print(word)
None
Therefore, you have to do it differently, using the next function:
text = 'the quick brown fox jumps over the lazy dog'
adjectives = ['slow', 'quick', 'brown', 'lazy']
word = next((word for word in adjectives if word in text), None)
if word:
print(word)
quick
Note, perhaps word is a misleading variable name here, because it doesn't necessarily have to match a whole word. i.e.
text = 'the quick brown fox jumps over the lazy dog'
adjectives = ['slo', 'uick', 'rown', 'azy']
word = next((word for word in adjectives if word in text), None)
if word:
print(word)
uick
so it might be best to split the text into words first:
text = 'the quick brown fox jumps over the lazy dog'
adjectives = ['slow', 'quick', 'brown', 'lazy']
word_list = text.split()
word = next((word for word in adjectives if word in word_list), None)
if word:
print(word)
but note that if we change the adjectives order:
text = 'the quick brown fox jumps over the lazy dog'
adjectives = ['brown', 'quick', 'slow', 'lazy']
word_list = text.split()
word = next((word for word in adjectives if word in word_list), None)
if word:
print(word)
we get:
brown
which isn't the first adjective in the text.
Therefore, we need to check in the order of the words in the text, not the order of adjectives in the list.
text = 'the quick brown fox jumps over the lazy dog'
adjectives = ['brown', 'quick', 'slow', 'lazy']
word_list = text.split() # NOTE: this "optimization" is no longer necessary now
word = next((word for word in word_list if word in adjectives), None)
if word:
print(word)
credit to HolyDanna for spotting this

>>> text = 'the quick brown fox jumps over the lazy dog'
>>>
>>> # 1. Case with no match among the adjectives: Result None as expected
>>> adjectives = ['slow', 'crippled']
>>> firstAdjective = next((word for word in adjectives if word in text), None)
>>> firstAdjective
>>>
>>> # 2. Case with a match to 1st available in adjectives but actually 2nd in the text
>>> adjectives = ['slow', 'brown', 'quick', 'lazy']
>>> firstAdjective = next((word for word in adjectives if word in text), None)
>>> firstAdjective
'brown'
>>> # 3. Case with a match to 1st available in the text, which is what is wanted
>>> firstAdjective = next((word for word in text.split() if word in adjectives), None)
>>> firstAdjective
'quick'
>>> # 4. Case where .split() is omitted. NOTE: This does not work.
>>> # In python 2.7 at least, .split() is required
>>> firstAdjective = next((word for word in text if word in adjectives), None)
>>> firstAdjective
>>>

Dictionary and position list back to sentence

I've managed to get my program to store a sentence or two into a dictionary and at the same time create a word position list.
What I need to do now is recreate the original sentence just from the dictionary and the position list. I've done lots of searches but the results I'm getting are either not what I need or are to confusing and beyond me.
Any help would be much appreciated, thanks.
Here is my code so far:
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
print ('This is the sentence with spaces before the punctuations:', sentence)
words_list = sentence.split()
print ('A list of the words in the sentence:', words_list)
dictionary = {}
word_pos_list = []
counter = 0
for word in words_list:
if word not in dictionary:
counter += 1
dictionary[word] = counter
word_pos_list.append(dictionary[word])
print ('The positions of the words in the sentence are:', word_pos_list)
John

While, as mentioned in comments, dictionaries are not sorted datastructures, if you are breaking up a sentence and indexing it into a dictionary and are trying to put it back together, you can try to use an OrderedDict from the collections library to do what you're doing.
That said, this is without any sort of further background or knowledge of how you are splitting your sentence (punctuation etc, I suggest looking into NLTP if you are doing any sort of natural language processing(NLP)).
from collections import OrderedDict
In [182]: def index_sentence(s):
.....: return {s.split(' ').index(i): i for i in s.split(' ')}
.....:
In [183]: def build_sentence_from_dict(d):
.....: return ' '.join(OrderedDict(d).values())
.....:
In [184]: s
Out[184]: 'See spot jump over the brown fox.'
In [185]: id = index_sentence(s)
In [186]: id
Out[186]: {0: 'See', 1: 'spot', 2: 'jump', 3: 'over', 4: 'the', 5: 'brown', 6: 'fox.'}
In [187]: build_sentence_from_dict(id)
Out[187]: 'See spot jump over the brown fox.'
In [188]:

To reconstruct from your list you have to reverse the location mapping:
# reconstruct
reversed_dictionary = {x:y for y, x in dictionary.items()}
print(' '.join(reversed_dictionary[x] for x in word_pos_list))
This can be done more nicely using a defaultdict (dictionary with predifined default value, in your case a list of locations for the word):
#!/usr/bin/env python3.4
from collections import defaultdict
# preprocessing
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
punctuation = ['()?:;,.!/"\'']
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
# using defaultdict this time
word_to_locations = defaultdict(list)
for part in enumerate(sentence.split()):
word_to_locations[part[1]].append(part[0])
# word -> list of locations
print(word_to_locations)
# location -> word
location_to_word = dict((y, x) for x in word_to_locations for y in word_to_locations[x])
print(location_to_word)
# reconstruct
print(' '.join(location_to_word[x] for x in range(len(location_to_word))))

It's not the randomness of dictionary keys that's the problem here, it's the failure to record every position at which a word was seen, duplicate or not. The following does that and then unwinds the dictionary to produce the original sentence, sans punctuation:
from collections import defaultdict
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = set('()?:;\\,.!/"\'')
sentence = ''.join(character for character in sentence if character not in punctuation)
print ('This is the sentence with no punctuation:', sentence)
words = sentence.split()
print('A list of the words in the sentence:', words)
dictionary = defaultdict(list)
last_word_position = 0
for word in words:
last_word_position += 1
dictionary[word].append(last_word_position)
print('A list of unique words in the sentence and their positions:', dictionary.items())
# Now the tricky bit to unwind our random dictionary:
sentence = []
for position in range(1, last_word_position + 1):
sentence.extend([word for word, positions in dictionary.items() if position in positions])
print(*sentence)
The output of the various print() statements:
This is the sentence: This Sentence is a very, very good sentence. Did you like my very good sentence?
This is the sentence with no punctuation: This Sentence is a very very good sentence Did you like my very good sentence
A list of the words in the sentence: ['This', 'Sentence', 'is', 'a', 'very', 'very', 'good', 'sentence', 'Did', 'you', 'like', 'my', 'very', 'good', 'sentence']
A list of unique words in the sentence and their positions: dict_items([('Sentence', [2]), ('is', [3]), ('a', [4]), ('very', [5, 6, 13]), ('This', [1]), ('my', [12]), ('Did', [9]), ('good', [7, 14]), ('you', [10]), ('sentence', [8, 15]), ('like', [11])])
This Sentence is a very very good sentence Did you like my very good sentence

Removing empty entries and adding them as a whitespace in the previous entry of a list

I'm currently working on a project that requires splitting sentences in order to compare two words (a given word which the user needs to type, giving us the second) to each other, and check the accuracy of the user's typing. I've been using x.split(" ") to do this, however, this is causing me an issue.
Let's say the given sentence was The quick brown fox, and the user types in The quick brown fox.
Instead of returning ['The','quick ', 'brown', 'fox'], it's returning ['The', 'quick', '', 'brown', fox']. This makes it harder to check for accuracy, as I'd like it to be checked word per word.
In other words, I'd like to append any extra spaces to the word that came before, but the split function is creating separate (empty) elements instead. How do I go about removing any empty entries and adding them to the word that came before them?
I'd like this to work for lists where there are multiple '' entries in a row as well, such as ['The', 'quick', '', '', 'brown', fox'].
Thanks!
EDIT - The code I'm using to test this is just some variation of x = The quick brown fox".split(' '), with different whitespaces.
EDIT 2 - I didn't think about this (thanks Malonge), but if the sentence starts with a space, I would actually like that to be counted as well. I don't know how easy that would be, since I'd need to make this particular instance an exception where the whitespace needs to be appended to the word that follows rather than the one that precedes it. However, I'll make a conscious choice to ignore that scenario when calculating accuracy due to the difficulty in implementing it.

You can use regex for this, this will match all the spaces that come after the first space:
>>> import re
>>> s = "The quick brown fox"
>>> re.findall(r'\S+\s*(?=\s\S|$)', s)
['The', 'quick ', 'brown', 'fox']
Debuggex Demo:
\S+\s*(?=\s\S|$)
Update:
To match leading spaces at the start of the string some modification to the above regex are required:
>>> s = "The quick brown fox"
>>> re.findall(r'((?:(?<=^\s)\s*)?\S+\s*(?=\s\S|$))', s)
['The', 'quick ', 'brown', 'fox']
>>> s1 = " The quick brown fox"
>>> re.findall(r'((?:(?<=^\s)\s*)?\S+\s*(?=\s\S|$))', s1)
[' The', 'quick ', 'brown', 'fox']
Debuggex Demo:
((?:(?<=^\s)\s*)?\S+\s*(?=\s\S|$))

You can get there a number of ways, but perhaps the easiest with what you've demonstrated is just to split without specifying a split parameter, which makes it split on whitespace, not just a single space:
>>> s = "The quick brown fox"
>>>
>>> s.split(' ')
['The', 'quick', '', 'brown', 'fox']
>>> s.split()
['The', 'quick', 'brown', 'fox']
You could also get there with:
>>> words = [w for w in s.split(" ") if w]
>>> words
['The', 'quick', 'brown', 'fox']
Or using regex:
>>> import re
>>>
>>> re.split('\s*', s)
['The', 'quick', 'brown', 'fox']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.