How do I program bigram as a table in python? - python

I'm doing this homework, and I am stuck at this point.
I can't program Bigram frequency in the English language, 'conditional probability' in python?
That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.
I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015% compared to the letters in the text.
The letters are from ^a-zA-Z, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?
It's like:
[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
[(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
... ...
[(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]
and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.
I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.

Assuming your file has no other punctuation (easy enough to strip out):
import itertools
def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b)
counts = [[0 for _ in range(52)] for _ in range(52)] # nothing has occurred yet
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word): # get pairwise characters from the text
given = ord(a) - ord('a') # index (in `counts`) of the "given" character
char = ord(b) - ord('a') # index of the character that follows the "given" character
counts[given][char] += 1
# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities
totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
if not totals[given]:
continue
for i in range(len(counts[given])):
counts[given][i] /= totals[given]
I haven't tested this, but it should be a good start
Here's a dictionary version, which should be easier to read and debug:
counts = {}
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word):
given = ord(a) - ord('a')
char = ord(b) - ord('a')
if given not in counts:
counts[given] = {}
if char not in counts[given]:
counts[given][char] = 0
counts[given][char] += 1
answer = {}
for given, chardict in answer.items():
total = sum(chardict.values())
for char, count in chardict.items():
answer[given][char] = count/total
Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

Related

How to extract words from repeating strings

Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!
One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'
How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm
Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril
A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count

Finding a word from a text dictionary with given random letters

When a person enters a function (e.g. find_from_dict(letters)), the function searches a word from dictionary.txt that can be made from the letters that the user has inputted—a word that contains the most letters inputted).
For example, letters is input as random typing such as "BAJPPNLE" which will then find "APPLE" from the dictionary since "APPLE" has the most letters from "BAJPPNLE".
def find_from_dict(letters):
n = 0
y = 0
x = 0
dictFile = [line.rstrip('\n') for line in open("dictionary.txt")]
listLetters = list(letters)
final = []
while True:
if n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] in listLetters:
x = x + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x < len(list(dictFile[n])) and list(dictFile[n])[x] not in listLetters:
x = 0
n = n + 1
elif n < len(dictFile) and len(list(dictFile[n])) <= len(listLetters) and x == len(list(dictFile[n])):
final.append(dictFile[n])
elif n < len(dictFile) and len(list(dictFile[n])) > len(listLetters):
n = n + 1
else:
print(final)
break
I have this code at the moment, but since my dictionary.txt file is huge and the code is inefficient, it takes forever to go through..
Does anyone have any idea how I could make this code efficient?
You can speed this up by preparing a word index formed of the sorted letters in your word list. Then look for sorted combinations of the letters in that index:
for example:
from collections import defaultdict
from itertools import combinations
with open("/usr/share/dict/words","r") as wordList:
words = defaultdict(list)
for word in wordList.read().upper().split("\n"):
words[tuple(sorted(word))].append(word) # index by sorted letters
def findWords(letters):
for size in range(len(letters),2,-1): # from large to small (minimum 3 letters)
for combo in combinations(sorted(letters),size): # combinations of that size
for word in (w for w in words[combo]): # matching fords from index
yield word # return as you go (iterator)
# If you only want one, change this to: return word
testing:
while True:
letters = input("Enter letters:")
if not letters: break
for word in findWords(letters.upper()):
stop = input(word)
if stop: break
print("")
sample output:
Enter letters:BAJPPNLE
JELAB
BEJAN
LEBAN
NABLE
PEBAN
PEBAN
ALPEN
NEPAL
PANEL
PENAL
PLANE
ALPEN
NEPAL
PANEL
PENAL
PLANE
APPLE
NAPPE.
Enter letters:EPROING
PERIGON
PIGEON
IGNORE
REGION
PROGNE
OPINER.
Enter letters:
if you need a solution without using libraries, you will need to use a recursive approach that does a breadth first traversal of the combination tree:
with open("/usr/share/dict/words","r") as wordList:
words = dict()
for word in wordList.read().upper().split("\n"):
words.setdefault(tuple(sorted(word)),list()).append(word) # index by sorted letters
def findWords(letters,size=None):
if size == None:
letters = sorted(letters)
for size in range(len(letters),2,-1):
for word in findWords(letters,size): yield word
elif len(letters) == size:
for word in words.get(tuple(letters),[]): yield word
elif len(letters)>size:
for i in range(len(letters)):
for word in findWords(letters[:i]+letters[i+1:],size):
yield word
You can kind of "cheat" your way through it by pre-processing the dictionary file.
The idea is: instead of having a list of words, you have a list of groups which is determined by the sorted letters of the words.
For example, something like:
"aeegr": [
"agree",
"eager",
],
"alps": [
"alps",
"laps",
"pals",
]
Then if you wanted to just find the exact match, you could sort the letters from the input and search in the processed file.
But you want the one that matches the most letters, so what you could do is number the letters with prime numbers (I'm only considering lowercase ascii characters), so that a is 2, b is 3, c is 5, d is 7 and so on.
Then, you can get a number by multiplying all the letters, so for example for alps you'd get 2*37*53*67.
In your dictionary file you then have the numbers obtained the same way for each word.
Like:
262774: [
"alps",
"laps",
"pals",
]
You then go through your dictionary and if the initial number divided by the dictionary number has a remainder of 0, that's a possible match.
The maximum number with a remainder of 0 is the one that you want, because that's the one with the most letters present.
Keep in mind that the numbers might get very big very quickly, depending on how many letters you use.

Scrabble cheater: scoring wildcard characters to zero in Python

I'm new to python world, and I made a code of scrabble finder with two wildcards (* and ?) in it. When scoring the word, I would like to score wildcard letters to zero, but it looks like it doesn't work. I'm wondering what is missing here.
When you look into the line after "# Add score and valid word to the empty list", I tried to code if a letter in the word is not in the rack, I removed the letter so that I can only score other characters that are not coming from wildcards and matches with the letter in the rack. For example, if I have B* in my rack and the word is BO, I would like to remove O and only score B so that I can score wildcard to zero.
But the result is not what I expected.
import sys
if len(sys.argv) < 2:
print("no rack error.")
exit(1)
rack = sys.argv[1]
rack_low = rack.lower()
# Turn the words in the sowpods.txt file into a Python list.
with open("sowpods.txt","r") as infile:
raw_input = infile.readlines()
data = [datum.strip('\n') for datum in raw_input]
# Find all of the valid sowpods words that can be made
# up of the letters in the rack.
valid_words = []
# Call each word in the sowpods.txt
for word in data:
# Change word to lowercase not to fail due to case.
word_low = word.lower()
candidate = True
rack_letters = list(rack_low)
# Iterate each letter in the word and check if the letter is in the
# Scrabble rack. If used once in the rack, remove the letter from the rack.
# If there's no letter in the rack, skip the letter.
for letter in word_low:
if letter in rack_letters:
rack_letters.remove(letter)
elif '*' in rack_letters:
rack_letters.remove('*')
elif '?' in rack_letters:
rack_letters.remove('?')
else:
candidate = False
if candidate == True:
# Add score and valid word to the empty list
total = 0
for letter in word_low:
if letter not in rack_letters:
word_strip = word_low.strip(letter)
for letter in word_strip:
total += scores[letter]
valid_words.append([total, word_low])
I'm going to go a slightly different route with my answer and hopefully speed the overall process up. We're going to import another function from the standard library -- permutations -- and then find possible results by trimming the total possible word list by the length of the rack (or, whatever argument is passed).
I've commented accordingly.
import sys
from itertools import permutations # So we can get our permutations from all the letters.
if len(sys.argv) < 2:
print("no rack error.")
exit(1)
rack = sys.argv[1]
rack_low = rack.lower()
# Turn the words in the sowpods.txt file into a Python list.
txt_path = r'C:\\\\\sowpods.txt'
with open(txt_path,'r') as infile:
raw_input = infile.readlines()
# Added .lower() here.
data = [i.strip('\n').lower() for i in raw_input]
## Sample rack of 7 letters with wildcard character.
sample_rack = 'jrnyoj?'
# Remove any non-alphabetic characters (i.e. - wildcards)
# We're using the isalpha() method.
clean_rack = ''.join([i for i in sample_rack if i.isalpha()])
# Trim word list to the letter count in the rack.
# (You can skip this part, but it might make producing results a little quicker.)
trimmed_data = [i for i in data if len(i) <= len(clean_rack)]
# Create all permutations from the letters in the rack
# We'll iterate over a count from 2 to the length of the rack
# so that we get all relevant permutations.
all_permutations = list()
for i in range(2, len(clean_rack) + 1):
all_permutations.extend(list(map(''.join, permutations(clean_rack, i))))
# We'll use set().intersection() to help speed the discovery process.
valid_words = list(set(all_permutations).intersection(set(trimmed_data)))
# Print sorted list of results to check.
print(f'Valid words for a rack containing letters \'{sample_rack}\' are:\n\t* ' + '\n\t* '.join(sorted(valid_words)))
Our output would be the following:
Valid words for a rack containing letters 'jrnyoj?' are:
* jo
* jor
* joy
* no
* nor
* noy
* ny
* on
* ony
* or
* oy
* yo
* yon
If you want to verify that the results are actually in the sowpods.txt file, you can just index the sowpods.txt list by where the word you want to look up is indexed:
trimmed_data[trimmed_data.index('jor')]
When you are totalling the scores you are using the words from the wordlist and not the inputted words:
total=0
for letter in word_low:
...
Rather, this should be:
total=0
for letter in rack_low:
...
Also, You do not need to loop and remove the letters with strip at the end.
you can just have:
total = 0
for letter in rack_low:
if letter not in rack_letters:
try:
total += scores[letter]
except KeyError: # If letter is * or ? then a KeyError occurs
pass
valid_words.append([total, word_low])

How do you count a negative or positive word prior to a specific word - Sentiment Analysis in Python?

I'm trying to count how many times a negative word from a list appears before a specific word. For example, "This terrible laptop." The specified word being "laptop", I want the output to have "Terrible 1" in Python.
def run(path):
negWords={} #dictionary to return the count
#load the negative lexicon
negLex=loadLexicon('negative-words.txt')
fin=open(path)
for line in fin: #for every line in the file (1 review per line)
line=line.lower().strip().split(' ')
review_set=set() #Adding all the words in the review to a set
for word in line: #Check if the word is present in the line
review_set.add(word) #As it is a set, only adds one time
for word in review_set:
if word in negLex:
if word in negWords:
negWords[word]=negWords[word]+1
else:
negWords[word] = 1
fin.close()
return negWords
if __name__ == "__main__":
print(run('textfile'))
This should do what you're looking for, it uses set & intersection to avoid some of the looping. The steps are —
get the negative words in the line
check the location of each word
if the word after that location is 'laptop' record it
Note that this will only identify the first occurrence of a negative word in a line, so "terrible terrible laptop" will not be a match.
from collections import defaultdict
def run(path):
negWords=defaultdict(int) # A defaultdict(int) will start at 0, can just add.
#load the negative lexicon
negLex=loadLexicon('negative-words.txt')
# ?? Is the above a list or a set, if it's a list convert to set
negLex = set(negLex)
fin=open(path)
for line in fin: #for every line in the file (1 review per line)
line=line.lower().strip().split(' ')
# Can just pass a list to set to make a set of it's items.
review_set = set(line)
# Compare the review set against the neglex set. We want words that are in
# *both* sets, so we can use intersection.
neg_words_used = review_set & negLex
# Is the bad word followed by the word laptop?
for word in neg_words_used:
# Find the word in the line list
ix = line.index(word)
if ix > len(line) - 2:
# Can't have laptop after it, it's the last word.
continue
# The word after this index in the line is laptop.
if line[ix+1] == 'laptop':
negWords[word] += 1
fin.close()
return negWords
If you're only interested in words preceding the word 'laptop', a far more sensible approach would be to look for the word 'laptop', then check the word prior to that to see if it is a negative word. The following example does that.
find laptop in the current line
if laptop isn't in the line, or is the first word, skip the line
get the word before laptop, check against the negative words
if you have a match add it to our result
This avoids doing lookups for words which are not related to laptops.
from collections import defaultdict
def run(path):
negWords=defaultdict(int) # A defaultdict(int) will start at 0, can just add.
#load the negative lexicon
negLex=loadLexicon('negative-words.txt')
# ?? Is the above a list or a set, if it's a list convert to set
negLex = set(negLex)
fin=open(path)
for line in fin: #for every line in the file (1 review per line)
line=line.lower().strip().split(' ')
try:
ix = line.index('laptop')
except ValueError:
# If we dont' find laptop, continue to next line.
continue
if ix == 0:
# Laptop is the first word of the line, can't check prior word.
continue
previous_word = line[ix-1]
if previous_word in negLex:
# Negative word before the current one.
negWords[previous_word] += 1
fin.close()
return negWords
It looks like you want to check a function against consecutive words, here is one way to do it, condition will be checked against every consecutive words.
text = 'Do you like bananas? Not only do I like bananas, I love bananas!'
trigger_words = {'bananas'}
positive_words = {'like', 'love'}
def condition(w):
return w[0] in positive_words and w[1] in trigger_words
for c in '.,?!':
text = text.replace(c, '')
words = text.lower().split()
matches = filter(condition, zip(words, words[1:]))
n_positives = 0
for w1, w2 in matches:
print(f'{w1.upper()} {w2} => That\'s positive !')
n_positives += 1
print(f'This text had a score of {n_positives}')
Output:
LIKE bananas => That's positive !
LIKE bananas => That's positive !
LOVE bananas => That's positive !
3
Bonus:
You can search for 3 consecutive words by just changing zip(w, w[1:]) to zip(w, w[1:], w[2:]) with a condition that checks for 3 words.
You can get a counter dictionary by doing this:
from collections import Counter
counter = Counter((i[0] for i in matches)) # counter = {'like': 2, 'love': 1}

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you
The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)
You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.
Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.
You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)
To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

Categories

Resources