Removing punctuation and creating a dictionary Python - python

I'm trying to create a function that removes punctuation and lowercases every letter in a string. Then, it should return all this in the form of a dictionary that counts the word frequency in the string.
This is the code I wrote so far:
def word_dic(string):
string = string.lower()
new_string = string.split(' ')
result = {}
for key in new_string:
if key in result:
result[key] += 1
else:
result[key] = 1
for c in result:
"".join([ c if not c.isalpha() else "" for c in result])
return result
But this what i'm getting after executing it:
{'am': 3,
'god!': 1,
'god.': 1,
'i': 2,
'i?': 1,
'thanks': 1,
'to': 1,
'who': 2}
I just need to remove he punctuation at the end of the words.

Another option is to use that famous Python's batteries included.
>>> sentence = 'Is this a test? It could be!'
>>> from collections import Counter
>>> Counter(re.sub('\W', ' ', sentence.lower()).split())
Counter({'a': 1, 'be': 1, 'this': 1, 'is': 1, 'it': 1, 'test': 1, 'could': 1})
Leverages collections.Counter for counting words, and re.sub for replacing everything that's not a word character.

"".join([ c if not c.isalpha() else "" for c in result]) creates a new string without the punctuation, but it doesn't do anything with it; it's thrown away immediately, because you never store the result.
Really, the best way to do this is to normalize your keys before counting them in result. For example, you might do:
for key in new_string:
# Keep only the alphabetic parts of each key, and replace key for future use
key = "".join([c for c in key if c.isalpha()])
if key in result:
result[key] += 1
else:
result[key] = 1
Now result never has keys with punctuation (and the counts for "god." and "god!" are summed under the key "god" alone), and there is no need for another pass to strip the punctuation after the fact.
Alternatively, if you only care about leading and trailing punctuation on each word (so "it's" should be preserved as is, not converted to "its"), you can simplify a lot further. Simply import string, then change:
key = "".join([c for c in key if c.isalpha()])
to:
key = key.rstrip(string.punctuation)
This matches what you specifically asked for in your question (remove punctuation at the end of words, but not at the beginning or embedded within the word).

You can use string.punctuation to recognize punctuation and use collections.Counter to count occurence once the string is correctly decomposed.
from collections import Counter
from string import punctuation
line = "It's a test and it's a good ol' one."
Counter(word.strip(punctuation) for word in line.casefold().split())
# Counter({"it's": 2, 'a': 2, 'test': 1, 'and': 1, 'good': 1, 'ol': 1, 'one': 1})
Using str.strip instead of str.replace allows to preserve words such as It's.
The method str.casefold is simply a more general case of str.lower.

Maybe if you want to reuse the words later, you can store them in a sub-dictionary along with its ocurrences number. Each word will have its place in a dictionary. We can create our own function to remove punctuation, pretty simple.
See if the code bellow serves your needs:
def remove_punctuation(word):
for c in word:
if not c.isalpha():
word = word.replace(c, '')
return word
def word_dic(s):
words = s.lower().split(' ')
result = {}
for word in words:
word = remove_punctuation(word)
if not result.get(word, None):
result[word] = {
'word': word,
'ocurrences': 1,
}
continue
result[word]['ocurrences'] += 1
return result
phrase = 'Who am I and who are you? Are we gods? Gods are we? We are what we are!'
print(word_dic(phrase))
and you'll have an output like this:
{
'who': {
'word': 'who',
'ocurrences': 2},
'am': {
'word': 'am',
'ocurrences': 1},
'i': {
'word': 'i',
'ocurrences': 1},
'and': {
'word': 'and',
'ocurrences': 1},
'are': {
'word': 'are',
'ocurrences': 5},
'you': {
'word': 'you',
'ocurrences': 1},
'we': {
'word': 'we',
'ocurrences': 4},
'gods': {
'word': 'gods',
'ocurrences': 2},
'what': {
'word': 'what',
'ocurrences': 1}
}
Then you can easily access each word and its ocurrences simply doing:
word_dict(phrase)['are']['word'] # output: are
word_dict(phrase)['are']['ocurrences'] # output: 5

Related

Python frequency table, exclude characters

Good evening,
I'm wondering how i can exclude certain characters from a frequency table?
first i read the file, creates a frequency table. after this i change it to a tuple to be able to get a percentage of occourence for each letter.
however i am wondering how i can implement that it does not count certain letters.
ie. an exclude list.
with open('test.txt', 'r') as file:
data = file.read().replace('\n', '')
frequency_table = {char : data.count(char) for char in set(data)}
x0= ("Character frequency table for '{}' is :\n {}".format(data, str(frequency_table)))
from collections import Counter
res = [(*key, val) for key, val in Counter(frequency_table).most_common()]
print("Frequency Tuple list : " + str(res))
print(res[1][1]/res[1][1])#
Sounds like you need an if at the end of your dictionary comprension:
frequency_table = {char : data.count(char) for char in set(data) if char not in EXCLUDE}
You can then set your EXCLUDE as, for example:
a list, i.e. ['a', 'b', 'c', 'd'] or list('abcd')
or, you can just use a string of characters directly, such as 'abcd'.
>>> data = 'aaaabcdefededefefedf'
>>> EXCLUDE_LIST = 'a'
>>> frequency_table = {char : data.count(char) for char in set(data) if char not in EXCLUDE_LIST}
>>> frequency_table
{'b': 1, 'c': 1, 'e': 6, 'f': 4, 'd': 4}
Add an if to your comprehension:
exclude = {'a', 'r'}
frequency_table = {char: data.count(char) for char in set(data) if char not in exclude}
Alternatively, use the difference between two sets:
exclude = {'a', 'r'}
frequency_table = {char: data.count(char) for char in set(data) - exclude}

Checking keys in a defaultdict

I have this code that should run through the keys in the python defaultdict, and if the key isn't in the defaultdict, it gets added.
I'm getting an error that I don't encounter with regular defined dictionaries, and I'm having a bit of trouble working it out:
The code:
from collections import defaultdict
def counts(line):
for word in line.split():
if word not in defaultdict.keys():
word = "".join(c for c in word if c not in ('!', '.', ':', ','))
defaultdict[word] = 0
if word != "--":
defaultdict[word] += 1
The error:
if word not in defaultdict.keys():
TypeError: descriptor 'keys' of 'dict' object needs an argument
You did not construct a defaultdict object here, you simply refer to the defaultdict class.
You can create one, like:
from collections import defaultdict
def counts(line):
dd = defaultdict(int)
for word in line.split():
word = ''.join(c for c in word if c not in ('!', '.', ':', ','))
if word not in dd:
dd[word] = 0
if word != '--':
dd[word] += 1
return dd
That being said, you probably want to use a Counter here, like:
from collections import Counter
def counts(line):
words = (
''.join(c for c in word if c not in ('!', '.', ':', ',')) for word in line.split()
)
return Counter(
word for word in words if word != '--'
)
defaultdict is a class; you need an object:
from collections import defaultdict
def counts(line, my_dict):
for word in line.split():
if word not in my_dict.keys():
word = "".join(c for c in word if c not in ('!', '.', ':', ','))
my_dict[word] = 0
if word != "--":
my_dict[word] += 1
my_dict = defaultdict()
counts("Now is the time for all good parties to come to the aid of man.", my_dict)
print(my_dict)
Output:
defaultdict(None, {'Now': 1, 'is': 1, 'the': 2, 'time': 1, 'for': 1, 'all': 1, 'good': 1, 'parties': 1, 'to': 2, 'come': 1, 'aid': 1, 'of': 1, 'man': 1})

Python split text into tokens using regex

Hi I have a question about splitting strings into tokens.
Here is an example string:
string = "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."
and I'm trying to split string correctly into its tokens.
Here is my function count_words
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
and the result of split here
['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a',
'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he',
'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut',
'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left',
'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed',
'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it',
'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong',
'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and',
'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he',
'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling',
'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a',
'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for',
'the', 'more', 'favoured', 'of', 'his', 'guests', '']
as you see there is the empty string '' in the last index of the split list.
Please help me understand this empty string in the list and to correctly split this example string.
You could use a list comprehension to iterate over the list items produced by re.split and only keep them if they are not empty strings:
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
split = [x for x in split if x != ''] # <- list comprehension
print(split)
You should also consider returning the data from the function, and printing it from the caller rather than printing it from within the function. That will provide you with flexibility in future.
That happened because the end of string is . and it is in the split pattern so , when match . the next match will start with an empty and that why you see ''.
I suggest this solution using re.findall instead to work an opposite way like this :
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.findall(r"[a-z\-]+", lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
Python's wiki explains this behavior:
If there are capturing groups in the separator and it matches at the
start of the string, the result will start with an empty string. The
same holds for the end of the string
Even though yours is not actually a capturing group, the effect is the same. Note that it could be at the end as well as at the start (for instance if your string started with a whitespace).
The 2 solution already proposed (more or less) by others are these:
Solution 1: findall
As other users pointed out you can use findall and try to inverse the logic of the pattern. With yours, you can easily negate your character class: [^\s\.,!?:;'\"-]+.
But it depends on you regex pattern because it is not always that easy.
Solution 2: check on the starting and ending token
Instead of checking if each token is != '', you can just look at the first or at the last one of the tokens, since you are eagerly taking all the characters on the set you need to split on.
split = re.split("[\s\.,!?:;'\"-]+",lowerText)
if split[0] == '':
split = split[1:]
if split[-1] == '':
split = split[:-1]
You have an empty string due to a point is also matching to split at the string ending and anything is downstream. You can, however, filter out empty strings with filter function and thus complete your function:
import re
import collections
def count_words(text):
"""Count how many times each unique word occurs in text."""
lowerText = text.lower()
split = re.split("[ .,!?:;'\"\-]+",lowerText)
## filer out empty strings and count
## words:
return collections.Counter( filter(None, split) )
count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})
import string
def count_words(text):
counts = dict()
text = text.translate(text.maketrans('', '', string.punctuation))
text = text.lower()
words = text.split()
print(words)
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
It works.

how to dict map each word to list of words which follow it in python?

what i am trying to do :
dict that maps each word that appears in the file
to a list of all the words that immediately follow that word in the file.
The list of words can be be in any order and should include
duplicates.So for example the key "and" might have the list
["then", "best", "then", "after", ...] listing
all the words which came after "and" in the text.
f = open(filename,'r')
s = f.read().lower()
words = s.split()#list of words in the file
dict = {}
l = []
i = 0
for word in words:
if i < (len(words)-1) and word == words[i]:
dict[word] = l.append(words[i+1])
print dict.items()
sys.exit(0)
collections.defaultdict is helpful for such iterations. For simplicity, I've invented a string rather than loaded from a file.
from collections import defaultdict
import string
x = '''This is a random string with some
string elements repeated. This is so
that, with someluck, we can solve a problem.'''
translator = str.maketrans('', '', string.punctuation)
y = x.lower().translate(translator).replace('\n', '').split(' ')
result = defaultdict(list)
for i, j in zip(y[:], y[1:]):
result[i].append(j)
# result
# defaultdict(list,
# {'a': ['random', 'problem'],
# 'can': ['solve'],
# 'elements': ['repeated'],
# 'is': ['a', 'so'],
# 'random': ['string'],
# 'repeated': ['this'],
# 'so': ['that'],
# 'solve': ['a'],
# 'some': ['string'],
# 'someluck': ['we'],
# 'string': ['with', 'elements'],
# 'that': ['with'],
# 'this': ['is', 'is'],
# 'we': ['can'],
# 'with': ['some', 'someluck']})
You can use defaultdict for this:
from collections import defaultdict
words = ["then", "best", "then", "after"]
words_dict = defaultdict(list)
for w1,w2 in zip(words, words[1:]):
words_dict[w1].append(w2)
Results:
defaultdict(<class 'list'>, {'then': ['best', 'after'], 'best': ['then']})

Getting key values from a list of dictionaries

I have a list that contains dictionaries with Letters and Frequencies. Basically, I have 53 dictionaries each for every alphabet (lowercase and uppercase) and space.
adict = {'Letter':'a', 'Frequency':0}
bdict = {'Letter':'b', 'Frequency':0}
cdict = {'Letter':'c', 'Frequency':0}
If you input a word, it will scan the word and update the frequency for its corresponding letter.
for ex in range(0, len(temp)):
if temp[count] == 'a': adict['Frequency']+=1
elif temp[count] == 'b': bdict['Frequency']+=1
elif temp[count] == 'c': cdict['Frequency']+=1
For example, I enter the word "Hello", The letters H,e,l,l,o is detected and its frequencies updated. Non zero frequencies will be transferred to a new list.
if adict['Frequency'] != 0 : newArr.append(adict)
if bdict['Frequency'] != 0 : newArr.append(bdict)
if cdict['Frequency'] != 0 : newArr.append(cdict)
After this, I had the newArr sorted by Frequency and transferred to a new list called finalArr. Below is a sample list contents for the word "Hello"
{'Letter': 'H', 'Frequency': 1}
{'Letter': 'e', 'Frequency': 1}
{'Letter': 'o', 'Frequency': 1}
{'Letter': 'l', 'Frequency': 2}
Now what I want is to transfer only the key values to 2 seperate lists; letterArr and numArr. How do I do this? my desired output is:
letterArr = [H,e,o,l]
numArr = [1,1,1,2]
Why don't you just use a collections.Counter? For example:
from collections import Counter
from operator import itemgetter
word = input('Enter a word: ')
c = Counter(word)
letter_arr, num_arr = zip(*sorted(c.items(), key=itemgetter(1,0)))
print(letter_arr)
print(num_arr)
Note the use of sorted() to sort by increasing frequency. itemgetter() is used to reverse the sort order so that the sort is performed first on the frequency, and then on the letter. The sorted frequencies are then separated using zip() on the unpacked list.
Demo
Enter a word: Hello
('H', 'e', 'o', 'l')
(1, 1, 1, 2)
The results are tuples, but you can convert to lists if you want with list(letter_arr) and list(num_arr).
I have a hard time understanding your data structure choice for this problem.
Why don't you just go with a dictionary like this:
frequencies = { 'H': 1, 'e': 1, 'l': 2, 'o': 1 }
Which is even easier to implement with a Counter:
from collections import Counter
frequencies = Counter("Hello")
print(frequencies)
>>> Counter({ 'H': 1, 'e': 1, 'l': 2, 'o': 1 })
Then to add another word, you'd simply have to use the updatemethod:
frequencies.update("How")
print(frequencies)
>>> Counter({'l': 2, 'H': 2, 'o': 2, 'w': 1, 'e': 1})
Finally, to get your 2 arrays, you can do:
letterArr, numArr = zip(*frequencies.items())
This will give you tuples though, if you really need lists, just do: list(letterArr)
You wanted a simple answer without further todo like zip, collections, itemgetter etc. This does the minimum to get it done, 3 lines in a loop.
finalArr= [{'Letter': 'H', 'Frequency': 1},
{'Letter': 'e', 'Frequency': 1},
{'Letter': 'o', 'Frequency': 1},
{'Letter': 'l', 'Frequency': 2}]
letterArr = []
numArr = []
for i in range(len(finalArr)):
letterArr.append(finalArr[i]['Letter'])
numArr.append(finalArr[i]['Frequency'])
print letterArr
print numArr
Output is
['H', 'e', 'o', 'l']
[1, 1, 1, 2]

Categories

Resources