Python count of words by word length - python

I was given a .txt file with a text. I have already cleaned the text (removed punctuation, uppercase, symbols), and now I have a string with the words.
I am now trying to get the count of characters len() of each item on the string. Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters
So far I have:
text = "sample.txt"
def count_chars(txt):
result = 0
for char in txt:
result += 1 # same as result = result + 1
return result
print(count_chars(text))
So far this is looking for the total len() of the text instead of by word.
I would like to get something like the function Counter Counter() this returns the word with the count of how many times it repeated throughout the text.
from collections import Counter
word_count=Counter(text)
I want to get the # of characters per word. Once we have such a count the plotting should be easier.
Thanks and anything helps!

Okay, first of all you need to open the sample.txt file.
with open('sample.txt', 'r') as text_file:
text = text_file.read()
or
text = open('sample.txt', 'r').read()
Now we can count the words in the text and put it, for example, in a dict.
counter_dict = {}
for word in text.split(" "):
counter_dict[word] = len(word)
print(counter_dict)

It looks like the accepted answer doesn't solve the problem as it was posed by the querent
Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters
import matplotlib.pyplot as plt
# ch10 = ... the text of "Moby Dick"'s chapter 10, as found
# in https://www.gutenberg.org/files/2701/2701-h/2701-h.htm
# split chap10 into a list of words,
words = [w for w in ch10.split() if w]
# some words are joined by an em-dash
words = sum((w.split('—') for w in words), [])
# remove suffixes and one prefix
for suffix in (',','.',':',';','!','?','"'):
words = [w.removesuffix(suffix) for w in words]
words = [w.removeprefix('"') for w in words]
# count the different lenghts using a dict
d = {}
for w in words:
l = len(w)
d[l] = d.get(l, 0) + 1
# retrieve the relevant info from the dict
lenghts, counts = zip(*d.items())
# plot the relevant info
plt.bar(lenghts, counts)
plt.xticks(range(1, max(lenghts)+1))
plt.xlabel('Word lengths')
plt.ylabel('Word counts')
# what is the longest word?
plt.title(' '.join(w for w in words if len(w)==max(lenghts)))
# T H E E N D
plt.show()

Related

Ignore words with '

I need to count all unique five letter words in a txt file and ignore any word with an
apostrophe. I'm new to python so I am quite confused trying to get just the five letter words and not sure how to ignore words that have an ' .
what I wrote so far seemed to work for filtering the unique words but not for just five letter words.
with open ("names.txt", 'r') as f: #open the file
words = f.read().lower().split() #read the contents into a sting, made all the character in string lower case and split string into list of words
print(words)
unique_words = set(words) #get unique words
print(len(unique_words))
for w in unique_words:
if len(w) == 5:
print(unique_words)
else:
pass
Your code looks good. I think the only bit you did wrong was to print(unique_words) instead of print(w) when you found a word w of length 5.
To ignore the words containing ' you can add this condition:
for w in unique_words:
if len(w) == 5 and "'" not in w:
print(w)
B.t.w. you don't need the pass statement if you are already at the end of the for loop.
This should do the trick
with open("names.txt", 'r') as f: #open the file
words = f.read().lower().split() #read the contents into a sting, made all the character in string lower case and split string into list of words
print(words)
unique_words = set() #Create empty set
for w in words:
if len(w) == 5 and "'" not in w:
unique_words.add(w) #add words to set
print(len(unique_words))

Count words (even multiples) in a text with Python

I have to write a function that counts how many times a word (or a series of words) appears in a given text.
This is my function so far. What I noticed is that with a series of 3 words the functions works well, but not with 4 words and so on.
from nltk import ngrams
def function(text, word):
for char in ".?!-":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = word.lower()
n_grams = ngrams(bigram_lower.split(), n)
for gram in n_grams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
First thing, please fix your indentation and don't use bigrams as a variable for ngrams as it's a bit confusing (Since you are not storing just bigrams in the bigrams variable). Secondly lets look at this part of your code -
for gram in bigrams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
Here you are increasing countN by one for each time a word in your ngram matches up instead of increasing it when the whole ngram matches up. You should instead only increase countN if all the words have matched up -
for gram in bigrams:
if list(gram) == word_lower.split():
countN = countN + 1
print (countN)
May be it was already done in here
Is nltk mandatory?
# Open the file in read mode
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
This shuld work for you:
def function(text, word):
for char in ".?!-,":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = tuple(word.lower().split())
bigrams = nltk.ngrams(bigram_lower.split(), n)
for gram in bigrams:
if gram == word_lower:
countN += 1
print (countN)
>>> tekst="this is the text i want to search, i want to search it for the words i want to search for, and it should count the occurances of the words i want to search for"
>>> function(tekst, "i want to search")
4
>>> function(tekst, "i want to search for")
2

How do I get the specific number of a word in a txt file?

I'm trying to find whenever one of some specific words is used in a TXT file and then count what number word in the file the word is. My code returns the number for some but not all of the words, and I have no idea why.
My code right now goes through the file word by word with a counter and returns the number if the word matches one of the words I want.
def wordnumber(file, filewrite, word1, word2, word3):
import os
wordlist = [word1, word2, word3]
infile = open(file, 'r')
g = open(filewrite, 'w')
g.write("start")
g.write(os.linesep)
lines = infile.read().splitlines()
infile.close()
wordsString = ' '.join(lines)
words = wordsString.split()
n = 1
for w in words:
if w in wordlist:
g.write(str(n))
g.write(os.linesep)
n = n+1
This works sometimes, but for some text files it only returns some of the numbers and leaves others blank.
If you want find the first occurence of the word in your words, just use
wordIndex = words.index(w) if w in words else None
and for all occurences use
wordIndexes = [i for i,x in enumerate(words) if x==word]
(taken from Python: Find in list)
But beware: if your text is "cat, dog, mouse", your code wouldn't find index of "cat" or "dog". Because "cat, dog, mouse".split() returns ['cat,', 'dog,', 'mouse'], and 'cat,' is not 'cat'.

How to create a sentence with list of position and a list of words

I wanted to create a code which ask a user to enter a list of position on a plain text file, save the position the user entered in the text file as list than ask the user to enter the word each position represent (the same order as the list of position) end re-create the sentence. However when i run this code:
import subprocess
subprocess.Popen(["notepad","list_of_numbers.txt"])
p =open("list_of_numbers.txt","r")
l = p.read()
p.close()
positions = list(l)
subprocess.Popen(["notepad","list_of_words.txt"])
s = open("list_of_words.txt","r")
s.read()
s.close()
sentence = str(s)
print (sentence)
mapping = {}
words = sentence.split()
for (position, word) in zip(positions, words):
mapping[position] = word
output = [mapping[position] for position in positions]
print(' '.join(output))
and i run
1 2 3 4 5 1 2 3 4 5
as list of position
and this as the list of words:
this is a repeated sentence
the output should be:
this is a repeated sentence this is a repeated sentence
but i get
"key error:3"
Im think they problem is i didnt store the list of position into a list properly but im not sure. Can somebody help me?
try this one
import subprocess
subprocess.Popen(["notepad","list_of_numbers.txt"])
with open("list_of_numbers.txt","r") as p:
l = p.read()
positions = l.split() # see you had to create list by spliting the string
subprocess.Popen(["notepad","list_of_words.txt"])
with open("list_of_words.txt","r") as s:
sentence = s.read() # you had to assign the string to variable
print (sentence)
mapping = {}
words = sentence.split()
for position, word in zip(positions, words):
mapping[position] = word
output = [mapping[position] for position in positions]
print(' '.join(output))
but also it could be
import subprocess
subprocess.Popen(["notepad","list_of_numbers.txt"])
with open("list_of_numbers.txt","r") as pos_file:
positions = pos_file.read().spllit()
subprocess.Popen(["notepad","list_of_words.txt"])
with open("list_of_words.txt","r") as sentence_file:
words = sentence_file.read().split()
mapping = dict(zip(positions, words))
output = [mapping[position] for position in positions]
print(' '.join(output))

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

Categories

Resources