I have a text file where I am counting the sum of lines, sum of characters and sum of words. How can I clean the data by removing stop words such as (the, for, a) using string.replace()
I have the codes below as of now.
Ex. if the text file contains the line:
"The only words to count are Apple and Grapes for this text"
It should output:
2 Apple
2 Grapes
1 words
1 only
1 text
And should not output words like:
the
to
are
for
this
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
After opening and reading the file (fname = open('2013_honda_accord.txt', 'r').read()), you can place this code:
blacklist = ["the", "to", "are", "for", "this"] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(word, "")
# The above causes multiple spaces in the text (e.g. ' Apple Grapes Apple')
while " " in fname:
fname = fname.replace(" ", " ") # Replace double spaces by one while double spaces are in text
Edit:
To avoid problems with words containing the unwanted words, you may do it like this (assuming words are in sentence middle):
blacklist = ["the", "to", "are", "for", "this"] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(" " + word + " ", " ")
# Or .'!? ect.
A check for double spaces is not required here.
Hope this helps!
You can easily terminate those words by writing a simple function:
#This function drops the restricted words from a sentece.
#Input - sentence, list of restricted words (restricted list should be all lower case)
#Output - list of allowed words.
def restrict (sentence, restricted):
return list(set([word for word in sentence.split() if word.lower() not in restricted]))
Then you can use this function whenever you want (before or after the word count).
for example:
restricted = ["the", "to", "are", "and", "for", "this"]
sentence = "The only words to count are Apple and Grapes for this text"
word_list = restrict(sentence, restricted)
print word_list
Would print:
["count", "Apple", "text", "only", "Grapes", "words"]
Of course you can add empty words removal (double spaces):
return list(set([word for word in sentence.split() if word.lower() not in restricted and len(word) > 0]))
Related
Refer to this image1
Hey, this is a program i want to write in python. I tried, and i sucessfully iterated words but now how do i count the individual score?
a_string = input("Enter a sentance: ").lower()
vowel_counts = {}
splits = a_string.split()
for i in splits:
words = []
words.append(i)
print(words)
You can flag the vowels using translate to convert all the vowels to 'a's . Then count the 'a's in each word using the count method:
sentence = "computer programmers rock"
vowels = str.maketrans("aeiouAEIOU","aaaaaaaaaa")
flagged = sentence.translate(vowels) # all vowels --> 'a'
counts = [word.count('a') for word in flagged.split()] # counts per word
score = sum(1 if c<=2 else 2 for c in counts) # sum of points
print(counts,score)
# [3, 3, 1] 5
I tried using this code that I found online:
K=sentences
m=[len(i.split()) for i in K]
lengthorder= sorted(K, key=len, reverse=True)
#print(lengthorder)
#print("\n")
list1 = lengthorder
str1 = '\n'.join(list1)
print(str1)
print('\n')
Sentence1 = "We have developed speed, but we have shut ourselves in"
res = len(Sentence1.split())
print ("The longest sentence in this text contains" + ' ' + str(res) + ' ' + "words.")
Sentence2 = "More than cleverness we need kindness and gentleness"
res = len(Sentence2.split())
print ("The second longest sentence in this text contains" + ' ' + str(res) + ' ' + "words.")
Sentence3 = "Machinery that gives abundance has left us in want"
res = len(Sentence3.split())
print ("The third longest sentence in this text contains" + ' ' + str(res) + ' ' + "words.")
but it doesn't sort out the sentences per word number, but per actual length (as in cm)
You can simply iterate through the different sentaces and split them up into words like this:
text = " We have developed speed. but we have. shut ourselves in Machinery that. gives abundance has left us in want Our knowledge has made us cynical Our cleverness, hard and unkind We think too much and feel too little More than machinery we need humanity More than cleverness we need kindness and gentleness"
# split into sentances
text2array = text.split(".")
i =0
# interate through sentances and split them into words
for sentance in text2array:
text2array[i] = sentance.split(" ")
i += 1
# sort the sentances by word length
text2array.sort(key=len,reverse=True)
i = 0
#iterate through sentances and print them to screen
for sentance in text2array:
i += 1
sentanceOut = ""
for word in sentance:
sentanceOut += " " + word
sentanceOut += "."
print("the nr "+ str(i) +" longest sentence is" + sentanceOut)
You can define a function that uses the regex to obtain the number of words in a given sentence:
import re
def get_word_count(sentence: str) -> int:
return len(re.findall(r"\w+", sentence))
Assuming you already have a list of sentences, you can iterate the list and pass each sentence to the word count function then store each sentence and its word count in a dictionary:
sentences = [
"Assume that this sentence has one word. Really?",
"Assume that this sentence has more words than all sentences in this list. Obviously!",
"Assume that this sentence has more than one word. Duh!",
]
word_count_dict = {}
for sentence in sentences:
word_count_dict[sentence] = get_word_count(sentence)
At this point, the word_count_dict contains sentences as keys and their associated word count as values.
You can then sort word_count_dict by values:
sorted_word_count_dict = dict(
sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True)
)
Here's the full snippet:
import re
def get_word_count(sentence: str) -> int:
return len(re.findall(r"\w+", sentence))
sentences = [
"Assume that this sentence has one word. Really?",
"Assume that this sentence has more words than all sentences in this list. Obviously!",
"Assume that this sentence has more than one word. Duh!",
]
word_count_dict = {}
for sentence in sentences:
word_count_dict[sentence] = get_word_count(sentence)
sorted_word_count_dict = dict(
sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True)
)
print(sorted_word_count_dict)
Let's assume that your sentences are already separate and there is no need to detect the sentences.
So we have a list of sentences. Then we need to calculate the length of the sentence based on the word count. the basic way is to split them by space. So each space separates two words from each other in a sentence.
list_of_sen = ['We have developed speed, but we have shut ourselves in','Machinery that gives abundance has left us in want Our knowledge has made us cynical Our cleverness', 'hard and unkind We think too much and feel too little More than machinery we need humanity More than cleverness we need kindness and gentleness']
sen_len=[len(i.split()) for i in list_of_sen]
sen_len= sorted(sen_len, reverse=True)
for index , count in enumerate(sen_len):
print(f'The {index+1} longest sentence in this text contains {count} words')
But if your sentence is not separated, first we need to recognize the end of the sentence then split them. Your sample date does not contain any punctuation that can be useful to separate sentences. So if we assume that your data has punctuation the answer below can be helpful.
see this question
from nltk import tokenized
p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
tokenize.sent_tokenize(p)
I have to write a function that counts how many times a word (or a series of words) appears in a given text.
This is my function so far. What I noticed is that with a series of 3 words the functions works well, but not with 4 words and so on.
from nltk import ngrams
def function(text, word):
for char in ".?!-":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = word.lower()
n_grams = ngrams(bigram_lower.split(), n)
for gram in n_grams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
First thing, please fix your indentation and don't use bigrams as a variable for ngrams as it's a bit confusing (Since you are not storing just bigrams in the bigrams variable). Secondly lets look at this part of your code -
for gram in bigrams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
Here you are increasing countN by one for each time a word in your ngram matches up instead of increasing it when the whole ngram matches up. You should instead only increase countN if all the words have matched up -
for gram in bigrams:
if list(gram) == word_lower.split():
countN = countN + 1
print (countN)
May be it was already done in here
Is nltk mandatory?
# Open the file in read mode
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
This shuld work for you:
def function(text, word):
for char in ".?!-,":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = tuple(word.lower().split())
bigrams = nltk.ngrams(bigram_lower.split(), n)
for gram in bigrams:
if gram == word_lower:
countN += 1
print (countN)
>>> tekst="this is the text i want to search, i want to search it for the words i want to search for, and it should count the occurances of the words i want to search for"
>>> function(tekst, "i want to search")
4
>>> function(tekst, "i want to search for")
2
The program correctly identifies the words regardless of punctuation. I am having trouble integrate this into spam_indicator(text).
def spam_indicator(text):
text=text.split()
w=0
s=0
words=[]
for char in string.punctuation:
text = text.replace(char, '')
return word
for word in text:
if word.lower() not in words:
words.append(word.lower())
w=w+1
if word.lower() in SPAM_WORDS:
s=s+1
return float("{:.2f}".format(s/w))
enter image description here
The second block is wrong. I am trying to remove punctuations to run the function.
Try removing the punctuation first, then split the text into words.
def spam_indicator(text):
for char in string.punctuation:
text = text.replace(char, ' ') # N.B. replace with ' ', not ''
text = text.split()
w = 0
s = 0
words = []
for word in text:
if word.lower() not in words:
words.append(word.lower())
w=w+1
if word.lower() in SPAM_WORDS:
s=s+1
return float("{:.2f}".format(s/w))
There are many improvements that could be made to your code.
Use a set for words rather than a list. Since a set can not contain duplicates you don't need to check whether you've already seen the word before adding it to the set.
Use str.translate() to remove the punctuation. You want to replace punctuation with whitespace so that the split() will split the text into words.
Use round() instead of converting to a string then to a float.
Here is an example:
import string
def spam_indicator(text):
trans_table = {ord(c): ' ' for c in string.punctuation}
text = text.translate(trans_table).lower()
text = text.split()
word_count = 0
spam_count = 0
words = set()
for word in text:
if word not in SPAM_WORDS:
words.add(word)
word_count += 1
else:
spam_count += 1
return round(spam_count / word_count, 2)
You need to take care not to divide by 0 if there are no non-spam words. Anyway, I'm not sure what you want as the spam indicator value. Perhaps it should be the number of spam words divided by the total number of words (both spam and non-spam) to make it a value between 0 and 1?
I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)