Not sure how to remove the "\n" thing at the end of output
Basically, i have this txt file with sentences such as:
"What does Bessie say I have done?" I asked.
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child
taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I managed to split the sentences by semicolon with code:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
re.split(';',low)
But not sure how to count the words of the split sentences as len() doesn't work:
The output of the sentences:
['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']
The third sentence for example, i am trying to count the 3 words at left and 8 words at right.
Thanks for reading!
The number of words is the number of spaces plus one:
e.g.
Two spaces, three words:
World is wonderful
Code:
import re
import string
lines = []
with open('file.txt', 'r') as f:
lines = f.readlines()
DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
# Remove empty sentance
if not sentence.strip():
continue
# Remove punctuation besides our delimiter ';'
sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
# Split by our delimeter
splitted = re.split(DELIMETER, sentence)
# The number of words is the number of spaces plus one
word_count.append([1 + x.strip().count(' ') for x in splitted])
# [[9], [7, 9], [7], [3, 8]]
print(word_count)
Use str.rstrip('\n') to remove the \n at the end of each sentence.
To count the words in a sentence, you can use len(sentence.split(' '))
To transform a list of sentences into a list of counts, you can use the map function.
So here it is:
import re
with open("testing.txt") as file:
for i, line in enumerate(file.readlines()):
# Ignore empty lines
if line.strip(' ') != '\n':
line = line.lower()
# Split by semicolons
parts = re.split(';', line)
print("SENTENCES:", parts)
counts = list(map(lambda part: len(part.split()), parts))
print("COUNTS:", counts)
Outputs
SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]
You'll need the library nltk
from nltk import sent_tokenize, word_tokenize
mytext = """I have a dog.
The dog is called Bob."""
for sent in sent_tokenize(mytext):
print(len(word_tokenize(sent)))
Output
5
6
Step by step explanation:
for sent in sent_tokenize(mytext):
print('Sentence >>>',sent)
print('List of words >>>',word_tokenize(sent))
print('Count words per sentence>>>', len(word_tokenize(sent)))
Output:
Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6
`
import re
sentences = [] #empty list for storing result
with open('testtext.txt') as fileObj:
lines = [line.strip() for line in fileObj if line.strip()] #makin list of lines allready striped from '\n's
for line in lines:
sentences += re.split(';', line) #spliting lines by ';' and store result in sentences
for sentence in sentences:
print(sentence +' ' + str(len(sentence.split()))) #out
try this one:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
low = low.strip()
low = low.replace('\n', '')
re.split(';',low)
Related
So Ive got a variable list which is always being fed a new line
And variable words which is a big list of single word strings
Every time list updates I want to compare it to words and see if any strings from words are in list
If they do match, lets say the word and is in both of them, I then want to print "And : 1". Then if next sentence has that as well, to print "And : 2", etc. If another word comes in like The I want to print +1 to that
So far I have split the incoming text into an array with text.split() - unfortunately that is where im stuck. I do see some use in [x for x in words if x in list] but dont know how I would use that. Also how I would extract the specific word that is matching
You can use a collections.Counter object to keep a tally for each of the words that you are tracking. To improve performance, use a set for your word list (you said it's big). To keep things simple assume there is no punctuation in the incoming line data. Case is handled by converting all incoming words to lowercase.
from collections import Counter
words = {'and', 'the', 'in', 'of', 'had', 'is'} # words to keep counts for
word_counts = Counter()
lines = ['The rabbit and the mole live in the ground',
'Here is a sentence with the word had in it',
'Oh, it also had in in it. AND the and is too']
for line in lines:
tracked_words = [w for word in line.split() if (w:=word.lower()) in words]
word_counts.update(tracked_words)
print(*[f'{word}: {word_counts[word]}'
for word in set(tracked_words)], sep=', ')
Output
the: 3, and: 1, in: 1
the: 4, in: 2, is: 1, had: 1
the: 5, and: 3, in: 4, is: 2, had: 2
Basically this code takes a line of input, splits it into words (assuming no punctuation), converts these words to lowercase, and discards any words that are not in the main list of words. Then the counter is updated. Finally the current values of the relevant words is printed.
This does the trick:
sentence = "Hello this is a sentence"
list_of_words = ["this", "sentence"]
dict_of_counts = {} #This will hold all words that have a minimum count of 1.
for word in sentence.split(): #sentence.split() returns a list with each word of the sentence, and we loop over it.
if word in list_of_words:
if word in dict_of_counts: #Check if the current sentence_word is in list_of_words.
dict_of_counts[word] += 1 #If this key already exists in the dictionary, then add one to its value.
else:
dict_of_counts[word] = 1 #If key does not exists, create it with value of 1.
print(f"{word}: {dict_of_counts[word]}") #Print your statement.
The total count is kept in dict_of_counts and would look like this if you print it:
{'this': 1, 'sentence': 1}
You should use defaultdict here for the fastest processing.
from collections import defaultdict
input_string = "This is an input string"
list_of_words = ["input", "is"]
counts = defaultdict(int)
for word in input_string.split():
if word in list_of_words:
counts[word] +=1
Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.
We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}
You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))
I have two lists. One is made up of positions from a sentence and the other is made up of words that make up the sentence. I want to recreate the variable sentence using poslist and wordlist.
recreate = []
sentence = "This and that, and this and this."
poslist = [1, 2, 3, 2, 4, 2, 5]
wordlist = ['This', 'and', 'that', 'this', 'this.']
I wanted to use a for loop to go through poslist and if the item in poslist was equal to the position of a word in wordlist it would append it to a new list, recreating the original list. My first try was:
for index in poslist:
recreate.append(wordlist[index])
print (recreate)
I had to make the lists strings to write the lists into a text file. When I tried splitting them again and using the code shown above it does not work. It said that the indexes needed to be slices or integers or slices not in a list. I would like a solution to this problem. Thank you.
The list of words is gotten using:
sentence = input("Enter a sentence >>") #asking the user for an input
sentence_lower = sentence.lower() #making the sentence lower case
wordlist = [] #creating a empty list
sentencelist = sentence.split() #making the sentence into a list
for word in sentencelist: #for loop iterating over the sentence as a list
if word not in wordlist:
wordlist.append(word)
txtfile = open ("part1.txt", "wt")
for word in wordlist:
txtfile.write(word +"\n")
txtfile.close()
txtfile = open ("part1.txt", "rt")
for item in txtfile:
print (item)
txtfile.close()
print (wordlist)
And the positions are gotten using:
poslist = []
textfile = open ("part2.txt", "wt")
for word in sentencelist:
poslist.append([position + 1 for position, i in enumerate(wordlist) if i == word])
print (poslist)
str1 = " ".join(str(x) for x in poslist)
textfile = open ("part2.txt", "wt")
textfile.write (str1)
textfile.close()
Lists are 0-indexed (the first item has the index 0, the second the index 1, ...), so you have to substract 1 from the indexes if you want to use "human" indexes in the poslist:
for index in poslist:
recreate.append(wordlist[index-1])
print (recreate)
Afterwards, you can glue them together again and write them to a file:
with open("thefile.txt", "w") as f:
f.write("".join(recreate))
First, your code can be simplified to:
sentence = input("Enter a sentence >>") #asking the user for an input
sentence_lower = sentence.lower() #making the sentence lower case
wordlist = [] #creating a empty list
sentencelist = sentence.split() #making the sentence into a list
with open ("part1.txt", "wt") as txtfile:
for word in sentencelist: #for loop iterating over the sentence as a list
if word not in wordlist:
wordlist.append(word)
txtfile.write(word +"\n")
poslist = [wordlist.index(word) for word in sentencelist]
print (poslist)
str1 = " ".join(str(x) for x in poslist)
with open ("part2.txt", "wt") as textfile:
textfile.write (str1)
In your original code, poslist was a list of lists instead of a list of integers.
Then, if you want to reconstruct your sentence from poslist (which is now a list of int and not a list of lists as in the code you provided) and wordlist, you can do the following:
sentence = ' '.join(wordlist[pos] for pos in poslist)
You can also do it using a generator expression and the string join method:
sentence = ' '.join(wordlist[pos-1] for pos in poslist if pos if pos <= len(wordlist))
# 'This and that, and this and this.'
You can use operator.itemgetter() for this.
from operator import itemgetter
poslist = [0, 1, 2, 1, 3, 1, 4]
wordlist = ['This', 'and', 'that', 'this', 'this.']
print(' '.join(itemgetter(*poslist)(wordlist)))
Note that I had to subtract one from all of the items in poslist, as Python is a zero-indexed language. If you need to programmatically change poslist, you could do poslist = (n - 1 for n in poslist) right after you declare it.
I am relatively new to python and have a question:
I am trying to write a script that will read a .txt file and check if words are in a list that I've provided and then return a count as to how many words were in that list.
So far,
import string
#this is just an example of a list
list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
infile=open(filename, "r")
lines = infile.readlines()
for line in lines:
words = line.split()
for word in words:
word = word.strip(string.punctuation)
I've tried to split the file into lines and then the lines into words without punctuation.
I am not sure where to go after this. I would like ultimately for the output to be something like this:
"your file has x words that are in the list".
Thank you!
You can split your file to words using the following command :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
Then count the number of words in your word list with loop over it and using count function :
w_list = ['hi', 'how', 'are', 'you']
with open("hi.txt", "r") as f :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
for w in w_list:
print "your file has {} {}".format(words.count(w),w)
# words to search for;
# (stored as a set so `word in search_for` is O(1))
search_for = set(["hi", "how", "are", "you"])
# get search text
# (no need to split into lines)
with open("hi.txt") as inf:
text = inf.read().lower()
# create translation table
# - converts non-word chars to spaces (this maintains appropriate word-breaks)
# - keeps apostrophe (for words like "don't" or "couldn't")
trans = str.maketrans(
"0123456789abcdefghijklmnopqrstuvwxyz'!#$%&()*+,-./:;<=>?#[\\]^_`{|}~\"\\",
" abcdefghijklmnopqrstuvwxyz' "
)
# apply translation table and split into words
words = text.translate(trans).split()
# count desired words
word_count = sum(word in search_for for word in words)
# show result
print("your file has {} words that are in the list".format(word_count))
Read file content by with statement and open() method.
Remove punctuation from the file content by string module.
Split file content by split() method and iterate every word by for loop.
Check if word is present in the input list or not and increment count value according yo that.
input file: hi.txt
hi, how are you?
hi, how are you?
code:
import string
input_list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
count = 0
with open(filename, "rb") as fp:
data = fp.read()
data = data.translate(string.maketrans("",""), string.punctuation)
for word in data.split():
if word in input_list:
count += 1
print "Total number of word present in file from the list are %d"%(count)
Output:
vivek#vivek:~/Desktop/stackoverflow$ python 18.py
Total number of word present in file from the list are 8
vivek#vivek:~/Desktop/stackoverflow$
Do not use variable names which already define by python interpreter
e.g. list in your code.
>>> list
<type 'list'>
>>> a = list([1,2,3])
>>> a
[1, 2, 3]
>>> list = ["hi", "how"]
>>> b = list([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable
>>>
use the len()
for example for a list;
enter code here
myList = ["c","b","a"]
len(myList)
it will return 3 meaning there are three items on your list.
I have 4 text files that I want to read and find the top 5 most occurring names of. The text files have names in the following format "Rasmus,M,11". Below is my code which right now is able to call all of the text files and then read them. Right now, this code prints out all of the names in the files.
def top_male_names ():
for x in range (2008, 2012):
txt = "yob" + str(x) + ".txt"
file_handle = open(txt, "r", encoding="utf-8")
file_handle.seek(0)
line = file_handle.readline().strip()
while line != "":
print (line)
line = file_handle.readline().strip()
top_male_names()
My question is, how can I keep track of all of these names, and find the top 5 that occur the most? The only way I could think of was creating a variable for each name, but that wouldn't work because there are 100s of entries in each text file, probably with 100s of different of names.
This is the gist of it:
from collections import Counter
counter = Counter()
for line in file_handle:
name, gender, age = line.split(',')
counter[name] += 1
print counter.most_common()
You can adapt it to your program.
If you need to count a number of words in a text, use regex.
For example
import re
my_string = "Wow! Is this true? Really!?!? This is crazy!"
words = re.findall(r'\w+', my_string) #This finds words in the document
Output::
>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']
"Is" and "is" are two different words. So we can just capitalize all the words, and then count them.
from collections import Counter
cap_words = [word.upper() for word in words] #capitalizes all the words
word_counts = Counter(cap_words) #counts the number each time a word appears
Output:
>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})
Now reading a file :
import re
from collections import Counter
with open('file.txt') as f: text = f.read()
words = re.findall(r'\w+', text )
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
Then you only have to sort the dict containing all the words, for the values not for keys and see the top 5 words.