Calculating how many times sentence words are repeating in the file

Calculating how many times sentence words are repeating in the file - python

I want to check how many times a word is repeating in the file. I have seen other codes on finding words in file but they won't solve my problem.From this I mean if I want to find "Python is my favourite language"The program will split the text will tell how many times it has repeated in the file.
def search_tand_export():
file = open("mine.txt")
#targetlist = list()
#targetList = [line.rstrip() for line in open("mine.txt")]
contentlist = file.read().split(" ")
string=input("search box").split(" ")
print(string)
fre={}
outputfile=open("outputfile.txt",'w')
for word in contentlist:
print(word)
for i in string:
# print(i)
if i == word:
print(f"'{string}' is in text file ")
outputfile.write(word)
print(word)
spl=tuple(string.split())
for j in range(0,len(contentist)):
if spl in contentlist:
fre[spl]+=1
else:
fre[spl]=1
sor_list=sorted(fre.items(),key =lambda x:x[1])
for x,y in sor_list:
print(f"Word\tFrequency")
print(f"{x}\t{y}")
else:
continue
print(f"The word or collection of word is not present")
search_tand_export()

I don't quite understand what you're trying to do.
But I suppose you are trying to find how many times every word from a given sentence is repeated in the file.
If this is the case, you can try something like this:
sentence = "Python is my favorite programming language"
words = sentence.split()
with open("file.txt") as fp:
file_data = fp.read()
for word in words:
print(f"{file_data.count(word)} occurence(s) of '{word}' found")
Note that the code above is case-sensitive (that is, "Python" and "python" are different words). To make it case-insensitive, you can bring file_data and every word during comparison to lowercase using str.lower().
sentence = "Python is my favorite programming language"
words = sentence.split()
with open("file.txt") as fp:
file_data = fp.read().lower()
for word in words:
print(f"{file_data.count(word.lower())} occurence(s) of '{word}' found")
A couple of things to note:
You are opening a file and even don't close it finally (although you should). It's better to use with open(...) as ... (context-manager), so the file is closed automatically.
Python strings (as well as lists, tuples etc.) have .count(what) method. It returns how many occurences of what are found in the object.
Read about PEP-8 coding style and give better names to variables. For example, it is not easy to understand what does fre means in your code. But if you name it as frequency, the code will become more readable, and it will be easier to work with it.
to be continued

Try this script. It finds word in file and counts how many times it is found in words:
file = open('hello.txt','r')
word = 'Python'
words = 0
for line in file:
for word in line:
words += 1
print('File contains ' + word + ' ' + str(words) + ' times' )

Related

Finding number of words in a file in python

I'm new to python and attempting to do an exercise where I open a txt file and then read the contents of it (probably straight forward for most but I will admit I am struggling a bit).
I opened my file and used .read() to read the file. I then proceeded to remove the file of any punctation.
Next I created a for loop. In this loop I began my using .split() and adding to an expression:
words = words + len(characters)
words being previously defined as 0 outside the loop and characters being what was split at the beginning of the loop.
Very long story short, the problem that I'm having now is that instead of adding the entire word to my counter, each individual character is being added. Anything I can do to fix that in my for loop?
my_document = open("book.txt")
readTheDocument = my_document.read
comma = readTheDocument.replace(",", "")
period = comma.replace(".", "")
stripDocument = period.strip()
numberOfWords = 0
for line in my_document:
splitDocument = line.split()
numberOfWords = numberOfWords + len(splitDocument)
print(numberOfWords)

A more Pythonic way is to use with:
with open("book.txt") as infile:
count = len(infile.read().split())
You've got to understand that by using .split() you are not really getting real grammatical words. You are getting word-like fragments. If you want proper words, use module nltk:
import nltk
with open("book.txt") as infile:
count = len(nltk.word_tokenize(infile.read()))

Just open the file and split to get the count of words.
file=open("path/to/file/name.txt","r+")
count=0
for word in file.read().split():
count = count + 1
print(count)

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work

A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

Find the number of characters in a file using Python

Here is the question:
I have a file with these words:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
Below is my program, but the number of counts for the characters without space is not correct.
The number of words is correct and the number of line is correct.
What is the mistake in the same loop?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
Whereas the number of characters without space is 35 and with space is 45.
If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.

Sum up the length of all words in a line:
characters += sum(len(word) for word in wordslist)
The whole program:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
3
13
35
This:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum:
sum(len(word) for word in wordslist)
Improved version
This version takes advantage of enumerate, so you save two lines of code, while keeping the readability:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation.
It is always good practice to close file after your are done using it.

Remember that each line (except for the last) has a line separator.
I.e. "\r\n" for Windows or "\n" for Linux and Mac.
Thus, exactly two characters are added in this case, as 47 and not 45.
A nice way to overcome this could be to use:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)

To count the characters, you should count each individual word. So you could have another loop that counts characters:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split() perhaps.

I found this solution very simply and readable:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)

This is too long for a comment.
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2.7.12
>>>len("taña")
5
Python 3.5.2
>>>len("taña")
4
Huh? The answer lies in unicode. That ñ is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.

How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
35
The image below shows this tested on RegExr:

It is probably counting new line characters. Subtract characters with (lines+1)

Here is the code:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.

A more Pythonic solution than the others:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines() does. There's no reason to reinvent the wheel.

You do have the correct answer - and your code is completely correct. The thing that I think it is doing is that there is an end of line character being passed through, which includes your character count by two (there isn't one on the last line, as there is no new line to go to). If you want to remove this, the simple fudge would be to do as Loaf suggested
characters = characters - (lines - 1)
See csl's answer for the second part...

Simply skip unwanted characters while calling len,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum the count,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str from the wordlist and take len,
characters=characters+ len(''.join(wordlist))
or sum the characters in the wordlist. I think this is the fastest.
characters=characters+ sum(1 for word in wordlist for char in word)

You have two problems. One is the line endings and the other is the spaces in between.
Now there are many people who posted pretty good answers, but I find this method easier to understand:
characters = characters + len(line.strip()) - line.strip().count(' ')
line.strip() removes the trailing and leading spaces. Then I'm subtracting the number of spaces from the total length.

It's very simple:
f = open('file.txt', 'rb')
f.seek(0) # Move to the start of file
print len(f.read())

Here i got smallest program with less memory usage for your problem
with open('FileName.txt') as f:
lines = f.readlines()
data = ''.join(lines)
print('lines =',len(lines))
print('Words = ',len(data.split()))
data = ''.join(data.split())
print('characters = ',len(data))
lines will be list of lines,so length of lines is nothing but number of lines.Next step data contains a string of your file contents(each word separated by a whitespace), so if we split data gives list of words in your file. thus, length of that list gives number of words. again if we join the words list you will get all characters as a single string. thus length of that gives number of characters.

taking the input as file name i.e files.txt from the input parameter and then counting the total number of characters in the file and save to the variable
char
fname = input("Enter the name of the file:")
infile = open(fname, 'r') # connection of the file
lines = 0
words = 0
char = 0 # init as zero integer
for line in infile:
wordslist = line.split() # splitting line to word
lines = lines + 1 # counter up the word
words = words + len(wordslist) # splitting word to charac
char = char + len(line) # counter up the character
print("lines are: " + str(lines))
print("words are: " + str(words))
print("chars are: " + str(char)) # printing beautify

num_lines = sum(1 for line in open('filename.txt'))
num_words = sum(1 for word in open('filename.txt').read().split())
num_chars = sum(len(word) for word in open('filename.txt').read().split())

Removing extra words in a file using python

Hi I am learning Python and out of my curiosity, I have written a program to remove the extra words in a file.
I am comparing the test in file 'text1.txt. and 'text2.txt' and based upon the test in text1, I am removing the words which were extra in the test2.
# Bin/ Python
text1 = open('text1.txt','r')
text2 = open('text2.txt','r')
t_l1 = text1.readlines()
t_l2 = text2.readlines()
# printing to check if the file contents were read properly.
print ' Printing the file 1 contents:'
w_t1 = []
for i in range(len(t_l1)):
w_t1 = t_l1[i].split(' ')
for j in range(len(w_t1)):
print w_t1[j]
#printing to see if the contents were read properly.
print'File 2 contents:'
w_t2 = []
for i in range(len(t_l2)):
w_t2.extend(t_l2[i].split(' '))
for j in range(len(w_t2)):
print w_t2[j]
print 'comparing and deleting the excess variables.'
i = 1
while (i<=len(w_t1)):
if(w_t1[i-1] == w_t2[i-1]):
print w_t1[i-1]
i += 1
# I put all words of file1 in list w_t1 and file2 in list w_t2. Now I am checking if
# each word in w_t1 is same as word in same place of w_t2 if not, i am deleting the
# that word in w_t2 and continuing the while loop.
else:
w.append(str(w_t2[i-1]))
w_t2.remove(w_t2[i-1])
i = i
print 'The extra words are: '+str(w) +'\n'
print w
print 'The original words are: '+ str(w_t2) +'\n'
print 'The extra values are: '
for item in w:
print item
# opening the file out.txt to write the output.
out = open('out.txt', 'w')
out.write(str(w))
# I am closing the files
text1.close()
text2.close()
out.close()
say text1.txt file has the words "Happy birthday dear Friend"
and text2.txt has the words "Happy claps birthday to you my dear Best Friend"
The program should give out the extra words in text2.txt which are "claps, to, you, my, Best"
The above program works fine but what if I have to do this for a file containing millions of words, or million lines ?? Checking each and every word dosen't seems to be a good idea. Do we have any Python pre defined functions for that ??
P.S : Kindly bear with me if this is a wrong question, I am learning python. Very soon I'll stop asking these.

It seems a 'Set' problem. First add your words in a set structure:
textSet1 = set()
with open('text1.txt','r') as text1:
for line in text1:
for word in line.split(' '):
textSet1.add(word)
textSet2 = set()
with open('text2.txt','r') as text2:
for line in text2:
for word in line.split(' '):
textSet2.add(word)
then simply apply set difference operator
textSet2.difference(textSet1)
that give you this result
set(['claps', 'to', 'you', 'my', 'Best'])
You can obtain a list from previous structure in this way
list(textSet2.difference(textSet1))
['claps', 'to', 'you', 'my', 'Best']
Then, how you can read here you shouldn't worry about large files size because with the given loader
When the next line is read, the previous one will be garbage collected
unless you have stored a reference to it somewhere else
More about lazy file loading here.
Finally, in a real problem I suppose there is a first set (bad words) that have a relative small size and a second set with a huge amount of data. If this is the case then you can avoid the creation of second set:
diff = []
with open('text2.txt','r') as text2:
for line in text2:
for word in line.split(' '):
if word in textSet1:
diff.append(word)

python file manipulation

I have a file with entries such as:
26 1
33 2
.
.
.
and another file with sentences in english
I have to write a script to print the 1st word in sentence number 26
and the 2nd word in sentence 33.
How do I do it?

The following code should do the task. With assumptions that files are not too large. You may have to do some modification to deal with edge cases (like double space, etc)
# Get numers from file
num = []
with open('1.txt') as file:
num = file.readlines()
# Get text from file
text = []
with open('2.txt') as file:
text = file.readlines()
# Parse text into words list.
data = []
for line in text: # For each paragraoh in the text
sentences = l.strip().split('.') # Split it into sentences
words = []
for sentence in sentences: # For each sentence in the text
words = sentence.split(' ') # Split it into words list
if len(words) > 0:
data.append(words)
# get desired result
for i = range(0, len(num)/2):
print data[num[i+1]][num[i]]

Here's a general sketch:
Read the first file into a list (a numeric entry in each element)
Read the second file into a list (a sentence in each element)
Iterate over the entry list, for each number find the sentence and print its relevant word
Now, if you show some effort of how you tried to implement this in Python, you will probably get more help.

The big issue is that you have to decide what separates "sentences". For example, is a '.' the end of a sentence? Or maybe part of an abbreviation, e.g. the one I've just used?-) Secondarily, and less difficult, what separates "words", e.g., is "TCP/IP" one word, or two?
Once you have sharply defined these rules, you can easily read the file of text into a a list of "sentences" each of which is a list of "words". Then, you read the other file as a sequence of pairs of numbers, and use them as indices into the overall list and inside the sublist thus identified. But the problem of sentence and word separation is really the hard part.

In the following code, I am assuming that sentences end with '. '. You can modify it easily to accommodate other sentence delimiters as well. Note that abbreviations will therefore be a source of bugs.
Also, I am going to assume that words are delimited by spaces.
sentences = []
queries = []
english = ""
for line in file2:
english += line
while english:
period = english.find('.')
sentences += english[: period+1].split()
english = english[period+1 :]
q=""
for line in file1:
q += " " + line.strip()
q = q.split()
for i in range(0, len(q)-1, 2):
sentence = q[i]
word = q[i+1]
queries.append((sentence, query))
for s, w in queries:
print sentences[s-1][w-1]
I haven't tested this, so please let me know (preferably with the case that broke it) if it doesn't work and I will look into bugs
Hope this helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating how many times sentence words are repeating in the file - python

Try this script. It finds word in file and counts how many times it is found in words: file = open('hello.txt','r') word = 'Python' words = 0 for line in file: for word in line: words += 1 print('File contains ' + word + ' ' + str(words) + ' times' )

Related

Finding number of words in a file in python

Replace words of a long document in Python

Find the number of characters in a file using Python

Removing extra words in a file using python

python file manipulation

Categories

Resources