Count words from many .txt files - python

I want to show every word that appears in each .txt file and the amount of times it occurs. Here is my code:
listtxt = ['a.txt','b.txt','c.txt',...]
output = open('dictionary.txt','w')
d = {}
for i in listtxt:
with open(i,'r') as file:
data = file.read()
for char in '-.,;”“':
data=data.replace(char,' ')
data = data.lower()
word_list = data.split()
for word in word_list:
d[word] = d.get(word,0) +1
for words, count in d.items():
output.write(('{} : {} \n'.format(words, count)))
output.close()
When I run it, it appears that the words and the number of occurrences appear separately, meaning it did not check the words that were in the previous file. I don't know why it is not working.

Simple indentation error. Last loop (for words, count in d.items()) needs to be outside the first loop (for i in listtxt).
Tabs matter in python.

Related

Calculating how many times sentence words are repeating in the file

I want to check how many times a word is repeating in the file. I have seen other codes on finding words in file but they won't solve my problem.From this I mean if I want to find "Python is my favourite language"The program will split the text will tell how many times it has repeated in the file.
def search_tand_export():
file = open("mine.txt")
#targetlist = list()
#targetList = [line.rstrip() for line in open("mine.txt")]
contentlist = file.read().split(" ")
string=input("search box").split(" ")
print(string)
fre={}
outputfile=open("outputfile.txt",'w')
for word in contentlist:
print(word)
for i in string:
# print(i)
if i == word:
print(f"'{string}' is in text file ")
outputfile.write(word)
print(word)
spl=tuple(string.split())
for j in range(0,len(contentist)):
if spl in contentlist:
fre[spl]+=1
else:
fre[spl]=1
sor_list=sorted(fre.items(),key =lambda x:x[1])
for x,y in sor_list:
print(f"Word\tFrequency")
print(f"{x}\t{y}")
else:
continue
print(f"The word or collection of word is not present")
search_tand_export()
I don't quite understand what you're trying to do.
But I suppose you are trying to find how many times every word from a given sentence is repeated in the file.
If this is the case, you can try something like this:
sentence = "Python is my favorite programming language"
words = sentence.split()
with open("file.txt") as fp:
file_data = fp.read()
for word in words:
print(f"{file_data.count(word)} occurence(s) of '{word}' found")
Note that the code above is case-sensitive (that is, "Python" and "python" are different words). To make it case-insensitive, you can bring file_data and every word during comparison to lowercase using str.lower().
sentence = "Python is my favorite programming language"
words = sentence.split()
with open("file.txt") as fp:
file_data = fp.read().lower()
for word in words:
print(f"{file_data.count(word.lower())} occurence(s) of '{word}' found")
A couple of things to note:
You are opening a file and even don't close it finally (although you should). It's better to use with open(...) as ... (context-manager), so the file is closed automatically.
Python strings (as well as lists, tuples etc.) have .count(what) method. It returns how many occurences of what are found in the object.
Read about PEP-8 coding style and give better names to variables. For example, it is not easy to understand what does fre means in your code. But if you name it as frequency, the code will become more readable, and it will be easier to work with it.
to be continued
Try this script. It finds word in file and counts how many times it is found in words:
file = open('hello.txt','r')
word = 'Python'
words = 0
for line in file:
for word in line:
words += 1
print('File contains ' + word + ' ' + str(words) + ' times' )

How can I merge two snippets of text that both contain a desired keyword?

I have a program that pulls out text around a specific keyword. I'm trying to modify it so that if two keywords are close enough together, it just shows one longer snippet of text instead of two individual snippets.
My current code, below, adds words after the keyword to a list and resets the counter if another keyword is found. However, I've found two problems with this. The first is that the data rate limit in my spyder notebook is exceeded, and I haven't been able to deal with that. The second is that though this would make a longer snippet, it wouldn't get rid of the duplicate.
Does anyone know a way to get rid of the duplicate snippet, or know how to merge the snippets in a way that doesn't exceed the data rate limit (or know how to change the spyder rate limit)? Thank you!!
def occurs(word1, word2, file, filewrite):
import os
infile = open(file,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(file, 'w')
f.write("start")
f.write(os.linesep)
g = open(filewrite,'w')
g.write("start")
g.write(os.linesep)
for item in wordlist: #multiple words
matches = [i for i, w in enumerate(words) if w.lower().find(item) != -1]
#above line goes through lines, finds the specific words we want
for m in matches: #next three lines find each instance of the word, print out surrounding words
list = []
s = ""
l = " ".join(words[m-20:m+1])
j = 0
while j < 20:
list.append(words[m+i])
j = j+1
if words[m+i] == word1 or words[m+i] == word2:
j = 0
print (list)
k = " ".join(list)
f.write(f"{s}...{l}{k}...") #writes the data to the external file
f.write(os.linesep)
g.write(str(m))
g.write(os.linesep)
f.close
g.close

Word Count from File: Is it having problems opening the file, or have I coded it incorrectly?

Problem: Program seems to get stuck opening a file to read.
My problem is that at the very beginning the program seems to be broken. It just displays
[(1, 'C:\Users\....\Desktop\Sense_and_Sensibility.txt')]
over and over, never-ending.
(NOTE: .... is a replacement for the purpose of posting because my computer username is my full name).
I'm not sure if I've coded this completely incorrectly, or if it's having problems opening the file. Any help is appreciated.
The program should:
1: open a file, replace all punctuation with spaces, change all words to lowercase, then store them in a dictionary.
2: look at a list of words (stop words) that will be removed from the original dictionary.
3: count the remaining words and sort based on frequency.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" # file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" # words to delete
with open(fname) as file: # have the program run the file
for line in file: # loop through
fname.replace('-.,"!?', " ") # replace punc. with space
words = fname.lower() # make all words lowercase
word_list = fname.split() # separate the words, store
word_dict = {} # create a dictionary
with open(swfilename) as delete: # open stop word list
for line in delete:
sw_list = swfilename.split() # separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) # delete common words
for word in word_list: # loop through
word_dict[word] = word_dict.get(word, 0) + 1 # count frequency
word_freq = [] # create index
for key, value in word_dict.items(): # count occurrences
word_freq.append((value, key)) # append freq list
word_freq.sort(reverse=True) # sort the words by freq
print(word_freq) # print most to least
Importing files in windows using python is some what different when compared to Mac and Linux OS
Just change the path of file from fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt"
To fname = "C:\\Users\\....\\Desktop\\Sense_and_Sensibility.txt"
Use double slashes
There are a couple of issues with your code. I would only discuss the most obvious one, given that it is impossible to reproduce your exact observations because the input you are using is not accessible to the readers.
I will first report your code verbatim and mark weak points with ??? followed by a number, which I will address after the code.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" #file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" #words to delete
with open(fname) as file: #???(1) have the program run the file
for line in file: #loop through
fname.replace ('-.,"!?', " ") #???(2) replace punc. with space
words = fname.lower() #???(3) make all words lowercase
word_list = fname.split() #separate the words, store
word_dict = {} #???(4) create a dictionary
with open(swfilename) as delete: #open stop word list
for line in delete:
sw_list = swfilename.split() #separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) #???(5) delete common words
for word in word_list: #???(6) loop through
word_dict[word] = word_dict.get(word, 0) + 1 #???(7) count frequency
word_freq = [] #???(8)create index
for key, value in word_dict.items(): #count occurrences
word_freq.append((value, key)) #append freq list
word_freq.sort(reverse = True) #sort the words by freq
print(word_freq) #print most to least
(minor) file is a reserved word in Python, and it is a good practice not to use for custom purposes as you are doing
(major) .replace() will replace the exact string on the left with the exact string on the right, but what you would like to do is to perform some sort of multi_replace(), which you could implement yourself (for example as a function) by consecutive calls to .replace() for example in a loop (or using functools.reduce()).
(major) fname contains the file name (path, actually) and not the content of the file you want to work with.
(major) You are looping through the lines of the file, but if you create your word_list and word_dict for each line, you will "overwrite" the content at each iteration. Also, the word_dict is created empty and never filled.
(major) The logic you are trying to implement will not work on a dictionary, because dictionaries cannot contain multiple identical keys. A more effective approach would be to create a filtered_list from the word_list by excluding the stop_words. The dictionary can then be used to implement a counter. I do understand that at your level it may be worth learning how to implement a counter, but please keep in mind that the module collections.Counter() from the standard library (thus accessible using import collections) does exactly what you want.
(major) given that at this point there is nothing useful left from your code, but looping through the original list instead of through the filtered list will have no information about the stop words.
(major) dictionary[key] can be used both for accessing (which you do not do) and for writing (which you do) the value associated to a specific key in a dictionary.
(minor) Obviously, your approach for sorting according to word frequency would work, but a much better approach would be to use the parameter key of .sort() and sorted().
Hope this helps!

printing 5 words before and after a specific word in a file in python

I have a folder which contains some other folders and these folders contain some text files. (The language is Persian). I want to print 5 words before and after a keyword with the keyword in the middle of them. I wrote the code, but it gives the 5 words in the start and the end of the line and not the words around the keyword. How can I fix it?
Hint: I just write the end of the code which relates to the question above. The start of the code is about the opening and normalizing the files.
def c ():
y = "آرامش"
text= normal_text(folder_path) # the first function to open and normalize the files
for i in text:
for line in i:
if y in line:
z = line.split()
print (z[-6], z[-5],
z[-4], z[-3],
z[-2], z[-1], y,
z[+1], z[+2],
z[+3], z[+4],
z[+5], z[+6])
what I expect is something like this:
word word word word word keyword word word word word word
Each sentence in a new line.
Try this. It splits the words. Then it calculates the amount to show before and after (with a minimum of however much is left, and a maximum of 5) and shows it.
words = line.split()
if y in words:
index = words.index(y)
before = index - min(index, 5)
after = index + min( len(words) - 1 - index, 5) + 1
print (words[before:after])
You need to get the words indices based on your keyword's index. You can use list.index() method in order to get the intended index, then use a simple indexing to get the expected words:
for f in normal_text(folder_path):
for line in f:
if keyword in line:
words = line.split()
ins = words.index(keyword)
print words[max(0, ind-5):min(ind+6, len(words))]
Or as a more optimized approach you can use a generator function in order to produce the words as an iterator which is very much optimized in terms of memory usage.
def get_words(keyword):
for f in normal_text(folder_path):
for line in f:
if keyword in line:
words = line.split()
ins = words.index(keyword)
yield words[max(0, ind-5):min(ind+6, len(words))]
Then you can simply loop over the result for print or etc.
y = "آرامش"
for words in get_words(y):
# do stuff
def c():
y = "آرامش"
text= normal_text(folder_path) # the first function to open and normalize the files
for i in text:
for line in i:
split_line = line.split()
if y in split_line:
index = split_line.index(y)
print (' '.join(split_line[max(0,index-5):min(index+6,le
n(split_line))]))
Assuming the keyword must be an exact word.

Displaying the Top 10 words in a string

I am writing a program that grabs a txt file off the internet and reads it. It then displays a bunch of data related to that txt file. Now, this all works well, until we get to the end. The last thing I want to do is display the top 10 most frequent words used in the txt file. The code I have right now only displays the most frequent word 10 times. Can someone look at this and tell me what the problem is? The only part you have to look at is the last part.
import urllib
open = urllib.urlopen("http://www.textfiles.com/etext/FICTION/alice30.txt").read()
v = str(open) # this variable makes the file a string
strip = v.replace(" ", "") # this trims spaces
char = len(strip) # this variable counts the number of characters in the string
ch = v.splitlines() # this variable seperates the lines
line = len(ch) # this counts the number of lines
print "Here's the number of lines in your file:", line
wordz = v.split()
print wordz
print "Here's the number of characters in your file:", char
spaces = v.count(' ')
words = ''.join(c if c.isalnum() else ' ' for c in v).split()
words = len(words)
print "Here's the number of words in your file:", words
topten = map(lambda x:filter(str.isalpha,x.lower()),v.split())
print "\n".join(sorted(words,key=words.count)[-10:][::-1])
Use collections.Counter to count all the words, Counter.most_common(10) will return the ten most common words and their count
wordz = v.split()
from collections import Counter
c = Counter(wordz)
print(c.most_common(10))
Using with to open the file and get a count of all the words in the txt file:
from collections import Counter
with open("http://www.textfiles.com/etext/FICTION/alice30.txt") as f:
c = Counter()
for line in f:
c.update(line.split()) # Counter.update adds the values
print(c.most_common(10))
To get total characters in the file get the sum of length of each key multiplied by the times it appears:
print(sum(len(k)*v for k,v in c.items()))
To get the word count:
print(sum(c.values()))

Categories

Resources