I have script that opens and reads text file, separate every word and making a list of those words. I made Counter to count each word from list how many times does it appears. Then I want to export in .csv file each row something like this:
word hello appears 10 times
word house appears 5 times
word tree appears 3 times
...And so on
Can you show me what do I need to change here to make script to work?
from collections import Counter
import re
import csv
cnt = Counter()
writefile = open('test1.csv', 'wb')
writer = csv.writer(writefile)
with open('screenplay.txt') as file: #Open .txt file with text
text = file.read().lower()
file.close()
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split()) #Making list of each word
for word in words:
cnt[word] += 1 #Counting how many times word appear
for key, count in cnt.iteritems():
key = text
writer.writerow([cnt[word]])
The big issue is that your second for-loop is happening for every occurrence of every word, not just once for each unique word. You will need to de-dent the whole loop so that it executes after you have finished your counting. Try something like this:
from collections import Counter
import re
import csv
cnt = Counter()
writefile = open('test1.csv', 'wb')
writer = csv.writer(writefile)
with open('screenplay.txt') as file:
text = file.read().lower()
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
for word in words:
cnt[word] += 1
for key, count in cnt.iteritems(): #De-dent this block
writer.writerow([key,count]) #Output both the key and the count
writefile.close() #Make sure to close your file to guarantee it gets flushed
Related
Currently, I have
import re
import string
input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []
for line in stopwords_file.readlines():
stopwords_list.extend(line.split())
stopwords_set = set(stopwords_list)
word_count = {}
for line in input_file.readlines():
words = line.strip()
words = words.translate(str.maketrans('','', string.punctuation))
words = re.findall('\w+', line)
for word in words:
if word.lower() in stopwords_set:
continue
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] + 1
word_index = sorted(word_count.keys())
for word in word_index:
print (word, word_count[word])
What it does is parses through a txt file I have, removes stopwords, and outputs the number of times a word appears in the document it is reading from.
The problem is that the txt file is not one file, but five.
The text in the document looks something like this:
1
The cat in the hat was on the mat
2
The rat on the mat sat
3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.
In Python, I want to find a way to go through 1, 2, and 3 and count how many times a word appears in an individual document, as well as the total amount of times a word appears in the whole text file - which my code currently does.
i.e Mat appears 2 times in the text document. It appears in Document 1 and Document 2 Ideally less wordy.
Give this a try:
import re
import string
def count_words(file_name):
word_count = {}
with open(file_name, 'r') as input_file:
for line in input_file:
if line.startswith("document"):
doc_id = line.split()[0]
words = line.strip().split()[1:]
for word in words:
word = word.translate(str.maketrans('','', string.punctuation)).lower()
if word in word_count:
word_count[word][doc_id] = word_count[word].get(doc_id, 0) + 1
else:
word_count[word] = {doc_id: 1}
return word_count
word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
print(f"{word} appears in: {doc_count}")
You have deleted your previous similar question and with it my answer, so I'm not sure if it's a good idea to answer again. I'll give a slightly different answer, without groupby, although I think it was fine.
You could try:
import re
from collections import Counter
from string import punctuation
with open("stopwords_en.txt", "r") as file:
stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d+)\s*$")
with open("documents.txt", "r") as file:
word_count, doc_no = {}, 0
for line in file:
match = re_new_doc.match(line)
if match:
doc_no = int(match[1])
continue
line = line.translate(translation)
for word in re.findall(r"\w+", line):
word = word.casefold()
if word in stopwords:
continue
word_count.setdefault(word, []).append(doc_no)
word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}
I would make the translation table only once, beforehand, not for each line again.
The regex for the identification of a new document (\d+)\s*$" looks for digits at the beginning of a line and nothing else, except maybe some whitespace, until the line break. You have to adjust it if the identifier follows a different logic.
word_count records each occurrence of a word in a list with the number of the current document.
word_count_overall just takes the length of the resp. lists to get the overall count of a word.
word_count_docs does apply a Counter on the lists to get the counts per document for each word.
So i'm trying to extract the most used words from a .txt file, and then put the 4 most common words into a csv file. (and then append if need be), At the moment it's extracting the most common words, and appending to a csv file. But it's appending each letter to a cell.
python
import collections
import pandas as pd
import matplotlib.pyplot as plt
import csv
fields=['first','second','third']
# Read input file, note the encoding is specified here
# It may be different in your text file
file = open('pyTest.txt', encoding="utf8")
a= file.read()
# Stopwords
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in a.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word in word_counter.most_common(n_print):
print(word[0])
# Close the file
file.close()
with open('old.csv', 'a') as out_file:
writer = csv.writer(out_file)
for word in word_counter.most_common(4):
print(word)
writer.writerow(word[0])
Output csv file
p,i,p,e
d,i,a,m,e,t,e,r
f,i,t,t,i,n,g,s
o,u,t,s,i,d,e
You can use a generator expression to extract the first item of each sub-list in the list returned by the most_common method as a row instead:
with open('old.csv', 'a') as out_file:
writer = csv.writer(out_file)
writer.writerow(word for word, _ in word_counter.most_common(4))
I am trying to count elements in a text file. I know I am missing an obvious part, but I can't put my finger on it. This is what I currently have which just produces the count of the letter "f" not the file:
filename = open("output3.txt")
f = open("countoutput.txt", "w")
import collections
for line in filename:
for number in line.split():
print(collections.Counter("f"))
break
import collections
counts = collections.Counter() # create a new counter
with open(filename) as infile: # open the file for reading
for line in infile:
for number in line.split():
counts.update((number,))
print("Now there are {} instances of {}".format(counts[number], number))
print(counts)
Say that I have a file of restaurant names and that I need to search through said file and find a particular string like "Italian". How would the code look if I searched the file for the string and print out the number of restaurants with the same string?
f = open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt", "r")
content = f.read()
f.close()
lines = content.split("\n")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
print ("There are", len(f.readlines()), "restaurants in the dataset")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "GREEK" in line:
for l in searchlines[i:i+3]: print (l),
print
You could count all the words using a Counter dict and then do lookups for certain words:
from collections import Counter
from string import punctuation
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
# sum(1 for _ in f) -> counts lines
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
# reset file pointer back to the start
f.seek(0)
# get count of how many times each word appears, at most once per line
cn = Counter(word.strip(punctuation).lower() for line in f for word in set(line.split()))
print(cn["italian"]) # no keyError if missing, will be 0
we use set(line.split()) so if a word appeared twice for a certain restaurant, we would only count it once. That looks for exact matches, if you are also looking to match partials like foo in foobar then it is going to be more complex to create a dataset where you can efficiently lookup multiple words.
If you really just want to count one word all you need to do is use sum how many times the substring appears in a line:
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
f.seek(0)
sub = "italian"
count = sum(sub in line.lower() for line in f)
If you want exact matches, you would need the split logic again or to use a regex with word boundaries.
You input the file as a string.
Then use the count method of strings.
Code:
#Let the file be taken as a string in s1
print s1.count("italian")
I am writing a program that grabs a txt file off the internet and reads it. It then displays a bunch of data related to that txt file. Now, this all works well, until we get to the end. The last thing I want to do is display the top 10 most frequent words used in the txt file. The code I have right now only displays the most frequent word 10 times. Can someone look at this and tell me what the problem is? The only part you have to look at is the last part.
import urllib
open = urllib.urlopen("http://www.textfiles.com/etext/FICTION/alice30.txt").read()
v = str(open) # this variable makes the file a string
strip = v.replace(" ", "") # this trims spaces
char = len(strip) # this variable counts the number of characters in the string
ch = v.splitlines() # this variable seperates the lines
line = len(ch) # this counts the number of lines
print "Here's the number of lines in your file:", line
wordz = v.split()
print wordz
print "Here's the number of characters in your file:", char
spaces = v.count(' ')
words = ''.join(c if c.isalnum() else ' ' for c in v).split()
words = len(words)
print "Here's the number of words in your file:", words
topten = map(lambda x:filter(str.isalpha,x.lower()),v.split())
print "\n".join(sorted(words,key=words.count)[-10:][::-1])
Use collections.Counter to count all the words, Counter.most_common(10) will return the ten most common words and their count
wordz = v.split()
from collections import Counter
c = Counter(wordz)
print(c.most_common(10))
Using with to open the file and get a count of all the words in the txt file:
from collections import Counter
with open("http://www.textfiles.com/etext/FICTION/alice30.txt") as f:
c = Counter()
for line in f:
c.update(line.split()) # Counter.update adds the values
print(c.most_common(10))
To get total characters in the file get the sum of length of each key multiplied by the times it appears:
print(sum(len(k)*v for k,v in c.items()))
To get the word count:
print(sum(c.values()))