Counting words in python from the text file - python

Need to open text file and find numbers of occurrences for the names given in the other file. Program should write name; count pairs, separated by semicolons into the file with .csv format
It should look like:
Jane; 77
Hector; 34
Anna; 39
...
Tried to use "Counter" but then it looks like a list, so I think that this is a wrong way to do the task
import re
import collections
from collections import Counter
wanted = re.findall('\w+', open('iliadcounts.csv').read().lower())
cnt = Counter()
words = re.findall('\w+', open('pg6130.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
but this is definitely not the right code for this task...

You can feed the whole list of words to Counter at once, it will count it for you.
You can then print only the words in wanted by iterating over it:
import re
import collections
from collections import Counter
# create some demo data as I do not have your data at hand - uses your filenames
def create_demo_files():
with open('iliadcounts.csv',"w") as f:
f.write("hug,crane,box")
with open('pg6130.txt',"w") as f:
f.write("hug,shoe,blues,crane,crane,box,box,box,wood")
create_demo_files()
# work with your files
with open('iliadcounts.csv') as f:
wanted = re.findall('\w+', f.read().lower())
with open('pg6130.txt') as f:
cnt = Counter( re.findall('\w+', f.read().lower()) )
# printed output for all words in wanted (all words are counted)
for word in wanted:
print("{}; {}".format(word, cnt.get(word)))
# would work as well:
# https://docs.python.org/3/library/string.html#string-formatting
# print(f"{word}; {cnt.get(word)}")
Output:
hug; 1
crane; 2
box; 3
Or you can print the whole Counter:
print(cnt)
Output:
Counter({'box': 3, 'crane': 2, 'hug': 1, 'shoe': 1, 'blues': 1, 'wood': 1})
Links:
https://pyformat.info/
string formatting
with open(...) as f:

Related

Count common words

I have two files .txt and I should compare them and just count common words.What I should get is just a total count of how many words in common 2 different files have. How I can do it? Can you help me? This is the code that I have try, but I need only the total count ex "I have found 125 occurrences"
So now you have a dict with non-zero values for common words. Counting them is as simple as:
sum(v != 0 for v in dct.values())
If I understood correctly, I would create one set for each of the files with its words but I would also convert anything to lower to be sure that I ll have "proper" matches. Then I would create an intersection of those sets and just get the length.
file1_words = set()
file2_words = set()
file1 = open("text1.txt")
file2 = open("verbs.txt")
for word in file1.read().strip().split():
file1_words.add(word.lower())
for word in file2.read().strip().split():
file2_words.add(word.lower())
print(file1_words)
print(file2_words)
common_words = file1_words.intersection(file2_words)
print(len(common_words))
Based on your explanation, what you want to do is to sum the values of the dictionary you have created. You can do this simply by:
print(sum(dct.values()))
This counts all the occurences of the words in common.
It uses Counter see https://docs.python.org/fr/3/library/collections.html#collections.Counter
import re
from collections import Counter
with open('text1.txt', 'r') as f:
words1 = re.findall(r'\w+', f.read().lower())
with open('verbs.txt', 'r') as f:
words2 = re.findall(r'\w+', f.read().lower())
counter1 = Counter(words1)
counter2 = Counter(words2)
common = set(counter1.keys()).intersection(counter2.keys())
sum([counter1[e] + counter2[e] for e in common])

Total number of first words in a txt file

I need a program that counts the top 5 most common first words of the lines in a file and which does not include lines where the first word is followed by a "DM" or an "RT"?
I don't have any code as of so far because I'm completely lost.
f = open("C:/Users/Joe Simpleton/Desktop/talking.txt", "r")
?????
Read each line of your text in. For each line, split it into words using a regular expression, this will return a list of the words. If there are at least two words, test the second word to make sure it is not in your list. Then use a Counter() to keep track of all of the word counts. Store the lowercase of each word so that uppercase and lowercase versions of the same word are not counted separately:
from collections import Counter
import re
word_counts = Counter()
with open('talking.txt') as f_input:
for line in f_input:
words = re.findall(r'\w+', line)
if (len(words) > 1 and words[1] not in ['DM', 'RT']) or len(words) == 1:
word_counts.update(word.lower() for word in words)
print(word_counts.most_common(5))
The Counter() has a useful feature in being able to show the most common values.
Not tested, but should work roughly like that:
from collections import Counter
count = Counter()
with open("path") as f:
for line in f:
parts = line.split(" ")
if parts[1] not in ["DM", "RT"]:
count[parts[0]] += 1
print(count.most_common(5))
You should also add a check that ensures that parts has > 2 elements.

How to save data from text file to python dictionary and select designation data

Example of data in txt file:
apple
orange
banana
lemon
pears
Code of filtering words with 5 letters without dictionary:
def numberofletters(n):
file = open("words.txt","r")
lines = file.readlines()
file.close()
for line in lines:
if len(line) == 6:
print(line)
return
print("===================================================================")
print("This program can use for identify and print out all words in 5
letters from words.txt")
n = input("Please Press enter to start filtering words")
print("===================================================================")
numberofletters(n)
My question is how create a dictionary whose keys are integers and values the English words with that many letters and Use the dictionary to identify and print out all the 5 letter words?
Imaging with a huge list of words
Sounds like a job for a defaultdict.
>>> from collections import defaultdict
>>> length2words = defaultdict(set)
>>>
>>> with open('file.txt') as f:
... for word in f: # one word per line
... word = word.strip()
... length2words[len(word)].add(word)
...
>>> length2words[5]
set(['lemon', 'apple', 'pears'])
If you care about duplicates and insertion order, use a defaultdict(list) and append instead of add.
you can make your for loop like this:
for line in lines:
line_len = len(line)
if line_len not in dicword.keys():
dicword.update({line_len: [line]})
else:
dicword[line_len].append(line)
Then you can get it by just doing dicword[5]
If I understood, you need to write filter your document and result into a file. For that you can write a CSV file with DictWriter (https://docs.python.org/2/library/csv.html).
DictWriter: Create an object which operates like a regular writer but maps dictionaries onto output rows.
BTW, you will be able to store and structure your document
def numberofletters(n):
file = open("words.txt","r")
lines = file.readlines()
file.close()
dicword = {}
writer = csv.DictWriter(filename, fieldnames=fieldnames)
writer.writeheader()
for line in lines:
if len(line) == 6:
writer.writerow({'param_label': line, [...]})
return
I hope that help you.

Exporting Python Counter to .CSV file

I have script that opens and reads text file, separate every word and making a list of those words. I made Counter to count each word from list how many times does it appears. Then I want to export in .csv file each row something like this:
word hello appears 10 times
word house appears 5 times
word tree appears 3 times
...And so on
Can you show me what do I need to change here to make script to work?
from collections import Counter
import re
import csv
cnt = Counter()
writefile = open('test1.csv', 'wb')
writer = csv.writer(writefile)
with open('screenplay.txt') as file: #Open .txt file with text
text = file.read().lower()
file.close()
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split()) #Making list of each word
for word in words:
cnt[word] += 1 #Counting how many times word appear
for key, count in cnt.iteritems():
key = text
writer.writerow([cnt[word]])
The big issue is that your second for-loop is happening for every occurrence of every word, not just once for each unique word. You will need to de-dent the whole loop so that it executes after you have finished your counting. Try something like this:
from collections import Counter
import re
import csv
cnt = Counter()
writefile = open('test1.csv', 'wb')
writer = csv.writer(writefile)
with open('screenplay.txt') as file:
text = file.read().lower()
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
for word in words:
cnt[word] += 1
for key, count in cnt.iteritems(): #De-dent this block
writer.writerow([key,count]) #Output both the key and the count
writefile.close() #Make sure to close your file to guarantee it gets flushed

how to find word frequency without duplicates across multiple files? [duplicate]

This question already has an answer here:
word frequency calculation in multiple files [duplicate]
(1 answer)
Closed 9 years ago.
I am trying to find the frequency of words in multiple files in a folder, i need to increment a word's count by 1 if it is found in a file.
For ex:the line "all's well that ends well" if read in file 1 must increment the count of "well" by 1 and not 2,
and if "she is not well" is read in file2, the count of "well" shall become 2
i need to increment the counter without including the duplicates, but my program does not take that into account, so plz help!!
import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
sorted(text)
return text
def removeduplicates(l):
return list(set(l))
folderpath='d:/articles-words'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
num_files = len(filepaths)
# Add all words to counter
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
cwords=removeduplicates(words)
counter.update(cwords)
# Display most common
for word, count in counter.most_common():
# Break out if the frequency is less than 0.1 * the number of files
if count < 0.1*num_files:
break
print('{} {}'.format(word,count))
I have tried the sort and remove duplicate techniques, but it still doesn't work!
If I understand your problem correctly, basically you want to know for each word, how many times does it appear across all files (regardless if the same word is more than once in a same file).
In order to do this, I did the following schema, that simulates a list of many files (I just cared about the process, not the files per-se, so you might have to manage to change the "files" for the actual list you want to process.
d = {}
i = 0
for f in files:
i += 1
for line in f:
words = line.split()
for word in words:
if word not in d:
d[word] = {}
d[word][i] = 1
d2 = {}
for word,occurences in d.iteritems():
d2[word] = sum( d[word].values() )
The result will give you something like the following:
{'ends': 1, 'that': 1, 'is': 1, 'well': 2, 'she': 1, 'not': 1, "all's": 1}
I'd do it a much different way, but the crux of it is using a set.
frequency = Counter()
for line in open("file", "r"):
for word in set(line):
frequency[word] += 1
I'm not sure if it's preferable to use .readline() or whatnot; I typically use for loops because they're so damn simple.
Edit: I see what you're doing wrong. You read the entire contents of the file with .read(), (perform removegarbage() on it) and then .split() the result. That'll give you a single list, destroying the newlines:
>>> "Hello world!\nFoo bar!".split()
['Hello', 'world!', 'Foo', 'bar!']

Categories

Resources