Count common words - python

I have two files .txt and I should compare them and just count common words.What I should get is just a total count of how many words in common 2 different files have. How I can do it? Can you help me? This is the code that I have try, but I need only the total count ex "I have found 125 occurrences"

So now you have a dict with non-zero values for common words. Counting them is as simple as:
sum(v != 0 for v in dct.values())

If I understood correctly, I would create one set for each of the files with its words but I would also convert anything to lower to be sure that I ll have "proper" matches. Then I would create an intersection of those sets and just get the length.
file1_words = set()
file2_words = set()
file1 = open("text1.txt")
file2 = open("verbs.txt")
for word in file1.read().strip().split():
file1_words.add(word.lower())
for word in file2.read().strip().split():
file2_words.add(word.lower())
print(file1_words)
print(file2_words)
common_words = file1_words.intersection(file2_words)
print(len(common_words))
Based on your explanation, what you want to do is to sum the values of the dictionary you have created. You can do this simply by:
print(sum(dct.values()))

This counts all the occurences of the words in common.
It uses Counter see https://docs.python.org/fr/3/library/collections.html#collections.Counter
import re
from collections import Counter
with open('text1.txt', 'r') as f:
words1 = re.findall(r'\w+', f.read().lower())
with open('verbs.txt', 'r') as f:
words2 = re.findall(r'\w+', f.read().lower())
counter1 = Counter(words1)
counter2 = Counter(words2)
common = set(counter1.keys()).intersection(counter2.keys())
sum([counter1[e] + counter2[e] for e in common])

Related

The output is unsorted and sorting on the second value is not possible. Is there special method to sort on the second value

The output is unsorted, and sorting on the second column is not possible. Is there special method to sort on the second value.
This program takes a text and counts how many times a word is in a text
import string
with open("romeo.txt") as file: # opens the file with text
lst = []
d = dict ()
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
if word not in lst:
lst.append(word) # append only this unique word to the list
uniquewords.write(str(word) + '\n') # write the unique word to the file
print(d)
Dictionaries with default value
The code snippet:
d = dict()
...
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
has become so common in python that a subclass of dict has been created to get rid of it. It goes by the name defaultdict and can be found in module collections.
Thus we can simplify your code snippet to:
from collections import defaultdict
d = defaultdict(int)
...
d[word] = d[word] + 1
No need for this manual if/else test; if word is not in the defaultdict, it will be added automatically with initial value 0.
Counters
Counting occurrences is also something that is frequently useful; so much so that there exists a subclass of dict called Counter in module collections. It will do all the hard work for you.
from collections import Counter
import string
with open('romeo.txt') as input_file:
counts = Counter(word.translate(str.maketrans('', '', string.punctuation)).upper() for line in input_file for word in line.split())
with open('romeo_unique.txt', 'w') as output_file:
for word in counts:
output_file.write(word + '\n')
As far as I can tell from the documentation, Counters are not guaranteed to be ordered by number of occurrences by default; however:
When I use them in the interactive python interpreter they are always printed in decreasing number of occurrences;
they provide a method .most_common() which is guaranteed to return in decreasing number of occurrences.
In Python, standard dictionaries are an unsorted data type, but you can look here, assuming that with sorting your output you mean d
A couple of remarks first:
You are not sorting explicitly (e.g. by using sorted) by a given property. Dictionaries might be considered to have a "natural" order by the alphanumeric value of the value part of each key-value pair and they might sort correctly when iterated (e.g. for printing), but it is better to explicitly sort a dict.
You check the existence of a word in the lst variable, which is very slow since checking a list requires checking all entries until something is found (or not). It would be much better to check for existence in a dict.
I'm assuming by "the second column" you mean the information for each word that counts the order in which the word first appeared.
With that I'd change the code to also record the word index of the first occurence of each word with, which then allows for sorting on exactly that.
Edit: Fixed the code. The sorting yielded by sorted sorts by key, not value. That's what I get for not testing code before posting an answer.
import string
from operator import itemgetter
with open("romeo.txt") as file: # opens the file with text
first_occurence = {}
uniqueness = {}
word_index = 1
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in uniqueness:
uniqueness[word] = 1
else:
uniqueness[word] += 1
if word not in first_occurence:
first_occurence[word] = word_index
uniquewords.write(str(word) + '\n') # write the unique word to the file
word_index += 1
print(sorted(uniqueness.items(), key=itemgetter(1)))
print(sorted(first_occurence.items(), key=itemgetter(1)))

Counting words in python from the text file

Need to open text file and find numbers of occurrences for the names given in the other file. Program should write name; count pairs, separated by semicolons into the file with .csv format
It should look like:
Jane; 77
Hector; 34
Anna; 39
...
Tried to use "Counter" but then it looks like a list, so I think that this is a wrong way to do the task
import re
import collections
from collections import Counter
wanted = re.findall('\w+', open('iliadcounts.csv').read().lower())
cnt = Counter()
words = re.findall('\w+', open('pg6130.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
but this is definitely not the right code for this task...
You can feed the whole list of words to Counter at once, it will count it for you.
You can then print only the words in wanted by iterating over it:
import re
import collections
from collections import Counter
# create some demo data as I do not have your data at hand - uses your filenames
def create_demo_files():
with open('iliadcounts.csv',"w") as f:
f.write("hug,crane,box")
with open('pg6130.txt',"w") as f:
f.write("hug,shoe,blues,crane,crane,box,box,box,wood")
create_demo_files()
# work with your files
with open('iliadcounts.csv') as f:
wanted = re.findall('\w+', f.read().lower())
with open('pg6130.txt') as f:
cnt = Counter( re.findall('\w+', f.read().lower()) )
# printed output for all words in wanted (all words are counted)
for word in wanted:
print("{}; {}".format(word, cnt.get(word)))
# would work as well:
# https://docs.python.org/3/library/string.html#string-formatting
# print(f"{word}; {cnt.get(word)}")
Output:
hug; 1
crane; 2
box; 3
Or you can print the whole Counter:
print(cnt)
Output:
Counter({'box': 3, 'crane': 2, 'hug': 1, 'shoe': 1, 'blues': 1, 'wood': 1})
Links:
https://pyformat.info/
string formatting
with open(...) as f:

Total number of first words in a txt file

I need a program that counts the top 5 most common first words of the lines in a file and which does not include lines where the first word is followed by a "DM" or an "RT"?
I don't have any code as of so far because I'm completely lost.
f = open("C:/Users/Joe Simpleton/Desktop/talking.txt", "r")
?????
Read each line of your text in. For each line, split it into words using a regular expression, this will return a list of the words. If there are at least two words, test the second word to make sure it is not in your list. Then use a Counter() to keep track of all of the word counts. Store the lowercase of each word so that uppercase and lowercase versions of the same word are not counted separately:
from collections import Counter
import re
word_counts = Counter()
with open('talking.txt') as f_input:
for line in f_input:
words = re.findall(r'\w+', line)
if (len(words) > 1 and words[1] not in ['DM', 'RT']) or len(words) == 1:
word_counts.update(word.lower() for word in words)
print(word_counts.most_common(5))
The Counter() has a useful feature in being able to show the most common values.
Not tested, but should work roughly like that:
from collections import Counter
count = Counter()
with open("path") as f:
for line in f:
parts = line.split(" ")
if parts[1] not in ["DM", "RT"]:
count[parts[0]] += 1
print(count.most_common(5))
You should also add a check that ensures that parts has > 2 elements.

unique words in lines of text read from standard input

I am trying to see how many unique words there is in standard input.
import sys
s = sys.stdin.readlines()
seen = []
for lines in s:
if lines not in seen:
seen = seen + (lines.split())
seen.append(lines)
print (len(seen))
I know I am on right track but if Tree and tree should not be counted as separate unique words.
Also Monday and 1 are words but – is not.
seen = []
for line in s:
for word in line.strip().split():
if word.isalnum() and word.lower() not in (x.lower() for x in seen):
seen.append(word)
print(len(seen))
Or better (if you want only the length, but not the words themselves):
print(len(set(word.lower() for line in s for word in line.strip().split() if word.isalnum()))
I guess this code snippet can help you in few lines. Basicly the idea is to use set.
st = set([])
for lines in s.split('\n'):
print(lines)
st=set(lines.split()).union(st)
print(st)

how to find word frequency without duplicates across multiple files? [duplicate]

This question already has an answer here:
word frequency calculation in multiple files [duplicate]
(1 answer)
Closed 9 years ago.
I am trying to find the frequency of words in multiple files in a folder, i need to increment a word's count by 1 if it is found in a file.
For ex:the line "all's well that ends well" if read in file 1 must increment the count of "well" by 1 and not 2,
and if "she is not well" is read in file2, the count of "well" shall become 2
i need to increment the counter without including the duplicates, but my program does not take that into account, so plz help!!
import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
sorted(text)
return text
def removeduplicates(l):
return list(set(l))
folderpath='d:/articles-words'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
num_files = len(filepaths)
# Add all words to counter
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
cwords=removeduplicates(words)
counter.update(cwords)
# Display most common
for word, count in counter.most_common():
# Break out if the frequency is less than 0.1 * the number of files
if count < 0.1*num_files:
break
print('{} {}'.format(word,count))
I have tried the sort and remove duplicate techniques, but it still doesn't work!
If I understand your problem correctly, basically you want to know for each word, how many times does it appear across all files (regardless if the same word is more than once in a same file).
In order to do this, I did the following schema, that simulates a list of many files (I just cared about the process, not the files per-se, so you might have to manage to change the "files" for the actual list you want to process.
d = {}
i = 0
for f in files:
i += 1
for line in f:
words = line.split()
for word in words:
if word not in d:
d[word] = {}
d[word][i] = 1
d2 = {}
for word,occurences in d.iteritems():
d2[word] = sum( d[word].values() )
The result will give you something like the following:
{'ends': 1, 'that': 1, 'is': 1, 'well': 2, 'she': 1, 'not': 1, "all's": 1}
I'd do it a much different way, but the crux of it is using a set.
frequency = Counter()
for line in open("file", "r"):
for word in set(line):
frequency[word] += 1
I'm not sure if it's preferable to use .readline() or whatnot; I typically use for loops because they're so damn simple.
Edit: I see what you're doing wrong. You read the entire contents of the file with .read(), (perform removegarbage() on it) and then .split() the result. That'll give you a single list, destroying the newlines:
>>> "Hello world!\nFoo bar!".split()
['Hello', 'world!', 'Foo', 'bar!']

Categories

Resources