Python - Locating Duplicate Words in a Text File

Python - Locating Duplicate Words in a Text File - python

I was wondering if you could help me with a python programming issue? I'm currently trying to write a program that reads a text file and output "word 1 True" if the word had already occurred in that file before or "word 1 False" if this is the first time the word appeared.
Here's what I came up with:
fh = open(fname)
lst = list ()
for line in fh:
words = line.split()
for word in words:
if word in words:
print("word 1 True", word)
else:
print("word 1 False", word)
However, it only returns "word 1 True"
Please advise.
Thanks!

A simple (and fast) way to implement this would be with a python dictionary. These can be thought of like an array, but the index-key is a string rather than a number.
This gives some code fragments like:
found_words = {} # empty dictionary
words1 = open("words1.txt","rt").read().split(' ') # TODO - handle punctuation
for word in words1:
if word in found_words:
print(word + " already in file")
else:
found_words[word] = True # could be set to anything
Now when processing your words, simply checking to see if the word already exists in the dictionary indicates that it was seen already.

You might also want to track previous locations, something like this:
with open(fname) as fh:
vocab = {}
for i, line in enumerate(fh):
words = line.split()
for j, word in enumerate(words):
if word in vocab:
locations = vocab[word]
print word "occurs at", locations
locations.append((i, j))
else:
vocab[word] = [(i, j)]
# print "First occurrence of", word

This snipped code doesn't use the file, but it's easy to test and study. The main difference is that you must load the file and read per line as you did in your example
example_file = """
This is a text file example
Let's see how many time example is typed.
"""
result = {}
words = example_file.split()
for word in words:
# if the word is not in the result dictionary, the default value is 0 + 1
result[word] = result.get(word, 0) + 1
for word, occurence in result.items():
print("word:%s; occurence:%s" % (word, occurence))
UPDATE:
As suggested by #khachik a better solution is using the Counter.
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

Following your route you could do this:
with open('tyger.txt', 'r') as f:
lines = (f.read()).split()
for word in lines:
if lines.count(word) > 1:
print(f"{word}: True")
else:
print(f"{word}: Flase")
Output
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 read_true.py
When: Flase
the: True
stars: Flase
threw: Flase
down: Flase
their: True
spears: Flase
...
You could also count every word:
with open('tyger.txt', 'r') as f:
count = {}
lines = f.read()
lines = lines.split()
for i in lines:
count[i] = lines.count(i)
print(count)
Output
{'When': 1, 'the': 2, 'stars': 1, 'threw': 1, 'down': 1, 'their': 2,
'spears': 1, 'And': 1, "water'd": 1, 'heaven': 1, 'with': 1, 'tears:':
1, 'Did': 2, 'he': 2, 'smile': 1, 'his': 1, 'work': 1, 'to': 1,
'see?': 1, 'who': 1, 'made': 1, 'Lamb': 1, 'make': 1, 'thee?': 1}
You can use the dictionary like so:
for k in count:
if count[k] > 1:
print(f"{k}: True")
else:
print(f"{k}: False")
Ouput
When: False
the: True
stars: False
threw: False
down: False
their: True
spears: False

Related

How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?

How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?
At the minute, I have this code:
text = "Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll prove she can again. We all already know this and you will too."
d = {}
lst = []
def findDupeWords():
string = text.lower()
#Split the string into words using built-in function
words = string.split(" ")
print("Duplicate words in a given string : ")
for i in range(0, len(words)):
count = 1
for j in range(i+1, len(words)):
if(words[i] == (words[j])):
count = count + 1
#Set words[j] to 0 to avoid printing visited word
words[j] = "0"
#Displays the duplicate word if count is greater than 1
if(count > 1 and words[i] != "0"):
print(words[i])
for key in d:
text = text.replace(key,d[key])
print(text)
findDupeWords()
The output I get when I run this is:
Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll
prove she can again. We all already know this and you will too.
Duplicate words in a given string :
she
doesn't
have
to
but
just
her
will
you
and
How can I turn this list of words into a dictionary, like the following:
{'she': 1, 'doesn't': 2, 'have': 3, 'to': 4} , etc...

Don't reinvent the wheel, use an instance of a collections.Counter in the standard library.
from collections import Counter
def findDupeWords(text):
counter = Counter(text.lower().split(" "))
for word in counter:
if counter[word] > 1:
print(word)
text = "Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll prove she can again. We all already know this and you will too."
findDupeWords(text)

Well, you could replace your call to print with an assignment to a dictionary:
def findDupeWords():
duplicates = {}
duplicate_counter = 0
...
#Displays the duplicate word if count is greater than 1
if(count > 1 and words[i] != "0"):
duplicate_counter += 1
duplicates[words[i]] = duplicate_counter
But there are easier ways to achieve this, for example with collections.Counter:
from collections import Counter
words = text.lower().split()
word_occurrences = Counter(words)
dupes_in_order = sorted(
(word for word in set(words) if word_occurrences[word] > 1),
key=lambda w: words.index(w),
)
dupe_dictionary = {word: i+1 for i, word in enumerate(dupes_in_order)}
Afterwards:
>>> dupe_dictionary
{'she': 1,
"doesn't": 2,
'have': 3,
'to': 4,
'but': 5,
'just': 6,
'her': 7,
'will': 8,
'you': 9,
'and': 10}

Python: Creating a function counting specific words in a textfile

I want to create a function that returns the value of word count of a specific word in a text file.
Here's what I currently have:
def Word_Counter(Text_File, Word):
Data = open(Text_File, 'r').read().lower()
count = Data.count(Word)
print(Word, "; ", count)
Word_Counter('Example.txt', "the")
Which returns: "the ; 35"
That is pretty much what I want it to do. But what if I want to test a text for a range of words. I want the words (key) and values in say a list or dictionary. What's a way of doing that without using modules?
Say if I tested the function with this list of words: [time, when, left, I, do, an, who, what, sometimes].
The results I would like would be something like:
Word Counts = {'time': 1, 'when': 4, 'left': 0, 'I': 5, 'do': 2, 'an': 0, 'who': 1, 'what': 3, 'sometimes': 1}
I have been able to create a dictionary which does a word count for every word, like example below.
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
I'd like to do a similar style but only targeting specific words, any suggestions?

From your given code, I did not test this.
def Word_Counter(Text_File, word_list):
Data = open(Text_File, 'r').read().lower()
output = {}
for word in word_list:
output[word] = Data.count(Word)
Or you can do this
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1

UPDATE
Try the following:
keywords = ['the', 'that']
worddict = {}
with open('out.txt', 'r') as f:
text = f.read().split(' ') # or f.read().split(',')
for word in text:
worddict[word] = worddict[word]+1 if word in worddict else 1
print([{x, worddict[x]} for x in keywords])

Word Frequency HW

Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.

We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}

You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))

Python word counter sensitive to if word is surrounded by quotation marks?

I have a problem with my Python program. I am trying to make a word counter, an exercise from Exercism.
Now, my program must pass 13 tests, all of which are diffrent strings with spaces, characters, digits, etc.
I used to have a problem because I would replace all non-letters and non-digits by a space. This created problem's for words like "don't", because it would divided it into two strings, don and t. To counter this I added an if statement excluding single ' marks from being replaced, which worked.
However, one of the strings I must test is "Joe can't tell between 'large' and large.". The problem is that since I exclude ' markets, here large and 'large' are considered as two different things, also they are the same word. How do I tell my program to "erase" quotes surrounding a word?
Here is my code, and I have added two scenarios, one being the string above, and the other being another string with only one ' mark that you should not delete:
def word_count(phrase):
count = {}
for c in phrase:
if not c.isalpha() and not c.isdigit() and c != "'":
phrase = phrase.replace(c, " ")
for word in phrase.lower().split():
if word not in count:
count[word] = 1
else:
count[word] += 1
return count
print(word_count("Joe can't tell between 'large' and large."))
print(word_count("Don't delete that single quote!"))
Thank you for your help.

The module string holds some nice text constants - important for you would be punctuation. The module collections holds Counter - a specialized dictionary class used to count things:
from collections import Counter
from string import punctuation
# lookup in set is fastest
ps = set(string.punctuation) # "!#$%&'()*+,-./:;<=>?#[\]^_`{|}~
def cleanSplitString(s):
"""cleans all punctualtion from the string s and returns split words."""
return ''.join([m for m in s if m not in ps]).lower().split()
def word_count(sentence):
return dict(Counter(cleanSplitString(sentence))) # return a "normal" dict
print(word_count("Joe can't tell between 'large' and large."))
print(word_count("Don't delete that single quote!"))
Output:
{'joe': 1, 'cant': 1, 'tell': 1, 'between': 1, 'large': 2, 'and': 1}
{'dont': 1, 'delete': 1, 'that': 1, 'single': 1, 'quote': 1}
If you want to keep the punctuations inside words, use:
def cleanSplitString_2(s):
"""Cleans all punctuations from start and end of words, keeps them if inside."""
return [w.strip(punctuation) for w in s.lower().split()]
Output:
{'joe': 1, "can't": 1, 'tell': 1, 'between': 1, 'large': 2, 'and': 1}
{"don't": 1, 'delete': 1, 'that': 1, 'single': 1, 'quote': 1}
Readup on strip()

Use .strip() to take off the first and last characters once you have them in the list - https://python-reference.readthedocs.io/en/latest/docs/str/strip.html
def word_count(phrase):
count = {}
for c in phrase:
if not c.isalpha() and not c.isdigit() and c != "'":
phrase = phrase.replace(c, " ")
print(phrase)
for word in phrase.lower().split():
word = word.strip("\'")
if word not in count:
count[word] = 1
else:
count[word] += 1
return count

Python - counting duplicate strings

I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']

Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3

Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']

If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}

why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.

As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Locating Duplicate Words in a Text File - python

Related

How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?

Python: Creating a function counting specific words in a textfile

Word Frequency HW

Python word counter sensitive to if word is surrounded by quotation marks?

Python - counting duplicate strings

Categories

Resources