How to remove duplicates in a document?

How to remove duplicates in a document? - python

I'm writing a words unjumble program. Here are my codes:
import collections
sortedWords = collections.defaultdict(list)
with open("/xxx/xxx/words.txt", "r") as f:
for word in f:
word = word.strip().lower()
sortFword = ''.join(sorted(word))
sortedWords[sortFword].append(word)
while True:
jumble = input("Enter your jumbled word:").lower()
sortedJumble = ''.join(sorted(jumble))
if sortedJumble in sortedWords:
words = sortedWords[sortedJumble]
if len(words) > 1:
print ("Your words are: ")
print ("\n".join(words))
else:
print ("Your word is", words[0]+".")
break
else:
print ("Oops, it can not be unjumbled.")
break
Now these code work. However, my program usually print two identical words. For example, I typed "prisng" as jumbled word, then I got two "spring"s. It is because that there were two "spring"s in the word document: one is "spring" and the other one is "Spring". I want to remove all duplicates of the words.txt, but how to remove them? Please give me some advice.

You can use the builtin set function to do this.
words = ['hi', 'Hi']
words = list(map(lambda x: x.lower(), words)) # makes all the words lowercase
words = list(set(words)) # removes all duplicates

Related

Create one list with strings using read text file in Python

I want to make a jumble game in python that uses words from a text file rather than from words written directly into the python file(in this case the code works perfectly). But when I want to import them, I get this list:
[['amazement', ' awe', ' bombshell', ' curiosity', ' incredulity', '\r\n'], ['godsend', ' marvel', ' portent', ' prodigy', ' revelation', '\r\n'], ['stupefaction', ' unforeseen', ' wonder', ' shock', ' rarity', '\r\n'], ['miracle', ' abruptness', ' astonishment\r\n']]
I want words to be sorted in one single list, for example:
["amazement", "awe", "bombshell"...]
This is my python code:
import random
#Welcome the player
print("""
Welcome to Word Jumble.
Unscramble the letters to make a word.
""")
filename = "words/amazement_words.txt"
lst = []
with open(filename) as afile:
for i in afile:
i=i.split(",")
lst.append(i)
print(lst)
word = random.choice(lst)
theWord = word
jumble = ""
while(len(word)>0):
position = random.randrange(len(word))
jumble+=word[position]
word=word[:position]+word[position+1:]
print("The jumble word is: {}".format(jumble))
#Getting player's guess
guess = input("Enter your guess: ")
#congratulate the player
if(guess==theWord):
print("Congratulations! You guessed it")
else:
print ("Sorry, wrong guess.")
input("Thanks for playing. Press the enter key to exit.")
I have a text file with words:
amazement, awe, bombshell, curiosity, incredulity,
godsend, marvel, portent, prodigy, revelation,
stupefaction, unforeseen, wonder, shock, rarity,
miracle, abruptness, astonishment
Thank you for help and any suggestions!

quasi one-liner does it:
with open("list_of_words.txt") as f:
the_list = sorted(word.strip(",") for line in f for word in line.split())
print(the_list)
use a double for in a gen-comprehension
splitting against spaces is the trick: it gets rid of the line-termination chars and multiple spaces. Then, just get rid of the commas using strip().
Apply sorted on the resulting generator comprehension
result:
['abruptness', 'amazement', 'astonishment', 'awe', 'bombshell', 'curiosity', 'godsend', 'incredulity', 'marvel', 'miracle', 'portent', 'prodigy', 'rarity', 'revelation', 'shock', 'stupefaction', 'unforeseen', 'wonder']
Only drawback of this quick method is that if 2 words are only separated by a comma, it will issue the 2 words as-is.
In that latter case, just add a for in the gencomp like this to perform a split according to comma and drop the empty result string (if word):
with open("list_of_words.txt") as f:
the_list = sorted(word for line in f for word_commas in line.split() for word in word_commas.split(",") if word)
print(the_list)
or in that latter case, maybe using regex split is better (we need to discard empty strings as well). Split expression being blank(s) or comma.
import re
with open("list_of_words.txt") as f:
the_list = sorted(word for line in f for word in re.split(r"\s+|,",line) if word)

use
lst.extend(i)
instead of
lst.append(i)
split return a list and you append a list to list everytime. Using extend instead will solve your problem.

Please try the following code:
import random
#Welcome the player
print("""
Welcome to Word Jumble.
Unscramble the letters to make a word.
""")
name = " My name "
filename = "words/amazement_words.txt"
lst = []
file = open(filename, 'r')
data = file.readlines()
another_lst = []
for line in data:
lst.append(line.strip().split(','))
print(lst)
for line in lst:
for li in line:
another_lst.append(li.strip())
print()
print()
print(another_lst)
word = random.choice(lst)
theWord = word
jumble = ""
while(len(word)>0):
position = random.randrange(len(word))
jumble+=word[position]
word=word[:position]+word[position+1:]
print("The jumble word is: {}".format(jumble))
#Getting player's guess
guess = input("Enter your guess: ")
#congratulate the player
if(guess==theWord):
print("Congratulations! You guessed it")
else:
print ("Sorry, wrong guess.")
input("Thanks for playing. Press the enter key to exit.")

str.split() generates a list, so if you append it to your result you get a list of lists.
A solution would be to concatenate the 2 list (+)
You can get rid of the '\r\n' by stripping i before splitting it

How to make a python program that lists the positions and displays and error message if not found

I did this code:
sentence = input("Type in your sentance ").lower().split()
Word = input("What word would you like to find? ")
Keyword = Word.lower().split().append(Word)
positions = []
for (S, subword) in enumerate(sentence):
if (subword == Word):
positions.append
print("The word" , Word , "is in position" , S+1)
But there are 2 problems with it; I dont know how to write a code when the users word is not found and to but the positions in "The word position is in [1,3,6,9].
Any help?
Thanks

Your code is having multiple errors. I am pasting here the sample code for your reference:
from collections import defaultdict
sentence_string = raw_input('Enter Sentence: ')
# Enter Sentence: Here is the content I need to check for index of all words as Yes Hello Yes Yes Hello Yes
word_string = raw_input("Enter Words: ")
# Enter Words: yes hello
word_list = sentence_string.lower().split()
words = word_string.lower().split()
my_dict = defaultdict(list)
for i, word in enumerate(word_list):
my_dict[word].append(i)
for word in words:
print "The word" , word, "is in position " , my_dict[word]
# The word yes is in position [21, 23, 24, 26]
# The word hello is in position [22, 25]
The approach here is:
Break your sentence i.e sentence_string here into list of words
Break your word string into list of words.
Create a dictionary my_dict to store all the indexes of the words in word_list
Iterate over the words to get your result with index, based on the value you store in my_dict.
Note: The commented part in above example is basically the output of the code.

Use index.
a = 'Some sentence of multiple words'
b = 'a word'
def list_matched_indices(list_of_words, word):
pos = 0
indices = []
while True:
try:
pos = list_of_words.index(word, pos)
indices.append(pos)
pos+=1
except:
break
if len(indices):
print(indices)
return indices
else:
print ("Not found")
list_matched_indices(a, b)

Python strings with anagrams

At the moment this code takes in a string from a user and compares it to a text file in which many words are stored. It then outputs all the words that contain an exact match to the string. (E.G "otp = opt, top, pot) Currently when i input the string it only matches the string to the word with the EXACT same letters in a rearranged order.
My question is how do i go about being able to type in excess letters but still output all the words that are contained? for example: Type in "orkignwer" and the program will output "working" even though there are extra letters.
words = []
def isAnAnagram(word, user):
wordList= list(word)
wordList.sort()
inputList= list(user)
inputList.sort()
return (wordList == inputList)
def getAnagrams(user):
lister = [word for word in words if len(word) == len(user) ]
for item in lister:
if isAnAnagram(item, user):
yield item
with open('Dictionary.txt', 'r') as f:
allwords = f.readlines()
f.close()
for x in allwords:
x = x.rstrip()
words.append(x)
inp = 1
while inp != "99":
inp = input("enter word:")
result = getAnagrams(inp)
print(list(result))

You have to edit the isAnAnagram and the getAnagrams functions. First the getAnagrams function should be edited to also include the words of greater length in the lister list:
def getAnagrams(user):
lister = [word for word in words if len(word) <= len(user) ]
for item in lister:
if isAnAnagram(item, user):
yield item
Then you would need to edit the isAnAnagram function. As Alexander Huszagh pointed out, you can use the Counter from the collections package:
from collections import Counter
def isAnAnagram(word, user):
word_counter = Counter(word)
input_counter = Counter(user)
return all(count <= input_counter[key] for key, count in word_counter.items())
The all(count <= input_counter[key] for key, count in word_counter.items()) checks to see if every letter of word appears in user at least as many times as they did in word.
P.S. If you want a more optimized solution, you might want to checkout TRIEs (e.g. MARISA-trie, python-trie or PyTrie).

How to find the position of a repeating word in a string - Python

How to get Python to return the position of a repeating word in a string?
E.g. the word "cat" in "the cat sat on the mat which was below the cat" is in the 2nd and 11th position in the sentence.

You can use re.finditer to find all occurrences of the word in a string and starting indexes:
import re
for word in set(sentence.split()):
indexes = [w.start() for w in re.finditer(word, sentence)]
print(word, len(indexes), indexes)
And using dictionary comprehension:
{word: [w.start() for w in re.finditer(word, sentence)] for word in sentence.split()}

This will return a dictionary mapping each word in the sentence, which repeates at least once, to the list of word index (not character index)
from collections import defaultdict
sentence = "the cat sat on the mat which was below the cat"
def foo(mystr):
sentence = mystr.lower().split()
counter = defaultdict(list)
for i in range(len(sentence)):
counter[sentence[i]].append(i+1)
new_dict = {}
for k, v in counter.iteritems():
if len(v) > 1:
new_dict[k] = v
return new_dict
print foo(sentence)

The following will take an input sentence, take a word from the sentence, and then print the position(s) of the word in a list with a starting index of 1 (it looks like that's what you want from your code).
sentence = input("Enter a sentence, ").lower()
word = input("Enter a word from the sentence, ").lower()
words = sentence.split(' ')
positions = [ i+1 for i,w in enumerate(words) if w == word ]
print(positions)

I prefer simplicity and here is my code below:
sentence = input("Enter a sentence, ").lower()
word_to_find = input("Enter a word from the sentence, ").lower()
words = sentence.split() ## Splits the string 'sentence' and returns a list of words in it. Split() method splits on default delimiters.
for pos in range(len(words)):
if word_to_find == words[pos]: ## words[pos] corresponds to the word present in the 'words' list at 'pos' index position.
print (pos+1)
The 'words' consists of the list of all the words present in the sentence. Then after that, we iterate and match each word present at index 'pos' with the word we are looking to find(word_to_find) and if both the words are same then we print the value of pos with 1 added to it.
Hope this is simple enough for you to understand and it serves your purpose.
If you wish to use a list comprehension for the above, then:
words = sentence.split()
positions = [ i+1 for i in range(len(words)) if word_to_find == words[i]]
print (positions)
Both the above ways are same, just the later gives you a list.

positions= []
sentence= input("Enter the sentence please: ").lower()
sentence=sentence.split( )
length=len(sentence))
word = input("Enter the word that you would like to search for please: ").lower()
if word not in sentence:
print ("Error, '"+word+"' is not in this sentence.")
else:
for x in range(0,length):
if sentence[x]==word: #
positions.append(x+1)
print(word,"is at positions", positions)

s="hello fattie i'm a fattie too"
#this code is unsure but manageable
looking word= "fattie"
li=[]
for i in range(len(s)):
if s.startswith(lw, i):
print (i)
space = s[:i].count(" ")
hello = space+1
print (hello)
li.append(hello)
print(li)

How to create a dictionary for a text file

My program opens a file and it can word count the words contained in it but i want to create a dictionary consisting of all the unique words in the text
for example if the word 'computer' appears three times i want that to count as one unique word
def main():
file = input('Enter the name of the input file: ')
infile = open(file, 'r')
file_contents = infile.read()
infile.close()
words = file_contents.split()
number_of_words = len(words)
print("There are", number_of_words, "words contained in this paragarph")
main()

Use a set. This will only include unique words:
words = set(words)
If you don't care about case, you can do this:
words = set(word.lower() for word in words)
This assumes there is no punctuation. If there is, you will need to strip the punctuation.
import string
words = set(word.lower().strip(string.punctuation) for word in words)
If you need to keep track of how many of each word you have, just replace set with Counter in the examples above:
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
This will give you a dictionary-like object that tells you how many of each word there is.
You can also get the number of unique words from this (although it is slower if that is all you care about):
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
nword = len(words)

#TheBlackCat his solution works but only gives you how much unique words are in the string/file. This solution also shows you how many times it occurs.
dictionaryName = {}
for word in words:
if word not in list(dictionaryName):
dictionaryName[word] = 1
else:
number = dictionaryName.get(word)
dictionaryName[word] = dictionaryName.get(word) + 1
print dictionaryName
tested with:
words = "Foo", "Bar", "Baz", "Baz"
output: {'Foo': 1, 'Bar': 1, 'Baz': 2}

Probably more cleaner and quick solution:
words_dict = {}
for word in words:
word_count = words_dict.get(word, 0)
words_dict[word] = word_count + 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove duplicates in a document? - python

You can use the builtin set function to do this. words = ['hi', 'Hi'] words = list(map(lambda x: x.lower(), words)) # makes all the words lowercase words = list(set(words)) # removes all duplicates

Related

Create one list with strings using read text file in Python

How to make a python program that lists the positions and displays and error message if not found

Python strings with anagrams

How to find the position of a repeating word in a string - Python

How to create a dictionary for a text file

Categories

Resources