Attempting to remove repeated words in a list python

Attempting to remove repeated words in a list python - python

I'm trying to remove repeated words in a list where it saves the location and word in a file but it doesn't save the word which occurs after the repeated word. Can someone tell me what's wrong with it?
sen = input("Input a sentence")
sen1 = sen.lower()
sen2 = sen1.split()
sen4 = sen2
f = open("newfile.txt", "w+")
for words in sen2:
print(words)
ok = sen1.count(words)
print(ok)
sent = sen4.index(words)
print(sent)
f.write(str(sent))
f.write(str(words))
if ok > 1 :
while ok > 0:
sen2.remove(words)
ok = ok-1
if ok == 0:
break
f.close()

You are making a common mistake, modifying a list while you are looping over its items.
Inside the loop for words in sen2: do sometimes execute sen2.remove(words), which modifies the list sen2. Strange things happened when you do this.
To avoid this, make a deep copy of the list sen2 with sen2copy = sen2[:], and loop over one of them and modify the other one. You could do this with
sen2copy = sen2[:]
for words in sen2copy:
or, if you want to be brief,
for words in sen2[:]:
If you don't understand the notation, sen2[:] is a slice of sen2 from the beginning to the end. In other words, it copies each item in sen2 to the new list. If you leave out the brackets and colon you just copy a reference to the entire list, which is not what you want.

Related

String searching in text file and dict values combinations

i'm a total beginner to python, i'm studying it at university and professor gave us some work to do before the exam. Currently it's been almost 2 weeks that i'm stuck with this program, the rule is that we can't use any library.
Basically I have this dictionary with several possibility of translations from ancient language to english, a dictionary from english to italian (only 1 key - 1 value pairs), a text file in an ancient language and another text file in Italian. Until now what i've done is basically scan the ancient language file and search for corresponding strings with dictionary (using .strip(".,:;?!") method), now i saved those corresponding strigs that contain at least 2 words in a list of strings.
Now comes the hard part: basically i need to try all possible combination of translations (values from ancient language to English) and then take these translations from english to italian the the other dictionary and check if that string exists in the Italian file, if yes i save the result and the paragraph where has been found (result in different paragraphs doesn't count, must be in the same I've made a small piece of code to count the paragraphs).
I'm having issues here for the following reasons:
In the strings that i've found how I'm supposed to replace the words and keep the punctuation? Because the return result must contain all the punctuation otherwise the output result will be wrong
If the string is contained but in 2 different lines of the text how should i proceed in order to make it work? For example i have a string of 5 words, at the end of a line i found the first 2 words corresponding but the remaining 3 words are the first 3 words of the next line.
As mentioned before the dict from ancient language to english is huge and can have up to 7 values (translations) for each key (ancient langauge), is there any efficient way to try all the combinations while searching if the string exists in a text file? This is probably the hardest part.
Probably the best way to process this is word by word scan every time and in case the sequence is broken i reset it somehow and keep scanning the text file...
Any idea?
Here you have commented code of what i've managed to do until now:
k = 2 #Random value, the whole program gonna be a function and the "k" value will be different each time
file = [ line.strip().split(';') for line in open('lexicon-GR-EN.csv', encoding="utf8").readlines() ] #Opening CSV file with possible translations from ancient Greek to English
gr_en = { words[0]: tuple(words[1:]) for words in file } #Creating a dictionary with the several translations (values)
file = open('lexicon-EN-IT.csv', encoding="utf8") # Opening 2nd CSV file
en_it = {} # Initializing dictionary
for row in file: # Scanning each row of the CSV file (From English to Italian)
L = row.rstrip("\n").split(';') # Clearing newline char and splitting the words
x = L[0]
t1 = L[1]
en_it[x] = t1 # Since in this CSV file all the words are 1 - 1 is not necesary any check for the length (len(L) is always 2 basically)
file = open('odyssey.txt', encoding="utf8") # Opening text file
result = () # Empty tuple
spacechecker = 0 # This is the variable that i need to determine if i'm on a even or odd line, if odd the line will be scanned normaly otherwise word order and words will be reversed
wordcount = 0 # Counter of how many words have been found
paragraph = 0 # Paragraph counter, starts at 0
paragraphspace = 0 # Another paragraph variable, i need this to prevent double-space to count as paragraph
string = "" # Empty string to store corresponding sequences
foundwords = [] # Empty list to store words that have been found
completed_sequences = [] # Empty list, here will be stored all completed sequences of words
completed_paragraphs = [] # Paragraph counter, this shows in which paragraph has been found each sequence of completed_sequences
for index, line in enumerate(file.readlines()): # Starting line by line scan of the txt file
words = line.split() # Splitting words
if not line.isspace() and index == 0: # Since i don't know nothing about the "secret tests" that will be conducted with this program i've set this check for the start of the first paragraph to prevent errors: if first line is not space
paragraph += 1 # Add +1 to paragraph counter
spacechecker += 1 # Add +1 to spacechecker
elif not line.isspace() and paragraphspace == 1: # Checking if the previous line was space and the current is not
paragraphspace = 0 # Resetting paragraphspace (precedent line was space) value
spacechecker += 1 # Increasing the spacechecker +1
paragraph +=1 # This means we're on a new paragraph so +1 to paragraph
elif line.isspace() and paragraphspace == 1: # Checking if the current line is space and the precedent line was space too.
continue # Do nothing and cycle again
elif line.isspace(): # Checking if the current line is space
paragraphspace += 1 # Increase paragraphspace (precedent line was space variable) +1
continue
else:
spacechecker += 1 # Any other case increase spacechecker +1
if spacechecker % 2 == 1: # Check if spacechecker is odd
for i in range(len(words)): # If yes scan the words in normal order
if words[i].strip(",.!?:;-") in gr_en != "[unavailable]": # If words[i] without any special char is in dictionary
currword = words[i] # If yes, we will call it "currword"
foundwords.append(currword) # Add currword to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (words[i].strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # We will put the foundwords list in a string
completed_sequences.append(string) # And add this string to the list of strings of completed_sequences
completed_paragraphs.append(paragraph) # Then add the paragraph of that string to the list of completed_paragraphs
result = list(zip(completed_sequences, completed_paragraphs)) # This the output format required, a tuple with the string and the paragraph of that string
wordcount = 0
foundwords.clear() # Clearing the foundwords list
else: # If none of the above happened (word is not in dictionary and wordcounter still isn't >= k)
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
else: # The case of spacechecker being not odd,
words = words[::-1] # Reverse the word order
for i in range(len(words)): # Scanning the row of words
currword = words[i][::-1] # Currword in this case will be reversed since the words in even lines are written in reverse.
if currword.strip(",.!?:;-") in gr_en != "[unavailable]": # If currword without any special char is in dictionary
foundwords.append(currword) # Append it to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (currword.strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword.strip(",.!?:;-") not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # Add the words that has been found to the string
completed_sequences.append(string) # Append the string to completed_sequences list
completed_paragraphs.append(paragraph) # Append the paragraph of the strings to the completed_paragraphs list
result = list(zip(completed_sequences, completed_paragraphs)) # Adding to the result the tuple combination of strings and corresponding paragraphs
wordcount = 0 # Reset wordcount
foundwords.clear() # Clear foundwords list
else: # In case none of above happened
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again

I'd probably take the following approach to solving this:
Try to collapse down the 2 word dictionaries into one (ancient_italian below), removing English from the equation. For example, if ancient->English has {"canus": ["dog","puppy", "wolf"]} and English->Italian has {"dog":"cane"} then you can create a new dictionary {"canus": "cane"}. (Of course if the English->Italian dict has all 3 English words, you need to either pick one, or display something like cane|cucciolo|lupo in the output).
Come up with a regular expression that can distinguish between words, and the separators (punctuation), and output them in order into a list (word_list below). I.e something like ['ecce', '!', ' ', 'magnus', ' ', 'canus', ' ', 'esurit', '.']
Step through this list, generating a new list. Something like:
translation = []
for item in word_list:
if item.isalpha():
# It's a word - translate it and add to the list
translation.append(ancient_italian[item])
else:
# It's a separator - add to the list as-is
translaton.append(item)
Finally join the list back together: ''.join(translation)

I'm unable to reply to your comment on the answer by match, but this may help:
For one, its not the most elegant approach but should work:
GR_IT = {}
for greek,eng in GR_EN.items():
for word in eng:
try:
GR_IT[greek] = EN_IT[word]
except:
pass
If theres no translation for a word it will be ignored though.
To get a list of words and punctuation split try this:
def repl_punc(s):
punct = ['.',',',':',';','?','!']
for p in punct:
s=s.replace(p,' '+p+' ')
return s
repl_punc(s).split()

KeyError on the same word

I am trying to generate a sentence in the style of the bible. But whenever I run it, it stops at a KeyError on the same exact word. This is confusing as it is only using its own keys and it is the same word every time in the error, despite having random.choice.
This is the txt file if you want to run it: ftp://ftp.cs.princeton.edu/pub/cs226/textfiles/bible.txt
import random
files = []
content = ""
output = ""
words = {}
files = ["bible.txt"]
sentence_length = 200
for file in files:
file = open(file)
content = content + " " + file.read()
content = content.split(" ")
for i in range(100): # I didn't want to go through every word in the bible, so I'm just going through 100 words
words[content[i]] = []
words[content[i]].append(content[i+1])
word = random.choice(list(words.keys()))
output = output + word
for i in range(int(sentence_length)):
word = random.choice(words[word])
output = output + word
print(output)

The KeyError happens on this line:
word = random.choice(words[word])
It always happens for the word "midst".
How? "midst" is the 100th word in the text.
And the 100th position is the first time it is seen.
The consequence is that "midst" itself was never put in words as a key.
Hence the KeyError.
Why does the program reach this word so fast? Partly because of a bug here:
for i in range(100):
words[content[i]] = []
words[content[i]].append(content[i+1])
The bug here is the words[content[i]] = [] statement.
Every time you see a word,
you recreate an empty list for it.
And the word before "midst" is "the".
It's a very common word,
many other words in the text have "the".
And since words["the"] is ["midst"],
the problem tends to happen a lot, despite the randomness.
You can fix the bug of creating words:
for i in range(100):
if content[i] not in words:
words[content[i]] = []
words[content[i]].append(content[i+1])
And then when you select words randomly,
I suggest to add a if word in words condition,
to handle the corner case of the last word in the input.

"midst" is the 101st word in your source text and it is the first time it shows up. When you do this:
words[content[i]].append(content[i+1])
you are making a key:value pair but you aren't guaranteed that that value is going to be equivalent to an existing key. So when you use that value to search for a key it doesn't exist so you get a KeyError.
If you change your range to 101 instead of 100 you will see that your program almost works. That is because the 102nd word is "of" which has already occurred in your source text.
It's up to you how you want to deal with this edge case. You could do something like this:
if i == (100-1):
words[content[i]].append(content[0])
else:
words[content[i]].append(content[i+1])
which basically loops back around to the beginning of the source text when you get to the end.

How do I read a file and convert each line into strings and see if any of the strings are in a user input?

What I am trying to do in my program is to have the program open a file with many different words inside it.
Receive a user input and check if any word inside the file is in user input.
Inside the file redflags.txt:
happy
angry
ball
jump
each word on a different line.
For example if the user input is "Hey I am a ball" then it will print redflag.
If the user input is "Hey this is a sphere" then it will print noflag.
Redflags = open("redflags.txt")
data = Redflags.read()
text_post = raw_input("Enter text you wish to analyse")
words = text_post.split() and text_post.lower()
if data in words:
print("redflag")
else:
print("noflag")

This should do the trick! Sets are generally much faster than lookups for list comparisons. Sets can tell you the intersection in this case (overlapping words), differences, etc. We consume the file that has a list of words, remove newline characters, lowercase, and we have our first list. The second list is created from the user input and split on spacing. Now we can perform our set intersection to see if any common words exist.
# read in the words and create list of words with each word on newline
# replace newline chars and lowercase
words = [word.replace('\n', '').lower() for word in open('filepath_to_word_list.txt', 'r').readlines()]
# collect user input and lowercase and split into list
user_input = raw_input('Please enter your words').lower().split()
# use set intersection to find if common words exist and print redflag
if set(words).intersection(set(user_input)):
print('redflag')
else:
print('noflag')

with open('redflags.txt', 'r') as f:
# See this post: https://stackoverflow.com/a/20756176/1141389
file_words = f.read().splitlines()
# get the user's words, lower case them all, then split
user_words = raw_input('Please enter your words').lower().split()
# use sets to find if any user_words are in file_words
if set(file_words).intersection(set(user_words)):
print('redflag')
else:
print('noredflag')

I would suggest you to use list comprehensions
Take a look at this code which does what you want (I will explain them below):
Redflags = open("redflags.txt")
data = Redflags.readlines()
data = [d.strip() for d in data]
text_post = input("Enter text you wish to analyse:")
text_post = text_post.split()
fin_list = [i for i in data if i in text_post]
if (fin_list):
print("RedFlag")
else:
print("NoFlag")
Output 1:
Enter text you wish to analyse:i am sad
NoFlag
Output 2:
Enter text you wish to analyse:i am angry
RedFlag
So first open the file and read them using readlines() this gives a list of lines from file
>>> data = Redflags.readlines()
['happy\n', 'angry \n', 'ball \n', 'jump\n']
See all those unwanted spaces,newlines (\n) use strip() to remove them! But you can't strip() a list. So take individual items from the list and then apply strip(). This can be done efficiently using list comprehensions.
data = [d.strip() for d in data]
Also why are you using raw_input() In Python3 use input() instead.
After getting and splitting the input text.
fin_list = [i for i in data if i in text_post]
I'm creating a list of items where i (each item) from data list is also in text_pos list. So this way I get common items which are in both lists.
>>> fin_list
['angry'] #if my input is "i am angry"
Note: In python empty lists are considered as false,
>>> bool([])
False
While those with values are considered True,
>>> bool(['some','values'])
True
This way your if only executes if list is non-empty. Meaning 'RedFlag' will be printed only when some common item is found between two lists. You get what you want.

Linear search to find spelling errors in Python

I'm working on learning Python with Program Arcade Games and I've gotten stuck on one of the labs.
I'm supposed to compare each word of a text file (http://programarcadegames.com/python_examples/en/AliceInWonderLand200.txt) to find if it is not in the dictionary file (http://programarcadegames.com/python_examples/en/dictionary.txt) and then print it out if it is not. I am supposed to use a linear search for this.
The problem is even words I know are not in the dictionary file aren't being printed out. Any help would be appreciated.
My code is as follows:
# Imports regular expressions
import re
# This function takes a line of text and returns
# a list of words in the line
def split_line(line):
split = re.findall('[A-Za-z]+(?:\'\"[A-Za-z]+)?', line)
return split
# Opens the dictionary text file and adds each line to an array, then closes the file
dictionary = open("dictionary.txt")
dict_array = []
for item in dictionary:
dict_array.append(split_line(item))
print(dict_array)
dictionary.close()
print("---Linear Search---")
# Opens the text for the first chapter of Alice in Wonderland
chapter_1 = open("AliceInWonderland200.txt")
# Breaks down the text by line
for each_line in chapter_1:
# Breaks down each line to a single word
words = split_line(each_line)
# Checks each word against the dictionary array
for each_word in words:
i = 0
# Continues as long as there are more words in the dictionary and no match
while i < len(dict_array) and each_word.upper() != dict_array[i]:
i += 1
# if no match was found print the word being checked
if not i <= len(dict_array):
print(each_word)
# Closes the first chapter file
chapter_1.close()

Linear search to find spelling errors in Python
Something like this should do (pseudo code)
sampleDict = {}
For each word in AliceInWonderLand200.txt:
sampleDict[word] = True
actualWords = {}
For each word in dictionary.txt:
actualWords[word] = True
For each word in sampleDict:
if not (word in actualDict):
# Oh no! word isn't in the dictionary
A set may be more appropriate than a dict, since the value of the dictionary in the sample isn't important. This should get you going, though

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.

ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Attempting to remove repeated words in a list python - python

Related

String searching in text file and dict values combinations

KeyError on the same word

How do I read a file and convert each line into strings and see if any of the strings are in a user input?

Linear search to find spelling errors in Python

python - matching string and replacing

Categories

Resources