String searching in text file and dict values combinations

String searching in text file and dict values combinations - python

i'm a total beginner to python, i'm studying it at university and professor gave us some work to do before the exam. Currently it's been almost 2 weeks that i'm stuck with this program, the rule is that we can't use any library.
Basically I have this dictionary with several possibility of translations from ancient language to english, a dictionary from english to italian (only 1 key - 1 value pairs), a text file in an ancient language and another text file in Italian. Until now what i've done is basically scan the ancient language file and search for corresponding strings with dictionary (using .strip(".,:;?!") method), now i saved those corresponding strigs that contain at least 2 words in a list of strings.
Now comes the hard part: basically i need to try all possible combination of translations (values from ancient language to English) and then take these translations from english to italian the the other dictionary and check if that string exists in the Italian file, if yes i save the result and the paragraph where has been found (result in different paragraphs doesn't count, must be in the same I've made a small piece of code to count the paragraphs).
I'm having issues here for the following reasons:
In the strings that i've found how I'm supposed to replace the words and keep the punctuation? Because the return result must contain all the punctuation otherwise the output result will be wrong
If the string is contained but in 2 different lines of the text how should i proceed in order to make it work? For example i have a string of 5 words, at the end of a line i found the first 2 words corresponding but the remaining 3 words are the first 3 words of the next line.
As mentioned before the dict from ancient language to english is huge and can have up to 7 values (translations) for each key (ancient langauge), is there any efficient way to try all the combinations while searching if the string exists in a text file? This is probably the hardest part.
Probably the best way to process this is word by word scan every time and in case the sequence is broken i reset it somehow and keep scanning the text file...
Any idea?
Here you have commented code of what i've managed to do until now:
k = 2 #Random value, the whole program gonna be a function and the "k" value will be different each time
file = [ line.strip().split(';') for line in open('lexicon-GR-EN.csv', encoding="utf8").readlines() ] #Opening CSV file with possible translations from ancient Greek to English
gr_en = { words[0]: tuple(words[1:]) for words in file } #Creating a dictionary with the several translations (values)
file = open('lexicon-EN-IT.csv', encoding="utf8") # Opening 2nd CSV file
en_it = {} # Initializing dictionary
for row in file: # Scanning each row of the CSV file (From English to Italian)
L = row.rstrip("\n").split(';') # Clearing newline char and splitting the words
x = L[0]
t1 = L[1]
en_it[x] = t1 # Since in this CSV file all the words are 1 - 1 is not necesary any check for the length (len(L) is always 2 basically)
file = open('odyssey.txt', encoding="utf8") # Opening text file
result = () # Empty tuple
spacechecker = 0 # This is the variable that i need to determine if i'm on a even or odd line, if odd the line will be scanned normaly otherwise word order and words will be reversed
wordcount = 0 # Counter of how many words have been found
paragraph = 0 # Paragraph counter, starts at 0
paragraphspace = 0 # Another paragraph variable, i need this to prevent double-space to count as paragraph
string = "" # Empty string to store corresponding sequences
foundwords = [] # Empty list to store words that have been found
completed_sequences = [] # Empty list, here will be stored all completed sequences of words
completed_paragraphs = [] # Paragraph counter, this shows in which paragraph has been found each sequence of completed_sequences
for index, line in enumerate(file.readlines()): # Starting line by line scan of the txt file
words = line.split() # Splitting words
if not line.isspace() and index == 0: # Since i don't know nothing about the "secret tests" that will be conducted with this program i've set this check for the start of the first paragraph to prevent errors: if first line is not space
paragraph += 1 # Add +1 to paragraph counter
spacechecker += 1 # Add +1 to spacechecker
elif not line.isspace() and paragraphspace == 1: # Checking if the previous line was space and the current is not
paragraphspace = 0 # Resetting paragraphspace (precedent line was space) value
spacechecker += 1 # Increasing the spacechecker +1
paragraph +=1 # This means we're on a new paragraph so +1 to paragraph
elif line.isspace() and paragraphspace == 1: # Checking if the current line is space and the precedent line was space too.
continue # Do nothing and cycle again
elif line.isspace(): # Checking if the current line is space
paragraphspace += 1 # Increase paragraphspace (precedent line was space variable) +1
continue
else:
spacechecker += 1 # Any other case increase spacechecker +1
if spacechecker % 2 == 1: # Check if spacechecker is odd
for i in range(len(words)): # If yes scan the words in normal order
if words[i].strip(",.!?:;-") in gr_en != "[unavailable]": # If words[i] without any special char is in dictionary
currword = words[i] # If yes, we will call it "currword"
foundwords.append(currword) # Add currword to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (words[i].strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # We will put the foundwords list in a string
completed_sequences.append(string) # And add this string to the list of strings of completed_sequences
completed_paragraphs.append(paragraph) # Then add the paragraph of that string to the list of completed_paragraphs
result = list(zip(completed_sequences, completed_paragraphs)) # This the output format required, a tuple with the string and the paragraph of that string
wordcount = 0
foundwords.clear() # Clearing the foundwords list
else: # If none of the above happened (word is not in dictionary and wordcounter still isn't >= k)
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
else: # The case of spacechecker being not odd,
words = words[::-1] # Reverse the word order
for i in range(len(words)): # Scanning the row of words
currword = words[i][::-1] # Currword in this case will be reversed since the words in even lines are written in reverse.
if currword.strip(",.!?:;-") in gr_en != "[unavailable]": # If currword without any special char is in dictionary
foundwords.append(currword) # Append it to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (currword.strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword.strip(",.!?:;-") not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # Add the words that has been found to the string
completed_sequences.append(string) # Append the string to completed_sequences list
completed_paragraphs.append(paragraph) # Append the paragraph of the strings to the completed_paragraphs list
result = list(zip(completed_sequences, completed_paragraphs)) # Adding to the result the tuple combination of strings and corresponding paragraphs
wordcount = 0 # Reset wordcount
foundwords.clear() # Clear foundwords list
else: # In case none of above happened
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again

I'd probably take the following approach to solving this:
Try to collapse down the 2 word dictionaries into one (ancient_italian below), removing English from the equation. For example, if ancient->English has {"canus": ["dog","puppy", "wolf"]} and English->Italian has {"dog":"cane"} then you can create a new dictionary {"canus": "cane"}. (Of course if the English->Italian dict has all 3 English words, you need to either pick one, or display something like cane|cucciolo|lupo in the output).
Come up with a regular expression that can distinguish between words, and the separators (punctuation), and output them in order into a list (word_list below). I.e something like ['ecce', '!', ' ', 'magnus', ' ', 'canus', ' ', 'esurit', '.']
Step through this list, generating a new list. Something like:
translation = []
for item in word_list:
if item.isalpha():
# It's a word - translate it and add to the list
translation.append(ancient_italian[item])
else:
# It's a separator - add to the list as-is
translaton.append(item)
Finally join the list back together: ''.join(translation)

I'm unable to reply to your comment on the answer by match, but this may help:
For one, its not the most elegant approach but should work:
GR_IT = {}
for greek,eng in GR_EN.items():
for word in eng:
try:
GR_IT[greek] = EN_IT[word]
except:
pass
If theres no translation for a word it will be ignored though.
To get a list of words and punctuation split try this:
def repl_punc(s):
punct = ['.',',',':',';','?','!']
for p in punct:
s=s.replace(p,' '+p+' ')
return s
repl_punc(s).split()

Related

My code is missing some of the lines im trying to get out of a file

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest. They share with you a regular expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds all words that meet this definition. My code works well on some of the tests but fails when the line that indicates the region of interest is repeated.
Here's is my code:
import re
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename,'r', encoding='utf-8') as file:
flag = False
words = []
count = 0
for line in file:
if line.startswith("*** START OF"):
while count < 1:
flag=True
count += 1
elif line.startswith("*** END"):
flag=False
break
elif(flag):
new_line = line.lower()
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
new_line)
words.extend(words_on_line)
return words
#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
The issue is the string "*** START OF" is repeated and isn't included when it is inside the region of interest.
The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
But I'm getting:
bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
Any help would be great!
Attached is a screenshot of the file

The specific problem of your code is the if .. elif .. elif statement, you're ignoring all lines that look like the line that signals the start or end of a block, even if it's in the test block.
You wanted something like this for your function:
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename, 'r', encoding='utf-8') as file:
in_block = False
words = []
for line in file:
if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
in_block = True
elif in_block and line == "*** END TEST CASE ***\n":
break
elif in_block:
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
words.extend(words_on_line)
return words
This is assuming you are actually looking for the whole line as a marker, but of course you can still use .startswith() if you actually accept that as the start or end of the block, as long as it's sufficiently unambiguous.
Your idea of using a flag is fine, although naming a flag to whatever it represents is always a good idea.

counting the unique words in a text file

Some of the unique words in the text file does not count and I've had no idea what's wrong in my code.
file = open('tweets2.txt','r')
unique_count = 0
lines = file.readlines()
line = lines[3]
per_word = line.split()
for i in per_word:
if line.count(i) == 1:
unique_count=unique_count + 1
print(unique_count)
file.close()
Here is the text file:
"I love REDACTED and Fiesta and all but can REDACTED host more academic-related events besides strand days???"
The output of this code is:
16
The expected output of the code came from the text file should be:
17
"i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
The output of this code is:
20
The expected output of the code came from the text file should be:
23

If you want to count the number of unique whitespace delimited tokens (case-sensitive) in the entire file then:
with open('myfile.txt') as infile:
print(len(set(infile.read().split())))

Maybe count() works with chars not words, instead use python way with set() function to clear duplicated words?
per_word = set(line.split())
print (len(per_word))

You are counting each word as a substring in the whole line because you do:
for i in per_word:
if line.count(i) == 1:
So now some words are repeated as substrings, but not as words. For example, the first word is "i". line.count("i") gives 7 (it is also in "if", "im", etc.) so you don't count it as a unique word (even though it is). If you do:
for i in per_word:
if per_word.count(i) == 1:
then you will count each word as a whole word and get the output you need.
Anyway this is very inefficient (O(n^2)) as you iterate over each word and then count iterates over the whole list again to count it. Either use a set as suggested in other answers or use a Counter:
from collections import Counter
unique_count = 0
line = "i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
per_word = line.split()
counter = Counter(per_word)
for count in counter.values():
if count == 1:
unique_count += 1
# Or simply
unique_count = sum(count == 1 for count in counter.values())
print(unique_count)

How to speed up this word-tuple finding algorithm?

I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
else:
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.
I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.
Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.
Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?
Thanks.
Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
else:
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
markovPairsFile.write(keyText)
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
markovFrequencyFile.write(valueText)

As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)

I would also look into collections.Counter, a data structure made for your task. A Counter object is like a dictionary but counts the occurrences of a key entry. You could use this by simply incrementing a word pair as you encounter it:
from collections import Counter
word_counts = Counter()
with open("markovData.txt", "r") as f:
# iterate over word pairs
word_counts[(word1, word2)] += 1
Alternatively, you can construct the tuple list as you have and simply pass this into a Counter as an object to compute the frequencies at the end:
word_counts = Counter(word_tuple_list)

Making two compressed files form into one paragraph

I have this code that I am trying to decompress. First, I have compressed the code which is all working but then when I go onto decompressing it there is a ValueError.
List.append(dic[int(bob)])
ValueError: invalid literal for int() with base 10: '1,2,3,4,5,6,7,8,9,'
This is the code...
def menu():
print("..........................................................")
para = input("Please enter a paragraph.")
print()
s = para.split() # splits sentence
another = [0] # will gradually hold all of the numbers repeated or not
index = [] # empty index
word_dictionary = [] # names word_dictionary variable
for count, i in enumerate(s): # has a count and an index for enumerating the split sentence
if s.count(i) < 2: # if it is repeated
another.append(max(another) + 1) # adds the other count to another
else: # if is has not been repeated
another.append(s.index(i) +1) # adds the index (i) to another
new = " " # this has been added because other wise the numbers would be 01234567891011121341513161718192320214
another.remove(0) # takes away the 0 so that it doesn't have a 0 at the start
for word in s: # for every word in the list
if word not in word_dictionary: # If it's not in word_dictionary
word_dictionary.append(word) # adds it to the dicitonary
else: # otherwise
s.remove(word) # it will remove the word
fo = open("indx.txt","w+") # opens file
for index in another: # for each i in another
index= str(index) # it will turn it into a string
fo.write(index) # adds the index to the file
fo.write(new) # adds a space
fo.close() # closes file
fo=open("words.txt", "w+") # names a file sentence
for word in word_dictionary:
fo.write(str(word )) # adds sentence to the file
fo.write(new)
fo.close() # closes file
menu()
index=open("indx.txt","r+").read()
dic=open("words.txt","r+").read()
index= index.split()
dic = dic.split()
Num=0
List=[]
while Num != len(index):
bob=index[Num]
List.append(dic[int(bob)])
Num+=1
print (List)
The problem is down on line 50. with ' List.append(dic[int(bob)])'.
Is there a way to get the Error message to stop popping up and for the code to output the sentence as inputted above?
Latest error message has occurred:
List.append(dic[int(bob)])
IndexError: list index out of range
When I run the code, I input "This is a sentence. This is another sentence, with commas."

The issue is index= index.split() is by default splitting on spaces, and, as the exception shows, your numbers are separated by ,s.
Without seeing index.txt I can't be certain if it will fix all of your indexes, but for the issue in OP, you can fix it by specifying what to split on, namely a comma:
index= index.split(',')
To your second issue, List.append(dic[int(bob)]) IndexError: list index out of range has two issues:
Your indexes start at 1, not 0, so you are off by one when reconstituting your array
This can be fixed with:
List.append(dic[int(bob) - 1])
Additionally you're doing a lot more work than you need to. This:
fo = open("indx.txt","w+") # opens file
for index in another: # for each i in another
index= str(index) # it will turn it into a string
fo.write(index) # adds the index to the file
fo.write(new) # adds a space
fo.close() # closes file
is equivalent to:
with open("indx.txt","w") as fo:
for index in another:
fo.write(str(index) + new)
and this:
Num=0
List=[]
while Num != len(index):
bob=index[Num]
List.append(dic[int(bob)])
Num+=1
is equivalent to
List = []
for item in index:
List.append(dic[int(item)])
Also, take a moment to review PEP-8 and try to follow those standards. Your code is very difficult to read because it doesn't follow them. I fixed the formatting on your comments so StackOverflow's parser could parse your code, but most of them only add clutter.

Linear search to find spelling errors in Python

I'm working on learning Python with Program Arcade Games and I've gotten stuck on one of the labs.
I'm supposed to compare each word of a text file (http://programarcadegames.com/python_examples/en/AliceInWonderLand200.txt) to find if it is not in the dictionary file (http://programarcadegames.com/python_examples/en/dictionary.txt) and then print it out if it is not. I am supposed to use a linear search for this.
The problem is even words I know are not in the dictionary file aren't being printed out. Any help would be appreciated.
My code is as follows:
# Imports regular expressions
import re
# This function takes a line of text and returns
# a list of words in the line
def split_line(line):
split = re.findall('[A-Za-z]+(?:\'\"[A-Za-z]+)?', line)
return split
# Opens the dictionary text file and adds each line to an array, then closes the file
dictionary = open("dictionary.txt")
dict_array = []
for item in dictionary:
dict_array.append(split_line(item))
print(dict_array)
dictionary.close()
print("---Linear Search---")
# Opens the text for the first chapter of Alice in Wonderland
chapter_1 = open("AliceInWonderland200.txt")
# Breaks down the text by line
for each_line in chapter_1:
# Breaks down each line to a single word
words = split_line(each_line)
# Checks each word against the dictionary array
for each_word in words:
i = 0
# Continues as long as there are more words in the dictionary and no match
while i < len(dict_array) and each_word.upper() != dict_array[i]:
i += 1
# if no match was found print the word being checked
if not i <= len(dict_array):
print(each_word)
# Closes the first chapter file
chapter_1.close()

Linear search to find spelling errors in Python
Something like this should do (pseudo code)
sampleDict = {}
For each word in AliceInWonderLand200.txt:
sampleDict[word] = True
actualWords = {}
For each word in dictionary.txt:
actualWords[word] = True
For each word in sampleDict:
if not (word in actualDict):
# Oh no! word isn't in the dictionary
A set may be more appropriate than a dict, since the value of the dictionary in the sample isn't important. This should get you going, though

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String searching in text file and dict values combinations - python

Related

My code is missing some of the lines im trying to get out of a file

counting the unique words in a text file

How to speed up this word-tuple finding algorithm?

Making two compressed files form into one paragraph

Linear search to find spelling errors in Python

Categories

Resources