Building a Markov model from a text file?

Building a Markov model from a text file? - python

I have an assignment to build a program that, based off an input file, reads text and then generates new text. The dictionary should map n string of letters to a list of letters that could follow the string, based off the text in the input file. Thus far, I have
def create_dic():
n = order_entry.get()
inputfile = file_entry.get() #name of input file
lines = open(inputfile,'r').read() #reads input file into string
model = {} #empty dictionary to build Markov model
For every n-character sequence in the input, I have to "look it up in the dictionary to get a list of possible succeeding characters and get the next character." I'm confused about instruction to look up the string in the dictionary when the dictionary is empty to begin with? Won't there be nothing in the dictionary?

Since this is an assignment, I will give you leading questions rather than an answer. As #Quirliom said, "Populate the dictionary."
When you want to use the Markov model, what key would you like to search the dictionary for?
When you search for that key, what would you like to get back?
The sentence, "The dictionary should map n string of letters to a list of letters that could follow the string, based off the text in the input file," has the answers to those questions. This means that you will have to do some work on the input file to figure out how to extract the dictionary keys and what they should map to.

this is definitely not the best approach, but you start with this.
on letter basis: which letter is in the first place most (for entire data).
first character (letter) of the words are countable entities. and it is rational to check which character (letter) has most records. start your generated text with this. then look which letter most succeeds this and so on. also take average word length and distribute generated words around this length.
for better results:
on n gram basis: which n-gram is most likely to precede others (you can extend it also as sentences)

Related

Print something when a word is in a word list

So I am currently trying to build a Caesar encrypted that automatically tries all the possibilities and compares them to a big list of words to see if it is a real word, so some sort of dictionary attack I guess.
I found a list with a lot of German words, and they even are split so that each word is on a new line. Currently, I am struggling with comparing the sentence that I currently have with the whole word list. So that when the program sees that a word in my sentence is also a word in the Word list that it prints out that this is a real word and possible the right sentence.
So this is how far I currently am, I have not included the code with which I try all the 26 letters. Only my way to look through the word list and compares it to a sentence. Maybe someone can tell me what I am doing wrong and why it doesn't work:
No idea why it doesn't work. I have also tried it with regular expressions but nothing works. The list is really long (166k Words).

There are /n at the en of each word of the list you created from the file, so the they will never be the same as what they are compared to.
Remove the newline character before appending (you can, for example, wordlist.append(line.rstrip())

how to iterate over the file and find the closest match between words [updated]?

I am trying to find the closes match for the misspelled words from my correct words list (like look-up table). I have a code that uses leven (source: wikipedia) similarity on comparing 1 word with a look-up list and selects the best matched (also by defining the cost).
my word list looks like correctList.txt:
words = ['computer','test','right','tesla','omega','energy']
Based on the two input required by Levenshtein similarity, I provide input of:
userInput = 'compute'
limitSearch = int('3')
output = check(userInput, limitSearch)
for result in output:
print ('\n closeMatches: ', result)
Now I want to expand this and instead of checking one written misspelled word with the look-up dictionary, use the list of misspelled words (similar to the following file) and compare it with my correct list.txt and substitutes the best matches.
example of my misspelled.txt:
misspelled = ['computee','teste','righ','tessla','oomega','energie']
It would be great if you can help.

I literally just made something like this about an hour ago. I used a module called leven to do it. I made it in repl.it because I was on a Chromebook. Here is a link to it for if you want to look at it.

Creating a dictionary in Python and using it to translate a word

I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?

The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.

It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en

i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.

Change two characters into one symbol (Python)

Im currently working on a file compression task for school, and I find myself unable to understand what's happening in this code (more specifically what ISN'T happening and why it is not happening).
So in this section of the code what I'm aiming to do is, in non-coding terms, change two adjacent letters which are the same into one symbol, therefore taking up less memory:
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
for ind,letter in enumerate(word_contents[:-1]):
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
However, when I run the full code with a sample text file, it seemingly doesn't do what I told it to do. For instance, the word 'Sally' should be 'Sa★y' but instead stays the same.
Could anyone help me get on the right track?
EDIT: I missed out a pretty key detail. I want the compressed string to somehow appear back in the original file_contents list where there are double letters, as the purpose of the full compression algorithm is to return a compressed version of the text in an inputted file.

I would suggest use a regex matching same adjacent characters.
Example:
import re
txt = 'sally and bobby'
print(re.sub(r"(.)\1", '*', txt))
# sa*y and bo*y
Loop and condition checking in your code are not required. Use below line instead:
word_contents = re.sub(r"(.)\1", '*', word_contents)

There are a few things wrong with your code (I think).
1) split produces a list not a str, so when you say this enumerate(word_contents[:-1]) It looks like you're assuming that gets you a string?!? at any rate... I'm not sure it is or not.
but then!
2)with this line:
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
You're operating on your list again. Where it looks pretty clear that you want to be operating on the string, or a list of characters in a word you're processing. At best this function will do nothing, and at worst, you're corrupting the word content list.
So when you perform your modifications you are modifying the word_contents list and not the list item [:-1] you are actually looking over. There are more issues, but I think that answers your question (I hope)
If you really want to understand what you're doing wrong I recommend putting in print statements along what you're doing. If you're looking for someone to do your homework for you, there is another which already gave you an answer I guess.
Here is an example of how you should add logging to the function
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
# See what the word content list actually is
print(word_contents)
# See what your slice is actually returning
print(word_contents[:-1])
# Unless you have something modifying your list elsewhere you probably want to iterate over the words list generally and not just the slice of it as well.
for ind,letter in enumerate(word_contents[:-1]):
# See what your other test is testing
print(word_contents[ind], word_contents[ind+1])
# Here you probably actually want
# word_contents[:-1][ind]
# which is the list item you iterate over and then the actual string I suspect you get back
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
UPDATE: based on the follow up questions from the OP I've made a sample program annotated with descriptions. Note this isn't an optimal solution, but mainly an exercise in teaching flow control and using basic structures.
# define the initial data...
file = "sally was a quick brown fox and jumped over the lazy dog which we'll call billy"
file_contents = file.split()
# Enumerate isn't needed in your example unless you intend to use the index later (example below)
for list_index, word in enumerate(file_contents):
# changing something you iterate over is dangerous and sometimes confusing like in your case you iterated over
# word contents and then modified it. if you have to take
# two characters you change the index and size of the structure making changes potentially invalid. So we'll create a new data structure to dump the results in
compressed_word = []
# since we have a list of strings we'll just iterate over each string (or word) individually
for character in word:
# Check to see if there is any data in the intermediate structure yet if not there are no duplicate chars yet
if compressed_word:
# if there are chars in new structure, test to see if we hit same character twice
if character == compressed_word[-1]:
# looks like we did, replace it with your star
compressed_word[-1] = "*"
# continue skips the rest of this iteration the loop
continue
# if we haven't seen the character before or it is the first character just add it to the list
compressed_word.append(character)
# I guess this is one reason why you may want enumerate, to update the list with the new item?
# join() is just converting the list back to a string
file_contents[list_index] = "".join(compressed_word)
# prints the new version of the original "file" string
print(" ".join(file_contents))
outputs: "sa*y was a quick brown fox and jumped over the lazy dog which we'* ca* bi*y"

Finding a substring's position in a larger string

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?

The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.
Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.
For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.
If you're interested in learning more about these approaches there are three main algorithms used in practice:
Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]
There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario
There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.

Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.
Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).
Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].
Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).
BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.

This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.
The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.
Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]
Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Building a Markov model from a text file? - python

Related

Print something when a word is in a word list

how to iterate over the file and find the closest match between words [updated]?

Creating a dictionary in Python and using it to translate a word

Change two characters into one symbol (Python)

Finding a substring's position in a larger string

Categories

Resources