Itertools compress

Itertools compress - python

I am trying to extract all the words that are possible within a string as part of vocabulary game. Consider the string "driver". I would like to find all the English words that can be formed by using the available letters from left to right.
From “driver” we could extract drive, dive, river & die.
But we could not extract “rid” because is not all the letter appears in order from left to right.
For now I would be content of extracting all the letter combination disregarding whether or not it is a word.
I was considering using a loop to extract binary pattern
1=“r”
10=“e”
11=“re”
100=“v”
101=“vr”
110=“ve”
111=“ver”
1000=“i”
1001=”ir”
1010=”ie”
1011=”ier”
1100=”iv”
1101=”ivr”
1110=”ive”
1111=”iver”
…
111110=”drive”
Please help!
Thank-you

Simple maths suggests that the approach you have is the best possible approach there is.
Since index i can either be present or absent, hence the number of combinations will be 2^n (since we are not shuffling).

Related

Creating a dictionary in Python and using it to translate a word

I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?

The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.

It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en

i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.

Tokenizing a concatenated string

I got a set of strings that contain concatenated words like the followings:
longstring (two English words)
googlecloud (a name and an English word)
When I type these terms into Google, it recognizes the words with "did you mean?" ("long string", "google cloud"). I need similar functionality in my application.
I looked into the options provided by Python and ElasticSearch. All the tokenizing examples I found are based on whitespace, upper case, special characters etc.
What are my options provided the strings are in English (but they may contain names)? It doesn't have to be on a specific technology.
Can I get this done with Google BigQuery?

Can you also roll your own implementation? I am thinking of an algorithm like this:
Get a dictionary with all words you want to distinguish
Build a data structure that allows quick lookup (I am thinking of a trie)
Try to find the first word (starting with one character and increasing it until a word is found); if found, use the remaining string and do the same until nothing is left. If it doesn't find anything, backtrack and extend the previous word.
Should be ok-ish if the string can be split, but will try all possibilities if its gibberish. Of course, it depends on how big your dictionary is going to be. But this was just a quick thought, maybe it helps.

If you do choose to solve this with BigQuery, then the following is a candidate solution:
Load list of all possible English words into a table called words. For example, https://github.com/dwyl/english-words has list of ~350,000 words. There are other datasets (i.e. WordNet) freely available in Internet too.
Using Standard SQL, run the following query over list of candidates:
SELECT first, second FROM (
SELECT word AS first, SUBSTR(candidate, LENGTH(word) + 1) AS second
FROM dataset.words
CROSS JOIN (
SELECT candidate
FROM UNNEST(["longstring", "googlecloud", "helloxiuhiewuh"]) candidate)
WHERE STARTS_WITH(candidate, word))
WHERE second IN (SELECT word FROM dataset.words)
For this example it produces:
Row first second
1 long string
2 google cloud
Even very big list of English words would be only couple of MBs, so the cost of this query is minimal. First 1 TB scan is free - which is good enough for about 500,000 scans on 2 MB table. After that each additional scan is 0.001 cents.

NLTK, reading in word numbers to float numbers

I've looked at the corpus section of NLTK, but there doesn't seem to be a numbers corpus. I want to change word numbers into text. For example:
input: one thousand two hundred forty three output: 1243
input: second output: 2
input: five percent output: 0.05

There isn't. What you need to do is build off this Is there a way to convert number words to Integers? or someone else you find useful/easier to work with.
To start off you'll need regex to extract those strings of interest (i.e. one, two...) then replace using the code above.
The first example you've given will be the easiest of the three, the last example is just divide that number by 100 since the output is actually an integer. The second one will be a little tricky as you'll have to modify the code or possibly create a whole new function.
AFAIK, there is no module that will parse the whole text for that.
Another possibility, as I looked further into this, is to use CD tagging from Tree Parser to help identify numbers. But you'll still need a function similar to the one mentioned above.

Finding a substring's position in a larger string

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?

The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.
Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.
For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.
If you're interested in learning more about these approaches there are three main algorithms used in practice:
Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]
There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario
There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.

Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.
Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).
Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].
Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).
BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.

This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.
The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.
Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]
Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

Which algorithm would fit best to solve a word-search game like "Boggle" with Python

I'm coding a game similar to Boggle where the gamer should find words inside a big string made of random letters.
For example, there are five arrays with strings inside like this. Five rows, made of six letters each one :
AMSDNS
MASDOM
ASDAAS
DSMMMS
OAKSDO
So, the users of the game should make words using the letters available with the following restrictions and rules in mind:
Its not possible to repeat the same letter to make a word. Im talking about the "physical" letter, in the game that is a dice. Its not possible to use the same dice twice or more to make the word.
Its not possible to "jump" any letter to make a word. The letters that make the word must be contiguous.
The user is able to move in any direction she wants without any restriction further than the two mentioned above. So its possible to go to the top, then bottom, then to the right, then top again, and so on. So the movements to look for words might be somehow erratic.
I want to know how to go through all the strings to make words. To know the words Im gonna use a txt file with words.
I don't know how to design an algorithm that is able to perform the search, specially thinking about the erratic movements that are needed to find the words and respecting the restrictions, too.
I already implemented the UX, the logic to throw the dice and fill the boardgame, and all the logic for the six-letters dice.
But this part its not easy, and I would like to read your suggestions to this interesting challenge.
Im using Python for this game because is the language I use to code and the language that I like the most. But an explanation or suggestion of an algorithm itself, should be nice too, independently of the language.

The basic algorithm is simple.
For each tile, do the following.
Start with an empty candidate word, then visit the current tile.
Visit a tile by following these steps.
Add the tile's position's letter to the candidate word.
Is the candidate word a known word? If so, add it to the found word list.
Is the candidate word a prefix to any known word?
If so, for each adjacent tile that has not been visited to form the candidate word, visit it (i.e., recurse).
If not, backtrack (stop considering new tiles for this candidate word).
To make things run smoothly when asking the question "is this word a prefix of any word in my dictionary", consider representing your dictionary as a trie. Tries offer fast lookup times for both words and prefixes.

You might find a Trie useful - put all dictionary words into a Trie, then make another Trie from the Boggle grid, only as long you're matching the dictionary Trie.
I.e. Dictionary trie:
S->T->A->C->K = stack
\->R->K = stark
\->T = start
Grid: (simplified)
STARKX
XXXTXX
XXXXXX
Grid trie: (only shown starting at S - also start at A for ART, etc)
S->X (no matches in dict Trie, so it stops)
\->T->X
\->A-R->K (match)
| |->T (match)
| \->X
\->C->K (match)
\->X
You could visualise your Tries with GraphViz like this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.