Tokenizing a concatenated string - python

I got a set of strings that contain concatenated words like the followings:
longstring (two English words)
googlecloud (a name and an English word)
When I type these terms into Google, it recognizes the words with "did you mean?" ("long string", "google cloud"). I need similar functionality in my application.
I looked into the options provided by Python and ElasticSearch. All the tokenizing examples I found are based on whitespace, upper case, special characters etc.
What are my options provided the strings are in English (but they may contain names)? It doesn't have to be on a specific technology.
Can I get this done with Google BigQuery?

Can you also roll your own implementation? I am thinking of an algorithm like this:
Get a dictionary with all words you want to distinguish
Build a data structure that allows quick lookup (I am thinking of a trie)
Try to find the first word (starting with one character and increasing it until a word is found); if found, use the remaining string and do the same until nothing is left. If it doesn't find anything, backtrack and extend the previous word.
Should be ok-ish if the string can be split, but will try all possibilities if its gibberish. Of course, it depends on how big your dictionary is going to be. But this was just a quick thought, maybe it helps.

If you do choose to solve this with BigQuery, then the following is a candidate solution:
Load list of all possible English words into a table called words. For example, https://github.com/dwyl/english-words has list of ~350,000 words. There are other datasets (i.e. WordNet) freely available in Internet too.
Using Standard SQL, run the following query over list of candidates:
SELECT first, second FROM (
SELECT word AS first, SUBSTR(candidate, LENGTH(word) + 1) AS second
FROM dataset.words
CROSS JOIN (
SELECT candidate
FROM UNNEST(["longstring", "googlecloud", "helloxiuhiewuh"]) candidate)
WHERE STARTS_WITH(candidate, word))
WHERE second IN (SELECT word FROM dataset.words)
For this example it produces:
Row first second
1 long string
2 google cloud
Even very big list of English words would be only couple of MBs, so the cost of this query is minimal. First 1 TB scan is free - which is good enough for about 500,000 scans on 2 MB table. After that each additional scan is 0.001 cents.

Related

Itertools compress

I am trying to extract all the words that are possible within a string as part of vocabulary game. Consider the string "driver". I would like to find all the English words that can be formed by using the available letters from left to right.
From “driver” we could extract drive, dive, river & die.
But we could not extract “rid” because is not all the letter appears in order from left to right.
For now I would be content of extracting all the letter combination disregarding whether or not it is a word.
I was considering using a loop to extract binary pattern
1=“r”
10=“e”
11=“re”
100=“v”
101=“vr”
110=“ve”
111=“ver”
1000=“i”
1001=”ir”
1010=”ie”
1011=”ier”
1100=”iv”
1101=”ivr”
1110=”ive”
1111=”iver”
…
111110=”drive”
Please help!
Thank-you
Simple maths suggests that the approach you have is the best possible approach there is.
Since index i can either be present or absent, hence the number of combinations will be 2^n (since we are not shuffling).

Creating a dictionary in Python and using it to translate a word

I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?
The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.
It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en
i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.

Fast way to find a substring with some mismatches allowed

I am looking for help in making an efficient way to process some high throughput DNA sequencing data.
The data are in 5 files with a few hundred thousand sequences each, within which each sequence is formatted as follows:
#M01102:307:000000000-BCYH3:1:1102:19202:1786 1:N:0:TAGAGGCA+CTCTCTCT
TAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT
+
AAABBFAABBBFGGGGFGGGGGAG5GHHHCH54BEEEEA5GGHDHHHH5BAE5DF5GGCEB33AF3313GHHHE255D55D55D53#5#B5DBD5#E/#//>/1??/?/E#///FDF0B?CC??CAAA;--./;/BBE?;AFFA./;/;.;AEA//BFFFF/BB/////;/..:.9999.;
What I am doing at the moment is iterating over the lines, checking if the first and last letter is an allowed character for a DNA sequence (A/C/G/T or N), then doing a fuzzy search for the two primer sequences that flank the coding sequence fragment I am interested in. This last step is the part where things are going wrong...
When I search for exact matches, I get useable data in a reasonable time frame. However, I know I am missing out on a lot of data that is being skipped because of a single mis-match in the primer sequences. This happens because read quality degrades with length, and so more unreadable bases ('N') crop up. These aren't a problem in my analysis otherwise, but are a problem with a simple direct string search approach -- N should be allowed to match with anything from a DNA perspective, but is not from a string search perspective (I am less concerned about insertion or deletions). For this reason I am trying to implement some sort of fuzzy or more biologically informed search approach, but have yet to find an efficient way of doing it.
What I have now does work on test datasets, but is much too slow to be useful on a full size real dataset. The relevant fragment of the code is:
from Bio import pairwise2
Sequence = 'NNNNNTAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT'
fwdprimer = 'TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG'
revprimer = 'TAGGACGGGGGGCGGAAA'
if Sequence.endswith(('N','A','G','T','G')) and Sequence.startswith(('N','A','G','T','G')):
fwdalign = pairwise2.align.localxs(Sequence,fwdprimer,-1,-1, one_alignment_only=1)
revalign = pairwise2.align.localxs(Sequence,revprimer,-1,-1, one_alignment_only=1)
if fwdalign[0][2]>45 and revalign[0][2]>15:
startIndex = fwdalign[0][3]+45
endIndex = revalign[0][3]+3
Sequence = Sequence[startIndex:endIndex]
print Sequence
(obviously the first conditional is not needed in this example, but helps to filter out the other 3/4 of the lines that don't have DNA sequence and so don't need to be searched)
This approach uses the pairwise alignment method from biopython, which is designed for finding alignments of DNA sequences with mismatches allowed. That part it does well, but because it needs to do a sequence alignment for each sequence with both primers it takes way too long to be practical. All I need it to do is find the matching sequence, allowing for one or two mismatches. Is there another way of doing this that would serve my goals but be computationally more feasible? For comparison, the following code from a previous version works plenty fast with my full data sets:
if ('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG' in Line) and ('TAGGACGGGGGGCGGAAA' in Line):
startIndex = Line.find('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG')+45
endIndex = Line.find('TAGGACGGGGGGCGGAAA')+3
Line = Line[startIndex:endIndex]
print Line
This is not something I run frequently, so don't mind if it is a little inefficient, but don't want to have to leave it running for a whole day. I would like to get a result in seconds or minutes, not hours.
The tre library provides fast approximate matching functions. You can specify the maximum number of mismatched characters with maxerr as in the example below:
https://github.com/laurikari/tre/blob/master/python/example.py
There is also the regex module, which supports fuzzy searching options: https://pypi.org/project/regex/#additional-features
In addition, you can also use a simple regular expression to allow alternate characters as in:
# Allow any character to be N
pattern = re.compile('[TN][AN][AN][TN]')
if pattern.match('TANN'):
print('found')

algorithm for testing mutliple substrings in multiple strings

I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.
Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.
It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.
If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.
you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'
Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

Search many expressions in many documents using Python

I often have to search many words (1000+) in many documents (million+). I need position of matched word (if matched).
So slow pseudo version of code is
for text in documents:
for word in words:
position = search(word, text)
if position:
print word, position
Is there any fast Python module for doing this? Or should I implement something myself?
For fast exact-text, multi-keyword search, try acora - http://pypi.python.org/pypi/acora/1.4
If you want a few extras - result relevancy, near-matches, word-rooting etc, Whoosh might be better - http://pypi.python.org/pypi/Whoosh/1.4.1
I don't know how well either scales to millions of docs, but it wouldn't take long to find out!
What's wrong with grep?
So you have to use python? How about:
import subprocess
subprocess.Popen('grep <pattern> <file>')
which is insane. But hey! You are using python ;-)
Assuming documents is a list of strings, you can use text.index(word) to find the first occurrence and text.count(word) to find the total number of occurrences. Your pseudocode seems to assume words will only occur once, so text.count(word) may be unnecessary.

Categories

Resources