Extracting a section of a string in python with limitations - python

I have a string output that looks like this:
Distance AAAB: ,0.13634,0.13700,0.00080,0.00080,-0.00066,.00001,
Distance AAAC: ,0.12617,0.12680,0.00080,0.00080,-0.00063,,
Distance AAAD: ,0.17045,0.16990,0.00080,0.00080,0.00055,,
Distance AAAE: ,0.09330,0.09320,0.00080,0.00080,0.00010,,
Distance AAAF: ,0.21048,0.21100,0.00080,0.00080,-0.00052,,
Distance AAAG: ,0.02518,0.02540,0.00040,0.00040,-0.00022,,
Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,
Distance AAAI: ,0.10811,0.10860,0.00080,0.00070,-0.00049,,
Distance AAAJ: ,0.02430,0.02400,0.00200,0.00200,0.00030,,
Distance AAAK: ,0.09449,0.09400,0.00200,0.00100,0.00049,,
Distance AAAL: ,0.07689,0.07660,0.00050,0.00050,0.00029,
What I want to do is extract a specific set of data out of this block, for example only Distance AAAH like so:
Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,
The measurements will always begin with Distance AAA*: with the star being the only character that will change.
Complications:
This needs to be generic, because I have a lot of different data sets and so Distance AAAH might not always be followed by Distance AAAI or preceded by Distance AAAG, since the measurements for different items vary. I also can't rely on .len(), because the last measurement can sometimes be blank (As it is with Distance AAAH) or can be filled (As with Distance AAAB. And I don't think I can use .find(), because I need all of the numbers following Distance AAAH.
I am still very new and I tried my best to find a solution similar to this problem, but have not had much luck.

You can search your text by this script :
#fullText = YOUR STRING
text = fullText.splitlines()
for line in text:
if line.startswith('Distance AAAH:'):
print line
Output:Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,

You could use re module. And making a function should be convenient.
import re
def SearchDistance(pattern,text):
pattern = pattern.replace(' ','\s')
print re.findall(r'{0}.+'.format(pattern),a)
SearchDistance('Distance AAAH',a)
Output:
['Distance AAAH: ,0.11404,0.11450,0.00120,0.00110,-0.00046,,']

Related

Why the fuzzywuzzy Ratio() uses a slightly different implementation of Levenshtein Distance while calculating the ratio between two strings?

I am trying to wrap my head around how the fuzzywuzzy library calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using the addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Here are a couple of examples:
Example 1
s1 = 'hello'
s2 = 'hell'
Levenshtein Score = 1 (it requires 1 edit, addition of 'o')
Example 2
s1 = 'hello'
s2 = 'hella'
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
I would love if somebody could give me some insight. Also is there a better way to convert LD to a similarity score? Or in general measure the similarity between two strings? I am trying to measure the similarity between some audio file transcriptions done by a human transcription service and by an Automatic Speech Recognition system.
Thank you!

Fast way find strings within hamming distance x of each other in a large array of random fixed length strings

I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.
One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v āˆˆ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_iāˆˆ{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.
Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.

Minimum Levenshtein distance across multiple words

I am trying to do some string matching using the Levenshtein algorithm for closest words on businesses. (In python but language won't make a huge difference)
An example query would be
search = 'bna'
lat & lon are close by the result I am looking for.
There is a pub right by the latitude and longitude called BNA Brewing Co. by searching BNA my hopes would be that that shows up first (as bna == bna)
I have tried two different way
m = min([editdistance.eval(search, place_split) for place_split in place.name.split(' ')
if place_split not in string.punctuation])
returns without ranking based on geographical distance, only levenshtein distance
Coffee & Books In Town Center
Talk 'n' Coffee
Raggedy Ann & Andy's
and with taking into account geographical distance, secondary to levenshtein
Shapers Hair Salon & Spa
Amora Day Spa
Pure Esthetics and Micro-Pigmentation
And
m = editdistance.eval(search, place.name)
The first one returns without ranking based on geographical distance, only levenshtein distance
KFC
MOO
A&W
and with taking into account geographical distance, secondary to levenshtein
A&W
A&W
KFC
So you can see that neither way are returning anything close to BNA Brewing Co.
What kind of logic do I have to use to get it to return something when the search terms exactly matches one of the place names in my database?
Recall that Levenshtein distances count the number of substitutions, additions and deletions required to transform one string into another. Because of this, they often are minimized when comparing strings of similar length (because even if a lot of substitutions are required, you don't have to add or remove a bunch of characters). You can see this playing out in your second example where your best outputs all are the same length as your search string (len("bna") == len("A&W")).
If your search string is always going to be a single word, then your idea to calculate the distance for each word in the string is a good one since each word is more likely to be a similar length to your search string. However, currently you are doing a case sensitive comparison which means that editdistance.eval('bna', 'BNA') == 3 which I'm guessing you don't want.
try:
m = min([editdistance.eval(search.lower(), place_split.lower()) for place_split in place.name.split(' ') if place_split not in string.punctuation])
which should give you a case insensitive search.

Matching 2 short descriptions and returning a confidence level

I have some data that I get from the Banks using Yodlee and the corresponding transaction messages on the mobile. Both have some description in them - short descriptions.
For example -
string1 = "tatasky_TPSL MUMBA IND"
string2 = "tatasky_TPSL"
They can be matched if one is a completely inside the other. However, some strings like
string1 = "T.G.I Friday's"
string1 = "TGI Friday's MUMBA MAH"
Still need to be matched. Is there a y algorithm which gives a confidence level in matching 2 descriptions ?
You might want to use Normalized edit distance also called levenstien distance levenstien distance wikipedia. So after getting levenstien distance between two strings, you can normalize it by dividing by the length of longest string (or average of those two strings). This normalised socre can act as confidense. You can find some 4-5 python packages of calculating levenstien distance. You can try it online as well edit distance calculator
Alternatively one simple solution is algorithm called longest common subsequence, which can be used here

Getting Keys Within Range/Finding Nearest Neighbor From Dictionary Keys Stored As Tuples

I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.
You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.

Categories

Resources