Find Similar Elements in List using Python - python

I need to look for similar Items in a list using python. (e.g. 'Limits' is similar to 'Limit' or 'Download ICD file' is similar to 'Download ICD zip file')
I really want my results to be similar with chars, not with digits (e.g. 'Angle 1' is similar to 'Angle 2'). All these strings in my list end with an '\0'
What I am trying to do is split every item at blanks and look if any part consists of a digit.
But somehow it is not working as I want it to work.
Here is my code example:
for k in range(len(split)): # split already consists of splitted list entry
replace = split[k].replace(
"\\0", ""
) # replace \0 at every line ending to guarantee it is only a digit
is_num = lambda q: q.replace(
".", "", 1
).isdigit() # lambda i found somewhere on the internet
check = is_num(replace)
if check == True: # break if it is a digit and split next entry of list
break
elif check == False: # i know, else would be fine too
seq = difflib.SequenceMatcher(a=List[i].lower(), b=List[j].lower())
if seq.ratio() > 0.9:
print(Element1, "is similar to", Element2, "\t")
break

Try this, its using get_close_matches from difflib instead of sequencematcher.
from difflib import get_close_matches
a = ["abc/0", "efg/0", "bc/0"]
b=[]
for i in a:
x = i.rstrip("/0")
b.append(x)
for i in range(len(b)):
print(get_close_matches(b[i], (b)))

Related

Breaking single line data to multi-line data

Working on project where i am given raw log data and need to parse it out to a readable state, know enought with python to break off all the undeed part and just left with raw data that needs to be split and formated, but cant figure out a way to break it apart if they put multiple records on the same line, which does not always happen.
This is the string value i am getting so far.
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,
I need to break it out apart so that each line starts with the number value, but since i can assume there will only be 1 or two of them i cant think of a way to do it, and the number of other comma seperate values after it vary so i cant go by length. this is what i am looking to get to use will further operations with data from the above example.
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000
*190206*01,2050,0100550,01,4999,0000000,,
txt = "*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,"
output = list()
i = 0
x = txt.split("*")
while i < len(x):
if len(x[i]) == 0:
i += 1
continue
print ("*{0}*{1}".format(x[i],x[i+1]))
output.append("*{0}*{1}".format(x[i],x[i+1]))
i += 2
Use split to tokezine the words between *
Print two constitutive tokens
You can use regex:
([*][0-9]*[*])
You can catch the header part with this and then split according to it.
Same answer as #mujiga but I though a dict might better for further operations
txt = "*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,"
datadict=dict()
i=0
x=txt.split("*")
while i < len(x):
if len(x[i]) == 0:
i += 1
continue
datadict[x[i]]=x[i+1]
i += 2
Adding on to #Ali Nuri Seker's suggestion to use regex, here' a simple one lacking lookarounds (which might actually hurt it in this case)
>>> import re
>>> string = '''*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,'''
>>> print(re.sub(r'([\*][0-9,]+[\*]+[0-9,]+)', r'\n\1', string))
#Output
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000
*190206*01,2050,0100550,01,4999,0000000,,

Consolidate similar patterns into single consensus pattern

In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot

Python: Split between two characters

Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

Split specific items in list into two

I'm building an XML parser in python for an SVG file. It will eventually become specific instructions for stepper motors.
SVG files contain commands such as 'M', 'C' and 'L.' The path data might look like this:
[M199.66, 0.50C199.6, 0.50...0.50Z]
When I extracted the path data, it's a list of one item (which is a string). I split the long string into multiple strings:
[u'M199.6', u'0.50C199.66', u'0.50']
The 'M, C and L' commands are important - I'm having difficulty splitting '0.5C199.6' into '0.5' and 'C199.6' because it only exists for certain items in the list, and I'd like to retain the C and not discard it. This is what I have so far:
for item in path_strings[0]:
s=string.split(path_strings[0], ',')
print s
break
for i in range(len(s)):
coordinates=string.split(s[i],'C')
print coordinates
break
You could try breaking it into substrings like this:
whole = "0.5C199.66"
start = whole[0:whole.find("C")]
end = whole[whole.find("C"):]
That should give you start == "0.5" and end == "C199.66"
Alternatively you could use the index function instead of find, which raises a ValueError when the substring can't be found. That would give you the benefit of easily determining that for the current string, no 'C' command is present.
http://docs.python.org/2/library/string.html#string-functions
Use a regex to search for the commands ([MCL]).
import re
lst = [u'M199.6', u'0.50C199.66', u'0.50']
for i, j in enumerate(lst):
m = re.search('(.+?)([MCL].+)', j)
if m:
print [m.group(1), m.group(2)] # = coordinates from your example
lst[i:i+1] = [m.group(1), m.group(2)] # replace the item in the lst with the splitted thing
# or do something else with the coordinates, whatever you want.
print lst
splits your list in:
[u'M199.6', u'0.50', u'C199.66', u'0.50']

Categories

Resources