Most optimal way to reverse search list of similar strings - python

I have a list of data that includes both command strings as well as the alphabet, upper and lowercase, totaling to 512+ (including sub-lists) strings. I want to parse the input data, but i cant think of any way to do it properly other than starting from the largest possible command size and cutting it down until i find a command that is the same as the string and then output the location of the command, but that takes forever. any other way i can think of will cause overlapping. im doing this in python
say:
L = ['a', 'b',['aa','bb','cc'], 'c']
for 'bb' the output would be '0201' and 'c' would be '03'
so how should i do this?

It sounds like you're searching through the list for every substring. How about you built a dict to lookup the keys. Of cause you still have to start searching at the longest subkey.
L = ['a', 'b',['aa','bb','cc'], 'c']
def lookups( L ):
""" returns `item`, `code` tuples """
for i, item in enumerate(L):
if isinstance(item, list):
for j, sub in enumerate(item):
yield sub, "%02d%02d" % (i,j)
else:
yield item, "%02d" % i
You could then lookup substrings with:
lookupdict = dict(lookups(L))
print lookupdict['bb'] # but you have to do 'bb' before trying 'b' ...
But if the key length is not just 1 or 2, it might also make sense to group the items into separate dicts where each key has the same length.

If you must use this data structure:
from collections import MutableSequence
def scanList( command, theList ):
for i, elt in enumerate( theList ):
if elt == command:
return ( i, None )
if isinstance( elt, MutableSequence ):
for j, elt2 in enumerate( elt ):
if elt2 == command:
return i, j
L = ['a', 'b',['aa','bb','cc'], 'c']
print( scanList( "bb", L ) )
# (2, 1 )
print( scanlist( "c", L ) )
# (3, None )
BUT
This is a bad data structure. Are you able to get this data in a nicer form?

Related

How to use enumerate in a list comprehension with two lists?

I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...
One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.

Adding certain lengthy elements to a list

I'm doing a project for my school and for now I have the following code:
def conjunto_palavras_para_cadeia1(conjunto):
acc = []
conjunto = sorted(conjunto, key=lambda x: (len(x), x))
def by_size(words, size):
result = []
for word in words:
if len(word) == size:
result.append(word)
return result
for i in range(0, len(conjunto)):
if i > 0:
acc.append(("{} ->".format(i)))
acc.append(by_size(conjunto, i))
acc = ('[%s]' % ', '.join(map(str, acc)))
print( acc.replace(",", "") and acc.replace("'", "") )
conjunto_palavras_para_cadeia1(c)
I have this list: c = ['A', 'E', 'LA', 'ELA'] and what I want is to return a string where the words go from the smallest one to the biggest on in terms of length, and in between they are organized alphabetically. I'm not being able to do that...
OUTPUT: [;1 ->, [A, E], ;2 ->, [LA], ;3 ->, [ELA]]
WANTED OUTPUT: ’[1->[A, E];2->[LA];3->[ELA]]’
Taking a look at your program, the only issue appears to be when you are formatting your output for display. Note that you can use str.format to insert lists into strings, something like this:
'{}->{}'.format(i, sublist)
Here's my crack at your problem, using sorted + itertools.groupby.
from itertools import groupby
r = []
for i, g in groupby(sorted(c, key=len), key=len):
r.append('{}->{}'.format(i, sorted(g)).replace("'", ''))
print('[{}]'.format(';'.join(r)))
[1->[A, E];2->[LA];3->[ELA]]
A breakdown of the algorithm stepwise is as follows -
sort elements by length
group consecutive elements by length
for each group, sort sub-lists alphabetically, and then format them as strings
at the end, join each group string and surround with square brackets []
Shortest solution (with using of pure python):
c = ['A', 'E', 'LA', 'ELA']
result = {}
for item in c:
result[len(item)] = [item] if len(item) not in result else result[len(item)] + [item]
str_result = ', '.join(['{0} -> {1}'.format(res, sorted(result[res])) for res in result])
I will explain:
We are getting items one by one in loop. And we adding them to dictionary by generating lists with index of word length.
We have in result:
{1: ['A', 'E'], 2: ['LA'], 3: ['ELA']}
And in str_result:
1 -> ['A', 'E'], 2 -> ['LA'], 3 -> ['ELA']
Should you have questions - ask

Replace one item in a string with one item from a list

I have a string and a list:
seq = '01202112'
l = [(0,1,0),(1,1,0)]
I would like a pythonic way of replacing each '2' with the value at the corresponding index in the list l such that I obtain two new strings:
list_seq = [01001110, 01101110]
By using .replace(), I could iterate through l, but I wondered is there a more pythonic way to get list_seq?
I might do something like this:
out = [''.join(c if c != '2' else str(next(f, c)) for c in seq) for f in map(iter, l)]
The basic idea is that we call iter to turn the tuples in l into iterators. At that point every time we call next on them, we get the next element we need to use instead of the '2'.
If this is too compact, the logic might be easier to read as a function:
def replace(seq, to_replace, fill):
fill = iter(fill)
for element in seq:
if element != to_replace:
yield element
else:
yield next(fill, element)
giving
In [32]: list(replace([1,2,3,2,2,3,1,2,4,2], to_replace=2, fill="apple"))
Out[32]: [1, 'a', 3, 'p', 'p', 3, 1, 'l', 4, 'e']
Thanks to #DanD in the comments for noting that I had assumed I'd always have enough characters to fill from! We'll follow his suggestion to keep the original characters if we run out, but modifying this approach to behave differently is straightforward and left as an exercise for the reader. :-)
[''.join([str(next(digit, 0)) if x is '2' else x for x in seq])
for digit in map(iter, l)]
I don't know if this solution is 'more pythonic' but:
def my_replace(s, c=None, *other):
return s if c is None else my_replace(s.replace('2', str(c), 1), *other)
seq = '01202112'
l = [(0,1,0),(1,1,0)]
list_req = [my_replace(seq, *x) for x in l]
seq = '01202112'
li = [(0,1,0),(1,1,0)]
def grunch(s, tu):
it = map(str,tu)
return ''.join(next(it) if c=='2' else c for c in s)
list_seq = [grunch(seq,tu) for tu in li]

Given an iterable of sets, and a name (string), return a set of names which is connected with the given "name"

Given a name (as a string) and an iterable of sets, containing two names (as strings), return a new set consisting of names that share a set with the given name.
For example:
itr = ({"a", "b"}, {"b", "c"}, {"c", "a"})
name = "a"
newset = {"b", "c"}
I'm looking for a pythonic way of approaching this problem. This is the current mess that I have:
def friends(itr, name):
newset = []
for i in itr:
if name in i:
for j in i:
if j != name:
newset.append(j)
return set(newset)
Any help would be appreciated. I'm relatively new to Python and programming in general. Thank you
>>> set(e for s in itr for e in s if name in s) - set((name,))
set(['c', 'b'])
Your logic is fine, but the solution is messy as you say:
def friends(itr, name):
newset = [] # Your should probably make this a set
for i in itr:
if name in i:
for j in i: # This loops is really not necessary
if j != name:
newset.append(j)
return set(newset)
Your code can be changed to something like this without any fancy tools:
def friends(itr, name):
newset = set()
for subset in itr:
if name in subset:
newset.update(subset)
return newset.difference((name,))
>>> reduce(set.union, filter(lambda x: name in x, itr), set()) - set((name,))
set(['c', 'b'])
First, filter out the sets which have the name, with a generator expression
>>> filtered_sets = (item - {name} for item in itr if name in item)
Then, by iterating the generator, filter out the name
>>> {item for items in filtered_sets for item in items if name != item}
{'b', 'c'}

Find the index of an item in a list that starts with a user defined input

Given a list such as:
lst = ['abc123a:01234', 'abcde123a:01234', ['gfh123a:01234', 'abc123a:01234']]
is there a way of quickly returning the index of all the items which start with a user-defined string, such as 'abc'?
Currently I can only return perfect matches using:
print lst.index('abc123a:01234')
or by doing this in a number of steps by finding all the elements that start with 'abc' saving these to a new list and searching the original list for perfect matches against these.
If the only quick way is to use regex how could I still have the flexibility of a user being able to input what the match should be?
You can accomplish that using the following script/method (which I admit is quite primitive):
lst = ['abc123a:01234', 'abcde123a:01234', ['gfh123a:01234', 'abc123a:01234']]
user_in = 'abc'
def get_ind(lst, searchterm, path=None, indices=None):
if indices is None:
indices = []
if path is None:
path = []
for index, value in enumerate(lst):
if isinstance(value, list):
get_ind(value, searchterm, path + [index], indices)
elif value.startswith(searchterm):
indices.append(path + [index])
return indices
new_lst = get_ind(lst, user_in)
>>> print new_lst
[[0], [1], [2, 1]]

Categories

Resources