Counting same length items in a list - python

I am trying to port a cgi script using pythonic style of coding.
sequence = "aaaabbababbbbabbabb"
res = sequence.split("a") + sequence.split("b")
res = [l for l in res if l]
The result is
>>> res
['bb', 'b', 'bbbb', 'bb', 'bb', 'aaaa', 'a', 'a', 'a', 'a']
This was ~100loc in C. Now i want to count the items with the same length in the res list efficiently. For example here res contains 5 elements with length 1, 3 elements with length 2 and 2 elements with length 4.
The problem is that the sequence string can be very big.

The easiest way to generate a histogram of string lengths given a list of strings is to use collections.Counter:
>>> from collections import Counter
>>> a = ["a", "b", "aaa", "bb", "aa", "bbb", "", "a", "b"]
>>> Counter(map(len, a))
Counter({1: 4, 2: 2, 3: 2, 0: 1})
Edit: There is also a better way to find runs of equal characters, namely itertools.groupby():
>>> sequence = "aaaabbababbbbabbabb"
>>> Counter(len(list(it)) for k, it in groupby(sequence))
Counter({1: 5, 2: 3, 4: 2})

You could probably do something like
occurrences_by_length={} # map of length of string->number of strings with that length.
for i in (len(x) for x in (sequence.split("a")+sequence.split("b"))):
if i in occurrences_by_length:
occurrences_by_length[i]=occurrences_by_length[i]+1
else:
occurrences_by_length[i]=1
Now occurrences_by_length has a mapping of the length of each string to the number of times a string of that length appears.

Related

Sorting list based on order of substrings in another list

I have two lists of strings.
list_one = ["c11", "a78", "67b"]
list_two = ["a", "b", "c"]
What is the shortest way of sorting list_one using strings from list_two to get the following output?
["a78", "67b", "c11"]
Edit 1:
There is a similar question Sorting list based on values from another list?, but in that question he already has the list of required indexes for resulting string, while here I have just the list of substrings.
Edit 2:
Since the example of list above might be not fully representative, I add another case.
list_one is ["1.cde.png", "1.abc.png", "1.bcd.png"]
list_two is ["abc", "bcd", "cde"].
The output is supposed to be [ "1.abc.png", "1.bcd.png", "1.cde.png"]
If, for example, list_one is shorter than list_two, it should still work:
list_one is ["1.cde.png", "1.abc.png"]
list_two is ["abc", "bcd", "cde"]
The output is supposed to be [ "1.abc.png", "1.cde.png"]
key = {next((s for s in list_one if v in s), None): i for i, v in enumerate(list_two)}
print(sorted(list_one, key=key.get))
This outputs:
['a78', '67b', 'c11']
Try this
list_one = ["c11", "a78", "67b"]
list_two = ["a", "b", "c"]
[x for y in list_two for x in list_one if y in x]
Output :
["a78", "67b", "c11"]
Assuming that each item in list_one contains exactly one of the characters from list_two, and that you know the class of those characters, e.g. letters, you can extract those using a regex and build a dictionary mapping the characters to the element. Then, just look up the correct element for each character.
>>> list_one = ["c11", "a78", "67b"]
>>> list_two = ["a", "b", "c"]
>>> d = {re.search("[a-z]", s).group(): s for s in list_one}
>>> list(map(d.get, list_two))
['a78', '67b', 'c11']
>>> [d[c] for c in list_two]
['a78', '67b', 'c11']
Other than the other approaches posted so far, which all seem to be O(n²), this is only O(n).
Of course, the approach can be generalized to e.g. more than one character, or characters in specific positions of the first string, but it will always require some pattern and knowledge about that pattern. E.g., for your more recent example:
>>> list_one = ["1.cde.png", "1.abc.png", "1.bcd.png"]
>>> list_two = ["abc", "cde"]
>>> d = {re.search("\.(\w+)\.", s).group(1): s for s in list_one}
>>> d = {s.split(".")[1]: s for s in list_one} # alternatively without re
>>> [d[c] for c in list_two if c in d]
['1.abc.png', '1.cde.png']
>>> sorted(list_one, key=lambda x: [i for i,e in enumerate(list_two) if e in x][0])
['a78', '67b', 'c11']

Generating all possible k-mers (string combinations) from a given list

I have a string S that is composed of 20 characters:
S='ARNDCEQGHILKMFPSTWYV'
I need to generate all possible k-mer combinations from a given input k.
When k == 3, then there are 8000 combinations (20*20*20) and the output list looks like this:
output = ['AAA', 'AAR', ..., 'AVV', ..., 'VVV'] #len(output)=8000
When k == 2, then there are 400 combinations (20*20) and the output list looks like this:
output = ['AA', 'AR', 'AN', ..., 'VV'] #len(output)=400
When k == 1, then there are only 20 combinations:
output =['A', 'R', 'N', ..., 'Y', 'V'] #len(output)=20
I know how to do this if the number k is fixed, like if k == 3, then I can do this:
for a in S:
for b in S:
for c in S:
output.append(a+b+c)
#then len(output)=8000
But the number k is chosen randomly.
I tried to use permutations, but it does not given me strings with repeated letters like 'AAA', but maybe it can and I'm just doing it wrong.
What you are looking for is itertools.product(). You can use repeat argument for the number of k's in your algorithm.
from itertools import product
...
list(product('ARNDCEQGHILKMFPSTWYV', repeat=2)) # len = 400
list(product('ARNDCEQGHILKMFPSTWYV', repeat=3)) # len = 8000
Bear in mind it returns tuples of characters as default, if you want strings instead, you can join using list comprehensions as below:
[''.join(c) for c in product('ARNDCEQGHILKMFPSTWYV', repeat=3)]
# ['AAA', 'AAR', ..., 'AVV', ..., 'VVV']
You can use itertools.product and generate the random value for k:
import itertools
import random
S = 'ARNDCEQGHILKMFPSTWYV'
final_results = map(''.join, itertools.product(*[S]*random.randint(1, 10)))
Just generate random integer V in range 0..L^k-1 where L is string length and k is length of k-mer.
Then build corresponding combination
V = Random(L**k)
for i in range(k):
C[i] = A[V % L] ///i-th letter using integer modulo
V = V // L ///integer division

Use list of nested indices to access list element

How can a list of indices (called "indlst"), such as [[1,0], [3,1,2]] which corresponds to elements [1][0] and [3][1][2] of a given list (called "lst"), be used to access their respective elements? For example, given
indlst = [[1,0], [3,1,2]]
lst = ["a", ["b","c"], "d", ["e", ["f", "g", "h"]]]
(required output) = [lst[1][0],lst[3][1][2]]
The output should correspond to ["b","h"]. I have no idea where to start, let alone find an efficient way to do it (as I don't think parsing strings is the most pythonic way to go about it).
EDIT: I should mention that the nested level of the indices is variable, so while [1,0] has two elements in it, [3,1,2] has three, and so forth. (examples changed accordingly).
Recursion can grab arbitrary/deeply indexed items from nested lists:
indlst = [[1,0], [3,1,2]]
lst = ["a", ["b","c"], "d", ["e", ["f", "g", "h"]]]
#(required output) = [lst[1][0],lst[3][1][2]]
def nested_lookup(nlst, idexs):
if len(idexs) == 1:
return nlst[idexs[0]]
return nested_lookup(nlst[idexs[0]], idexs[1::])
reqout = [nested_lookup(lst, i) for i in indlst]
print(reqout)
dindx = [[2], [3, 0], [0], [2], [3, 1, 2], [3, 0], [0], [2]]
reqout = [nested_lookup(lst, i) for i in dindx]
print(reqout)
['b', 'h']
['d', 'e', 'a', 'd', 'h', 'e', 'a', 'd']
I also found that arbitrary extra zero indices are fine:
lst[1][0][0]
Out[36]: 'b'
lst[3][1][2]
Out[37]: 'h'
lst[3][1][2][0][0]
Out[38]: 'h'
So if you actually know the max nesting depth you can fill in the index list values by overwriting your (variable number, shorter) index list values into the max fixed length dictionary primed with zeros using the .update() dictonary method
Then directly hard code the indices of nested list, which ignores any "extra" hard coded zero valued indices
below hard coded 4 depth:
def fix_depth_nested_lookup(nlst, idexs):
reqout = []
for i in idexs:
ind = dict.fromkeys(range(4), 0)
ind.update(dict(enumerate(i)))
reqout.append(nlst[ind[0]][ind[1]][ind[2]][ind[3]])
return reqout
print(fix_depth_nested_lookup(lst, indlst))
['b', 'h']
you can try this code block:
required_output = []
for i,j in indlst:
required_output.append(lst[i][j])
You can just iterate through and collect the value.
>>> for i,j in indlst:
... print(lst[i][j])
...
b
f
Or, you can use a simple list comprehension to form a list from those values.
>>> [lst[i][j] for i,j in indlst]
['b', 'f']
Edit:
For variable length, you can do the following:
>>> for i in indlst:
... temp = lst
... for j in i:
... temp = temp[j]
... print(temp)
...
b
h
You can form a list with functions.reduce and list comprehension.
>>> from functools import reduce
>>> [reduce(lambda temp, x: temp[x], i,lst) for i in indlst]
['b', 'h']
N.B. this is a python3 solution. For python2, you can just ignore the import statement.

How do I pick an arbitrary number of an element occurring many times in a list?

I have two variables holding a string each and an empty list:
a = 'YBBB'
b = 'RYBB'
x = []
I want to loop through each of the strings and treat each 'B' in the two lists as an independent element (wish I could just type a.('B') and b.('B'). What I actually want to do is loop through b and ask if each of the items in b are in a. If so, the length of the item in b (say'B') is checked for in a. This should give 3. Then I want to compare the lengths of the item in the two lists and push the lesser of the two into the empty list. In this case, only two 'B's will be pushed into x.
You can use a nested list comprehension like following:
>>> [i for i in set(b) for _ in range(min(b.count(i), a.count(i)))]
['B', 'B', 'Y']
If the order is important you can use collections.OrderedDict for creating the unique items from b:
>>> from collections import OrderedDict
>>>
>>> [i for i in OrderedDict.fromkeys(b) for _ in range(min(b.count(i), a.count(i)))]
['Y', 'B', 'B']
This is useless text for the moderators.
import collections
a = 'YBBB'
b = 'RYBB'
x = []
a_counter = collections.Counter(a)
b_counter = collections.Counter(b)
print(a_counter)
print(b_counter)
for ch in b:
if a_counter[ch]:
x.append(min(a_counter[ch], b_counter[ch]) * ch)
print(x)
--output:--
Counter({'B': 3, 'Y': 1})
Counter({'B': 2, 'Y': 1, 'R': 1})
['Y', 'BB', 'BB']
Or, if you only want to step through each unique element in b:
for ch in set(b):
if a_counter[ch]:
x.append(min(a_counter[ch], b_counter[ch]) * ch)
print(x)
--output:--
['Y', 'BB']

Combinations of a string with specific variable characters

How can I collect the combinations of a string, in which certain characters (but not all) are variable?
In other words, I have an input string and a character map. The character map specifies which characters are variable, and what they could be replaced with. The function then yields all possible combinations.
To put this in context, I'm trying to collect possible variations for an OCR output string that could have been misinterpreted by the OCR engine.
Example input:
"ABCD"
Example character map:
dict(
B=("X", "Z"),
D=("E")
)
Intended output:
[
"ABCD",
"ABCE",
"AXCD",
"AXCE",
"AZCD",
"AZCE"
]
You can use itertools.product:
>>> from itertools import product
>>> s = "ABCD"
>>> d = {"B": ["X", "Z"], "D": ["E"]}
>>> poss = [[c]+d.get(c,[]) for c in s]
>>> poss
[['A'], ['B', 'X', 'Z'], ['C'], ['D', 'E']]
>>> [''.join(p) for p in product(*poss)]
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
Note that I made d["D"] a list rather than simply a string for consistency.
My own solution was very ugly and non-Pythonic, but here goes:
def fuzzy_search(string, character_map):
all_variations = []
for i, character in enumerate(string):
if character in character_map:
character_variations = list(character_map[character])
character_variations.insert(0, character)
if i == len(string) - 1:
return [string[:-1] + variation for variation in character_variations]
for variation in character_variations:
sub_variations = fuzzy_search(string[i + 1:], character_map)
for sub_variation in sub_variations:
all_variations.append(string[:i] + variation + sub_variation)
return all_variations
return all_variations
map = dict(
B=("X", "Z"),
D=("E")
)
print fuzzy_search("ABCD", map)
Outputs:
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
I figured there should be way more elegant solutions than a recursive function with multiple loops.

Categories

Resources