Sort sets of objects by numbers in python - python

I have a sequence of lists such as :
>>> result
[['Human_COII_1000-4566_hsp', 'Human_COII_500-789_hsp', 'Human_COII_100-300_hsp'], ['Human_COI_100-300_hsp', 'Human_COI_500-789_hsp', 'Human_COI_1000-4566_hsp']]
and I would like with each list to sort them by the number-number and get:
[['Human_COII_100-300_hsp', 'Human_COII_500-789_hsp', 'Human_COII_1000-4566_hsp'], ['Human_COI_100-300_hsp', 'Human_COI_500-789_hsp', 'Human_COI_1000-4566_hsp']]
I tried:
for i in result:
sorted(i)
but the order is not the one I wanted.

You could make a new sorted list using comprehension,
>>> import re
>>> x
[set(['Human_COII_1000-4566_hsp', 'Human_COII_100-300_hsp', 'Human_COII_500-789_hsp']), set(['Human_COI_100-300_hsp', 'Human_COI_500-789_hsp', 'Human_COI_1000-4566_hsp'])]
>>>
>>> # for python2
>>> [sorted(y, key=lambda item: map(int, re.findall(r'\d+', item))) for y in x]
[['Human_COII_100-300_hsp', 'Human_COII_500-789_hsp', 'Human_COII_1000-4566_hsp'], ['Human_COI_100-300_hsp', 'Human_COI_500-789_hsp', 'Human_COI_1000-4566_hsp']]
>>>
>>> # python3
>>> [sorted(y, key=lambda item: tuple(map(int, re.findall(r'\d+', item)))) for y in x]

Related

Python: Split a list into multiple lists based on a subset of elements [duplicate]

This question already has answers here:
Splitting a list based on a delimiter word
(4 answers)
Closed 4 years ago.
I am trying to split a list that I have into individual lists whenever a specific character or a group of characters occur.
eg.
Main_list = [ 'abcd 1233','cdgfh3738','hryg21','**L**','gdyrhr657','abc31637','**R**','7473hrtfgf'...]
I want to break this list and save it into a sublist whenever I encounter an 'L' or an 'R'
Desired Result:
sublist_1 = ['abcd 1233','cdgfh3738','hryg21']
sublist_2 = ['gdyrhr657','abc31637']
sublist 3 = ['7473hrtfgf'...]
Is there a built in function or a quick way to do this ?
Edit: I do not want the delimiter to be in the list
Use a dictionary for a variable number of variables.
In this case, you can use itertools.groupby to efficiently separate your lists:
L = ['abcd 1233','cdgfh3738','hryg21','**L**',
'gdyrhr657','abc31637','**R**','7473hrtfgf']
from itertools import groupby
# define separator keys
def split_condition(x):
return x in {'**L**', '**R**'}
# define groupby object
grouper = groupby(L, key=split_condition)
# convert to dictionary via enumerate
res = dict(enumerate((list(j) for i, j in grouper if not i), 1))
print(res)
{1: ['abcd 1233', 'cdgfh3738', 'hryg21'],
2: ['gdyrhr657', 'abc31637'],
3: ['7473hrtfgf']}
Consider using one of many helpful tools from a library, i.e. more_itertools.split_at:
Given
import more_itertools as mit
lst = [
"abcd 1233", "cdgfh3738", "hryg21", "**L**",
"gdyrhr657", "abc31637", "**R**",
"7473hrtfgf"
]
Code
result = list(mit.split_at(lst, pred=lambda x: set(x) & {"L", "R"}))
Demo
sublist_1, sublist_2, sublist_3 = result
sublist_1
# ['abcd 1233', 'cdgfh3738', 'hryg21']
sublist_2
# ['gdyrhr657', 'abc31637']
sublist_3
# ['7473hrtfgf']
Details
The more_itertools.split_at function splits an iterable at positions that meet a special condition. The conditional function (predicate) happens to be a lambda function, which is equivalent to and substitutable with the following regular function:
def pred(x):
a = set(x)
b = {"L", "R"}
return a.intersection(b)
Whenever characters of string x intersect with L or R, the predicate returns True, and the split occurs at that position.
Install this package at the commandline via > pip install more_itertools.
#Polyhedronic, you can also try this.
>>> import re
>>> Main_list = [ 'abcd 1233','cdgfh3738','hryg21','**L**','gdyrhr657','abc31637','**R**','7473hrtfgf']
>>>
>>> s = ','.join(Main_list)
>>> s
'abcd 1233,cdgfh3738,hryg21,**L**,gdyrhr657,abc31637,**R**,7473hrtfgf'
>>>
>>> items = re.split('\*\*R\*\*|\*\*L\*\*', s)
>>> items
['abcd 1233,cdgfh3738,hryg21,', ',gdyrhr657,abc31637,', ',7473hrtfgf']
>>>
>>> output = [[a for a in item.split(',') if a] for item in items]
>>> output
[['abcd 1233', 'cdgfh3738', 'hryg21'], ['gdyrhr657', 'abc31637'], ['7473hrtfgf']]
>>>
>>> sublist_1 = output[0]
>>> sublist_2 = output[1]
>>> sublist_3 = output[2]
>>>
>>> sublist_1
['abcd 1233', 'cdgfh3738', 'hryg21']
>>>
>>> sublist_2
['gdyrhr657', 'abc31637']
>>>
>>> sublist_3
['7473hrtfgf']
>>>

Replace in strings of list

I can use re.sub easily in single string like this:
>>> a = "ajhga':&+?%"
>>> a = re.sub('[.!,;+?&:%]', '', a)
>>> a
"ajhga'"
If I use this on list of strings then I am not getting the result. What I am doing is:
>>> a = ["abcd:+;", "(l&'kka)"]
>>> for x in a:
... x = re.sub('[\(\)&\':+]', '', x)
...
>>> a
['abcd:+;', "(l&'kka)"]
How can I strip expressions from strings in list?
>>> a = ["abcd:+;", "(l&'kka)"]
>>> a = [re.sub('[\(\)&\':+]', '', x) for x in a]
>>> a
['abcd;', 'lkka']
>>>
for index,x in enumerate(a):
a[index] = re.sub('[\(\)&\':+]', '', x)
Your changing the value but not updating your list. enumerate is function that return tuple (index,value) for each item of list

Filter List by Longest Element Containing a String

I want to filter a list of all items containing the same last 4 digits, I want to print the longest of them.
For example:
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
# want to return abcdabcd1234 and poiupoiupoiupoiu7890
In this case, we print the longer of the elements containing 1234, and the longer of the elements containing 7890. Finding the longest element containing a certain element is not hard, but doing it for all items in the list (different last four digits) efficiently seems difficult.
My attempt was to first identify all the different last 4 digits using list comprehension and slice:
ids=[]
for x in lst:
ids.append(x[-4:])
ids = list(set(ids))
Next, I would search through the list by index, with a "max_length" variable and "current_id" to find the largest elements of each id. This is clearly very inefficient and was wondering what the best way to do this would be.
Use a dictionary:
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> d = {} # to keep the longest items for digits.
>>> for item in lst:
... key = item[-4:] # last 4 characters
... d[key] = max(d.get(key, ''), item, key=len)
...
>>> d.values() # list(d.values()) in Python 3.x
['abcdabcd1234', 'poiupoiupoiupoiu7890']
from collections import defaultdict
d = defaultdict(str)
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
for x in lst:
if len(x) > len(d[x[-4:]]):
d[x[-4:]] = x
To display the results:
for key, value in d.items():
print key,'=', value
which produces:
1234 = abcdabcd1234
7890 = poiupoiupoiupoiu7890
itertools is great. Use groupby with a lambda to group the list into the same endings, and then from there it is easy:
>>> from itertools import groupby
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> [max(y, key=len) for x, y in groupby(lst, lambda l: l[-4:])]
['abcdabcd1234', 'poiupoiupoiupoiu7890']
Slightly more generic
import string
import collections
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
z = [(x.translate(None, x.translate(None, string.digits)), x) for x in lst]
x = collections.defaultdict(list)
for a, b in z:
x[a].append(b)
for k in x:
print k, max(x[k], key=len)
1234 abcdabcd1234
7890 poiupoiupoiupoiu7890

How to split a list into subsets based on a pattern?

I'm doing this but it feels this can be achieved with much less code. It is Python after all. Starting with a list, I split that list into subsets based on a string prefix.
# Splitting a list into subsets
# expected outcome:
# [['sub_0_a', 'sub_0_b'], ['sub_1_a', 'sub_1_b']]
mylist = ['sub_0_a', 'sub_0_b', 'sub_1_a', 'sub_1_b']
def func(l, newlist=[], index=0):
newlist.append([i for i in l if i.startswith('sub_%s' % index)])
# create a new list without the items in newlist
l = [i for i in l if i not in newlist[index]]
if len(l):
index += 1
func(l, newlist, index)
func(mylist)
You could use itertools.groupby:
>>> import itertools
>>> mylist = ['sub_0_a', 'sub_0_b', 'sub_1_a', 'sub_1_b']
>>> for k,v in itertools.groupby(mylist,key=lambda x:x[:5]):
... print k, list(v)
...
sub_0 ['sub_0_a', 'sub_0_b']
sub_1 ['sub_1_a', 'sub_1_b']
or exactly as you specified it:
>>> [list(v) for k,v in itertools.groupby(mylist,key=lambda x:x[:5])]
[['sub_0_a', 'sub_0_b'], ['sub_1_a', 'sub_1_b']]
Of course, the common caveats apply (Make sure your list is sorted with the same key you're using to group), and you might need a slightly more complicated key function for real world data...
In [28]: mylist = ['sub_0_a', 'sub_0_b', 'sub_1_a', 'sub_1_b']
In [29]: lis=[]
In [30]: for x in mylist:
i=x.split("_")[1]
try:
lis[int(i)].append(x)
except:
lis.append([])
lis[-1].append(x)
....:
In [31]: lis
Out[31]: [['sub_0_a', 'sub_0_b'], ['sub_1_a', 'sub_1_b']]
Use itertools' groupby:
def get_field_sub(x): return x.split('_')[1]
mylist = sorted(mylist, key=get_field_sub)
[ (x, list(y)) for x, y in groupby(mylist, get_field_sub)]

Nested List and For Loop

Consider this:
list = 2*[2*[0]]
for y in range(0,2):
for x in range(0,2):
if x ==0:
list[x][y]=1
else:
list[x][y]=2
print list
Result:
[[2,2],[2,2]]
Why doesn't the result be [[1,1],[2,2]]?
Because you are creating a list that is two references to the same sublist
>>> L = 2*[2*[0]]
>>> id(L[0])
3078300332L
>>> id(L[1])
3078300332L
so changes to L[0] will affect L[1] because they are the same list
The usual way to do what you want would be
>>> L = [[0]*2 for x in range(2)]
>>> id(L[0])
3078302124L
>>> id(L[1])
3078302220L
notice that L[0] and L[1] are now distinct
Alternatively to save space:
>>> [[x,x] for x in xrange(1,3)]

Categories

Resources