Say I have a sorted list of strings as in:
['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
Now I want to sort based on the trailing numerical value for the Bs - so I have:
['A', 'B' , 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
One possible algorithm would be to hash up a regex like regex = re.compile(ur'(B)(\d*)), find the indices of the first and last B, slice the list, sort the slice using the regex's second group, then insert the sorted slice. However this seems too much of a hassle. Is there a way to write a key function that "leaves the item in place" if it does not match the regex and only
sorts the items (sublists) that match ?
Note: the above is just an example; I don't necessarily know the pattern (or I may want to also sort C's, or any string that has a trailing number in there). Ideally, I'm looking for an approach to the general problem of sorting only subsequences which match a given criterion (or failing that, just those that meet the specific criterion of a given prefix followed by a string of digits).
In the simple case where you just want to sort trailing digits numerically and their non-digit prefixes alphabetically, you need a key function which splits each item into non-digit and digit components as follows:
'AB123' -> ['AB', 123]
'CD' -> ['CD']
'456' -> ['', 456]
Note: In the last case, the empty string '' is not strictly necessary in CPython 2.x, as integers sort before strings – but that's an implementation detail rather than a guarantee of the language, and in Python 3.x it is necessary, because strings and integers can't be compared at all.
You can build such a key function using a list comprehension and re.split():
import re
def trailing_digits(x):
return [
int(g) if g.isdigit() else g
for g in re.split(r'(\d+)$', x)
]
Here it is in action:
>>> s1 = ['11', '2', 'A', 'B', 'B1', 'B11', 'B2', 'B21', 'C', 'C11', 'C2']
>>> sorted(s1, key=trailing_digits)
['2', '11', 'A', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'C2', 'C11']
Once you add the restriction that only strings with a particular prefix or prefixes have their trailing digits sorted numerically, things get a little more complicated.
The following function builds and returns a key function which fulfils the requirement:
def prefixed_digits(*prefixes):
disjunction = '|'.join('^' + re.escape(p) for p in prefixes)
pattern = re.compile(r'(?<=%s)(\d+)$' % disjunction)
def key(x):
return [
int(g) if g.isdigit() else g
for g in re.split(pattern, x)
]
return key
The main difference here is that a precompiled regex is created (containing a lookbehind constructed from the supplied prefix or prefixes), and a key function using that regex is returned.
Here are some usage examples:
>>> s2 = ['A', 'B', 'B11', 'B2', 'B21', 'C', 'C11', 'C2', 'D12', 'D2']
>>> sorted(s2, key=prefixed_digits('B'))
['A', 'B', 'B2', 'B11', 'B21', 'C', 'C11', 'C2', 'D12', 'D2']
>>> sorted(s2, key=prefixed_digits('B', 'C'))
['A', 'B', 'B2', 'B11', 'B21', 'C', 'C2', 'C11', 'D12', 'D2']
>>> sorted(s2, key=prefixed_digits('B', 'D'))
['A', 'B', 'B2', 'B11', 'B21', 'C', 'C11', 'C2', 'D2', 'D12']
If called with no arguments, prefixed_digits() returns a key function which behaves identically to trailing_digits:
>>> sorted(s1, key=prefixed_digits())
['2', '11', 'A', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'C2', 'C11']
Caveats:
Due to a restriction in Python's re module regarding lookbhehind syntax, multiple prefixes must have the same length.
In Python 2.x, strings which are purely numeric will be sorted numerically regardless of which prefixes are supplied to prefixed_digits(). In Python 3, they'll cause an exception (except when called with no arguments, or in the special case of key=prefixed_digits('') – which will sort purely numeric strings numerically, and prefixed strings alphabetically). Fixing that may be possible with a significantly more complex regex, but I gave up trying after about twenty minutes.
If I understand correctly, your ultimate goal is to sort sub-sequences,
while leaving alone the items that are not part of the sub-sequences.
In your example, the sub-sequence is defined as items starting with "B".
Your example list happens to contain items in lexicographic order,
which is a bit too convenient,
and can be distracting from finding a generalized solution.
Let's mix things up a little by using a different example list.
How about:
['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2']
Here, the items are no longer ordered (at least I tried to organize them so that they are not), neither the ones starting with "B", nor the others.
However, the items starting with "B" still form a single contiguous sub-sequence, occupying the single range 1-6 rather than split ranges for example as 0-3 and 6-7.
This again might be distracting, I will address that aspect further down.
If I understand your ultimate goal correctly, you would like this list to get sorted like this:
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
To make this work, we need a key function that will return a tuple, such that:
First value:
If the item doesn't start with "B", then the index in the original list (or a value in the same order)
If the item starts with "B", then the index of the last item that didn't start with "B"
Second value:
If the item doesn't start with "B", then omit this
If the item starts with "B", then the numeric value
This can be implemented like this, and with some doctests:
def order_sublist(items):
"""
>>> order_sublist(['A', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'C1', 'C11', 'C2'])
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
>>> order_sublist(['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2'])
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
"""
def key():
ord1 = [0]
def inner(item):
if not item.startswith('B'):
ord1[0] += 1
return ord1[0],
return ord1[0], int(item[1:] or 0)
return inner
return sorted(items, key=key())
In this implementation, the items get sorted by these keys:
[(1,), (1, 2), (1, 11), (1, 22), (1, 0), (1, 1), (1, 21), (2,), (3,), (4,), (5,)]
The items not starting by "B" keep their order, thanks to the first value in the key tuple, and the items starting with "B" get sorted thanks to the second value of the key tuple.
This implementation contains a few tricks that are worth explaining:
The key function returns a tuple of 1 or 2 elements, as explained earlier: the non-B items have one value, the B items have two.
The first value of the tuple is not exactly the original index, but it's good enough. The value before the first B item is 1, all the B items use the same value, and the values after the B get an incremented value every time. Since (1,) < (1, x) < (2,) where x can be anything, these keys will get sorted as we wanted them.
And now on to the "real" tricks :-)
What's up with the ord1 = [0] and ord1[0] += 1 ? This is a technique to change a non-local value in a function. Had I used simply ord1 = 0 and ord1 += 1 would not work, because ord1 is a primitive value defined outside of the function. Without the global keyword it's neither visible nor reassignable. A primitive ord1 value inside the inner function would shadow the outer primitive value. But ord1 being a list, it's visible inside inner, and its content can be modified. Note that cannot be reassigned. If you replaced with ord1[0] += 1 as ord1 = [ord1[0] + 1] which would result in the same value, it would not work, as in that case ord1 at the left side is a local variable, shadowing the ord1 in the outer scope, and not modifying its value.
What's up with the key and inner functions? I thought it would be neat if the key function we will pass to sorted will be reusable. This simpler version works too:
def order_sublist(items):
ord1 = [0]
def inner(item):
if not item.startswith('B'):
ord1[0] += 1
return ord1[0],
return ord1[0], int(item[1:] or 0)
return sorted(items, key=inner)
The important difference is that if you wanted to use inner twice, both uses would share the same ord1 list. Which can be acceptable, as longs as the integer value ord1[0] doesn't overflow during the use. In this case you won't use the function twice, and even if you did probably there wouldn't be a risk of integer overflow, but as a matter of principle, it's nice to make the function clean and reusable by wrapping it as I did in my initial proposal. What the key function does is simply initialize ord1 = [0] in its scope, define the inner function, and return the inner function. This way ord1 is effectively private, thanks to the closure. Every time you call key(), it returns a function that has its private, fresh ord1 value.
Last but not least, notice the doctests: the """ ... """ comment is more than just documentation, it's executable tests. The >>> lines are code to execute in a Python shell, and the following lines are the expected output. If you have this program in a file called script.py, you can run the tests with python -m doctest script.py. When all tests pass, you get no output. When a test fails, you get a nice report. It's a great way to verify that your program works, through demonstrated examples. You can have multiple test cases, separated by blank lines, to cover interesting corner cases. In this example there are two test cases, with your original sorted input, and the modified unsorted input.
However, as #zero-piraeus has made an interesting remark:
I can see that your solution relies on sorted() scanning the list left-to-right (which is reasonable – I can't imagine TimSort is going to be replaced or radically changed any time soon – but not guaranteed by Python AFAIK, and there are sorting algorithms that don't work like that).
I tried to be self-critical and doubt that the scanning from left to right is reasonable.
But I think it is.
After all, the sorting really happens based on the keys,
not the actual values.
I think most likely Python does something like this:
Take a list of the key values with [key(value) for value in input], visiting the values from left to right.
zip the list of keys with the original items
Apply whatever sorting algorithm on the zipped list, comparing items by the first value of the zip, and swapping items
At the end, return the sorted items with return [t[1] for t in zipped]
When building the list of key values,
it could work on multiple threads,
let's say two, the first thread one populating the first half and the second thread populating the second half in parallel.
That would mess up the ord1[0] += 1 trick.
But I doubt it does this kind of optimization,
as it simply seems overkill.
But to eliminate any shadow of doubt,
we can follow this alternative implementation strategy ourselves,
though the solution becomes a bit more verbose:
def order_sublist(items):
"""
>>> order_sublist(['A', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'C1', 'C11', 'C2'])
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
>>> order_sublist(['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2'])
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
"""
ord1 = 0
zipped = []
for item in items:
if not item.startswith('B'):
ord1 += 1
zipped.append((ord1, item))
def key(item):
if not item[1].startswith('B'):
return item[0],
return item[0], int(item[1][1:] or 0)
return [v for _, v in sorted(zipped, key=key)]
Do note that thanks to the doctests,
we have an easy way to verify that the alternative implementation still works as before.
What if you wanted this example list:
['X', 'B', 'B1', 'B11', 'B2', 'B22', 'C', 'Q1', 'C11', 'C2', 'B21']
To get sorted like this:
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'Q1', 'C11', 'C2', 'B22']
That is, the items starting with "B" sorted by their numeric value,
even when they don't form a contiguous sub-sequence?
That won't be possible with a magical key function.
It certainly is possible though, with some more legwork.
You could:
Create a list with the original indexes of the items starting with "B"
Create a list with the items starting with "B" and sort it with whatever way you like
Write back the content of the sorted list at the original indexes
If you need help with this last implementation, let me know.
Most of the answers focused on the B's while I needed a more general solution as noted. Here's one:
def _order_by_number(items):
regex = re.compile('(.*?)(\d*)$') # pass as an argument for generality
keys = {k: regex.match(k) for k in items}
keys = {k: (v.groups()[0], int(v.groups()[1] or 0))
for k, v in keys.items()}
items.sort(key=keys.__getitem__)
I am still looking for a magic key however that would leave stuff in place
You can use the natsort module:
>>> from natsort import natsorted
>>>
>>> a = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
>>> natsorted(a)
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
If the elements that are to be sorted are all adjacent to each other in the list:
You can use cmp in the sorted()-function instead of key:
s1=['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
def compare(a,b):
if (a[0],b[0])==('B','B'): #change to whichever condition you'd like
inta=int(a[1:] or 0)
intb=int(b[1:] or 0)
return cmp(inta,intb) #change to whichever mode of comparison you'd like
else:
return 0 #if one of a, b doesn't fulfill the condition, do nothing
sorted(s1,cmp=compare)
This assumes transitivity for the comparator, which is not true for a more general case. This is also much slower than using key, but the advantage is that it can take context into account (to a small extent).
If the elements that are to be sorted are not all adjacent to each other in the list:
You could generalise the comparison-type sorting algorithms by checking every other element in the list, and not just neighbours:
s1=['11', '2', 'A', 'B', 'B11', 'B21', 'B1', 'B2', 'C', 'C11', 'C2', 'B09','C8','B19']
def cond1(a): #change this to whichever condition you'd like
return a[0]=='B'
def comparison(a,b): #change this to whichever type of comparison you'd like to make
inta=int(a[1:] or 0)
intb=int(b[1:] or 0)
return cmp(inta,intb)
def n2CompareSort(alist,condition,comparison):
for i in xrange(len(alist)):
for j in xrange(i):
if condition(alist[i]) and condition(alist[j]):
if comparison(alist[i],alist[j])==-1:
alist[i], alist[j] = alist[j], alist[i] #in-place swap
n2CompareSort(s1,cond1,comparison)
I don't think that any of this is less of a hassle than making a separate list/tuple, but it is "in-place" and leaves elements that don't fulfill our condition untouched.
You can use the following key function. It will return a tuple of the form (letter, number) if there is a number, or of the form (letter,) if there is no number. This works since ('A',) < ('A', 1).
import re
a = ['A', 'B' ,'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
regex = re.compile(r'(\d+)')
def order(e):
num = regex.findall(e)
if num:
num = int(num[0])
return e[0], num
return e,
print(sorted(a, key=order))
>> ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
If I'm understanding your question clear, you are trying to sort an array by two attributes; the alphabet and the trailing 'number'.
You could just do something like
data = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
data.sort(key=lambda elem: (elem[0], int(elem[1:]))
but since this would throw an exception for elements without a number trailing them, we can go ahead and just make a function (we shouldn't be using lambda anyways!)
def sortKey(elem):
try:
attribute = (elem[0], int(elem[1:]))
except:
attribute = (elem[0], 0)
return attribute
With this sorting key function made, we can sort the element in place by
data.sort(key=sortKey)
Also, you could just go ahead and adjust the sortKey function to give priority to certain alphabets if you wanted to.
To answer precisely what you describe you can do this :
l = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2', 'D']
def custom_sort(data, c):
s = next(i for i, x in enumerate(data) if x.startswith(c))
e = next((i for i, x in enumerate(data) if not x.startswith(c) and i > s), -1)
return data[:s] + sorted(data[s:e], key=lambda d: int(d[1:] or -1)) + data[e:]
print(custom_sort(l, "B"))
if you what an complete sort you can simply do this (as #Mike JS Choi answered but simplier) :
output = sorted(l, key=lambda elem: (elem[0], int(elem[1:] or -1)))
You can use ord() to transform for exemple 'B11' in numerical value:
cells = ['B11', 'C1', 'A', 'B1', 'B2', 'B21', 'B22', 'C11', 'C2', 'B']
conv_cells = []
## Transform expression in numerical value.
for x, cell in enumerate(cells):
val = ord(cell[0]) * (ord(cell[0]) - 65) ## Add weight to ensure respect order.
if len(cell) > 1:
val += int(cell[1:])
conv_cells.append((val, x)) ## List of tuple (num_val, index).
## Display result.
for x in sorted(conv_cells):
print(str(cells[x[1]]) + ' - ' + str(x[0]))
If you wish to sort with different rules for different subgroups you may use tuples as sorting keys. In this case items would be grouped and sorted layer by layer: first by first tuple item, next in each subgroup by second tuple item and so on. This allows us to have different sorting rules in different subgroups. The only limit - items should be comparable within each group. For example, you cannot have int and str type keys in the same subgroup, but you can have them in different subgroups.
Lets try to apply it to the task. We will prepare tuples with elements types (str, int) for B elements, and tuples with (str, str) for all others.
def sorter(elem):
letter, num = elem[0], elem[1:]
if letter == 'B':
return letter, int(num or 0) # hack - if we've got `''` as num, replace it with `0`
else:
return letter, num
data = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
sorted(data, key=sorter)
# returns
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
UPDATE
If you prefer it in one line:
data = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
sorted(data, key=lambda elem: (elem[0], int(elem[1:] or 0) if elem[0]=='B' else elem[:1]
# result
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
Anyway these key functions are quite simple, so you can adopt them to real needs.
import numpy as np
def sort_with_prefix(list, prefix):
alist = np.array(list)
ix = np.where([l.startswith(prefix) for l in list])
alist[ix] = [prefix + str(n or '')
for n in np.sort([int(l.split(prefix)[-1] or 0)
for l in alist[ix]])]
return alist.tolist()
For example:
l = ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
print(sort_with_prefix(l, 'B'))
>> ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
Using just key and the precondition that the sequence is already 'sorted':
import re
s = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
def subgroup_ordinate(element):
# Split the sequence element values into groups and ordinal values.
# use a simple regex and int() in this case
m = re.search('(B)(.+)', element)
if m:
subgroup = m.group(1)
ordinate = int(m.group(2))
else:
subgroup = element
ordinate = None
return (subgroup, ordinate)
print sorted(s, key=subgroup_ordinate)
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
The subgroup_ordinate() function does two things: identifies groups to be sorted and also determines the ordinal number within the groups. This example uses regular expression but the function could be arbitrarily complex. For example we can change it to ur'(B|C)(.+)' and sort both B and C sequences .
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
Reading the bounty question carefully I note the requirement 'sorts some values while leaving others "in place"'. Defining the comparison function to return 0 for elements that are not in subgroups would leave these elements where they were in the sequence.
s2 = ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'A', 'C', 'C1', 'C2', 'C11']
def compare((_a,a),(_b,b)):
return 0 if a is None or b is None else cmp(a,b)
print sorted(s, compare, subgroup_ordinate)
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'A', 'C', 'C1', 'C2', 'C11']
import re
from collections import OrderedDict
a = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
dict = OrderedDict()
def get_str(item):
_str = list(map(str, re.findall(r"[A-Za-z]", item)))
return _str
def get_digit(item):
_digit = list(map(int, re.findall(r"\d+", item)))
return _digit
for item in a:
_str = get_str(item)
dict[_str[0]] = sorted([get_digit(dig) for dig in a if _str[0] in dig])
nested_result = [[("{0}{1}".format(k,v[0]) if v else k) for v in dict[k]] for k in dict.keys()]
print (nested_result)
# >>> [['A'], ['B', 'B1', 'B2', 'B11', 'B21', 'B22'], ['C', 'C1', 'C2', 'C11']]
result = []
for k in dict.keys():
for v in dict[k]:
result.append("{0}{1}".format(k,v[0]) if v else k)
print (result)
# >>> ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
If you want to sort an arbitrary subset of elements while leaving other elements in place, it can be useful to design a view over the original list.
The idea of a view in general is that it's like a lens over the original list, but modifying it will manipulate the underlying original list.
Consider this helper class:
class SubList:
def __init__(self, items, predicate):
self.items = items
self.indexes = [i for i in range(len(items)) if predicate(items[i])]
#property
def values(self):
return [self.items[i] for i in self.indexes]
def sort(self, key):
for i, v in zip(self.indexes, sorted(self.values, key=key)):
self.items[i] = v
The constructor saves the original list in self.items, and the original indexes in self.indexes, as determined by predicate. In your examples, the predicate function can be this:
def predicate(item):
return item.startswith('B')
Then, the values property is the lens over the original list,
returning a list of values picked from the original list by the original indexes.
Finally, the sort function uses self.values to sort,
and then modifies the original list.
Consider this demo with doctests:
def demo(values):
"""
>>> demo(['X', 'b3', 'a', 'b1', 'b2'])
['X', 'b1', 'a', 'b2', 'b3']
"""
def predicate(item):
return item.startswith('b')
sub = SubList(values, predicate)
def key(item):
return int(item[1:])
sub.sort(key)
return values
Notice how SubList is used only as a tool through which to manipulate the input values. After the sub.sort call, values is modified, with elements to sort selected by the predicate function, and sorted according to the key function, and all other elements never moved.
Using this SubList helper with appropriate predicate and key functions,
you can sort arbitrary selection of elements of a list.
def compound_sort(input_list, natural_sort_prefixes=()):
padding = '{:0>%s}' % len(max(input_list, key=len))
return sorted(
input_list,
key = lambda li: \
''.join(
[li for c in '_' if not li.startswith(natural_sort_prefixes)] or
[c for c in li if not c.isdigit()] + \
[c for c in padding.format(li) if c.isdigit()]
)
)
This sort method receives:
input_list: The list to be sorted,
natural_sort_prefixes: A string or a tuple of strings.
List items targeted by the natural_sort_prefixes will be sorted naturally. Items not matching those prefixes will be sorted lexicographically.
This method assumes that the list items are structured as one or more non-numerical characters followed by one or more digits.
It should be more performant than solutions using regex, and doesn't depend on external libraries.
You can use it like:
print compound_sort(['A', 'B' , 'B11', 'B1', 'B2', 'C11', 'C2'], natural_sort_prefixes=("A","B"))
# ['A', 'B', 'B1', 'B2', 'B11', 'C11', 'C2']