Sort strings based on the number of distinct characters - python

I am confused why the code below, which is looking to sort strings based on their number of distinct alphabets, requires the set() and list() portions.
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']
strings.sort(key = lambda x: len(set(list(x))))
print(strings)
Thanks

In fact, the key of that code is the set() function. Why? Because it will return a set with not-repeated elements. For example:
set('foo') -> ['f', 'o']
set('aaaa') -> ['a']
set('abab') -> ['a', 'b']
Then, in order to sort based on the number of distinct alphabets, the len() function is used.

Nice question! Let's peel the layers off the sort() call.
According to the Python docs on sort and sorted,
key specifies a function of one argument that is used to extract a comparison key from each list element: key=str.lower. The default value is None (compare the elements directly).
That is, sort takes a keyword argument key and expects it to be a function. Specifically, it wants a key(x) function that will be used to generate a key value for each string in strings list, instead of the usual lexical ordering. In the Python shell:
>>> key = lambda x: len(set(list(x)))
>>> ordering = [key(x) for x in strings]
>>> ordering
[2, 3, 1, 2, 2, 4]
This could be any ordering scheme you like. Here, we want to order by the number of unique letters. That's where set and list come in. list("foo") will result in ['f', 'o', 'o']. Then we get len(list('foo')) == 3 -- the length of the word. Not the number of unique characters.
>>> key2 = lambda x: len(list(x))
>>> ordering2 = [key2(x) for x in strings]
>>> ordering2
[3, 3, 4, 4, 4, 4]
So we use set and list to get a set of characters. A set is like a list, except they only include the unique elements of a list. For instance we can make a list of characters for any word like this:
>>> list(strings[0])
['f', 'o', 'o']
And a set:
>>> set(list(strings[0]))
set(['o', 'f'])
The len() of that set is 2, so when sort goes to compare the "foo" in strings[0] to all the other strings[x] in strings, it uses this list. For example:
>>> (len(set(strings[0][:])) < len(set(strings[1][:])))
True
Which gives us the ordering we want.
EDIT: #PeterGibson pointed out above that list(string[i]) isn't needed. This is true because strings are iterable in Python, just like lists:
>>> set("foo")
set(['o', 'f'])

Related

Sort a list of strings using a user-provided order in Python

I have a list of strings, say ['ayyaaauu', 'shhasyhh', 'shaash'] and an accompanying order list ['s', 'y', 'u', 'h', 'a'].
I want to sort the list of strings using the order list in place of the normal alphabetical order used in sorting.
sorted(['ayyaaauu', 'shhasyhh', 'shaash']) would return ['ayyaaauu', 'shaash', 'shhasyhh'], sorting using alphabetical order.
I want the sort function to return ['shhasyhh', 'shaash', 'ayyaaauu'], which is the alphabet order specified in the order list.
Similar questions like this one are only useful if you only consider the first element of the string. I want to sequentially consider the entire string if the previous letters are the same.
P.S. In practice, I am generating the list of strings using letters from the order list, so all the letters in the strings have a "correct" order.
You can use sorted(), and then generate the key parameter by using map() to map each letter to its priority (sorted() then uses tuple comparison to generate the ordering):
data = ['ayyaaauu', 'shhasyhh', 'shaash']
ordering = ['s', 'y', 'u', 'h', 'a']
priorities = {letter: index for index, letter in enumerate(ordering)}
result = sorted(data, key=lambda x: tuple(map(lambda y: priorities[y], x)))
# Prints ['shhasyhh', 'shaash', 'ayyaaauu']
print(result)

Subtle difference when apply `set()` to find max count items in a list

A friend has asked this question, and I just cannot find a good explanation for it. (He knows how the max() and key works in this case)
Given a list of scores as this:
lst = ['A', 'B', 'B', 'B', 'C', 'C', 'C', 'E']
>>> max(lst, key=lst.count)
'B'
>>> max(set(lst), key=lst.count)
'C'
# if run min - will return different results - w/ and w/o set():
>>> min(lst, key=lst.count)
'A'
>>> min(set(lst), key=lst.count)
'E'
>>>
max and min return the first maximal / minimal element in an iterable.
lst.count("A") and lst.count("E") are equal (evaluating to 1), and so are lst.count("B") and lst.count("C") (evaluating to 3). A set is unordered in Python, and converting a list to a set does not preserve its order. (The internal order of a set is not exactly random, but arbirtrary.)
This is the reason why the results differ.
If you want to keep the order, but have unique elements, you could do:
unique_lst = sorted(set(lst), key=lst.index)

Sum of first elements in nested lists

I am trying to get the first element in a nested list and sum up the values.
eg.
nested_list = [[1, 'a'], [2, 'b'], [3, 'c'], [4, 'd']]
print sum(i[0] for i in nested_list)
However, there are times in which the first element in the lists None instead
nested_list = [[1, 'a'], [None, 'b'], [3, 'c'], [4, 'd']]
new = []
for nest in nested_list:
if not nest[0]:
pass
else:
new.append(nest[0])
print sum(nest)
Wondering what is the better way that I can code this?
Just filter then, in this case testing for values that are not None:
sum(i[0] for i in nested_list if i[0] is not None)
A generator expression (and list, dict and set comprehensions) takes any number of nested for loops and if statements. The above is the equivalent of:
for i in nested_list:
if i[0] is not None:
i[0] # used to sum()
Note how this mirrors your own code; rather than use if not ...: pass and else, I inverted the test to only allow for values you actually can sum. Just add more for loops or if statements in the same left-to-right to nest order if you need more loops with filters, or use and or or to string together multiple tests in a single if filter.
In your specific case, just testing for if i[0] would also suffice; this would filter out None and 0, but the latter value would not make a difference to the sum anyway:
sum(i[0] for i in nested_list if i[0])
You already approached this in your own if test in the loop code.
First of all, Python has no null, the equivalent of null in languages like Java, C#, JavaScript, etc. is None.
Secondly, we can use a filter in the generator expression. The most generic is probably to check with numbers:
from numbers import Number
print sum(i[0] for i in nested_list if isinstance(i[0], Number))
Number will usually make sure that we accept ints, longs, floats, complexes, etc. So we do not have to keep track of all objects in the Python world that are numerical ourselves.
Since it is also possible that the list contains empty sublists, we can also check for that:
from numbers import Number
print sum(i[0] for i in nested_list if i and isinstance(i[0], Number))

Filter list's elements by type of each element

I have list with different types of data (string, int, etc.). I need to create a new list with, for example, only int elements, and another list with only string elements. How to do it?
You can accomplish this with list comprehension:
integers = [elm for elm in data if isinstance(elm, int)]
Where data is the data. What the above does is create a new list, populate it with elements of data (elm) that meet the condition after the if, which is checking if element is instance of an int. You can also use filter:
integers = list(filter(lambda elm: isinstance(elm, int), data))
The above will filter out elements based on the passed lambda, which filters out all non-integers. You can then apply it to the strings too, using isinstance(elm, str) to check if instance of string.
Sort the list by type, and then use groupby to group it:
>>> import itertools
>>> l = ['a', 1, 2, 'b', 'e', 9.2, 'l']
>>> l.sort(key=lambda x: str(type(x)))
>>> lists = [list(v) for k,v in itertools.groupby(l, lambda x: str(type(x)))]
>>> lists
[[9.2], [1, 2], ['a', 'b', 'e', 'l']]

Multi Dimensional List - Sum Integer Element X by Common String Element Y

I have a multi dimensional list:
multiDimList = [['a',1],['a',1],['a',1],['b',2],['c',3],['c',3]]
I'm trying to sum the instances of element [1] where element [0] is common.
To put it more clearly, my desired output is another multi dimensional list:
multiDimListSum = [['a',3],['b',2],['c',6]]
I see I can access, say the value '2' in multiDimList by
x = multiDimList [3][1]
so I can grab the individual elements, and could probably build some sort of function to do this job, but it'd would be disgusting.
Does anyone have a suggestion of how to do this pythonically?
Assuming your actual sequence has similar elements grouped together as in your example (all instances of 'a', 'b' etc. together), you can use itertools.groupby() and operator.itemgetter():
from itertools import groupby
from operator import itemgetter
[[k, sum(v[1] for v in g)] for k, g in groupby(multiDimList, itemgetter(0))]
# result: [['a', 3], ['b', 2], ['c', 6]]
Zero Piraeus's answer covers the case when field entries are grouped in order. If they're not, then the following is short and reasonably efficient.
from collections import Counter
reduce(lambda c,x: c.update({x[0]: x[1]}) or c, multiDimList, Counter())
This returns a collection, accessible by element name. If you prefer it as a list you can call the .items() method on it, but note that the order of the labels in the output may be different from the order in the input even in the cases where the input was consistently ordered.
You could use a dict to accumulate the total associated to each string
d = {}
multiDimList = [['a',1],['a',1],['a',1],['b',2],['c',3],['c',3]]
for string, value in multiDimList:
# Retrieves the current value in the dict if it exists or 0
current_value = d.get(string, 0)
d[string] += value
print d # {'a': 3, 'b': 2, 'c': 6}
You can then access the value for b by using d["b"].

Categories

Resources