Categorize elements of a list in python - python

I want to efficiently categorize the elements of a given list L1. This list can be arbitrary long, so I am looking for an efficient way to do the following.
The list L1 contains several elements [e_1,...,e_N] that can be compared with a generic function called areTheSame(e1,e2). If this function returns True, it means that both elements belong to the same category.
At the end, I want to have another list L2, which in turn contains different lists [LC_1, ..., LC_M]. Each LC list contains all the elements from the same category.

Assuming that the function is transitive and reflective (and if it's not, the whole grouping does not seem to make much sense), it is enough to compare each word to one "representative" from each group, e.g. just the first or last element. If no such group exists, create a new group, e.g. using next with an empty list as default element.
lst = "a list with some words with different lengths".split()
areTheSame = lambda x, y: len(x) == len(y)
res = []
for w in lst:
l = next((x for x in res if areTheSame(w, x[0])), [])
if l == []:
res.append(l)
l.append(w)
Result: [['a'], ['list', 'with', 'some', 'with'], ['words'], ['different'], ['lengths']]
Still, this has complexity O(n*k), where n is the number of words and k the number of groups. It would be more efficient if instead of areTheSame(x,y) you had a function getGroup(x), then you'd have O(n). That is, instead of testing whether two elements belong to the same group, that function would extract the attribute(s) that determine which group the element belongs to. In my example, that's just the len of the strings, but in your case it might be more complex.
getGroup = lambda x: len(x)
d = collections.defaultdict(list)
for w in lst:
d[getGroup(w)].append(w)
Result: {1: ['a'], 4: ['list', 'with', 'some', 'with'], 5: ['words'], 9: ['different'], 7: ['lengths']}

I believe you can use itertools groupby function but might need to modify the areTheSame function so it will be a keyfunc, i.e. will yield some kind of key.
L1 = sorted(L1, key=keyfunc)
L2 = [list(g) for _, g in groupby(L1, keyfunc))

Related

How do I shorten this code with python list comprehension?

I am trying to code this,
def retrieve_smallest_letter(words):
"""
Input: 'words' (lst) which represents a list of strings.
Output: A new list with the smaller letter of the each word's
first and last index value compared to be appended to the list.
For Example:
>>> lst = ['sandbox', 'portabello', 'lion', 'australia', 'salamander']
>>> retrieve_smallest_letter(lst)
['s', 'o', 'l', 'a', 'r']
"""
My code
def retrieve_smallest_letter(words):
lst = []
for i in range(len(words)):
first_word = words[i][0]
last_word = words[i][len(words[i])-1]
if first_word < last_word:
lst.append(str(first_word))
else:
lst.append(str(last_word))
return lst
How can I shorten this code with list comprehension?
First thing to understand is that a list comprehension is fundamentally restricted semantics on a for loop:
r = [a for a in b if c]
is essentially syntactic sugar for
r = []
for a in b:
if c:
r.append(a)
so the first step is to get the problem into a "shape" which fits a list comprehension:
simple iteration and filtering
no assignments (because that's as yet not supported)
only one production
Using Python correctly also help, do let's start by simplifying the existing loop:
iterate collections directly, Python has a powerful iterator protocol and you don't usually iterate by indexing unless absolutely necessary
use indexing or destructuring on the word, Python allows indexing from the end (using negative indices)
perform the selection "inline" as a single expression, using either a conditional expression or in this case the min builtin which does exactly what's needed
def retrieve_smallest_letter(words):
lst = []
for word in words:
lst.append(min(word[0], word[-1]))
return lst
or
def retrieve_smallest_letter(words):
lst = []
for first, *_, last in words:
lst.append(min(first, last))
return lst
from there, the conversion is trivial (there is no filtering so it can be ignored):
def retrieve_smallest_letter(words):
return [min(first, last) for first, *_, last in words]
Yes
[min(word[0], word[-1]) for word in words]
lst = [min(w[0], w[-1]) for w in words]

Given a list of lists of strings, find most frequent pair of strings, second most frequent pair, ....., then most frequent triplet of strings, etc

I have a list that contains k lists of strings (each of these k lists do not have any duplicate string). We know the union of all possible strings (suppose we have n unique strings).
What we need to find is: What is the most frequent pair of strings (i.e., which 2 strings appear together the most across the k lists? And the second most frequent pair of strings, the third most frequent pair of strings, etc. Also, I'd like to know the most frequent triplet of strings, the second most frequent triplet of strings, etc.
The only algorithm that I could think of to do this is of terrible complexity, where basically to solve for the most frequent pair, I'd enumerate all possible pairs out of the n strings (O(n^2)) and for each of them check how many lists have them (O(k)) and then I'll sort the results to get what I need, and so my overall complexity is O(n^2.x), ignoring the last sort.
Any ideas for a better algorithm time-wise? (that would hopefully work well for triplets of strings and quadruplets of strings, etc)? Code in python is best, but detailed pseudocode (and data structure, if relevant) or detailed general idea is fine, too!
For example:
If
myList=[['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']],
Then the expected output of the pairs question would be: 'AC','ACC' is the most frequent pair and 'AB','ACC' is the second most frequent pair.
You can use combinations, Counter and frozenset:
from itertools import combinations
from collections import Counter
combos = (combinations(i, r=2) for i in myList)
Counter(frozenset(i) for c in combos for i in c).most_common(2)
Output:
[(frozenset({'AC', 'ACC'}), 3), (frozenset({'AB', 'ACC'}), 2)]
This is a general solution for all length of combinations:
import itertools
def most_freq(myList, n):
d={} #create a dictionary that will keep pair:frequency
for i in myList:
if len(i)>=n:
for k in itertools.combinations(i, n): #generates all combinations of length n in i
if k in d: #increases the frequency for this pair by 1
d[k]+=1
else:
d[k]=1
return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)} #this just sorts the dictionary based on the value, in descending order
Examples:
myList=[['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']]
>>> most_freq(myList,2)
{('AB', 'ACC'): 2, ('AC', 'ACC'): 2, ('AB', 'AC'): 1, ('ACC', 'BB'): 1, ('ACC', 'AC'): 1, ('BB', 'AC'): 1}
>>> most_freq(myList,3)
{('AB', 'AC', 'ACC'): 1, ('ACC', 'BB', 'AC'): 1}
Found a snippet on my hard drive, check if it helps you:
from collections import Counter
from itertools import combinations
mylist = [['AB', 'AC', 'ACC'], ['AB','ACC'],['ACC'],['AC','ACC'],['ACC','BB','AC']]
d = Counter()
for s in mylist:
if len(mylist) < 2:
continue
s.sort()
for c in combinations(s,2):
d[c] += 1
print(list(d.most_common()[0][0]))
Will return the list ['AC','ACC']
I have a rather simple approach, without using any libraries.
Firstly, for each list inside the main list, we can compute the hash for every pair of string. (more on string hashing here: https://cp-algorithms.com/string/string-hashing.html). Maintain a dictionary, that holds the count for each hash occurred. In the end, we just need to sort the dictionary to get all pairs, ranked in order of their occurrence count.
Example: [['AB', 'AC', 'ACC', 'TR'], ['AB','ACC']]
For list 1, that is ['AB', 'AC', 'ACC', 'TR'],
Compute hash for the pairs "AB AC", "AC ACC", "ACC TR" and correspondingly add them to the dictionary. Repeat the same for all lists inside the main list.

How would you find the maximum number a string has been repeated consecutively in a list?

Say for example a list looks like this:
['ABC', 'ABC', 'XYZ', 'ABC', 'ABC', 'ABC']
And you pass in 'ABC'
It should return the number 3, since the maximum number of times 'ABC' has repeated consecutively is 3.
How would you go about doing this?
I've tried iterating through the list checking if the index is 'ABC', and if the next index is 'ABC', but this makes finding the maximum number almost impossible.
You can do this using itertools.groupby, which groups consecutive elements. By providing no key function, it groups equal elements.
We need to use sum(1 for _ in v) to get the length of the run (since it is an iterable rather than a list, we can't use len), and max to get the biggest length.
from itertools import groupby
def longest_run_of(lst, element):
return max(sum(1 for _ in v) for k, v in groupby(lst) if k == element)
Example:
>>> longest_run_of(lst, 'ABC')
3
>>> longest_run_of(lst, 'XYZ')
1
Try:
from itertools import groupby
x=['ABC', 'ABC', 'XYZ', 'ABC', 'ABC', 'ABC']
y=dict(sorted([(gr, len(list(val))) for gr, val in groupby(x, lambda el: el)], key=lambda el: el[1]))
It will return dict of maximal consecutive occurrences per element - so to get it for ABC just do:
>> y["ABC"]
3
In short what's happening there - groupby will group consecutive occurrences together.
len(list(val)) will count them.
Next we sort them to get the elements with highest number of consecutive occurrences at the end.
Then we convert to dict - leveraging the fact, that in case of multiple occurrences of the same key - dict always takes the last one, which thanks to sorting will always be the one with highest number of occurrences.

Efficiently comparing the first item in each list of two large list of lists?

I'm currently working with a a large list of lists (~280k lists) and a smaller one (~3.5k lists). I'm trying to efficiently compare the first index in the smaller list to the first index in the large list. If they match, I want to return both lists from the small and large list that have a matching first index.
For example:
Large List 1:
[[a,b,c,d],[e,f,g,h],[i,j,k,l],[m,n,o,p]]
Smaller list 2:
[[e,q,r,s],[a,t,w,s]]
Would return
[([e,q,r,s],[e,f,g,h]),([a,t,w,s],[a,b,c,d])]
I currently have it setup as shown here, where a list of tuples is returned with each tuple holding the two lists that have a matching first element. I'm fine with any other data structures being used. I was trying to use a set of tuples but was having issues trying to figure out how to do it quicker than what I already have.
My code to to compare these two list of lists is currently this:
match = []
for list_one in small_list:
for list_two in large_list:
if str(list_one[0]).lower() in str(list_two[0]).lower():
match.append((spm_values, cucm_values))
break
return match
Assuming order doesn't matter, I would highly recommend using a dictionary to map prefix (one character) to items and set to find matches:
# generation of data... not important
>>> lst1 = [list(c) for c in ["abcd", "efgh", "ijkl", "mnop"]]
>>> lst2 = [list(c) for c in ["eqrs", "atws"]]
# mapping prefix to list (assuming uniqueness)
>>> by_prefix1 = {chars[0]: chars for chars in lst1}
>>> by_prefix2 = {chars[0]: chars for chars in lst2}
# actually finding matches by intersecting sets (fast)
>>> common = set(by_prefix1.keys()) & set(by_prefix2.keys())
>>> tuples = tuple(((by_prefix1[k], by_prefix2[k]) for k in common))
>>> tuples
Here's a one liner using list comprehension. I'm not sure how efficient it is, though.
large = [list(c) for c in ["abcd", "efgh", "ijkl", "mnop"]]
small = [list(c) for c in ["eqrs", "atws"]]
ret = [(x,y) for x in large for y in small if x[0] == y[0]]
print ret
#output
[(['a', 'b', 'c', 'd'], ['a', 't', 'w', 's']), (['e', 'f', 'g', 'h'], ['e', 'q', 'r', 's'])]
I'm actually using Python 2.7.11, although I guess this may work.
l1 =[['a','b','c','d'],['e','f','g','h'],['i','j','k','l'],['m','n','o','p']]
l2 =[['e','q','r','s'],['a','t','w','s']]
def org(Smalllist,Largelist):
L = Largelist
S = Smalllist
Final = []
for i in range(len(S)):
for j in range(len(L)):
if S[i][0] == L[j][0]:
Final.append((S[i],L[j]))
return Final
I suggest you to put the Smaller list in the first variable in order to get the results in the order you expected.
It's very important that you enter these letters as strings upon testing, as I did, otherwise they might be considered variables and the code will not run properly.

Multi Dimensional List - Sum Integer Element X by Common String Element Y

I have a multi dimensional list:
multiDimList = [['a',1],['a',1],['a',1],['b',2],['c',3],['c',3]]
I'm trying to sum the instances of element [1] where element [0] is common.
To put it more clearly, my desired output is another multi dimensional list:
multiDimListSum = [['a',3],['b',2],['c',6]]
I see I can access, say the value '2' in multiDimList by
x = multiDimList [3][1]
so I can grab the individual elements, and could probably build some sort of function to do this job, but it'd would be disgusting.
Does anyone have a suggestion of how to do this pythonically?
Assuming your actual sequence has similar elements grouped together as in your example (all instances of 'a', 'b' etc. together), you can use itertools.groupby() and operator.itemgetter():
from itertools import groupby
from operator import itemgetter
[[k, sum(v[1] for v in g)] for k, g in groupby(multiDimList, itemgetter(0))]
# result: [['a', 3], ['b', 2], ['c', 6]]
Zero Piraeus's answer covers the case when field entries are grouped in order. If they're not, then the following is short and reasonably efficient.
from collections import Counter
reduce(lambda c,x: c.update({x[0]: x[1]}) or c, multiDimList, Counter())
This returns a collection, accessible by element name. If you prefer it as a list you can call the .items() method on it, but note that the order of the labels in the output may be different from the order in the input even in the cases where the input was consistently ordered.
You could use a dict to accumulate the total associated to each string
d = {}
multiDimList = [['a',1],['a',1],['a',1],['b',2],['c',3],['c',3]]
for string, value in multiDimList:
# Retrieves the current value in the dict if it exists or 0
current_value = d.get(string, 0)
d[string] += value
print d # {'a': 3, 'b': 2, 'c': 6}
You can then access the value for b by using d["b"].

Categories

Resources