Most efficient way to count duplicate tuples in python list

Most efficient way to count duplicate tuples in python list - python

I have a list with more than 100 millions of tuples, with key-value elements like this:
list_a = [(1,'a'), (2,'b'), (1,'a'), (3,'b'), (3,'b'), (1,'a')]
I need to output a second list like this:
list_b = [(1,'a', 3), (2, 'b', 1), (3, 'b', 2) ]
Last element in a tuple is the count of duplicates in the list for such tuple. Order in list_b doesn't matter.
Then, I wrote this code:
import collections
list_b = []
for e, c in collections.Counter(list_a).most_common():
list_b.append("{}, {}, {}".format(e[0], e[1], c))
Running with 1000 tuples it last 2 seconds approximately... figure out how long will take with more that 100 millions. Any idea to speed it up?

Your bottle neck is using list.append method, since it's running on native python instead of the innate C code, it'll perform much slower.
You can opt to use list comprehension instead and it'll be much faster:
c = Counter(list_a)
result = [(*k, v) for k, v in c.items()]
Ran this on a 1000 item list on my machine, it was pretty quick.

Related

Pythonic way to iterate over a shifted list of tuples

Given a list L = [('a',3),('b',4),('c',14),('d',10)],
the desired output is the first item from a tuple and the second item from the next tuple, e.g.:
a 4
b 14
c 10
Straightforward but unpythonic way would be
for i in range(len(L)-1):
print(L[i][0], L[i+1][1])
Alternatively, this is what I've came up with:
for (a0,a1),(b0,b1) in zip(L,L[1:]):
print(a0,b1)
but it seems to be wasteful. Is there a standard way to do this?

I personally think both options are just fine It is possible to extract the items and join them:
pairs = zip(map(itemgetter(0), L), map(itemgetter(1), L[1:]))
# [('a', 4), ('b', 14), ('c', 10)]

A pythonic way is to use a generator expression.
You could write it like this:
for newTuple in ((L[i][0], L[i+1][1]) for i in range(len(L)-1)):
print(newTuple)
It looks like a list-comprehension, but the iterator-generator will not create the full list, just yields a tuple by tuple, so it is not taking additional memory for a full-copy of list.

To improve your zip example (which is already good), you could use itertools.islice to avoid creating a sliced copy of the initial list. In python 3, the below code only generates values, no temporary list is created in the process.
import itertools
L = [('a',3),('b',4),('c',14),('d',10)]
for (a0,_),(_,b1) in zip(L,itertools.islice(L,1,None)):
print(a0,b1)

I'd split the first and second items with the help of two generator expressions, use islice to drop one item and zip the two streams together again.
first_items = (a for a, _ in L)
second_items = (b for _, b in L)
result = zip(first_items, islice(second_items, 1, None))
print(list(result))
# [('a', 4), ('b', 14), ('c', 10)]

Single list iteration vs multiple list comprehensions

I have a list of data for which I need to copy some of it's elements into a couple of different lists. Would it be better to do a single iteration of the list or perform multiple list comprehensions
E.g.
def split_data(data):
a = []
b = []
c = []
for d in data:
if d[0] > 1 : a.append(d)
if d[1] == 'b': b.append(d)
if len(d) == 3 : c.append(d)
return a, b, c
Versus
def split_data(data):
a = [d for d in data if d[0] > 1 ]
b = [d for d in data if d[1] == 'b']
c = [d for d in data if len(d) == 3 ]
return a, b, c
I know the more pythonic way of doing this is with list comprehensions, but is that the case in this instance?

in your 1st example code, it only need to iterate through the data once with multiple if statement, while the later code need to iterate through the data 3 times. I believe list comprehension will win most of the time with equal number of iteration over data.
For simple operation like your example, i would prefer list comprehension method, when the operation become more complex, maybe the other would be better for the sake of code readability.
Some benchmarking over the 2 function should tell you more.
Based on my quick benchmarking over those 2 function using some dummy data set getting runtime as below. This runtime might not always true depends on the data set
# without list comprehension
>>> timeit.timeit('__main__.split_data([("a","b")] * 1000000)', 'import __main__', number=1)
0.43826036048574224
# with list comprehension
>>> timeit.timeit('__main__.split_data([("a","b")] * 1000000)', 'import __main__', number=1)
0.31136326966964134

I'd say it depends. If your d is a list and comparatively small, you could go for list comprehension. However, if your d is comparatively large (hint %timeit is your friend), your first option will only iterate once over it and might therefore be more efficient.
Also note, that your first version would work with all generators, whereas the second version won't work with generators that consume items. You could even chain this by providing a generator yourself, i.e., using yield a, b, c instead of return.

If you wanna go with more pythonic, we can consult the zen of python.
Explicit is better than implicit.
Sparse is better than dense.
Readability counts.
Although both are readable, I'd say your first example is more readable. If your data had more dimensions and required more nested for loops, the first example would be more clear about how you want to handle each nested element if more logic was involved.
Although Skycc's answer does show slightly faster results for list comprehension, ideally you should go for readability first then optimize, unless you really need that little speedup from list comprehension.

Python: Conversion from dictionary to array

I have a Python dictionary (say D) where every key corresponds to some predefined list. I want to create an array with two columns where the first column corresponds to the keys of the dictionary D and the second column corresponds to the sum of the elements in the corresponding lists. As an example, if,
D = {1: [5,55], 2: [25,512], 3: [2, 18]}
Then, the array that I wish to create should be,
A = array( [[1,60], [2,537], [3, 20]] )
I have given a small example here, but I would like to know of a way where the implementation is the fastest. Presently, I am using the following method:
A_List = map( lambda x: [x,sum(D[x])] , D.keys() )
I realize that the output from my method is in the form of a list. I can convert it into an array in another step, but I don't know if that will be a fast method (I presume that the use of arrays will be faster than the use of lists). I will really appreciate an answer where I can know what's the fastest way of achieving this aim.

You can use a list comprehension to create the desired output:
>>> [(k, sum(v)) for k, v in D.items()] # Py2 use D.iteritems()
[(1, 60), (2, 537), (3, 20)]
On my computer, this runs about 50% quicker than the map(lambda:.., D) version.
Note: On py3 map just returns a generator so you need to list(map(...)) to get the real time it takes.

I hope that helps:
Build an array with the values of the keys of D:
first_column = list(D.keys())
Build an array with the sum of values in each key:
second_column = [sum(D[key]) for key in D.keys()]
Build an array with shape [first_column,second_column]
your_array = list(zip(first_column,second_column))

You can try this also:
a=[]
for i in D.keys():
a+=[[i,sum(D[i])]]

How to count number of unique lists within list?

I've tried using Counter and itertools, but since a list is unhasable, they don't work.
My data looks like this: [ [1,2,3], [2,3,4], [1,2,3] ]
I would like to know that the list [1,2,3] appears twice, but I cant figure out how to do this. I was thinking of just converting each list to a tuple, then hashing with that. Is there a better way?

>>> from collections import Counter
>>> li=[ [1,2,3], [2,3,4], [1,2,3] ]
>>> Counter(str(e) for e in li)
Counter({'[1, 2, 3]': 2, '[2, 3, 4]': 1})
The method that you state also works as long as there are not nested mutables in each sublist (such as [ [1,2,3], [2,3,4,[11,12]], [1,2,3] ]:
>>> Counter(tuple(e) for e in li)
Counter({(1, 2, 3): 2, (2, 3, 4): 1})
If you do have other unhasable types nested in the sub lists lists, use the str or repr method since that deals with all sub lists as well. Or recursively convert all to tuples (more work).

ll = [ [1,2,3], [2,3,4], [1,2,3] ]
print(len(set(map(tuple, ll))))
Also, if you wanted to count the occurences of a unique* list:
print(ll.count([1,2,3]))
*value unique, not reference unique)

I think, using the Counter class on tuples like
Counter(tuple(item) for item in li)
Will be optimal in terms of elegance and "pythoniticity": It's probably the shortest solution, it's perfectly clear what you want to achieve and how it's done, and it uses resp. combines standard methods (and thus avoids reinventing the wheel).
The only performance drawback I can see is, that every element has to be converted to a tuple (in order to be hashable), which more or less means that all elements of all sublists have to be copied once. Also the internal hash function on tuples may be suboptimal if you know that list elements will e.g. always be integers.
In order to improve on performance, you would have to
Implement some kind of hash algorithm working directly on lists (more or less reimplementing the hashing of tuples but for lists)
Somehow reimplement the Counter class in order to use this hash algorithm and provide some suitable output (this class would probably use a dictionary using the hash values as key and a combination of the "original" list and the count as value)
At least the first step would need to be done in C/C++ in order to match the speed of the internal hash function. If you know the type of the list elements you could probably even improve the performance.
As for the Counter class I do not know if it's standard implementation is in Python or in C, if the latter is the case you'll probably also have to reimplement it in C in order to achieve the same (or better) performance.
So the question "Is there a better solution" cannot be answered (as always) without knowing your specific requirements.

list = [ [1,2,3], [2,3,4], [1,2,3] ]
repeats = []
unique = 0
for i in list:
count = 0;
if i not in repeats:
for i2 in list:
if i == i2:
count += 1
if count > 1:
repeats.append(i)
elif count == 1:
unique += 1
print "Repeated Items"
for r in repeats:
print r,
print "\nUnique items:", unique
loops through the list to find repeated sequences, while skipping items if they have already been detected as repeats, and adds them into the repeats list, while counting the number of unique lists.

Best way to iterate through unknown number of lists in general?

Given a programming language that supports iteration through lists i.e.
for element in list do
...
If we have a program that takes a dynamic number of lists as input, list[1] ... list[n] (where n can take any value), what is the best way to iterate through every combination of elements in these lists?
e.g. list[1] = [1,2], list[2] = [1,3] then we iterate through [[1,1], [1,3], [2,1], [2,3]].
My ideas that I don't think are very good:
1) Create a big product of these lists into list_product (e.g. in Python you could use itertools.product() multiple times) and then iterate over list_product. Problem is that this requires us to store a (potentially huge) iterable.
2) Find the product of the length of all the lists, total_length and do something along the lines of the following by using a modular arithmetic type idea.
len_lists = [len(list[i]) for i in [1..n]]
total_length = Product(len_lists)
for i in [1 ... total_length] do
total = i-1
list_index = [1...n]
for j in [n ... 1] do
list_index[j] = IntegerPartOf(total / Product([1:j-1]))
total = RemainderOf(total / Product([1:j-1]))
od
print list_index
od
where the list_index are then printed for all different combinations.
Is there a better way with regards to speed (don't care so much about readability)?

1) Create a big product of these lists into list_product (e.g. in Python you could use itertools.product() multiple times) and then iterate over list_product. Problem is that this requires us to store a (potentially huge) iterable.
The point of itertools (and iterators in general) is that they do not construct their entire result at once, but create and return terms from the result one at a time. So if you have a list of lists ListOfLists and you want all tuples containing one element from each list in it, do use
for elt in itertools.product(*ListOfLists):
...
Note that you only need to call product once. It's simple and efficient.

You can use itertools.product without needing to materialize a list:
>>> from itertools import product
>>> lol = [[1,2],[1,3]]
>>> product(*lol)
<itertools.product object at 0xaa414b4>
>>> for x in product(*lol):
... print x
...
(1, 1)
(1, 3)
(2, 1)
(2, 3)
As for performance, it's very easy to spend more time thinking of ways to optimize it than you can ever hope to gain from the optimizations. If you're doing anything inside the loop at all, then it's pretty likely that the iteration overhead itself is negligible. (Most common exception is a tight numerical loop, in which case you should try to do it numpythonically instead.)
My advice would be to use itertools.product and get on with your day.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way to count duplicate tuples in python list - python

Related

Pythonic way to iterate over a shifted list of tuples

Single list iteration vs multiple list comprehensions

Python: Conversion from dictionary to array

How to count number of unique lists within list?

Best way to iterate through unknown number of lists in general?

Categories

Resources