I've tried using Counter and itertools, but since a list is unhasable, they don't work.
My data looks like this: [ [1,2,3], [2,3,4], [1,2,3] ]
I would like to know that the list [1,2,3] appears twice, but I cant figure out how to do this. I was thinking of just converting each list to a tuple, then hashing with that. Is there a better way?
>>> from collections import Counter
>>> li=[ [1,2,3], [2,3,4], [1,2,3] ]
>>> Counter(str(e) for e in li)
Counter({'[1, 2, 3]': 2, '[2, 3, 4]': 1})
The method that you state also works as long as there are not nested mutables in each sublist (such as [ [1,2,3], [2,3,4,[11,12]], [1,2,3] ]:
>>> Counter(tuple(e) for e in li)
Counter({(1, 2, 3): 2, (2, 3, 4): 1})
If you do have other unhasable types nested in the sub lists lists, use the str or repr method since that deals with all sub lists as well. Or recursively convert all to tuples (more work).
ll = [ [1,2,3], [2,3,4], [1,2,3] ]
print(len(set(map(tuple, ll))))
Also, if you wanted to count the occurences of a unique* list:
print(ll.count([1,2,3]))
*value unique, not reference unique)
I think, using the Counter class on tuples like
Counter(tuple(item) for item in li)
Will be optimal in terms of elegance and "pythoniticity": It's probably the shortest solution, it's perfectly clear what you want to achieve and how it's done, and it uses resp. combines standard methods (and thus avoids reinventing the wheel).
The only performance drawback I can see is, that every element has to be converted to a tuple (in order to be hashable), which more or less means that all elements of all sublists have to be copied once. Also the internal hash function on tuples may be suboptimal if you know that list elements will e.g. always be integers.
In order to improve on performance, you would have to
Implement some kind of hash algorithm working directly on lists (more or less reimplementing the hashing of tuples but for lists)
Somehow reimplement the Counter class in order to use this hash algorithm and provide some suitable output (this class would probably use a dictionary using the hash values as key and a combination of the "original" list and the count as value)
At least the first step would need to be done in C/C++ in order to match the speed of the internal hash function. If you know the type of the list elements you could probably even improve the performance.
As for the Counter class I do not know if it's standard implementation is in Python or in C, if the latter is the case you'll probably also have to reimplement it in C in order to achieve the same (or better) performance.
So the question "Is there a better solution" cannot be answered (as always) without knowing your specific requirements.
list = [ [1,2,3], [2,3,4], [1,2,3] ]
repeats = []
unique = 0
for i in list:
count = 0;
if i not in repeats:
for i2 in list:
if i == i2:
count += 1
if count > 1:
repeats.append(i)
elif count == 1:
unique += 1
print "Repeated Items"
for r in repeats:
print r,
print "\nUnique items:", unique
loops through the list to find repeated sequences, while skipping items if they have already been detected as repeats, and adds them into the repeats list, while counting the number of unique lists.
Related
I have a list of ~3000 items. Let's call it listA.
And another list with 1,000,000 items. Let's call it listB.
I want to check how many items of listA belong in listB. For example to get an answer like 436.
The obvious way is to have a nested loop looking for each item, but this is slow, especially due to the size of the lists.
What is the fastest and/or Pythonic way to get the number of the items of one list belonging to another?
Make a set out of list_b. That will avoid nested loops and make the contains-check O(1). The entire process will be O(M+N) which should be fairly optimal:
set_b = set(list_b)
count = sum(1 for a in list_a if a in set_b)
# OR shorter, but maybe less intuitive
count = sum(a in set_b for a in list_a)
# where the bool expression is coerced to int {0; 1} for the summing
If you don't want to (or have to) count repeated elements in list_a, you can use set intersection:
count = len(set(list_a) & set(list_b))
# OR
count = len(set(list_a).intersection(list_b)) # avoids one conversion
One should also note that these set-based operations only work if the items in your lists are hashable (e.g. not lists themselves)!
Another option is to use set and find the intersection:
len(set(listA).intersection(listB))
You can loop over the contents of listA and use a generator to yield values to be more efficient:
def get_number_of_elements(s, a):
for i in s:
if i in a:
yield i
print(len(list(get_number_of_elements(listA, listB))))
I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.
from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.
You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.
Given a programming language that supports iteration through lists i.e.
for element in list do
...
If we have a program that takes a dynamic number of lists as input, list[1] ... list[n] (where n can take any value), what is the best way to iterate through every combination of elements in these lists?
e.g. list[1] = [1,2], list[2] = [1,3] then we iterate through [[1,1], [1,3], [2,1], [2,3]].
My ideas that I don't think are very good:
1) Create a big product of these lists into list_product (e.g. in Python you could use itertools.product() multiple times) and then iterate over list_product. Problem is that this requires us to store a (potentially huge) iterable.
2) Find the product of the length of all the lists, total_length and do something along the lines of the following by using a modular arithmetic type idea.
len_lists = [len(list[i]) for i in [1..n]]
total_length = Product(len_lists)
for i in [1 ... total_length] do
total = i-1
list_index = [1...n]
for j in [n ... 1] do
list_index[j] = IntegerPartOf(total / Product([1:j-1]))
total = RemainderOf(total / Product([1:j-1]))
od
print list_index
od
where the list_index are then printed for all different combinations.
Is there a better way with regards to speed (don't care so much about readability)?
1) Create a big product of these lists into list_product (e.g. in Python you could use itertools.product() multiple times) and then iterate over list_product. Problem is that this requires us to store a (potentially huge) iterable.
The point of itertools (and iterators in general) is that they do not construct their entire result at once, but create and return terms from the result one at a time. So if you have a list of lists ListOfLists and you want all tuples containing one element from each list in it, do use
for elt in itertools.product(*ListOfLists):
...
Note that you only need to call product once. It's simple and efficient.
You can use itertools.product without needing to materialize a list:
>>> from itertools import product
>>> lol = [[1,2],[1,3]]
>>> product(*lol)
<itertools.product object at 0xaa414b4>
>>> for x in product(*lol):
... print x
...
(1, 1)
(1, 3)
(2, 1)
(2, 3)
As for performance, it's very easy to spend more time thinking of ways to optimize it than you can ever hope to gain from the optimizations. If you're doing anything inside the loop at all, then it's pretty likely that the iteration overhead itself is negligible. (Most common exception is a tight numerical loop, in which case you should try to do it numpythonically instead.)
My advice would be to use itertools.product and get on with your day.
Update:
Hello again. My question is, how can I compare values of an dictionary for equality. More Informationen about my Dictionary:
keys are session numbers
values of each key are nested lists -> f.e.
[[1,0],[2,0],[3,1]]
the length of values for each key arent the same, so it could be that session number 1 have more values then session number 2
here an example dictionary:
order_session =
{1:[[100,0],[22,1],[23,2]],10:[100,0],[232,0],[10,2],[11,2]],22:[[5,2],[23,2],....],
... }
My Goal:
Step 1: to compare the values of session number 1 with the values of the whole other session numbers in the dictionary for equality
Step 2: take the next session number and compare the values with the other values of the other session numbers, and so on
- finally we have each session number value compared
Step 3: save the result into a list f.e.
output = [[100,0],[23,2], ... ] or output = [(100,0),(23,2), ... ]
if you can see a value-pair [100,0] of session 1 and 10 are the same. also the value-pair [23,2] of session 1 and 22 are the same.
Thanks for helping me out.
Update 2
Thank you for all your help and tips to change the nested list of lists into list of tuples, which are quite better to handle it.
I prefer Boaz Yaniv solution ;)
I also like the use of collections.Counter() ... unlucky that I use 2.6.4 (Counter works at 2.7) maybe I change to 2.7 sometimes.
If your dictionary is long, you'd want to use sets, for better performance (looking up already-encountered values in lists is going to be quite slow):
def get_repeated_values(sessions):
known = set()
already_repeated = set()
for lst in sessions.itervalues():
session_set = set(tuple(x) for x in lst)
repeated = (known & session_set) - already_repeated
already_repeated |= repeated
known |= session_set
for val in repeated:
yield val
sessions = {1:[[100,0],[22,1],[23,2]],10:[[100,0],[232,0],[10,2],[11,2]],22:[[5,2],[23,2]]}
for x in get_repeated_values(sessions):
print x
I also suggest (again, for performance reasons) to nest tuples inside your lists instead of lists, if you're not going to change them on-the-fly. The code I posted here will work either way, but it would be faster if the values are already tuples.
There's probably a nicer and more optimal way to do this, but I'd work my way from here:
seen = []
output = []
for val in order_session.values():
for vp in val:
if vp in seen:
if not vp in output:
output.append(vp)
else:
seen.append(vp)
print(output)
Basically, what this does is to look through all the values, and if the value has been seen before, but not output before, it is appended to the output.
Note that this works with the actual values of the value pairs - if you have objects of various kinds that result in pointers, my algorithm might fail (I haven't tested it, so I'm not sure). Python re-uses the same object reference for "low" integers; that is, if you run the statements a = 5 and b = 5 after each other, a and b will point to the same integer object. However, if you set them to, say, 10^5, they will not. But I don't know where the limit is, so I'm not sure if this applies to your code.
>>> from collections import Counter
>>> D = {1:[[100,0],[22,1],[23,2]],
... 10:[[100,0],[232,0],[10,2],[11,2]],
... 22:[[5,2],[23,2]]}
>>> [k for k,v in Counter(tuple(j) for i in D.values() for j in i).items() if v>1]
[(23, 2), (100, 0)]
If you really really need a list of lists
>>> [list(k) for k,v in Counter(tuple(j) for i in D.values() for j in i).items() if v>1]
[[23, 2], [100, 0]]
order_session = {1:[[100,0],[22,1],[23,2]],10:[[100,0],[232,0],[10,2],[11,2]],22:[[5,2],[23,2],[80,21]],}
output = []
for pair in sum(order_session.values(), []):
if sum(order_session.values(), []).count(pair) > 1 and pair not in output:
output.append(pair)
print output
...
[[100, 0], [23, 2]]
If have a list of dictionary items like so:
L = [{"a":1, "b":0}, {"a":3, "b":1}...]
I would like to split these entries based upon the value of "b", either 0 or 1.
A(b=0) = [{"a":1, "b":1}, ....]
B(b=1) = [{"a":3, "b":2}, .....]
I am comfortable with using simple list comprehensions, and i am currently looping through the list L two times.
A = [d for d in L if d["b"] == 0]
B = [d for d in L if d["b"] != 0]
Clearly this is not the most efficient way.
An else clause does not seem to be available within the list comprehension functionality.
Can I do what I want via list comprehension?
Is there a better way to do this?
I am looking for a good balance between readability and efficiency, leaning towards readability.
Thanks!
update:
thanks everyone for the comments and ideas! the most easiest one for me to read is the one by Thomas. but i will look at Alex' suggestion as well. i had not found any reference to the collections module before.
Don't use a list comprehension. List comprehensions are for when you want a single list result. You obviously don't :) Use a regular for loop:
A = []
B = []
for item in L:
if item['b'] == 0:
target = A
else:
target = B
target.append(item)
You can shorten the snippet by doing, say, (A, B)[item['b'] != 0].append(item), but why bother?
If the b value can be only 0 or 1, #Thomas's simple solution is probably best. For a more general case (in which you want to discriminate among several possible values of b -- your sample "expected results" appear to be completely divorced from and contradictory to your question's text, so it's far from obvious whether you actually need some generality;-):
from collections import defaultdict
separated = defaultdict(list)
for x in L:
separated[x['b']].append(x)
When this code executes, separated ends up with a dict (actually an instance of collections.defaultdict, a dict subclass) whose keys are all values for b that actually occur in dicts in list L, the corresponding values being the separated sublists. So, for example, if b takes only the values 0 and 1, separated[0] would be what (in your question's text as opposed to the example) you want as list A, and separated[1] what you want as list B.