Complexity of checking value in a list while creating list - python

What is the worst case time complexity of the following code:
temp_lst = [(1, "one"), (2, "two"), (3, "three")]
if 5 not in [i[0] for i in temp_lst]:
print("5 is not here")
My understanding is that it's O(n^2) because you're both building a list while searching this list, so equivalent would be some sort of for loop inside another for loop.

Assuming n means the length of temp_lst, then this code has a time complexity of O(n).
The list comprehension [i[0] for i in temp_list] is equivalent to the following loop, which is clearly O(n):
result = []
for i in temp_lst:
result.append(i[0])
The resulting list has the same length, n, so the expression 5 not in ..., which is implemented as a linear search, also takes O(n) time.
The list comprehension and the linear search are done one after the other, so we should add, not multiply: O(n) + O(n) = O(n).

This is linear--build the lookup list once and traverse it once. O(2n) reduces to O(n).
Having said that, this appears to be a problematic design that I'd consider to be an antipattern, absent of any further information. If the tuples are indeed sequentially numbered and unique, then this structure makes far more sense as
temp_lst = ["one", "two", "three"]
Now, we can simply say if 5 < len(temp_lst) and we have O(1) lookup time. Not only that, the code is simpler and there is no redundant information. If you need 1-indexing, either add a None to the front of the list or subtract 1 from all your lookups.
If the numbers aren't sequential and would leave holes in the list, then a dict is likely the appropriate structure:
users = {"51232": "bob", "12342": "amy", "17652": "carol"}
Again, we have O(1) lookup time when searching by id, "51232" in users.

Related

Find index for multiple elements in a long list

I have a very long lst containing unique elements. I want to design a function which takes a list of elements as the input and it can return a list of index efficiently. We assume the items needed to find their index are all in the lst.
Here is an example:
lst = ['ab','sd','ef','de']
items_to_find = ['sd', 'ef', 'sd']
>>> fo(lst, items_to_find)
# Output: [1,2,1]
I have one solution of my own, but it looks less efficient.
>> [lst.index(x) for x in items_to_find]
Because the lst is very long, I need a very fast algorithm to solve it.
First create a dictionary containing in the index location of each item in the list (you state that all items are unique, hence no issue with duplicate keys).
Then use the dictionary to look up each item's index location which is average time complexity O(1).
my_list = ['ab', 'sd', 'ef', 'de']
d = {item: idx for idx, item in enumerate(my_list)}
items_to_find = ['sd', 'ef', 'sd']
>>> [d.get(item) for item in items_to_find]
[1, 2, 1]
You could use a dictionary with elements from lst as the key and index and as the value. Search in a dictionary is O(1).
Although the answer you've accepted is very good, here's something that would be more memory efficient and is probably almost as fast. However #Alexander's answer creates a potentially huge dictionary if the list is very long (since the elements in it are all unique).
The code below also builds a dictionary to speed up searching, but it's for the target elements so is likely to be much smaller than the list being searched. For the sample data the one it creates (named targets) contains only: {'sd': [0, 2], 'ef': [1]}
It one pass through the sequence and checks each of the values in it are targets and, if so, updates the results list according. This approach requires a little more code to implement since the setup is slightly more involved, so that's another trade-off.
def find_indices(seq, elements):
targets = {}
for index, element in enumerate(elements):
targets.setdefault(element, []).append(index)
indices = [None for _ in elements] # Pre-allocate.
for location, value in enumerate(seq):
if value in targets:
for element, indexes in targets.items():
if element == value:
for index in indexes:
indices[index] = location
return indices
lst = ['ab', 'sd', 'ef', 'de']
indices = find_indices(lst, ['sd', 'ef', 'sd'])
print(indices) # -> [1, 2, 1]
A simple first approximation...
def get_indices(data_list, query_list):
datum_index_mapping = {datum:None for datum in query_list}
for index, datum in enumerate(data_list):
if datum in datum_index_mapping:
datum_index_mapping[datum] = index
return [datum_index_mapping[d] for d in query_list]
The above is the most simple, intuitive solution which makes some effort to be efficient (by only bothering to store a dictionary of indices for the elements you actually want to look up).
However, it suffers from the fact that- even if the initial query list is very short- it'll iterate through the entire data list / data generator. In addition, it has to do a dictionary write every time it sees a value it's seen before. The below fixes those inefficiencies, although it adds the overhead of a set, so it must do a set write for each unique element in the query list, as well as a dictionary write for each unique element in the query list.
def get_indices(data_list, query_list):
not_found = set(query_list)
datum_index_mapping = {}
for index, datum in enumerate(data_list):
if datum in not_found:
datum_index_mapping[datum] = index
not_found.remove(datum)
if len(not_found) == 0:
break
return [datum_index_mapping[d] for d in query_list]
Obviously, depending on your program, you may not actually want to have a list of indices at all, but simply have your function return the mapping.
If you'll be resolving multiple arbitrary query lists, you may want to simply do an enumerate() on the original dataset as other answers have shown and keep the dictionary that maps values to indices in memory as well for query purposes.
What counts as efficient often depends upon the larger program; all we can do here are pigeonhole optimizations. It also depends on whether the memory hierarchy and processing power (i.e. can we parallelize? Is compute more expensive, or is memory more expensive? What's the I/O hit if we need to fallback to swap?).
If you are sure all the searched values actually exist in the searching list and the lst is sorted (of course, the sorting itself might take some time), you can do that in one pass (linear complexity):
def sortedindex(lst,find):
find.sort()
indices = []
start = 0
for item in find:
start = lst.index(item,start)
indices.append(start)
return indices
The "start" shows the first index where the algorithm starts comparing the inspected item to the item in the main list. When the correct index is found, it will become the next starting mark. Because both lists are sorted in the same way, you do not have to worry that you skipped any of the next items.

2d-list calculations

I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.
from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.
You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.

Efficient way to loop on comparing string list element to a string list sub-element

I am currently struggling to find an efficient way to compare part of a string element attached to a list, to another string element. The current code computation is very long (1 hour with 4,8 millions elements in first list and 5000 elements in second one).
What I need to do: If 8 first characters of the first string element is equal to the full second element, a third list is updated with the full first element. Once it is found, we test another element of the first list.
Here is the code:
for first_element in first_List :
for second_element in second_List:
if first_element[:8] == second_element :
third_List.append(first_element)
break
I know those kinds of loops are not the best way to deal with very big lists. The number of if tests is really huge.
I was wondering if there is an efficient way to do this.
I think intersection with sets won't work since I'm comparing a part of an element to a full one and I need to copy the full first element in a third list.
Do you have some suggestions or ideas please?
This works:
second_set = set(second_list)
third_list = [value for value in first_list if value[:8] in second_set]
Example:
>>> first_list = ['abcdfghij', 'xyzxyzxyz', 'fjgjgggjhhh']
>>> second_list = ['abcdfghi', 'xyzxyzxy', 'xxx']
>>> second_set = set(second_list)
>>> third_list = [value for value in first_list if value[:8] in second_set]
>>> third_list
['abcdfghij', 'xyzxyzxyz']
This should be much more efficient.
The conversion of the list second_list into the set is O(n).
There is one loop over first_list that is O(n). The lookup in the set, i.e. in second_set is O(1).
Consider using a hash set, or just Set in python.
The nice thing about a hash set is that it can check if an element is in the set very fast (O(1)), in your case improving runtime by a factor of up to 5000 over the O(n) solution of iterating though the list every time.
Create a new list whose elements are taken from the first_List provided that its initial part (8 characters) is present in the second_List:
third_List = [x for x in first_List if x[:8] in second_List]
This approach should be optimized by using second_Set instead of second_List:
second_Set = set(second_List)

How to count number of unique lists within list?

I've tried using Counter and itertools, but since a list is unhasable, they don't work.
My data looks like this: [ [1,2,3], [2,3,4], [1,2,3] ]
I would like to know that the list [1,2,3] appears twice, but I cant figure out how to do this. I was thinking of just converting each list to a tuple, then hashing with that. Is there a better way?
>>> from collections import Counter
>>> li=[ [1,2,3], [2,3,4], [1,2,3] ]
>>> Counter(str(e) for e in li)
Counter({'[1, 2, 3]': 2, '[2, 3, 4]': 1})
The method that you state also works as long as there are not nested mutables in each sublist (such as [ [1,2,3], [2,3,4,[11,12]], [1,2,3] ]:
>>> Counter(tuple(e) for e in li)
Counter({(1, 2, 3): 2, (2, 3, 4): 1})
If you do have other unhasable types nested in the sub lists lists, use the str or repr method since that deals with all sub lists as well. Or recursively convert all to tuples (more work).
ll = [ [1,2,3], [2,3,4], [1,2,3] ]
print(len(set(map(tuple, ll))))
Also, if you wanted to count the occurences of a unique* list:
print(ll.count([1,2,3]))
*value unique, not reference unique)
I think, using the Counter class on tuples like
Counter(tuple(item) for item in li)
Will be optimal in terms of elegance and "pythoniticity": It's probably the shortest solution, it's perfectly clear what you want to achieve and how it's done, and it uses resp. combines standard methods (and thus avoids reinventing the wheel).
The only performance drawback I can see is, that every element has to be converted to a tuple (in order to be hashable), which more or less means that all elements of all sublists have to be copied once. Also the internal hash function on tuples may be suboptimal if you know that list elements will e.g. always be integers.
In order to improve on performance, you would have to
Implement some kind of hash algorithm working directly on lists (more or less reimplementing the hashing of tuples but for lists)
Somehow reimplement the Counter class in order to use this hash algorithm and provide some suitable output (this class would probably use a dictionary using the hash values as key and a combination of the "original" list and the count as value)
At least the first step would need to be done in C/C++ in order to match the speed of the internal hash function. If you know the type of the list elements you could probably even improve the performance.
As for the Counter class I do not know if it's standard implementation is in Python or in C, if the latter is the case you'll probably also have to reimplement it in C in order to achieve the same (or better) performance.
So the question "Is there a better solution" cannot be answered (as always) without knowing your specific requirements.
list = [ [1,2,3], [2,3,4], [1,2,3] ]
repeats = []
unique = 0
for i in list:
count = 0;
if i not in repeats:
for i2 in list:
if i == i2:
count += 1
if count > 1:
repeats.append(i)
elif count == 1:
unique += 1
print "Repeated Items"
for r in repeats:
print r,
print "\nUnique items:", unique
loops through the list to find repeated sequences, while skipping items if they have already been detected as repeats, and adds them into the repeats list, while counting the number of unique lists.

Best way to iterate through unknown number of lists in general?

Given a programming language that supports iteration through lists i.e.
for element in list do
...
If we have a program that takes a dynamic number of lists as input, list[1] ... list[n] (where n can take any value), what is the best way to iterate through every combination of elements in these lists?
e.g. list[1] = [1,2], list[2] = [1,3] then we iterate through [[1,1], [1,3], [2,1], [2,3]].
My ideas that I don't think are very good:
1) Create a big product of these lists into list_product (e.g. in Python you could use itertools.product() multiple times) and then iterate over list_product. Problem is that this requires us to store a (potentially huge) iterable.
2) Find the product of the length of all the lists, total_length and do something along the lines of the following by using a modular arithmetic type idea.
len_lists = [len(list[i]) for i in [1..n]]
total_length = Product(len_lists)
for i in [1 ... total_length] do
total = i-1
list_index = [1...n]
for j in [n ... 1] do
list_index[j] = IntegerPartOf(total / Product([1:j-1]))
total = RemainderOf(total / Product([1:j-1]))
od
print list_index
od
where the list_index are then printed for all different combinations.
Is there a better way with regards to speed (don't care so much about readability)?
1) Create a big product of these lists into list_product (e.g. in Python you could use itertools.product() multiple times) and then iterate over list_product. Problem is that this requires us to store a (potentially huge) iterable.
The point of itertools (and iterators in general) is that they do not construct their entire result at once, but create and return terms from the result one at a time. So if you have a list of lists ListOfLists and you want all tuples containing one element from each list in it, do use
for elt in itertools.product(*ListOfLists):
...
Note that you only need to call product once. It's simple and efficient.
You can use itertools.product without needing to materialize a list:
>>> from itertools import product
>>> lol = [[1,2],[1,3]]
>>> product(*lol)
<itertools.product object at 0xaa414b4>
>>> for x in product(*lol):
... print x
...
(1, 1)
(1, 3)
(2, 1)
(2, 3)
As for performance, it's very easy to spend more time thinking of ways to optimize it than you can ever hope to gain from the optimizations. If you're doing anything inside the loop at all, then it's pretty likely that the iteration overhead itself is negligible. (Most common exception is a tight numerical loop, in which case you should try to do it numpythonically instead.)
My advice would be to use itertools.product and get on with your day.

Categories

Resources