Find index for multiple elements in a long list - python

I have a very long lst containing unique elements. I want to design a function which takes a list of elements as the input and it can return a list of index efficiently. We assume the items needed to find their index are all in the lst.
Here is an example:
lst = ['ab','sd','ef','de']
items_to_find = ['sd', 'ef', 'sd']
>>> fo(lst, items_to_find)
# Output: [1,2,1]
I have one solution of my own, but it looks less efficient.
>> [lst.index(x) for x in items_to_find]
Because the lst is very long, I need a very fast algorithm to solve it.

First create a dictionary containing in the index location of each item in the list (you state that all items are unique, hence no issue with duplicate keys).
Then use the dictionary to look up each item's index location which is average time complexity O(1).
my_list = ['ab', 'sd', 'ef', 'de']
d = {item: idx for idx, item in enumerate(my_list)}
items_to_find = ['sd', 'ef', 'sd']
>>> [d.get(item) for item in items_to_find]
[1, 2, 1]

You could use a dictionary with elements from lst as the key and index and as the value. Search in a dictionary is O(1).

Although the answer you've accepted is very good, here's something that would be more memory efficient and is probably almost as fast. However #Alexander's answer creates a potentially huge dictionary if the list is very long (since the elements in it are all unique).
The code below also builds a dictionary to speed up searching, but it's for the target elements so is likely to be much smaller than the list being searched. For the sample data the one it creates (named targets) contains only: {'sd': [0, 2], 'ef': [1]}
It one pass through the sequence and checks each of the values in it are targets and, if so, updates the results list according. This approach requires a little more code to implement since the setup is slightly more involved, so that's another trade-off.
def find_indices(seq, elements):
targets = {}
for index, element in enumerate(elements):
targets.setdefault(element, []).append(index)
indices = [None for _ in elements] # Pre-allocate.
for location, value in enumerate(seq):
if value in targets:
for element, indexes in targets.items():
if element == value:
for index in indexes:
indices[index] = location
return indices
lst = ['ab', 'sd', 'ef', 'de']
indices = find_indices(lst, ['sd', 'ef', 'sd'])
print(indices) # -> [1, 2, 1]

A simple first approximation...
def get_indices(data_list, query_list):
datum_index_mapping = {datum:None for datum in query_list}
for index, datum in enumerate(data_list):
if datum in datum_index_mapping:
datum_index_mapping[datum] = index
return [datum_index_mapping[d] for d in query_list]
The above is the most simple, intuitive solution which makes some effort to be efficient (by only bothering to store a dictionary of indices for the elements you actually want to look up).
However, it suffers from the fact that- even if the initial query list is very short- it'll iterate through the entire data list / data generator. In addition, it has to do a dictionary write every time it sees a value it's seen before. The below fixes those inefficiencies, although it adds the overhead of a set, so it must do a set write for each unique element in the query list, as well as a dictionary write for each unique element in the query list.
def get_indices(data_list, query_list):
not_found = set(query_list)
datum_index_mapping = {}
for index, datum in enumerate(data_list):
if datum in not_found:
datum_index_mapping[datum] = index
not_found.remove(datum)
if len(not_found) == 0:
break
return [datum_index_mapping[d] for d in query_list]
Obviously, depending on your program, you may not actually want to have a list of indices at all, but simply have your function return the mapping.
If you'll be resolving multiple arbitrary query lists, you may want to simply do an enumerate() on the original dataset as other answers have shown and keep the dictionary that maps values to indices in memory as well for query purposes.
What counts as efficient often depends upon the larger program; all we can do here are pigeonhole optimizations. It also depends on whether the memory hierarchy and processing power (i.e. can we parallelize? Is compute more expensive, or is memory more expensive? What's the I/O hit if we need to fallback to swap?).

If you are sure all the searched values actually exist in the searching list and the lst is sorted (of course, the sorting itself might take some time), you can do that in one pass (linear complexity):
def sortedindex(lst,find):
find.sort()
indices = []
start = 0
for item in find:
start = lst.index(item,start)
indices.append(start)
return indices
The "start" shows the first index where the algorithm starts comparing the inspected item to the item in the main list. When the correct index is found, it will become the next starting mark. Because both lists are sorted in the same way, you do not have to worry that you skipped any of the next items.

Related

How to know when to refer an item in a for loop by itself vs. referring to an item as an index of the list

In a for loop, I'm trying to understand when to refer to an item by its item name and when to refer to the item as an index of the list I'm looping through.
In the code pasted below, I don't understand why "idx" is referred to in the "if" statement with a reference to the list index but then in the definition of maximum_score_index, it is referred to by itself.
def linear_search(search_list):
maximum_score_index = None
for **idx** in range(len(search_list)):
if not maximum_score_index or **search_list[idx]** > search_list[maximum_score_index]:
maximum_score_index = **idx**
return maximum_score_index
I'd love to have an explanation so I can differentiate in the future and some examples to show the difference so I can understand.
In Python, range(num) (more or less) returns a list of numbers from 0 through num - 1. It follows that range(len(my_list)) will generate a list of numbers from 0 through the length of my_list minus one. This is frequently useful, because the generated numbers are the indices of each item in my_list (Python lists start counting at 0). For example, range(len(["a", "b", "c"])) is [0, 1, 2], the indices needed to access each item in the original list. ["a", "b", "c"][0] is "a", and so on.
In Python, the for x in mylist loop iterates through each item in mylist, setting x to the value of each item in order. One common pattern for Python for loops is the for x in range(len(my_list)). This is useful, because you loop through the indices of each list item instead of the values themselves. It's almost as easy to access the values (just use my_list[x]) but it's much easier to do things like access the preceding value (just use my_list[x-1], much simpler than it would be if you didn't have the index!).
In your example, idx is tracking the index of each list item as the program iterates through search_list. In order to retrieve values from search_list, the program uses search_list[idx], much like I used my_list[x] in my example. The code then assigns maximum_score_index to the index itself, a number like 0, 1, or 2, rather than the value. It's still easy to find out what the maximum score is, with search_list[maximum_score_index]. The reason idx is not being used as a list accessor in the second case is because the program is storing the index itself, not the value of the array at that index.
Basically, this line:
if not maximum_score_index or **search_list[idx]** > search_list[maximum_score_index]:
maximum_score_index = **idx**
Can be thought of as:
if (this is the first pass) or (element at index > this loop-iteration element):
keep this index as largest element
What I recommend to do:
Go through the code, on a piece of paper and iterate over a list to
see what the code does
Write the code in any IDE, and use a debugger
to see what the code does
Are you looking for the index of the highest element in the list or the value?
If you are looking for the value, it can be as simple as:
highest = max(search_list)
You could also use enumerate, which will grant you "free" access to the current index in the loop:
>>> search_list
[10, 15, 5, 3]
>>> maximum_score_index = None
>>> for idx, value in enumerate(search_list):
... if not maximum_score_index or search_list[idx] > value:
... maximum_score_index = idx
...
>>> maximum_score_index
1
>>> search_list[maximum_score_index]
15

Need help speeding up this function

Input: A list of lists of various positions.
[['61097', '12204947'],
['61097', '239293'],
['61794', '37020977'],
['61794', '63243'],
['63243', '5380636']]
Output: A sorted list that contains the count of unique numbers in a list.
[4, 3, 3, 3, 3]
The idea is fairly simple, I have a list of lists where each list contains a variable number of positions (in our example there is only 2 in each list, but lists of up to 10 exist). I want to loop through each list and if there exists ANY other list that contains the same number then that list gets appended to the original list.
Example: Taking the input data from above and using the following code:
def gen_haplotype_blocks(df):
counts = []
for i in range(len(df)):
my_list = [item for item in df if any(x in item for x in df[i])]
my_list = list(itertools.chain.from_iterable(my_list))
uniq_counts = len(set(my_list))
counts.append(uniq_counts)
clear_output()
display('Currently Running ' +str(i))
return sorted(counts, reverse=True)
I get the output that is expected. In this case when I loop through the first list ['61097', '12204947'] I find that my second list ['61097', '239293'] both contain '61097' so these who lists get concatenated together and form ['61097', '12204947', '61097', '239293']. This is done for every single list outputting the following:
['61097', '12204947', '61097', '239293']
['61097', '12204947', '61097', '239293']
['61794', '37020977', '61794', '63243']
['61794', '37020977', '61794', '63243', '63243', '5380636']
['61794', '63243', '63243', '5380636']
Once this list is complete, I then count the number of unique values in each list, append that to another list, then sort the final list and return that.
So in the case of ['61097', '12204947', '61097', '239293'], we have two '61097', one '12204947' and one '239293' which equals to 3 unique numbers.
While my code works, it is VERY slow. Running for nearly two hours and still only on line ~44k.
I am looking for a way to speed up this function considerably. Preferably without changing the original data structure. I am very new to python.
Thanks in advance!
Too considerably improve the speed of your program, especially for larger data set. The key is to use a hash table, or a dictionary in Python's term, to store different numbers as the key, and the lines each unique number exist as value. Then in the second pass, merge the lists for each line based on the dictionary and count unique elements.
def gen_haplotype_blocks(input):
unique_numbers = {}
for i, numbers in enumerate(input):
for number in numbers:
if number in unique_numbers:
unique_numbers[number].append(i)
else:
unique_numbers[number] = [i]
output = [[] for _ in range(len(input))]
for i, numbers in enumerate(input):
for number in numbers:
for line in unique_numbers[number]:
output[i] += input[line]
counts = [len(set(x)) for x in output]
return sorted(counts, reverse=True)
In theory, the time complexity of your algorithm is O(N*N), N as the size of the input list. Because you need to compare each list with all other lists. But in this approach the complexity is O(N), which should be considerably faster for a larger data set. And the trade-off is extra space complexity.
Not sure how much you expect by saying "considerably", but converting your inner lists to sets from the beginning should speed up things. The following works approximately 2.5x faster in my testing:
def gen_haplotype_blocks_improved(df):
df_set = [set(d) for d in df]
counts = []
for d1 in df_set:
row = d1
for d2 in df_set:
if d1.intersection(d2) and d1 != d2:
row = row.union(d2)
counts.append(len(row))
return sorted(counts, reverse=True)

2d-list calculations

I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.
from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.
You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.

Efficient way to loop on comparing string list element to a string list sub-element

I am currently struggling to find an efficient way to compare part of a string element attached to a list, to another string element. The current code computation is very long (1 hour with 4,8 millions elements in first list and 5000 elements in second one).
What I need to do: If 8 first characters of the first string element is equal to the full second element, a third list is updated with the full first element. Once it is found, we test another element of the first list.
Here is the code:
for first_element in first_List :
for second_element in second_List:
if first_element[:8] == second_element :
third_List.append(first_element)
break
I know those kinds of loops are not the best way to deal with very big lists. The number of if tests is really huge.
I was wondering if there is an efficient way to do this.
I think intersection with sets won't work since I'm comparing a part of an element to a full one and I need to copy the full first element in a third list.
Do you have some suggestions or ideas please?
This works:
second_set = set(second_list)
third_list = [value for value in first_list if value[:8] in second_set]
Example:
>>> first_list = ['abcdfghij', 'xyzxyzxyz', 'fjgjgggjhhh']
>>> second_list = ['abcdfghi', 'xyzxyzxy', 'xxx']
>>> second_set = set(second_list)
>>> third_list = [value for value in first_list if value[:8] in second_set]
>>> third_list
['abcdfghij', 'xyzxyzxyz']
This should be much more efficient.
The conversion of the list second_list into the set is O(n).
There is one loop over first_list that is O(n). The lookup in the set, i.e. in second_set is O(1).
Consider using a hash set, or just Set in python.
The nice thing about a hash set is that it can check if an element is in the set very fast (O(1)), in your case improving runtime by a factor of up to 5000 over the O(n) solution of iterating though the list every time.
Create a new list whose elements are taken from the first_List provided that its initial part (8 characters) is present in the second_List:
third_List = [x for x in first_List if x[:8] in second_List]
This approach should be optimized by using second_Set instead of second_List:
second_Set = set(second_List)

Python list index splitting and manipulation

My question seems simple, but for a novice to python like myself this is starting to get too complex for me to get, so here's the situation:
I need to take a list such as:
L = [(a, b, c), (d, e, d), (etc, etc, etc), (etc, etc, etc)]
and make each index an individual list so that I may pull elements from each index specifically. The problem is that the list I am actually working with contains hundreds of indices such as the ones above and I cannot make something like:
L_new = list(L['insert specific index here'])
for each one as that would mean filling up the memory with hundreds of lists corresponding to individual indices of the first list and would be far too time and memory consuming from my point of view. So my question is this, how can I separate those indices and then pull individual parts from them without needing to create hundreds of individual lists (at least to the point where I wont need hundreds of individual lines to create them).
I might be misreading your question, but I'm inclined to say that you don't actually have to do anything to be able to index your tuples. See my comment, but: L[0][0] will give "a", L[0][1] will give "b", L[2][1] will give "etc" etc...
If you really want a clean way to turn this into a list of lists you could use a list comprehension:
cast = [list(entry) for entry in L]
In response to your comment: if you want to access across dimensions I would suggest list comprehension. For your comment specifically:
crosscut = [entry[0] for entry in L]
In response to comment 2: This is largely a part of a really useful operation called slicing. Specifically to do the referenced operation you would do this:
multiple_index = [entry[0:3] for entry in L]
Depending on your readability preferences there are actually a number of possibilities here:
list_of_lists = []
for sublist in L:
list_of_lists.append(list(sublist))
iterator = iter(L)
for i in range(0,iterator.__length_hint__()):
return list(iterator.next())
# Or yield list(iterator.next()) if you want lazy evaluation
What you have there is a list of tuples, access them like a list of lists
L[3][2]
will get the second element from the 3rd tuple in your list L
Two way of using inner lists:
for index, sublist in enumerate(L):
# do something with sublist
pass
or with an iterator
iterator = iter(L)
sublist = L.next() # <-- yields the first sublist
in both case, sublist elements can be reached via
direct index
sublist[2]
iteration
iterator = iter(sublist)
iterator.next() # <-- yields first elem of sublist
for elem in sublist:
# do something with my elem
pass

Categories

Resources