Need help speeding up this function

Need help speeding up this function - python

Input: A list of lists of various positions.
[['61097', '12204947'],
['61097', '239293'],
['61794', '37020977'],
['61794', '63243'],
['63243', '5380636']]
Output: A sorted list that contains the count of unique numbers in a list.
[4, 3, 3, 3, 3]
The idea is fairly simple, I have a list of lists where each list contains a variable number of positions (in our example there is only 2 in each list, but lists of up to 10 exist). I want to loop through each list and if there exists ANY other list that contains the same number then that list gets appended to the original list.
Example: Taking the input data from above and using the following code:
def gen_haplotype_blocks(df):
counts = []
for i in range(len(df)):
my_list = [item for item in df if any(x in item for x in df[i])]
my_list = list(itertools.chain.from_iterable(my_list))
uniq_counts = len(set(my_list))
counts.append(uniq_counts)
clear_output()
display('Currently Running ' +str(i))
return sorted(counts, reverse=True)
I get the output that is expected. In this case when I loop through the first list ['61097', '12204947'] I find that my second list ['61097', '239293'] both contain '61097' so these who lists get concatenated together and form ['61097', '12204947', '61097', '239293']. This is done for every single list outputting the following:
['61097', '12204947', '61097', '239293']
['61097', '12204947', '61097', '239293']
['61794', '37020977', '61794', '63243']
['61794', '37020977', '61794', '63243', '63243', '5380636']
['61794', '63243', '63243', '5380636']
Once this list is complete, I then count the number of unique values in each list, append that to another list, then sort the final list and return that.
So in the case of ['61097', '12204947', '61097', '239293'], we have two '61097', one '12204947' and one '239293' which equals to 3 unique numbers.
While my code works, it is VERY slow. Running for nearly two hours and still only on line ~44k.
I am looking for a way to speed up this function considerably. Preferably without changing the original data structure. I am very new to python.
Thanks in advance!

Too considerably improve the speed of your program, especially for larger data set. The key is to use a hash table, or a dictionary in Python's term, to store different numbers as the key, and the lines each unique number exist as value. Then in the second pass, merge the lists for each line based on the dictionary and count unique elements.
def gen_haplotype_blocks(input):
unique_numbers = {}
for i, numbers in enumerate(input):
for number in numbers:
if number in unique_numbers:
unique_numbers[number].append(i)
else:
unique_numbers[number] = [i]
output = [[] for _ in range(len(input))]
for i, numbers in enumerate(input):
for number in numbers:
for line in unique_numbers[number]:
output[i] += input[line]
counts = [len(set(x)) for x in output]
return sorted(counts, reverse=True)
In theory, the time complexity of your algorithm is O(N*N), N as the size of the input list. Because you need to compare each list with all other lists. But in this approach the complexity is O(N), which should be considerably faster for a larger data set. And the trade-off is extra space complexity.

Not sure how much you expect by saying "considerably", but converting your inner lists to sets from the beginning should speed up things. The following works approximately 2.5x faster in my testing:
def gen_haplotype_blocks_improved(df):
df_set = [set(d) for d in df]
counts = []
for d1 in df_set:
row = d1
for d2 in df_set:
if d1.intersection(d2) and d1 != d2:
row = row.union(d2)
counts.append(len(row))
return sorted(counts, reverse=True)

Related

Insert items into a list based on their occurrence

Say I am continuously generating new data (e.g. integers) and want to collect them in a list.
import random
lst = []
for _ in range(50):
num = random.randint(0, 10)
lst.append(num)
When a new value is generated, I want it to be positioned in the list based on the count of occurrences of that value, so data with lower "current occurrence" should be placed before those with higher "current occurrence".
"Current occurrence" means "the number of duplicates of that data that have already been collected so far, up to this iteration". For the data that have the same occurrence, they should then follow the order in which they are generated.
For example, if at iteration 10 the current list is [1,2,3,4,2,3,4,3,4], let's say a new value 1 is generated, then it should be inserted at index 7, resulting in [1,2,3,4,2,3,4,1,3,4]. Because it is the second occurrence of 1, it should be placed after all the values that only occur once, but after all other existing items that occur twice: 2, 3 and 4 (hence, preserving the order).
This is my current code that can rearrange the list:
from collections import defaultdict
def rearrange(lst):
d = defaultdict(list)
count = defaultdict(int)
for x in lst:
count[x] += 1
d[count[x]].append(x)
res = []
for k in sorted(d.keys()):
res += d[k]
return res
lst = rearrange(lst)
However, this is not giving my expected result.
I wrote a separate algorithm that keeps generating new data until some convergence criterion is met, where the list has the potential to become extremely large.
Therefore I want to rearrange my generated values on-the-fly, i.e. to constantly insert data into the list "in-place". Of course I can use my rearrage function in each iteration, but that would be super inefficient. What I want to do is to insert new data into the correct position of the list, not replacing it with a new list in each iteration.
Any suggestions?
Edit: the data structure doesn't necessarily need to be a list, but it has to be ordered, and doesn't require another data structure to hold information.

The data structure I think that might work better for your purpose is a forest (in this case, a disjoint union of lists).
In summary, you keep one internal list for each occurrence of the values. When a new value comes, you add it to the list just after the one you added the last value this item came.
In order to keep track of the counts of occurrences, you can use a built-in Counter.
Here is a sample implementation:
from collections import Counter
def rearranged(iterable):
forest, counter = list(), Counter()
for x in iterable:
c = counter[x]
if c == len(forest):
forest.append([x])
else:
forest[c] += [x]
counter[x] += 1
return [x for lst in forest for x in lst]
rearranged([1,2,3,4,2,3,4,3,4,1])
# [1, 2, 3, 4, 2, 3, 4, 1, 3, 4]
For this to work better, your input iterable should be a generator (so the items can be generated on the fly).

Find index for multiple elements in a long list

I have a very long lst containing unique elements. I want to design a function which takes a list of elements as the input and it can return a list of index efficiently. We assume the items needed to find their index are all in the lst.
Here is an example:
lst = ['ab','sd','ef','de']
items_to_find = ['sd', 'ef', 'sd']
>>> fo(lst, items_to_find)
# Output: [1,2,1]
I have one solution of my own, but it looks less efficient.
>> [lst.index(x) for x in items_to_find]
Because the lst is very long, I need a very fast algorithm to solve it.

First create a dictionary containing in the index location of each item in the list (you state that all items are unique, hence no issue with duplicate keys).
Then use the dictionary to look up each item's index location which is average time complexity O(1).
my_list = ['ab', 'sd', 'ef', 'de']
d = {item: idx for idx, item in enumerate(my_list)}
items_to_find = ['sd', 'ef', 'sd']
>>> [d.get(item) for item in items_to_find]
[1, 2, 1]

You could use a dictionary with elements from lst as the key and index and as the value. Search in a dictionary is O(1).

Although the answer you've accepted is very good, here's something that would be more memory efficient and is probably almost as fast. However #Alexander's answer creates a potentially huge dictionary if the list is very long (since the elements in it are all unique).
The code below also builds a dictionary to speed up searching, but it's for the target elements so is likely to be much smaller than the list being searched. For the sample data the one it creates (named targets) contains only: {'sd': [0, 2], 'ef': [1]}
It one pass through the sequence and checks each of the values in it are targets and, if so, updates the results list according. This approach requires a little more code to implement since the setup is slightly more involved, so that's another trade-off.
def find_indices(seq, elements):
targets = {}
for index, element in enumerate(elements):
targets.setdefault(element, []).append(index)
indices = [None for _ in elements] # Pre-allocate.
for location, value in enumerate(seq):
if value in targets:
for element, indexes in targets.items():
if element == value:
for index in indexes:
indices[index] = location
return indices
lst = ['ab', 'sd', 'ef', 'de']
indices = find_indices(lst, ['sd', 'ef', 'sd'])
print(indices) # -> [1, 2, 1]

A simple first approximation...
def get_indices(data_list, query_list):
datum_index_mapping = {datum:None for datum in query_list}
for index, datum in enumerate(data_list):
if datum in datum_index_mapping:
datum_index_mapping[datum] = index
return [datum_index_mapping[d] for d in query_list]
The above is the most simple, intuitive solution which makes some effort to be efficient (by only bothering to store a dictionary of indices for the elements you actually want to look up).
However, it suffers from the fact that- even if the initial query list is very short- it'll iterate through the entire data list / data generator. In addition, it has to do a dictionary write every time it sees a value it's seen before. The below fixes those inefficiencies, although it adds the overhead of a set, so it must do a set write for each unique element in the query list, as well as a dictionary write for each unique element in the query list.
def get_indices(data_list, query_list):
not_found = set(query_list)
datum_index_mapping = {}
for index, datum in enumerate(data_list):
if datum in not_found:
datum_index_mapping[datum] = index
not_found.remove(datum)
if len(not_found) == 0:
break
return [datum_index_mapping[d] for d in query_list]
Obviously, depending on your program, you may not actually want to have a list of indices at all, but simply have your function return the mapping.
If you'll be resolving multiple arbitrary query lists, you may want to simply do an enumerate() on the original dataset as other answers have shown and keep the dictionary that maps values to indices in memory as well for query purposes.
What counts as efficient often depends upon the larger program; all we can do here are pigeonhole optimizations. It also depends on whether the memory hierarchy and processing power (i.e. can we parallelize? Is compute more expensive, or is memory more expensive? What's the I/O hit if we need to fallback to swap?).

If you are sure all the searched values actually exist in the searching list and the lst is sorted (of course, the sorting itself might take some time), you can do that in one pass (linear complexity):
def sortedindex(lst,find):
find.sort()
indices = []
start = 0
for item in find:
start = lst.index(item,start)
indices.append(start)
return indices
The "start" shows the first index where the algorithm starts comparing the inspected item to the item in the main list. When the correct index is found, it will become the next starting mark. Because both lists are sorted in the same way, you do not have to worry that you skipped any of the next items.

Python: replace values of sublist, with values looked up from another sublist without indexing

Description
I have two lists of lists which are derived from CSVs (minimal working example below). The real dataset for this too large to do this manually.
mainlist = [["MH75","QF12",0,38], ["JQ59","QR21",105,191], ["JQ61","SQ48",186,284], ["SQ84","QF36",0,123], ["GA55","VA63",80,245], ["MH98","CX12",171,263]]
replacelist = [["MH75","QF12","BA89","QR29"], ["QR21","JQ59","VA51","MH52"], ["GA55","VA63","MH19","CX84"], ["SQ84","QF36","SQ08","JQ65"], ["SQ48","JQ61","QF87","QF63"], ["MH98","CX12","GA34","GA60"]]
mainlist contains a pair of identifiers (mainlist[x][0], mainlist[x][1]) and these are associated with to two integers (mainlist[x][2] and mainlist[x][3]).
replacelist is a second list of lists which also contains the same pairs of identifiers (but not in the same order within a pair, or across rows). All sublist pairs are unique. Importantly, replacelist[x][2],replacelist[x][3] corresponds to a replacement for replacelist[x][0],replacelist[x][1], respectively.
I need to create a new third list, newlist which copies mainlist but replaces the identifiers with those from replacelist[x][2],replacelist[x][3]
For example, given:
mainlist[2] is: [JQ61,SQ48,186,284]
The matching pair in replacelist is
replacelist[4]: [SQ48,JQ61,QF87,QF63]
Therefore the expected output is
newlist[2] = [QF87,QF63,186,284]
More clearly put:
if replacelist = [[A, B, C, D]]
A is replaced with C, and B is replaced with D.
but it may appear in mainlist as [[B, A]]
Note newlist row position uses the same as mainlist
Attempt
What has me totally stumped on a simple problem is I feel I can't use basic list comprehension [i for i in replacelist if i in mainlist] as the order within a pair changes, and if I sorted(list) then I lose information about what to replace the lists with. Current solution (with commented blanks):
newlist = []
for k in replacelist:
for i in mainlist:
if k[0] and k[1] in i:
# retrieve mainlist order, then use some kind of indexing to check a series of nested if statements to work out positional replacement.
As you can see, this solution is clearly inefficient and I can't work out the best way to perform the final step in a few lines.
I can add more information if this is not clear

It'll help if you had replacelist as a dict:
mainlist = [[MH75,QF12,0,38], [JQ59,QR21,105,191], [JQ61,SQ48,186,284], [SQ84,QF36,0,123], [GA55,VA63,80,245], [MH98,CX12,171,263]]
replacelist = [[MH75,QF12,BA89,QR29], [QR21,JQ59,VA51,MH52], [GA55,VA63,MH19,CX84], [SQ84,QF36,SQ08,JQ65], [SQ48,JQ61,QF87,QF63], [MH98,CX12,GA34,GA60]]
replacements = {frozenset(r[:2]):dict(zip(r[:2], r[2:])) for r in replacements}
newlist = []
for *ids, val1, val2 in mainlist:
reps = replacements[frozenset([id1, id2])]
newlist.append([reps[ids[0]], reps[ids[1]], val1, val2])

First thing you do - transform both lists in a dictionary:
from collections import OrderedDict
maindct = OrderedDict((frozenset(item[:2]),item[2:]) for item in mainlist)
replacedct = {frozenset(item[:2]):item[2:] for item in replacementlist}
# Now it is trivial to create another dict with the desired output:
output_list = [replacedct[key] + maindct[key] for key in maindct]
The big deal here is that by using a dictionary, you cancel up the search time for the indices on the replacement list - in a list you have to scan all the list for each item you have, which makes your performance worse with the square of your list length. With Python dictionaries, the search time is constant - and do not depend on the data length at all.

2d-list calculations

I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.

from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.

You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.

Wildcard for nested list-query

I've got a nested list and I'd like to check whether i is contained on the lowest level of my list (i is the first of two elements of one "sublist").
1) Is there a direct way to do this?
2) I tried the following:
for i in randomlist:
if [i,randomlist.count(i)] in list1:
Is there a way to replace randomlist.count(i) with a wildcard? I tried *,%,..., but non of these worked well. Any ideas?
Thanks in advance!

I think what you want is:
if any(l[0] == i for l in list1):
This will only check the first item in each sub-list, which is effectively the same as having a wild-card second element.

It seems that this is the actual problem:
input shows nested list with numbers and their counts in sublists:
[[86, 4], [67, 1], [89, 1],...] output: i need to know whether a
number with its count is already in the list (in order not to add it a
second time), but the count is unknown during the for loop
There are two ways to approach this problem. First, if the list does not have duplicates, simply convert it to a dictionary:
numbers = dict([[86,4],[67,1],[89,1]])
Now each number is the key, and the count a value. Next, if you want to know if a number is not in the dictionary, you have many ways to do that:
# Fetch the number
try:
count = numbers[14]
except KeyError:
print('{} does not exist'.format(14))
# Another way to write the above is:
count = numbers.get(14)
if not count:
print('{} does not exist'.format(14))
# From a list of a numbers, add them only if they don't
# exist in the dictionary:
for i in list_of_numbers:
if i not in numbers.keys():
numbers[i] = some_value
If there are already duplicates in the original list, you can still convert it into a dictionary but you need to do some extra work if you want to preserve all the values for the numbers:
from collections import defaultdict
numbers = defaultdict(list)
for key,value in original_list:
numbers[key].append(value)
Now if you have duplicate numbers, all their values are stored in a list. You can still follow the same logic:
for i in new_numbers:
numbers[i].append(new_value)
Except now if the number already existed, the new_value will just be added to the list of existing values.
Finally, if all you want to do is add to the list if the first number doesn't exist:
numbers = set(i[0] for i in original_list)
for i in new_numbers:
if i not in numbers:
original_list += [i, some_value]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need help speeding up this function - python

Related

Insert items into a list based on their occurrence

Find index for multiple elements in a long list

Python: replace values of sublist, with values looked up from another sublist without indexing

2d-list calculations

Wildcard for nested list-query

Categories

Resources