I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.
from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.
You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.
Related
Input: A list of lists of various positions.
[['61097', '12204947'],
['61097', '239293'],
['61794', '37020977'],
['61794', '63243'],
['63243', '5380636']]
Output: A sorted list that contains the count of unique numbers in a list.
[4, 3, 3, 3, 3]
The idea is fairly simple, I have a list of lists where each list contains a variable number of positions (in our example there is only 2 in each list, but lists of up to 10 exist). I want to loop through each list and if there exists ANY other list that contains the same number then that list gets appended to the original list.
Example: Taking the input data from above and using the following code:
def gen_haplotype_blocks(df):
counts = []
for i in range(len(df)):
my_list = [item for item in df if any(x in item for x in df[i])]
my_list = list(itertools.chain.from_iterable(my_list))
uniq_counts = len(set(my_list))
counts.append(uniq_counts)
clear_output()
display('Currently Running ' +str(i))
return sorted(counts, reverse=True)
I get the output that is expected. In this case when I loop through the first list ['61097', '12204947'] I find that my second list ['61097', '239293'] both contain '61097' so these who lists get concatenated together and form ['61097', '12204947', '61097', '239293']. This is done for every single list outputting the following:
['61097', '12204947', '61097', '239293']
['61097', '12204947', '61097', '239293']
['61794', '37020977', '61794', '63243']
['61794', '37020977', '61794', '63243', '63243', '5380636']
['61794', '63243', '63243', '5380636']
Once this list is complete, I then count the number of unique values in each list, append that to another list, then sort the final list and return that.
So in the case of ['61097', '12204947', '61097', '239293'], we have two '61097', one '12204947' and one '239293' which equals to 3 unique numbers.
While my code works, it is VERY slow. Running for nearly two hours and still only on line ~44k.
I am looking for a way to speed up this function considerably. Preferably without changing the original data structure. I am very new to python.
Thanks in advance!
Too considerably improve the speed of your program, especially for larger data set. The key is to use a hash table, or a dictionary in Python's term, to store different numbers as the key, and the lines each unique number exist as value. Then in the second pass, merge the lists for each line based on the dictionary and count unique elements.
def gen_haplotype_blocks(input):
unique_numbers = {}
for i, numbers in enumerate(input):
for number in numbers:
if number in unique_numbers:
unique_numbers[number].append(i)
else:
unique_numbers[number] = [i]
output = [[] for _ in range(len(input))]
for i, numbers in enumerate(input):
for number in numbers:
for line in unique_numbers[number]:
output[i] += input[line]
counts = [len(set(x)) for x in output]
return sorted(counts, reverse=True)
In theory, the time complexity of your algorithm is O(N*N), N as the size of the input list. Because you need to compare each list with all other lists. But in this approach the complexity is O(N), which should be considerably faster for a larger data set. And the trade-off is extra space complexity.
Not sure how much you expect by saying "considerably", but converting your inner lists to sets from the beginning should speed up things. The following works approximately 2.5x faster in my testing:
def gen_haplotype_blocks_improved(df):
df_set = [set(d) for d in df]
counts = []
for d1 in df_set:
row = d1
for d2 in df_set:
if d1.intersection(d2) and d1 != d2:
row = row.union(d2)
counts.append(len(row))
return sorted(counts, reverse=True)
I have a very long lst containing unique elements. I want to design a function which takes a list of elements as the input and it can return a list of index efficiently. We assume the items needed to find their index are all in the lst.
Here is an example:
lst = ['ab','sd','ef','de']
items_to_find = ['sd', 'ef', 'sd']
>>> fo(lst, items_to_find)
# Output: [1,2,1]
I have one solution of my own, but it looks less efficient.
>> [lst.index(x) for x in items_to_find]
Because the lst is very long, I need a very fast algorithm to solve it.
First create a dictionary containing in the index location of each item in the list (you state that all items are unique, hence no issue with duplicate keys).
Then use the dictionary to look up each item's index location which is average time complexity O(1).
my_list = ['ab', 'sd', 'ef', 'de']
d = {item: idx for idx, item in enumerate(my_list)}
items_to_find = ['sd', 'ef', 'sd']
>>> [d.get(item) for item in items_to_find]
[1, 2, 1]
You could use a dictionary with elements from lst as the key and index and as the value. Search in a dictionary is O(1).
Although the answer you've accepted is very good, here's something that would be more memory efficient and is probably almost as fast. However #Alexander's answer creates a potentially huge dictionary if the list is very long (since the elements in it are all unique).
The code below also builds a dictionary to speed up searching, but it's for the target elements so is likely to be much smaller than the list being searched. For the sample data the one it creates (named targets) contains only: {'sd': [0, 2], 'ef': [1]}
It one pass through the sequence and checks each of the values in it are targets and, if so, updates the results list according. This approach requires a little more code to implement since the setup is slightly more involved, so that's another trade-off.
def find_indices(seq, elements):
targets = {}
for index, element in enumerate(elements):
targets.setdefault(element, []).append(index)
indices = [None for _ in elements] # Pre-allocate.
for location, value in enumerate(seq):
if value in targets:
for element, indexes in targets.items():
if element == value:
for index in indexes:
indices[index] = location
return indices
lst = ['ab', 'sd', 'ef', 'de']
indices = find_indices(lst, ['sd', 'ef', 'sd'])
print(indices) # -> [1, 2, 1]
A simple first approximation...
def get_indices(data_list, query_list):
datum_index_mapping = {datum:None for datum in query_list}
for index, datum in enumerate(data_list):
if datum in datum_index_mapping:
datum_index_mapping[datum] = index
return [datum_index_mapping[d] for d in query_list]
The above is the most simple, intuitive solution which makes some effort to be efficient (by only bothering to store a dictionary of indices for the elements you actually want to look up).
However, it suffers from the fact that- even if the initial query list is very short- it'll iterate through the entire data list / data generator. In addition, it has to do a dictionary write every time it sees a value it's seen before. The below fixes those inefficiencies, although it adds the overhead of a set, so it must do a set write for each unique element in the query list, as well as a dictionary write for each unique element in the query list.
def get_indices(data_list, query_list):
not_found = set(query_list)
datum_index_mapping = {}
for index, datum in enumerate(data_list):
if datum in not_found:
datum_index_mapping[datum] = index
not_found.remove(datum)
if len(not_found) == 0:
break
return [datum_index_mapping[d] for d in query_list]
Obviously, depending on your program, you may not actually want to have a list of indices at all, but simply have your function return the mapping.
If you'll be resolving multiple arbitrary query lists, you may want to simply do an enumerate() on the original dataset as other answers have shown and keep the dictionary that maps values to indices in memory as well for query purposes.
What counts as efficient often depends upon the larger program; all we can do here are pigeonhole optimizations. It also depends on whether the memory hierarchy and processing power (i.e. can we parallelize? Is compute more expensive, or is memory more expensive? What's the I/O hit if we need to fallback to swap?).
If you are sure all the searched values actually exist in the searching list and the lst is sorted (of course, the sorting itself might take some time), you can do that in one pass (linear complexity):
def sortedindex(lst,find):
find.sort()
indices = []
start = 0
for item in find:
start = lst.index(item,start)
indices.append(start)
return indices
The "start" shows the first index where the algorithm starts comparing the inspected item to the item in the main list. When the correct index is found, it will become the next starting mark. Because both lists are sorted in the same way, you do not have to worry that you skipped any of the next items.
I have a list of ~3000 items. Let's call it listA.
And another list with 1,000,000 items. Let's call it listB.
I want to check how many items of listA belong in listB. For example to get an answer like 436.
The obvious way is to have a nested loop looking for each item, but this is slow, especially due to the size of the lists.
What is the fastest and/or Pythonic way to get the number of the items of one list belonging to another?
Make a set out of list_b. That will avoid nested loops and make the contains-check O(1). The entire process will be O(M+N) which should be fairly optimal:
set_b = set(list_b)
count = sum(1 for a in list_a if a in set_b)
# OR shorter, but maybe less intuitive
count = sum(a in set_b for a in list_a)
# where the bool expression is coerced to int {0; 1} for the summing
If you don't want to (or have to) count repeated elements in list_a, you can use set intersection:
count = len(set(list_a) & set(list_b))
# OR
count = len(set(list_a).intersection(list_b)) # avoids one conversion
One should also note that these set-based operations only work if the items in your lists are hashable (e.g. not lists themselves)!
Another option is to use set and find the intersection:
len(set(listA).intersection(listB))
You can loop over the contents of listA and use a generator to yield values to be more efficient:
def get_number_of_elements(s, a):
for i in s:
if i in a:
yield i
print(len(list(get_number_of_elements(listA, listB))))
Description
I have two lists of lists which are derived from CSVs (minimal working example below). The real dataset for this too large to do this manually.
mainlist = [["MH75","QF12",0,38], ["JQ59","QR21",105,191], ["JQ61","SQ48",186,284], ["SQ84","QF36",0,123], ["GA55","VA63",80,245], ["MH98","CX12",171,263]]
replacelist = [["MH75","QF12","BA89","QR29"], ["QR21","JQ59","VA51","MH52"], ["GA55","VA63","MH19","CX84"], ["SQ84","QF36","SQ08","JQ65"], ["SQ48","JQ61","QF87","QF63"], ["MH98","CX12","GA34","GA60"]]
mainlist contains a pair of identifiers (mainlist[x][0], mainlist[x][1]) and these are associated with to two integers (mainlist[x][2] and mainlist[x][3]).
replacelist is a second list of lists which also contains the same pairs of identifiers (but not in the same order within a pair, or across rows). All sublist pairs are unique. Importantly, replacelist[x][2],replacelist[x][3] corresponds to a replacement for replacelist[x][0],replacelist[x][1], respectively.
I need to create a new third list, newlist which copies mainlist but replaces the identifiers with those from replacelist[x][2],replacelist[x][3]
For example, given:
mainlist[2] is: [JQ61,SQ48,186,284]
The matching pair in replacelist is
replacelist[4]: [SQ48,JQ61,QF87,QF63]
Therefore the expected output is
newlist[2] = [QF87,QF63,186,284]
More clearly put:
if replacelist = [[A, B, C, D]]
A is replaced with C, and B is replaced with D.
but it may appear in mainlist as [[B, A]]
Note newlist row position uses the same as mainlist
Attempt
What has me totally stumped on a simple problem is I feel I can't use basic list comprehension [i for i in replacelist if i in mainlist] as the order within a pair changes, and if I sorted(list) then I lose information about what to replace the lists with. Current solution (with commented blanks):
newlist = []
for k in replacelist:
for i in mainlist:
if k[0] and k[1] in i:
# retrieve mainlist order, then use some kind of indexing to check a series of nested if statements to work out positional replacement.
As you can see, this solution is clearly inefficient and I can't work out the best way to perform the final step in a few lines.
I can add more information if this is not clear
It'll help if you had replacelist as a dict:
mainlist = [[MH75,QF12,0,38], [JQ59,QR21,105,191], [JQ61,SQ48,186,284], [SQ84,QF36,0,123], [GA55,VA63,80,245], [MH98,CX12,171,263]]
replacelist = [[MH75,QF12,BA89,QR29], [QR21,JQ59,VA51,MH52], [GA55,VA63,MH19,CX84], [SQ84,QF36,SQ08,JQ65], [SQ48,JQ61,QF87,QF63], [MH98,CX12,GA34,GA60]]
replacements = {frozenset(r[:2]):dict(zip(r[:2], r[2:])) for r in replacements}
newlist = []
for *ids, val1, val2 in mainlist:
reps = replacements[frozenset([id1, id2])]
newlist.append([reps[ids[0]], reps[ids[1]], val1, val2])
First thing you do - transform both lists in a dictionary:
from collections import OrderedDict
maindct = OrderedDict((frozenset(item[:2]),item[2:]) for item in mainlist)
replacedct = {frozenset(item[:2]):item[2:] for item in replacementlist}
# Now it is trivial to create another dict with the desired output:
output_list = [replacedct[key] + maindct[key] for key in maindct]
The big deal here is that by using a dictionary, you cancel up the search time for the indices on the replacement list - in a list you have to scan all the list for each item you have, which makes your performance worse with the square of your list length. With Python dictionaries, the search time is constant - and do not depend on the data length at all.
My question seems simple, but for a novice to python like myself this is starting to get too complex for me to get, so here's the situation:
I need to take a list such as:
L = [(a, b, c), (d, e, d), (etc, etc, etc), (etc, etc, etc)]
and make each index an individual list so that I may pull elements from each index specifically. The problem is that the list I am actually working with contains hundreds of indices such as the ones above and I cannot make something like:
L_new = list(L['insert specific index here'])
for each one as that would mean filling up the memory with hundreds of lists corresponding to individual indices of the first list and would be far too time and memory consuming from my point of view. So my question is this, how can I separate those indices and then pull individual parts from them without needing to create hundreds of individual lists (at least to the point where I wont need hundreds of individual lines to create them).
I might be misreading your question, but I'm inclined to say that you don't actually have to do anything to be able to index your tuples. See my comment, but: L[0][0] will give "a", L[0][1] will give "b", L[2][1] will give "etc" etc...
If you really want a clean way to turn this into a list of lists you could use a list comprehension:
cast = [list(entry) for entry in L]
In response to your comment: if you want to access across dimensions I would suggest list comprehension. For your comment specifically:
crosscut = [entry[0] for entry in L]
In response to comment 2: This is largely a part of a really useful operation called slicing. Specifically to do the referenced operation you would do this:
multiple_index = [entry[0:3] for entry in L]
Depending on your readability preferences there are actually a number of possibilities here:
list_of_lists = []
for sublist in L:
list_of_lists.append(list(sublist))
iterator = iter(L)
for i in range(0,iterator.__length_hint__()):
return list(iterator.next())
# Or yield list(iterator.next()) if you want lazy evaluation
What you have there is a list of tuples, access them like a list of lists
L[3][2]
will get the second element from the 3rd tuple in your list L
Two way of using inner lists:
for index, sublist in enumerate(L):
# do something with sublist
pass
or with an iterator
iterator = iter(L)
sublist = L.next() # <-- yields the first sublist
in both case, sublist elements can be reached via
direct index
sublist[2]
iteration
iterator = iter(sublist)
iterator.next() # <-- yields first elem of sublist
for elem in sublist:
# do something with my elem
pass