I have a list of element, label pairs like this: [(e1, l1), (e2, l2), (e3, l1)]
I have to count how many labels two element have in common - ie. in the list above e1and e3have the label l1 in common and thus 1 label in common.
I have this Python implementation:
def common_count(e_l_list):
count = defaultdict(int)
l_list = defaultdict(set)
for e1, l in e_l_list:
for e2 in l_list[l]:
if e1 == e2:
continue
elif e1 > e2:
count[e1,e2] += 1
else:
count[e2,e1] += 1
l_list[l].add(e1)
return count
It takes a list like the one above and computes a dictionary of element pairs and counts. The result for the list above should give {(e1, e2): 1}
Now i have to scale this to millions of elements and labels and i though Cython would be a good solution to save CPU time and memory. But i can't find docs on how to use maps in Cython.
How would i implement the above in pure Cython?
It can be asumed that all elements and labels are unsigned integers.
Thanks in advance :-)
I think you are trying to over complicate this by creating pairs of elements and storing all common labels as the value when you can create a dict with the element as the key and have a list of all values associated with that element. When you want to find common labels convert the lists to a set and perform an intersection on them, the resulting set will have the common labels between the two. The average time of the intersection, checked with ~20000 lists, is roughly 0.006 or very fast
I tested this with this code
from collections import *
import random
import time
l =[]
for i in xrange(10000000):
#With element range 0-10000000 the dictionary creation time increases to ~16 seconds
l.append((random.randrange(0,50000),random.randrange(0,50000)))
start = time.clock()
d = defaultdict(list)
for i in l: #O(n)
d[i[0]].append(i[1]) #O(n)
print time.clock()-start
times = []
for i in xrange(10000):
start = time.clock()
tmp = set(d[random.randrange(0,50000)]) #picks a random list of labels
tmp2 = set(d[random.randrange(0,50000)]) #not guaranteed to be a different list but more than likely
times.append(time.clock()-start)
common_elements = tmp.intersection(tmp2)
print sum(times)/100.0
18.6747529999 #creation of list
4.17812619876 #creation of dictionary
0.00633531142994 #intersection
Note: The times do change slightly depending on number of labels. Also creating the dict might be too long for your situation but that is only a one time operation.
I would also highly not recommend creating all pairs of elements. If you have 5,000,000 elements and they all share at least one label, which is worst case, then you are looking at 1.24e+13 pairs or, more bluntly, 12.5 trillion. That would be ~1700 terabytes or ~1.7 petabytes
Related
Given a list containing N sublists of multiple lengths, find all unique combinations of a k size, selecting only one element from each sublist.
The order of the elements in the combination is not relevant: (a, b) = (b, a)
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output =
[
('B1', 'T1'),('B1', 'T2'),('B1', 'L1'),('B1', 'L2'),('B1', 'L3'),('B1', 'L4'),
('B2', 'T1'),('B2', 'T2'),('B2', 'L1'),('B2', 'L2'),('B2', 'L3'),('B2', 'L4'),
('B3', 'T1'),('B3', 'T2'),('B3', 'L1'),('B3', 'L2'),('B3', 'L3'),('B3', 'L4'),
('T1', 'L1'),('T1', 'L2'),('T1', 'L3'),('T1', 'L4'),
('T2', 'L1'),('T2', 'L2'),('T2', 'L3'),('T2', 'L4')
]
Extra points for a pythonic way of doing it
Speed/Efficiency matters, the idea is to use in a list with hundreds of lists ranging from 5 to 50 in length
What I have been able to accomplish so far:
Using for and while loops to move pointers and build the answer, however I am having a hard time figuring out how to include K parameter to set the size of tuple combination dinamically. (not really happy about it)
def build_combinations(lst):
result = []
count_of_lst = len(lst)
for i, sublist in enumerate(lst):
if i == count_of_lst - 1:
continue
else:
for item in sublist:
j = 0
while i < len(lst)-1:
while j <= len(lst[i+1])-1:
comb = (item, lst[i+1][j])
result.append(comb)
j = j + 1
i = i + 1
j = 0
i = 0
return result
I've seen many similar questions in stack overflow, but none of them addressed the parameters the way I am trying to (one item from each list, and the size of the combinations being a params of function)
I tried using itertools combinations, product, permutation and flipping them around without success. Whenever using itertools I have either a hard time using only one item from each list, or not being able to set the size of the tuple I need.
I tried NumPy using arrays and a more math/matrix approach, but didn't go too far. There's definitely a way of solving with NumPy, hence why I tagged numpy as well
You need to combine two itertools helpers, combinations to select the two unique ordered lists to use, then product to combine the elements of the two:
from itertools import combinations, product
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = [pair
for lists in combinations(sample_list, sample_k)
for pair in product(*lists)]
print(expected_output)
Try it online!
If you want to get really fancy/clever/ugly, you can push all the work down to the C layer with:
from itertools import combinations, product, starmap, chain
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = list(chain.from_iterable(starmap(product, combinations(sample_list, sample_k))))
print(expected_output)
That will almost certainly run meaningfully faster for huge inputs (especially if you can loop the results from chain.from_iterable directly rather than realizing them as a list), but it's probably not worth the ugliness unless you're really tight for cycles (I wouldn't expect much more than a 10% speed-up, but you'd need to benchmark to be sure).
Assume I have the following data structure (a list of lists):
myList = [['Something','1'], ['Something','2'], ['Something Else','5'], ['Yet Another Something','1'], ['Yet ANOTHER Something','0'], ['Yet Another Something','2']]
I have a function that will remove duplicates from that list, choosing the highest number for the 2nd value. However, it seems to choke on very large data sets (150+ entries in myList). For this small data set, I expect the returned list to be:
[['Something','2'], ['Something Else','5'], ['Yet Another Something','2']]
What kind of optimization can be implemented using standard python (without including custom, external modules) into this function so that it returns the same result set without issues on large data sets?
Here is my function:
def remove_duplicates(duplicate):
final_list = []
final_list_upper = []
for k,v in duplicate:
found = False
for x in range(len(final_list)):
if k in final_list[x] or k.upper() in final_list_upper[x]:
if k == final_list[x][0] or k.upper() == final_list_upper[x][0]:
if int(v) >= int(final_list[x][1]):
final_list.pop(final_list.index(final_list[x]))
final_list_upper.pop(final_list_upper.index(final_list_upper[x]))
break
else:
found = True
break
if not found:
final_list.append([k,v])
final_list_upper.append([k.upper(),v])
final_list_upper = [] # clear the list
return final_list
You're using a second loop to check if the current "key" that you're checking exists in the list. This is slowing down your code.
Why? Because, as your code demonstrates, checking for membership in lists is slow. Really slow, because you need to iterate over the entire list, which means it's an O(N) operation, so the time depends linearly on the size of the list.
Instead, you could simply change the list to a dictionary. Lookup in a dictionary is an O(1) operation, so the lookup happens in constant (or nearly constant) time regardless of the size of the dictionary.
When you do this, there's no longer a need for two loops. Here's an idea:
def remove_duplicates_new(duplicate):
final_dict = {}
case_sensitive_keys = {}
for k, v in duplicate:
klower = k.lower()
vint = int(v)
old_val = final_dict.get(klower, 0) # Get the key k, with a default of zero if the key doesn't exist
if vint > old_val:
# Replace if current value is greater than old value
final_dict[klower] = vint
case_sensitive_keys[klower] = k
# Now we're done looping, so create the list
final_list = [[case_sensitive_keys[k], str(v)] for k, v in final_dict.items()]
return final_list
To compare, let's make a test list with 10000 elements. The "keys" are random numbers between 1 and 100, so we're bound to get a whole bunch of duplicates.:
import random
import timeit
testList = [[str(random.randint(1, 100)), str(random.randint(1, 10))] for i in range(10000)]
timeit.timeit('remove_duplicates(testList)', setup='from __main__ import testList, remove_duplicates', number=10)
# Output: 1.1064800999999989
timeit.timeit('remove_duplicates_new(testList)', setup='from __main__ import testList, remove_duplicates_new', number=10)
# Output: 0.03743689999998878
Hot damn! That's a ~30x speedup!
I have some question about memory error in python3.6
import itertools
input_list = ['a','b','c','d']
group_to_find = list(itertools.product(input_list,input_list))
a = []
for i in range(len(group_to_find)):
if group_to_find[i] not in a:
a.append(group_to_find[i])
group_to_find = list(itertools.product(input_list,input_list))
MemoryError
You are creating a list, in full, from the Cartesian product of your input list, so in addition to input_list you now need len(input_list) ** 2 memory slots for all the results. You then filter that list down again to a 4th list. All in all, for N items you need memory for 2N + (N * N) references. If N is 1000, that's 1 million and 2 thousand references, for N = 1 million, you need 1 million million plus 2 million references. Etc.
Your code doesn't need to create the group_to_find list, at all, for two reasons:
You could just iterate and handle each pair individually:
a = []
for pair in itertools.product(input_list, repeat=2):
if pair not in a:
a.append(pair)
This is still going to be slow, because pair not in a has to scan the whole list to find matches. You do this N times, for up to K pairs (where K is the product of the number of unique values in input_list, potentially equal to N), so that's N * K time spent checking for duplicates. You could use a = set() to make that faster. But see point 2.
Your end product in a is the exact same list of pairs that itertools.product() would produce anyway, unless you input values are not unique. You could just make those unique first:
a = itertools.product(set(input_list), repeat=2)
Again, don't put this in a list. Iterate over it in a loop and use the pairs it produces one by one.
I am using two architecture programs, with visual programming plugins (Grasshopper for Rhino and Dynamo for Revit - for those that know / are interested)
Grasshopper contains a function called 'Jitter' this will shuffle a list, however it has an input from 0.0 to 1.0 which controls the degree of shuffling - 0.0 results in no shuffling 1.0 produces a complete shuffle.
The second of the programs (Dynamo) does not contain this functionality. It contains a shuffle module (which contains a seed value) however it is a complete random shuffle.
Ultimately the goal is to produce a series of solid and glazed panels, but to produce a slight random effect (but avoiding large clumping of solid and glazed elements - hence I want a "light shuffle")
I have written a code which will calculate the number of glazed(True) and solid(False) values required and then evenly distribute True and False values based on the number of items and percent specified.
I have checked out the random module reference however I'm not familiar with the various distributions as described.
Could someone help out or point me in the right direction if an existing function would achieve this.
(I have cheated slightly by adding True False alternately to make up the correct number of items within the list - list3 is the final list, list2 contains the repeated module of true falses)
Many thanks
import math
import random
percent = 30
items = 42
def remainder():
remain = items % len(list2)
list3.append(True)
remain -= 1
while remain > 0 :
list3.append(False)
remain -= 1
return list3
#find module of repeating True and False values
list1 = ([True] + [False] * int((100/percent)-1))
#multiply this list to nearest multiple based on len(items)
list2 = list1 * int(items/(100/percent))
# make a copy of list2
list3 = list2[:]
#add alternating true and false to match len(list3) to len(items)
remainder()
#an example of a completely shuffled list - which is not desired
shuffled = random.sample(list3, k = len(list3))
Here is an approach based on this paper which proves a result about the mixing time needed to scramble a list by using swaps of adjacent items
from random import choice
from math import log
def jitter(items,percent):
n = len(items)
m = (n**2 * log(n))
items = items[:]
indices = list(range(n-1))
for i in range(int(percent*m)):
j = choice(indices)
items[j],items[j+1] = items[j+1],items[j]
return items
A test, each line showing the result of jitter with various percents being applied to the same list:
ls = list(('0'*20 + '1'*20)*2)
for i in range(11):
p = i/10.0
print(''.join(jitter(ls,p)))
Typical output:
00000000000000000000111111111111111111110000000000000000000011111111111111111111
00000000000000111100001101111011011111001010000100010001101000110110111111111111
00000000100100000101111110000110111101000001110001101001010101100011111111111110
00000001010010011011000100111010101100001111011100100000111010110111011001011111
00100001100000001101010000011010011011111011001100000111011011111011010101011101
00000000011101000110000110000010011001010110011111100100111101111011101100111110
00110000000001011001000010110011111101001111001001100101010011010111111011101100
01101100000100100110000011011000001101111111010100000100000110111011110011011111
01100010110100010100010100011000000001000101100011111011111011111011010100011111
10011100101000100010001100100000100111001111011011000100101101101010101101011111
10000000001000111101101011000011010010110011010101110011010100101101011110101110
I'm not sure how principled the above is, but it seems like a reasonable place to start.
There's no clear definition of what "degree of shuffling" (d) means, so you'll need to choose one. One option would be: "the fraction of items remaining unshuffled is (1-d)".
You could implement that as:
Produce a list of indices
Remove (1-d)*N of them
Shuffle the rest
Reinsert the ones removed
Use these to look up values from the original data
def partial_shuffle(x, d):
"""
x: data to shuffle
d: fraction of data to leave unshuffled
"""
n = len(x)
dn = int(d*n)
indices = list(range(n))
random.shuffle(indices)
ind_fixed, ind_shuff = indices[dn:], indices[:dn]
# copy across the fixed values
result = x[:]
# shuffle the shuffled values
for src, dest in zip(ind_shuff, sorted(ind_shuff)):
result[dest] = x[src]
return result
The other algorithms you're referring to are probably using the Fisher-Yates shuffle under the hood.
This O(n) shuffle starts with the first element of an array and swaps it with a random higher element, then swaps the second element with a random higher element, and so on.
Naturally, stopping this shuffle before you reach the last element at some fraction [0,1] would give a partially-randomized array, like you want.
Unfortunately, the effect of the foregoing is that all the "randomness" builds up on one side of the array.
Therefore, make a list of array indices, shuffle these completely, and then use the indices as an input to the Fisher-Yates algorithm to partially sort the original array.
I believe I found a more versatile, robust, and a consistent way to implement this "adjustable shuffling" technique.
import random
import numpy as np
def acc_shuffle(lis, sr, array=False, exc=None): # "sr" = shuffling rate
if type(lis) != list: # Make it compatible with shuffling (mxn) numpy.ndarrays
arr = lis
shape = arr.shape
lis = list(arr.reshape(-1))
lis = lis[:] # Done, such that any changes applied on "lis" wont affect original input list "x"
indices = list(range(len(lis)))
if exc is not None: # Exclude any indices if necessary
for ele in sorted(exc, reverse=True):
del indices[ele]
shuff_range = int(sr * len(lis) / 2) # How much to shuffle (depends on shuffling rate)
if shuff_range < 1:
shuff_range = 1 # "At least one shuffle (swap 2 elements)"
for _ in range(shuff_range):
i = random.choice(indices)
indices.remove(i) # You can opt not to remove the indices for more flexibility
j = random.choice(indices)
indices.remove(j)
temp = lis[i]
lis[i] = lis[j]
lis[j] = temp
if array is True:
return np.array(lis).reshape(shape)
return lis
Here is what I am trying to do. The output of a calculation on a dataframe gives a number. I use that number to rank the different dataframes and I need to retain the top-N (in the example below, the top 10 is chosen). The ranking is achieved by comparing the number to the last number of a reverse sorted list. If the current number is larger, the list is popped and the new entry added to the list followed by reverse sorting again. The following is structurally identical to what I have and it works, albeit slowly. I would appreciate any suggestions to improve its speed, efficiency or Pythonicness.
import random
import pandas as pd
def gen_df():
return random.uniform(0.0, 1.0), pd.DataFrame()
if __name__ == '__main__':
mylist = []
for i in range(1000):
val, df = gen_df()
if len(mylist) < 10:
mylist.append((val, df))
else:
mylist.sort(reverse=True)
if mylist[-1][0] < val:
mylist.pop()
mylist.append((val, df))
EDIT: Reduced one sort after suggestion by zondo.
The way to speed it up is to replace your list with a min-heap of size 10. Put the first 10 frames into the heap. Then, for each item, if it's larger than the smallest item on the heap, pop the smallest item and push the new item.
I'm not a Python programmer, so I'll present the pseudocode.
heap = new min-heap
for each item
if (heap.length < 10)
heap.push(item)
else if (item > heap.peek())
heap.pop(); // remove smallest item
heap.push(item); // add new item
This assumes, of course, that there's a min-heap implementation that you can use. I suspect heapq will do the trick.
That's going to be significantly faster than sorting the list every time you insert a new item.
Remember that, in Python, lists are really just pointers to the things they contain. So certain list operations can be quite fast, even if the list contains some pretty heavy data structures (i.e. the DataFrames in your example). Your approach involves making a small list (10 items long) and constantly modifying it to be "correct" as more DataFrames are "considered" for the top 10. That feels a bit unnecessary to me. I would just make one big list of all the candidates, sort it once, and take the first 10. Also, appends are slower than insertions, so better to allocate the memory all at once.
My guess is that for big data sets, the approach I lay out below will be a bit faster. But regardless, I find it a bit more readable.
def get_top_10_so():
mylist = []
for i in range(1000):
val, df = gen_df()
if len(mylist) < 10:
mylist.append((val, df))
else:
mylist.sort(reverse=True)
if mylist[-1][0] < val:
mylist.pop()
mylist.append((val, df))
return mylist
def get_top_10_mine():
mylist = [None] * 1000
for i in range(1000):
mylist[i] = gen_df()
mylist.sort(key=lambda tup: tup[0], reverse=True)
return mylist[:10]