Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))
The following is extremely slow so I need a more reasonable/ faster way of doing this.
sorted_by_feature is a list of (feature_value, 0_or_1) tuples.
same_vals = {}
for ele in sorted_by_feature:
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
l = same_vals.keys()
orderings = list(itertools.permutations(l))
for ordering in orderings:
list_tups = []
for dic_key in ordering:
list_tups += same_vals[dic_key]
left_1 = 0
left_0 = 0
right_1 = num_one
right_0 = num_zero
for index, tup in enumerate(list_tups):
#0's or #1's on the left +/- 1
calculate entropy on left/ right, calculate entropy drop, etc.
Trivial details (continuing the code above):
if index == len(sorted_by_feature) -1:
break
if tup[1] == 1:
left_1 += 1
right_1 -= 1
if tup[1] == 0:
left_0 += 1
right_0 -= 1
#only calculate entropy if values to left and right of split are different
if list_tups[index][0] != list_tups[index+1][0]:
tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.
A couple of minor improvements I can see:
Use try/catch instead of if not in to avoid double lookup of keys
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
# Should be changed to
try:
same_vals[ele[0]].append(ele) # Most of the time this will work
catch KeyError:
same_vals[ele[0]] = [ele]
Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down
orderings = list(itertools.permutations(l))
for ordering in orderings:
# Should be changed to
for ordering in itertools.permutations(l):
But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.
Related
For the sake of the argument, consider following (very bad) sorting algorithm in python:
def so(ar):
while True:
le = len(ar)
switch = False
for y in range(le):
if y+1 == le:
break
if ar[y] > ar[y+1]:
ar[y],ar[y+1] = ar[y+1],ar[y]
switch = True
if switch == False:
break
return ar
I'm trying to understand the concept of "complexity of the algorithm" and there is one thing I don't get.
I came across the post that explains how to find the complexity of the algorithm here:
You add up how many machine instructions it will execute as a function
of the size of its input, and then simplify the expression to the
largest (when N is very large) term and can include any simplifying
constant factor.
But well, the problem is, that I cannot calculate how many machine instructions will be executed just
by knowing the length of the list.
Consider first example:
li = [random.randint(1,5000) for x in range(3000)]
start = time.time()
so(li)
end = time.time() - start
print(end)
Output: 2.96921706199646
Now have a look at the second example:
ok = [5000,43000,232] + [x for x in range(2997)]
start = time.time()
so(ok)
end = time.time() - start
print(end)
Output: 0.010689020156860352
We can see that the same sorting algorithm, two different lists, lists are the same length, and two completely different execution times.
When people are talking about algorithm complexity (big O notation) they normally assume that the only variable that determines complexity of the algo is the size of the input, but clearly, in the example above it is not the case. It is not only the size of the list, but also the positioning of each value within the list that determines the speed of the sorting.
So my question is, why do we only consider size of input when estimating complexity?
And, if it is possible, can you tell me what the complexity of the algorithm above will be?
You're correct, complexity doesn't only depend on N. That's why you'll often see indications about average, worst and best cases.
Timsort is used in Python because it's (O n log n) on average, still fast for worst-cases (O(n log n)) and extremely fast for best-cases (O(n), when the list is already sorted).
Quicksort also has an average complexity of O(n log n), but its worst case is O(n²), when the list is already sorted. This use case happens very often, so it might be worth it to actually shuffle the list before sorting it!
why do we only consider size of input when estimating complexity?
In the narrow sense of complexity as of the use of Big O notation in computer science, it is simply by definition:
In computer science, big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows.
In the broader sense your question could be interpreted as "why do we use Big O notation to describe algorithm complexity when the nature of the data can be just as important as its size."
The answer here lies in the fact that algorithm development is often done on small datasets to make it easy, while in the real world the datasets are huge. When you are writing your sorting function you're most likely going to try it first on small lists of random data. You'd want the result small enough that you can verify that it worked by simply looking at the result...
The time complexity is not always definitely dependent on size of input. When we look at randomized sorting algorithms, the input patterns might play a significant role in determining time complexity.
We usually calculate time complexity in terms of worst, good and average case and could particularly study time complexity in terms of specific input order/patterns which could lead to good, average and best case time complexity.
For example, in first case provided by you, since input is randomized, there is 1/n! probability for a particular input to occur. The good case (when the list is sorted already) is Ω(n) and the worst case(when the list is reversely sorted) is O(n²) , but the probability is low for best or worst case to occur.
Therefore, the sorting algorithm has θ(n²) average time complexity since the probability of comparison and swap in case of two elements in average case input is high due to random distribution of numbers.
In the second case, the order is strict which means high probability for input to tend toward best case or worst case time complexity . In your case, input is more tending towards good case, therefore lesser time.
I have 2.92M data points in a 3.0GB CSV file and I need to loop through it twice to create a graph which I want to load into NetworkX. At the current rate it will take me days to generate this graph. How can I speed this up?
similarity = 8
graph = {}
topic_pages = {}
CSV.foreach("topic_page_node_and_edge.csv") do |row|
topic_pages[row[0]] = row[1..-1]
end
CSV.open("generate_graph.csv", "wb") do |csv|
i = 0
topic_pages.each do |row|
i+=1
row = row.flatten
topic_pages_attributes = row[1..-1]
graph[row[0]] = []
topic_pages.to_a[i..-1].each do |row2|
row2 = row2.flatten
topic_pages_attributes2 = row2[1..-1]
num_matching_attributes = (topic_pages_attributes2 & topic_pages_attributes).count
if num_matching_attributes >= similarity or num_matching_attributes == topic_pages_attributes2.count or num_matching_attributes == topic_pages_attributes.count
graph[row[0]].push(row2[0])
end
end
csv << [row[0], graph[row[0]]].flatten
end
end
benchmark. For example using cProfile, which comes with Python. It's easy to have some costly inefficiencies in your code, and they can easily come at a 10x performance cost in intensive applications.
Pretty code such as
(topic_pages_attributes2 & topic_pages_attributes).count
may turn out to be a major factor in your runtime, that can easily be reduced by using more traditional code.
Use a more efficient language. For example in benchmarksgame.alioth, on a number of intensive problems, the fastest Python 3 program is in median 63x slower than the fastest C program (Ruby is at 67x, JRuby at 33x). Yes, the performance gap can be big, even with well-optimized Python code. But if you didn't optimize your code, it may be even bigger; and you may be able to get a 100x-1000x speedup by using a more efficient language and carefully optimizing your code.
Consider more clever formulations of your problem. For example, instead of iterating over each node, iterate over each edge once. In your case, that would probably mean building an inverted index, topic -> pages. This is very similar to the way text search engines work, and a popular way to compute such operations on clusters: the individual topics can be split on separate nodes. This approach benefits from the sparsity in your data.
This can take down the runtime of your algorithm drastically.
You have about 3 Mio documents. Judging from your total data size, they probably have less than 100 topics on average? Your pairwise comparison approach needs 3mio^2 comparisons, that is what hurts you. If the more popular topics are used on only 30.000 documents each, you may get away with computing only 30k^2 * number of topics. Assuming you have 100 of such very popular topics (rare topics don't matter much), this would be a 100x speedup.
Simplify your problem. For example, first merge all documents that have exactly the same topics by sorting. To make this more effective, also eliminate all topics that occur exactly once. But probably there are only some 10.000-100.000 different sets of documents. This step can be easily solved using sorting, and will make your problem some 900-90000 times easier (assuming above value range).
Some of these numbers may be too optimistic - for example, IO was not taken into account at all, and if your problem is I/O bound, using C/Java may not help much. There may be some highly popular topics that can hurt with the approaches discussed in C. For D) you need O(n log n) time for sorting your data; but there are very good implementations for this available. But it definitely is a simplification that you should do. These documents will also form fully connected cliques in your final data, which likely hurt other analyses as well.
The most amount of time is spent in loading data from disk i belive. Parallelize reading data into multiple threads / processes and then create graph.
Also you could probably create subset graphs in different machines and combine them later.
Wondering about the performance impact of doing one iteration vs many iterations. I work in Python -- I'm not sure if that affects the answer or not.
Consider trying to perform a series of data transformations to every item in a list.
def one_pass(my_list):
for i in xrange(0, len(my_list)):
my_list[i] = first_transformation(my_list[i])
my_list[i] = second_transformation(my_list[i])
my_list[i] = third_transformation(my_list[i])
return my_list
def multi_pass(my_list):
range_end = len(my_list)
for i in xrange(0, range_end):
my_list[i] = first_transformation(my_list[i])
for i in xrange(0, range_end):
my_list[i] = second_transformation(my_list[i])
for i in xrange(0, range_end):
my_list[i] = third_transformation(my_list[i])
return my_list
Now, apart from issues with readability, strictly in performance terms, is there a real advantage to one_pass over multi_pass? Assuming most of the work happens in the transformation functions themselves, wouldn't each iteration in multi_pass only take roughly 1/3 as long?
The difference will be how often the values and code you're reading are in the CPU's cache.
If the elements of my_list are large, but fit into the CPU cache, the first version may be beneficial. On the other hand, if the (byte)code of the transformations is large, caching the operations may be better than caching the data.
Both versions are probably slower than the way more readable:
def simple(my_list):
return [third_transformation(second_transformation(first_transformation(e)))
for e in my_list]
Timing it yields:
one_pass: 0.839533090591
multi_pass: 0.840938806534
simple: 0.569097995758
Assuming you're considering a program that can easily be one loop with multiple operations, or multiple loops doing one operation each, then it never changes the computational complexity (e.g. an O(n) algorithm is still O(n) either way).
One advantage of the single-pass approach are that you save on the "book-keeping" of the looping. Whether the iteration mechanism is incrementing and comparing counters, or retrieving "next" pointers and checking for null, or whatever, you do it less when you do everything in one pass. Assuming that your operations do any significant amount of work at all (and that your looping mechanism is simple and straightforward, not looping over an expensive generator or something), then this "book-keeping" work will be dwarfed by the actual work of your operations, which makes this definitely a micro-optimisation that you shouldn't be doing unless you know your program is too slow and you've exhausted all more significant available optimisations.
Another advantage can be that applying all your operations to each element of the iteration before you move on to the next one tends to benefit better from the CPU cache, since each item could still be in the cache in subsequent operations on the same item, whereas using multiple passes makes that almost impossible (unless your entire collection fits in the cache). Python has so much indirection via dictionaries going on though that it's probably not hard for each individual operation to overflow the cache by reading hash buckets scattered all over the memory space. So this is still a micro-optimisation, but this analysis gives it more of a chance (though still no certainty) of making a significant difference.
One advantage of multi-pass can be that if you need to keep state between loop iterations, the single-pass approach forces you to keep the state of all operations around. This can hurt the CPU cache (maybe the state of each operation individually fits in the cache for an entire pass, but not the state of all the operations put together). In the extreme case this effect could theoretically make the difference between the program fitting in memory and not (I have encountered this once in a program that was chewing through very large quantities of data). But in the extreme cases you know that you need to split things up, and the non-extreme cases are again micro-optimisations that are not worth making in advance.
So performance generally favours single-pass by an insignificant amount, but can in some cases favour either single-pass or multi-pass by a significant amount. The conclusion you can draw from this is the same as the general advice applying to all programming: start by writing code in whatever way is most clear and maintainable and still adequately addresses your program. Only once you've got a mostly finished program and if it turns out to be "not fast enough", then measure the performance effects of the various parts of your code to find out where it's worth spending your time.
Time spent worrying about whether to write single-pass or multi-pass algorithms for performance reasons will almost always turn out to have been wasted. So unless you have unlimited development time available to you, you will get the "best" results from your total development effort (where best includes performance) by not worrying about this up-front, and addressing it on an as-needed basis.
Personally, I would no doubt prefer the one_pass option. It definitely performs better. You may be right that the difference wouldn't be huge. Python has optimized the xrange iterator really well, but you are still doing 3 times more iterations than needed.
You may get decreased cached misses in either version compared to the other. It depends on what those transform functions actually do.
If those functions have a lot of code and operate on different sets of data (besides the input and output), multipass may be better. Otherwise the single pass is likely to be better because the current list element will likely remain cached and the loop operations are only done once instead of three times.
This is a case were comparing actual run times would be very useful.
Which is more CPU intensive, to do an if(x==num): check, or to do a sum x+y?
Your question is somewhat incomplete because you are comparing two different operations. If you need to add two things together then testing x==y isn't going to get you anywhere. So presumably you want to compare
if y != 0:
sum += y
with
sum +=y
It's a lot more complex for interpreted languages like Python, but on the hardware a test for non-zero introduces a branch and that in itself can be expensive. But I wouldn't want to say which would be faster without timing.
Throw into the equation different performance characteristics of different architectures and you have another confounding factor.
As always, you are best to write your code in the most natural maintainable way first and then time it. If you feel you need to extract more performance use a profiler to find hot spots and then optimise.
I have an interesting problem. I'm faced with a function that takes a long time to compute a value based on some index. Call it takes_a_long_time(index). The values returned from this function are guaranteed to have a global minimum, but there are no guarantees that the index associated with will be close to zero.
Since takes_a_long_time takes arbitrarily large positive integers as its index, There are unique constraints on how to begin the binary search. I need a way to create a finite interval to search in for the exact minimum. My first thought was to check increasingly large intervals starting from zero. Something like:
def find_interval_with_minimum():
start = 0
end = 1
interval_size = 1
minimum_in_interval = check_minimum_in(start, end)
while not minimum_in_interval:
interval_size = interval_size * 2
start = end
end = start + interval_size
minimum_in_interval = check_minimum_in(start, end)
return start, end
This would seem to work fine, but there is an additional detail that really throws things off. takes_a_long_time requires exponentially more time to compute a value as indexes approach zero. Since check_minimum_in would require multiple calls to takes_a_long_time, I would like to avoid starting at zero.
So my question is, given that the minimum could be anywhere on [0, +infinity), is there any reasonable way to run this "backwards?" Or, is there some good heuristic to use to avoid checking low indices if not necessary?
I'd love a language agnostic solution. However, I am writing this in Python, so if there is a python specific approach to this, I'd take that as well.
From the comments to the question, the curve is well-behaved and you could use something like ternary search. The only problem then is how to handle the inconvenient behavior as your approach zero. So don't start at zero: define a new function g from your function f with g(x) = f(1/x). Search this starting from x=0 and a small value, doubling or otherwise increasing the interval size until it contains the minimum.
To do this, you need to know the limit of f as its argument approaches infinity, or the equivalent limit of g as its argument goes to zero. If it can't be evaluated explicitly, I'd try a numerical approximation.
See the comments to the answer for some points to consider in how you increase the interval size, especially that by Steve Jessop.
Sounds like the thing to do is to pick a large number, big enough that takes_a_long_time doesn't take too long to be acceptable. Start two threads: one which starts looking up towards positive infinity for a range containing the minimum, and another one which starts looking down towards zero for a range containing the minimum. Because of the exponential time increase, 0 might as well be at infinity as far as searching is concerned. Whichever thread finds a result, cancel the other one.
But then, unless you want to take advantage of multiple CPU cores don't start two threads (and if you do, don't start exactly two threads, start one per core or so). Just alternate doing work on on side or the other.
Given this basic strategy, now you need to tune the rate at which you approach 0. The faster you approach it, the fewer steps to find the minimum if it's really on that side, but the bigger the range left to be binary searched, because on average you'll "overshoot" further towards zero. If the performance curve is reciprocal-exponential, then presumably you want to overshoot as little as possible, so should approach 0 very slowly. It might even be that your task is computationally infeasible, "exponential" often means "impossible".
Obviously I can't say anything about what the initial "large number" should be. Is a hundred tolerable? Is a million? Graham's number? If you don't even know what's likely to have acceptable performance, you could find out by running in parallel (again, either via threads or via dovetailing) a set of calculations of takes_a_long_time for different indexes until one of them completes. Again, there's no guarantee that this is computationally feasible - if every single index that fits in the memory of your computer takes at least a billion years, you're stuck in practice even though you have a solution in theory.