Run Python Program on Site - python

this is a question that I don't know where to search for an answer. I have a Python program that has too many calculations, for example, consider the DFS algorithm with branch factor equals 62 and depth equal to 20. If I run this program on my pc I would take ages to be completed. But is there any website that gives me resources to do this job? Like I put my code in it and run it then two days later check the results.
I'm aware that maybe this question flagged as spam or anything like that, but thanks for your help anyway!
UPDATE
I investigate further on this little question. What I really want is a Cloud Computing but for free!!

I doubt that cloud computing could help you as it might take ages in cloud as well. Cloud computing should be used when your code can be efficiently parallelised into problems with reasonable complexity. You can parallelise DFS (say, based on the choice of the first branch) but you still left with the problem of almost the same size. You should consider optimising/approximating your calculations to be runnable with restricted resources.

Related

How can I maximize throughput for an embarrassingly-parallel task in Python on Google Cloud Platform?

I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. The bottleneck of the application occurs after randomly permuting an input matrix N (default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. The runs are fully independent of one another. I've captured the top of the pipeline below:
This processes the default 125 permutations. As you can see, only the RunClustering step takes an appreciable amount of time (there are 11 more steps not shown below that total to 11 more seconds). I ran the pipeline earlier today for just 1 permutation, and the Run Clustering step takes 3 seconds (close enough to 1/125th the time shown above).
I'd like the RunClustering step to finish in 3-4 seconds no matter what the input N is. My understanding is that Dataflow is the correct tool for speeding up embarrassingly-parallel computation on Google Cloud Platform, so I've spent a couple weeks learning it and porting my code. Is my understanding correct? I've also tried throwing more machines at the problem (instead of Autoscaling, which, for whatever reason, only scales up to 2-3 machines*) and specifying more powerful machine types, but those don't help.
*Is this because of a long startup time for VMs? Is there a way to use quickly-provisioned VMs, if that's the case? Another question I have is how to cut down on the pipeline startup time; it's a deal breaker if users can't get results back quickly, and the fact that the total Dataflow job time is 13–14 minutes (compared to the already excessive 6–7 for the pipeline) is unacceptable.
Your pipeline is suffering from excessive fusion, and ends up doing almost everything on one worker. This is also why autoscaling doesn't scale higher: it detects that it is unable to parallelize your job's code, so it prefers not to waste extra workers. This is also why manually throwing more workers at the problem doesn't help.
In general fusion is a very important optimization, but excessive fusion is also a common problem that, ideally, Dataflow would be able to mitigate automatically (like it automatically mitigates imbalanced sharding), but it is even harder to do, though some ideas for that are in the works.
Meanwhile, you'll need to change your code to insert a "reshuffle" (a group by key / ungroup will do - fusion never happens across a group by key operation). See Preventing fusion; the question Best way to prevent fusion in Google Dataflow? contains some example code.

When building Python with profile guided optimization do I have to leave the computer alone?

This may seem or even be a stupid question: When I build something self-tuning like Python with PGO (or ATLAS or I believe FFTW also does it), does the computer have to be otherwise idle (to not interfere with the measurements) or can I pass the time playing Doom?
The linked README from the python source distribution seems to deem this too trivial a matter to mention, but I'm genuinely unsure about this.
What you do on your computer while it is performing the PGO measurements should have no impact what so ever on the result of the optimization. What PGO do is to use measurments to find the hot paths in the code for a given data set and use this information to make the program as fast as possible for this data set and which path is hot and which is not is independent of other programs running on the computer.
To explain things a bit, when optimizing code there are trade offs. The improvement will be higher in some parts of the code and lower in others depending on which code transforms are used and where they are applied. To get a better final result you want high improvements in code that is executed a lot (hot code in compiler lingo) while you can live with smaller improvements in code that is executed less frequently (cold code). Normally a set of heuristics are used to identify these hot parts of the program and apply optimizations in a way that makes these parts as fast as possible. The problem with this approach is that the heuristics does not know anything about how the program will be used in practice and may misidentify hot code as cold.
Profile guided optimization (PGO) is a method to help the compiler to locate the hot parts of the code using data from real executions. As a first step you tell the compiler to build an instrumented version of the program to measure how the code is executed in practice, typically by adding counters to count the number of iterations in loops and which branch is chosen in if-statements. The second step is to run the instrumented program on real data. At the end of execution the program will output the values of all the added counters and by matching counters with the code it is possible to see which parts of the program are hot (high numbers) and which are cold (low numbers). Finally the program is compiled but this time agumented with the program profile. This implies that the compiler no longer need to guess which parts should be faster and which could be slower it can look it up in the profile.

Running parallel iterations

I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()
From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange

Efficient scheduling of university courses

I'm currently working on a website that will allow students from my university to automatically generate valid schedules based on the courses they'd like to take.
Before working on the site itself, I decided to tackle the issue of how to schedule the courses efficiently.
A few clarifications:
Each course at our university (and I assume at every other
university) comprises of one or more sections. So, for instance,
Calculus I currently has 4 sections available. This means that, depending on the amount of sections, and whether or not the course has a lab, this drastically affects the scheduling process.
Courses at our university are represented using a combination of subject abbreviation and course code. In the case of Calculus I: MATH 1110.
The CRN is a code unique to a section.
The university I study at is not mixed, meaning males and females study in (almost) separate campuses. What I mean by almost is that the campus is divided into two.
The datetimes and timeranges dicts are meant to decreases calls to datetime.datetime.strptime(), which was a real bottleneck.
My first attempt consisted of the algorithm looping continuously until 30 schedules were found. Schedules were created by randomly choosing a section from one of the inputted courses, and then trying to place sections from the remaining courses to try to construct a valid schedule. If not all of the courses fit into the schedule i.e. there were conflicts, the schedule was scrapped and the loop continued.
Clearly, the above solution is flawed. The algorithm took too long to run, and relied too much on randomness.
The second algorithm does the exact opposite of the old one. First, it generates a collection of all possible schedule combinations using itertools.product(). It then iterates through the schedules, crossing off any that are invalid. To ensure assorted sections, the schedule combinations are shuffled (random.shuffle()) before being validated. Again, there is a bit of randomness involved.
After a bit of optimization, I was able to get the scheduler to run in under 1 second for an average schedule consisting of 5 courses. That's great, but the problem begins once you start adding more courses.
To give you an idea, when I provide a certain set of inputs, the amount of combinations possible is so large that itertools.product() does not terminate in a reasonable amount of time, and eats up 1GB of RAM in the process.
Obviously, if I'm going to make this a service, I'm going to need a faster and more efficient algorithm. Two that have popped up online and in IRC: dynamic programming and genetic algorithms.
Dynamic programming cannot be applied to this problem because, if I understand the concept correctly, it involves breaking up the problem into smaller pieces, solving these pieces individually, and then bringing the solutions of these pieces together to form a complete solution. As far as I can see, this does not apply here.
As for genetic algorithms, I do not understand them much, and cannot even begin to fathom how to apply one in such a situation. I also understand that a GA would be more efficient for an extremely large problem space, and this is not that large.
What alternatives do I have? Is there a relatively understandable approach I can take to solve this problem? Or should I just stick to what I have and hope that not many people decide to take 8 courses next semester?
I'm not a great writer, so I'm sorry for any ambiguities in the question. Please feel free to ask for clarification and I'll try my best to help.
Here is the code in its entirety.
http://bpaste.net/show/ZY36uvAgcb1ujjUGKA1d/
Note: Sorry for using a misleading tag (scheduling).
Scheduling is a very famous constraint satisfaction problem that is generally NP-Complete. A lot of work has been done on the subject, even in the same context as you: Solving the University Class Scheduling Problem Using Advanced ILP Techniques. There are even textbooks on the subject.
People have taken many approaches, including:
Dynamic programming
Genetic algorithms
Neural networks
You need to reduce your problem-space and complexity. Make as many assumptions as possible (max amount of classes, block based timing, ect). There is no silver bullet for this problem but it should be possible to find a near-optimal solution.
Some semi-recent publications:
QUICK scheduler a time-saving tool for scheduling class sections
Scheduling classes on a College Campus
Did you ever read anything about genetic programming? The idea behind it is that you let the 'thing' you want solved evolve, just by itsself, until it has grown to the best solution(s) possible.
You generate a thousand schedules, of which usually zero are anywhere in the right direction of being valid. Next, you change 'some' courses, randomly. From these new schedules you select some of the best, based on ratings you give according to the 'goodness' of the schedule. Next, you let them reproduce, by combining some of the courses on both schedules. You end up with a thousand new schedules, but all of them a tiny fraction better than the ones you had. Let it repeat until you are satisfied, and select the schedule with the highest rating from the last thousand you generated.
There is randomness involved, I admit, but the schedules keep getting better, no matter how long you let the algorithm run. Just like real life and organisms there is survival of the fittest, and it is possible to view the different general 'threads' of the same kind of schedule, that is about as good as another one generated. Two very different schedules can finally 'battle' it out by cross breeding.
A project involving school schedules and genetic programming:
http://www.codeproject.com/Articles/23111/Making-a-Class-Schedule-Using-a-Genetic-Algorithm
I think they explain pretty well what you need.
My final note: I think this is a very interesting project. It is quite difficult to make, but once done it is just great to see your solution evolve, just like real life. Good luck!
The way you're currently generating combinations of sections is probably throwing up huge numbers of combinations that are excluded by conflicts between more than one course. I think you could reduce the number of combinations that you need to deal with by generating the product of the sections for only two courses first. Eliminate the conflicts from that set, then introduce the sections for a third course. Eliminate again, then introduce a fourth, and so on. This should see a more linear growth in the processing time required as the number of courses selected increases.
This is a hard problem. It you google something like 'course scheduling problem paper' you will find a lot of references. Genetic algorithm - no, dynamic programming - yes. GAs are much harder to understand and implement than standard DP algos. Usually people who use GAs out of the box, don't understand standard techniques. Do some research and you will find different algorithms. You might be able to find some implementations. Coming up with your own algorithm is way, way harder than putting some effort into understanding DP.
The problem you're describing is a Constraint Satisfaction Problem. My approach would be the following:
Check if there's any uncompatibilities between courses, if yes, record them as constraints or arcs
While not solution is found:
Select the course with less constrains (that is, has less uncompatibilities with other courses)
Run the AC-3 algorithm to reduce search space
I've tried this approach with sudoku solving and it worked (solved the hardest sudoku in the world in less than 10 seconds)

Running multiple instances of a python program efficiently & economically?

I wrote a program that calls a function with the following prototype:
def Process(n):
# the function uses data that is stored as binary files on the hard drive and
# -- based on the value of 'n' -- scans it using functions from numpy & cython.
# the function creates new binary files and saves the results of the scan in them.
#
# I optimized the running time of the function as much as I could using numpy &
# cython, and at present it takes about 4hrs to complete one function run on
# a typical winXP desktop (three years old machine, 2GB memory etc).
My goal is to run this function exactly 10,000 times (for 10,000 different values of 'n') in the fastest & most economical way. following these runs, I will have 10,000 different binary files with the results of all the individual scans. note that every function 'run' is independent (meaning, there is no dependency whatsoever between the individual runs).
So the question is this. having only one PC at home, it is obvious that it will take me around 4.5 years (10,000 runs x 4hrs per run = 40,000 hrs ~= 4.5 years) to complete all runs at home. yet, I would like to have all the runs completed within a week or two.
I know the solution would involve accessing many computing resources at once. what is the best (fastest / most affordable, as my budget is limited) way to do so? must I buy a strong server (how much would it cost?) or can I have this run online? in such a case, is my propritary code gets exposed, by doing so?
in case it helps, every instance of 'Process()' only needs about 500MB of memory. thanks.
Check out PiCloud: http://www.picloud.com/
import cloud
cloud.call(function)
Maybe it's an easy solution.
Does Process access the data on the binary files directly or do you cache it in memory? Reducing the usage of I/O operations should help.
Also, isn't it possible to break Process into separate functions running in parallel? How is the data dependency inside the function?
Finally, you could give some cloud computing service like Amazon EC2 a try (don't forget to read this for tools), but it won't be cheap (EC2 starts at $0.085 per hour) - an alternative would be going to an university with a computer cluster (they are pretty common nowadays, but it will be easier if you know someone there).
Well, from your description, it sounds like things are IO bound... In which case parallelism (at least on one IO device) isn't going to help much.
Edit: I just realized that you were referring more to full cloud computing, rather than running multiple processes on one machine... My advice below still holds, though.... PyTables is quite nice for out-of-core calculations!
You mentioned that you're using numpy's mmap to access the data. Therefore, your execution time is likely to depend heavily on how your data is structured on the disc.
Memmapping can actually be quite slow in any situation where the physical hardware has to spend most of its time seeking (e.g. reading a slice along a plane of constant Z in a C-ordered 3D array). One way of mitigating this is to change the way your data is ordered to reduce the number of seeks required to access the parts you are most likely to need.
Another option that may help is compressing the data. If your process is extremely IO bound, you can actually get significant speedups by compressing the data on disk (and sometimes even in memory) and decompressing it on-the-fly before doing your calculation.
The good news is that there's a very flexible, numpy-oriented library that's already been put together to help you with both of these. Have a look at pytables.
I would be very surprised if tables.Expr doesn't significantly (~ 1 order of magnitude) outperform your out-of-core calculation using a memmapped array. See here for a nice, (though canned) example. From that example:

Categories

Resources