I have a problem in my Storm setup and it looks like there's some discrepancy between the number of executors I set for the topology, and the number of actual bolt processes I see running on one of the servers in that topology.
When setting the number of executors per bolt I use the setBolt method of TopologyBuilder. The number of executors per the UI is correct (total of 105), and when drilling down to the number of executors per server I see that every server in my topology should hold 7-9 executors. This is all good and well however, when sshing to the server and using htop I see that there is one parent process with at least 30 child processes running for that bolt type.
A few notes:
I am using a very old version of Storm (0.9.3) that unfortunately I can't upgrade.
I'm running a Storm instance that is running python processes (don't know how relevant that is)
I think I'm missing something on the relation between the number of Storm processes and number of bolts/executors I'm configuring or, how to read htop properly. In any case, I would love to get some explanation.
I found this answer, saying that htop shows threads as processes but I still don't think that answers my question.
Thank you
Related
I'm trying to understand the logic behind flink's slots and parallelism configurations in .yaml document.
Official Flink Documentation states that for each core in your cpu, you have to allocate 1 slot and increase parallelism level by one simultaneously.
But i suppose that this is just a recommendation. If for a example i have a powerful core(e.g. the newest i7 with max GHz), it's different from having an old cpu with limited GHz. So running much more slots and parallelism than my system's cpu maxcores isn't irrational.
But is there any other way than just testing different configurations, to check my system's max capabilities with flink?
Just for the record, im using Flink's Batch Python API.
It is recommended to assign each slot at least one CPU core because each operator is executed by at least 1 thread. Given that you don't execute blocking calls in your operator and the bandwidth is high enough to feed the operators constantly with new data, 1 slot per CPU core should keep your CPU busy.
If on the other hand, your operators issue blocking calls (e.g. communicating with an external DB), it sometimes might make sense to configure more slots than you have cores.
There are several interesting points in your question.
First, the slots in Flink are the processing capabilities that each taskmanager brings to the cluster, and they limit, first, the number of applications that can be executed on it, as well as the number of executable operators at the same time. Tentatively, a computer should not provide more processing power than CPU units present in it. Of course, this is true if all the tasks that run on it are computation intensive in CPU and low IO operations. If you have operators in your application highly blocking by IO operations there will be no problem in configuring more slots than CPU cores available in your taskmanager as #Till_Rohrmann said.
On the other hand, the default parallelism is the number of CPU cores available to your application in the Flink cluster, although it is something you can specify as a parameter manually when you run your application or specify it in your code. Note that a Flink cluster can run multiple applications simultaneously and it is not convenient that only one block entire cluster, unless it is the target, so, the default parallelism is usually less than the number of slots available in your Cluster (the sum of all slots contributed by your taskmanagers).
However, an application with parallelism 4 means, tentatively, that if it contains an stream: input().Map().Reduce().Sink() there should be 4 instances of each operator, so, the sum of cores used by the application Is greater than 4. But, this is something that the developers of Flink should explain ;)
I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.
The setup
Our setup is unique in the following ways:
we have a large number of distinct Django installations on a single server.
each of these has its own code base, and even runs as a separate linux user. (Currently implemented using Apache mod_wsgi, each installation configured with a small number of threads (2-5) behind a nginx proxy).
each of these installations have a significant memory footprint (20 - 200 MB)
these installations are "web apps" - they are not exposed to the general web, and will be used by a limited nr. of users (1 - 100).
traffic is expected to be in (small) bursts per-installation. I.e. if a certain installation becomes used, a number of follow up requests are to be expected for that installation (but not others).
As each of these processes has the potential to rack up anywhere between 20 and 200 MB of memory, the total memory footprint of the Django processes is "too large". I.e. it quickly exceeds the available physical memory on the server, leading to extensive swapping.
I see 2 specific problems with the current setup:
We're leaving the guessing of which installation needs to be in physical mememory to the OS. It would seem to me that we can do better. Specifically, an installation that currently gets more traffic would be better off with a larger number of ready workers. Also: installations that get no traffic for extensive amounts of time could even do with 0 ready workers as we can deal with the 1-2s for the initial request as long as follow-up requests are fast enough. A specific reason I think we can be "smarter than the OS": after a server restart on a slow day the server is much more responsive (difference is so great it can be observed w/ the naked eye). This would suggest to me that the overhead of presumably swapped processes is significant even if they have not currenlty activily serving requests for a full day.
Some requests have larger memory needs than others. A process that has once dealt with one such a request has claimed the memory from the OS, but due to framentation will likely not be able to return it. It would be worthwhile to be able to retire memory-hogs. (Currenlty we simply have a retart-after-n-requests configured on Apache, but this is not specifically triggered after the fragmentation).
The question:
My idea for a solution would be to have the main server spin up/down workers per installation depending on the needs per installation in terms of traffic. Further niceties:
* configure some general system constraints, i.e. once the server becomes busy be less generous in spinning up processes
* restart memory hogs.
There are many python (WSGI) servers available. Which of them would (easily) allow for such a setup. And what are good pointers for that?
See if uWSGI works for you. I don't think there is something more flexible.
You can have it spawn and kill workers dynamically, set max memory usage etc. Or you might come with better ideas after reading their docs.
Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.
Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.
There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.
Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.
The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.
I'm running Django as threaded fastcgi via flup, served by lighttpd, communicating via sockets.
What is the expected CPU usage for each fastcgi thread under no load? On startup, each thread runs at 3-4% cpu usage for a while, and then backs off to around .5% over the course of a couple of hours. It doesn't sink below this level.
Is this much CPU usage normal? Do I have some bug in my code that is causing the idle loop to require more processing than it should? I expected the process to use no measurable CPU when it was completely idle.
I'm not doing anything ridiculously complicated with Django, definitely nothing that should require extended processing. I realize that this isn't a lot of load, but if it's a bug I introduced, I would like to fix it.
I've looked at this on django running as fastcgi on both Slicehost (django 1.1, python 2.6) and Dreamhost (django 1.0, python 2.5), and I can say this:
Running the top command shows the processes use a large amount of CPU to start up for ~2-3 seconds, then drop down to 0 almost immediately.
Running the ps aux command after starting the django app shows something similar to what you describe, however this is actually misleading. From the Ubuntu man pages for ps:
CPU usage is currently expressed as
the percentage of time spent running
during the entire lifetime of a
process. This is not ideal, and it
does not conform to the standards that
ps otherwise conforms to. CPU usage is
unlikely to add up to exactly 100%.
Basically, the %CPU column shown by ps is actually an average over the time the process has been running. The decay you see is due to the high initial spike followed by inactivity being averaged over time.
Your fast-cgi threads must not consume any (noticeable) CPU if there are no requests to process.
You should investigate the load you are describing. I use the same architecture and my threads are completely idle.