Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a few basic questions on Dask:
Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?
As an edit:
My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).
As a second edit:
I think I am also somewhat unclear regarding the relation between Futures and delayed computations.
Thx
1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.
2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).
3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).
4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a few basic questions on Dask:
Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?
As an edit:
My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).
As a second edit:
I think I am also somewhat unclear regarding the relation between Futures and delayed computations.
Thx
1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.
2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).
3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).
4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have an C++ game which sends a Python-SocketIO request to a server, which loads the requested JSON data into memory for reference, and then sends portions of it to the client as necessary. Most of the previous answers here detail that the server has to repeatedly search the database, when in this case, all of the data is stored in memory after the first time, and is released after the client disconnects.
I don't want to have a large influx of memory usage whenever a new client joins, however most of what I have seen points away from using small files (50-100kB absolute maximum), and instead use large files, which would cause the large memory usage I'm trying to avoid.
My question is this: would it still be beneficial to use one large file, or should I use the smaller files; both from an organization standpoint and from a performance one?
Is it better to have one large file or many smaller files for data storage?
Both can potentially be better. Each have their advantage and disadvantage. Which is better depends on the details of the use case. It's quite possible that best way may be something in between such as a few medium sized files.
Regarding performance, the most accurate way to verify what is best is to try out each of them and measure.
You should separate it into multiple files for less memory if you're only accessing small parts of it. For example, if you're only accessing let's say a player, then your folder structure would look like this:
players
- 0.json
- 1.json
other
- 0.json
Then you could write a function that just gets the player with a certain id (0, 1, etc.).
If you're planning on accessing all of the players, other objects, and more at once, then have the same folder structure and just concatenate the parts you need into one object in memory.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to ask a question about transferring data from c++ to python in real time.
My situation is :
1) I am generating data every 1 ms in c++,
2) I would like to stack data for certain amount of time and make a dataset,
3) I would like to run some machine learning algorithm written in Python without turning of c++ program.
So far, I have thought about several things :
option 1 ) Save the dataset as a txt file and read it from python. But this seems too slow due to I/O process.
option 2 ) Use IPC such as zeromq. I am quite new to IPC, so I am not sure if it is the thing I really want. Also, among the multiple methods (mmap, shared memory, message queue, ...), I do not know which one is the best shot for me.
option 3 ) Use UDP. From my understanding UDP sometimes sends the same data twice, or skips the data, or mixes data up (e.g. previous time step data arrives later)
Is there any recommendations I need to search and study?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a Python program which is running in a loop and downloading 20k RSS feeds using feedparser and inserting feed data into RDBMS.
I have observed that it starts from 20-30 feeds a min and gradually slows down. After couple of hours it comes down to 4-5 feeds an hour. If I kill the program and restart from where it left, again the throughput is 20-30 feeds a min.
It certainly is not MySQL which is slowing down.
What could be potential issues with the program?
In all likelihood the issue is to do with memory. You are probably holding the feeds in memory or somehow accumulating memory that isn't getting garbage collected. To diagnose:
Look at the size of your task (task manager if windows and top if unix/Linux) and monitor it as it grows with the feeds.
Then you can use a memory profiler to figure what exactly is consuming the memory
Once you have found that you can code differently maybe
A few tips:
Do an explicit garbage collection call (gc.collect()) after setting any relevant unused data structures to empty
Use a multiprocessing scheme where you spawn multiple processes that each handle a smaller number of feeds
Maybe go on a 64 bit system if you are using a 32 bit
Some suggestions on memory profiler:
https://pypi.python.org/pypi/memory_profiler
This one is quite good and the decorators are helpful
https://stackoverflow.com/a/110826/559095
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have to write something to ping more than 3000 IP address every time(nonstop) and i should check if an IP has not respond to ping x times in a row, report it to operators. I do not know what kind of subjects i have to take care of: such as resource checking, threading or processing, using celery or RabbitMQ(since i do not have any experience working with them) or anything else? I seriously do not have a clue to start from where?
I appreciate any idea in advance.
Do you have to reinvent this? There are lot of excellent monitoring apps (including free, open-source ones) already out there, e.g. Nagios, Splunk, Ganglia to name a few.
There are lots of problems you will come into doing it yourself, some ideas that come immediately to mind:
Running out of resources on the monitoring box itself (i.e. it is starved of CPU / network to do all that monitoring). This shouldn't be a problem for those numbers, but at greater scale it would be.
Dealing with multi-threading in your Python app. It's always hard, especially when things go wrong.
Dealing with flapping of these services (possibly less of an issue just for pings).
"Who monitors the monitor?"
Firewalls / routers ditching responses to pings on healthy boxes.
Detecting higher-level issues for that machine (i.e. pings still responding, but everything else useful on that machine is dead, out of disk, etc).
If you do still want to do it yourself, I'd start with a basic doing Queue using a round-robin approach.
You could try scheduling these tasks with Generators (but can be quite hard to understand / debug), or go straight to multi-threading. As you say, using an AMQPimplementation like RabbitMQ would be good to allow persistence (so you can restart your python program, etc), but sounds a bit like overkill to start with.