Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a few basic questions on Dask:
Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?
As an edit:
My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).
As a second edit:
I think I am also somewhat unclear regarding the relation between Futures and delayed computations.
Thx
1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.
2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).
3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).
4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have an C++ game which sends a Python-SocketIO request to a server, which loads the requested JSON data into memory for reference, and then sends portions of it to the client as necessary. Most of the previous answers here detail that the server has to repeatedly search the database, when in this case, all of the data is stored in memory after the first time, and is released after the client disconnects.
I don't want to have a large influx of memory usage whenever a new client joins, however most of what I have seen points away from using small files (50-100kB absolute maximum), and instead use large files, which would cause the large memory usage I'm trying to avoid.
My question is this: would it still be beneficial to use one large file, or should I use the smaller files; both from an organization standpoint and from a performance one?
Is it better to have one large file or many smaller files for data storage?
Both can potentially be better. Each have their advantage and disadvantage. Which is better depends on the details of the use case. It's quite possible that best way may be something in between such as a few medium sized files.
Regarding performance, the most accurate way to verify what is best is to try out each of them and measure.
You should separate it into multiple files for less memory if you're only accessing small parts of it. For example, if you're only accessing let's say a player, then your folder structure would look like this:
players
- 0.json
- 1.json
other
- 0.json
Then you could write a function that just gets the player with a certain id (0, 1, etc.).
If you're planning on accessing all of the players, other objects, and more at once, then have the same folder structure and just concatenate the parts you need into one object in memory.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a few basic questions on Dask:
Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?
As an edit:
My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).
As a second edit:
I think I am also somewhat unclear regarding the relation between Futures and delayed computations.
Thx
1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.
2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).
3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).
4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a Python program which is running in a loop and downloading 20k RSS feeds using feedparser and inserting feed data into RDBMS.
I have observed that it starts from 20-30 feeds a min and gradually slows down. After couple of hours it comes down to 4-5 feeds an hour. If I kill the program and restart from where it left, again the throughput is 20-30 feeds a min.
It certainly is not MySQL which is slowing down.
What could be potential issues with the program?
In all likelihood the issue is to do with memory. You are probably holding the feeds in memory or somehow accumulating memory that isn't getting garbage collected. To diagnose:
Look at the size of your task (task manager if windows and top if unix/Linux) and monitor it as it grows with the feeds.
Then you can use a memory profiler to figure what exactly is consuming the memory
Once you have found that you can code differently maybe
A few tips:
Do an explicit garbage collection call (gc.collect()) after setting any relevant unused data structures to empty
Use a multiprocessing scheme where you spawn multiple processes that each handle a smaller number of feeds
Maybe go on a 64 bit system if you are using a 32 bit
Some suggestions on memory profiler:
https://pypi.python.org/pypi/memory_profiler
This one is quite good and the decorators are helpful
https://stackoverflow.com/a/110826/559095
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
As the title suggests, I'm interested in the best (perhaps the most Pythonic way) to structure a program which uses many global variables.
First of all, by "many", I mean some 30 variables (which may be dictionaries, floats or strings) which every module of my program needs to access. Now, there seem to be two ways to do this:
define the "global" variables in seperate modules
use an object oriented approach
The advantage of using an object oriented approach is that I can have many instances of some main class initialized, and perhaps compare different values (results of some analysis, for example) later on.
I already have a program written, but basically it breaks down to one class with some 30 or so attributes. Although it works fine, I'm aware this is a pretty messy way to do this.
So, basically, is I use OOP approach, I would perhaps need to break my main class down to a few subclasses, every one of which stores specific logically related variables.
Any suggestions are welcome.
P.S. Just to be concrete about what I'm trying to do: I have a FEM-solver which needs to store structure info, element and node data, analysis result data, etc. So, I'm dealing with a lot of data types most of which are connected in some way.
Unfortunately, as was hinted at in the comments, there is no "Pythonic" way to do this. Having a large number of global constants is just fine - many programs and libraries do this. But in the comments, you've specified that all of your globals are being modified.
You need to take your program's architecture back to the drawing board. Rethink the relationships between your program's entities (functions, classes, modules, etc). There has to be a better way to organize it.
And by the way, it also sounds like you're getting close to using the God Object Antipattern. Use some of the advice in this SO question to refactor your massive class that has it's fingers all over your program.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Sometimes, when looking at Python code examples, I'll come across one where the whole program is contained within its own class, and almost every function of the program is actually a method of that class apart from a 'main' function.
Because it's a fairly new concept to me, I can't easily find an example even though I've seen it before, so I hope someone understands what I am referring to.
I know how classes can be used outside of the rest of a program's functions, but what is the advantage of using them in this way compared with having functions on their own?
Also, can/should a separate module with no function calls be structured using a class in this way?
A module is preferred when it is a collection of pure functions i.e. no shared state like module level variables. A big class is often used when there are multiple functions operating on a shared state.
In Python scripts, you will often see the pattern of the main function being just the instantiation of a class and calling a method for e.g youtube-dl. This is done for various reasons:
Can instantiate multiple objects without mixing state. It is easier to make it threadsafe.
Classes can be inherited or composed (for e.g. see BaseHTTPRequestHandler
Classes have more features than modules like constructors, iteration support etc.
In general, classes offer more power with slight added complexity. Some people prefer functions for simplicity esp in the case of one-time scripts. The tradeoff is upto the developer and both are valid options in Python.
A program often has to maintain state and share resources between functions (command line options, dB connection, etc). When that's the case a class is usually a better solution (wrt/ readability, testability and overall maintainability) than having to pass the whole context to every function or (worse) using global state.