Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a Python program which is running in a loop and downloading 20k RSS feeds using feedparser and inserting feed data into RDBMS.
I have observed that it starts from 20-30 feeds a min and gradually slows down. After couple of hours it comes down to 4-5 feeds an hour. If I kill the program and restart from where it left, again the throughput is 20-30 feeds a min.
It certainly is not MySQL which is slowing down.
What could be potential issues with the program?
In all likelihood the issue is to do with memory. You are probably holding the feeds in memory or somehow accumulating memory that isn't getting garbage collected. To diagnose:
Look at the size of your task (task manager if windows and top if unix/Linux) and monitor it as it grows with the feeds.
Then you can use a memory profiler to figure what exactly is consuming the memory
Once you have found that you can code differently maybe
A few tips:
Do an explicit garbage collection call (gc.collect()) after setting any relevant unused data structures to empty
Use a multiprocessing scheme where you spawn multiple processes that each handle a smaller number of feeds
Maybe go on a 64 bit system if you are using a 32 bit
Some suggestions on memory profiler:
https://pypi.python.org/pypi/memory_profiler
This one is quite good and the decorators are helpful
https://stackoverflow.com/a/110826/559095
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a few basic questions on Dask:
Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?
As an edit:
My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).
As a second edit:
I think I am also somewhat unclear regarding the relation between Futures and delayed computations.
Thx
1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.
2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).
3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).
4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have an C++ game which sends a Python-SocketIO request to a server, which loads the requested JSON data into memory for reference, and then sends portions of it to the client as necessary. Most of the previous answers here detail that the server has to repeatedly search the database, when in this case, all of the data is stored in memory after the first time, and is released after the client disconnects.
I don't want to have a large influx of memory usage whenever a new client joins, however most of what I have seen points away from using small files (50-100kB absolute maximum), and instead use large files, which would cause the large memory usage I'm trying to avoid.
My question is this: would it still be beneficial to use one large file, or should I use the smaller files; both from an organization standpoint and from a performance one?
Is it better to have one large file or many smaller files for data storage?
Both can potentially be better. Each have their advantage and disadvantage. Which is better depends on the details of the use case. It's quite possible that best way may be something in between such as a few medium sized files.
Regarding performance, the most accurate way to verify what is best is to try out each of them and measure.
You should separate it into multiple files for less memory if you're only accessing small parts of it. For example, if you're only accessing let's say a player, then your folder structure would look like this:
players
- 0.json
- 1.json
other
- 0.json
Then you could write a function that just gets the player with a certain id (0, 1, etc.).
If you're planning on accessing all of the players, other objects, and more at once, then have the same folder structure and just concatenate the parts you need into one object in memory.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to ask a question about transferring data from c++ to python in real time.
My situation is :
1) I am generating data every 1 ms in c++,
2) I would like to stack data for certain amount of time and make a dataset,
3) I would like to run some machine learning algorithm written in Python without turning of c++ program.
So far, I have thought about several things :
option 1 ) Save the dataset as a txt file and read it from python. But this seems too slow due to I/O process.
option 2 ) Use IPC such as zeromq. I am quite new to IPC, so I am not sure if it is the thing I really want. Also, among the multiple methods (mmap, shared memory, message queue, ...), I do not know which one is the best shot for me.
option 3 ) Use UDP. From my understanding UDP sometimes sends the same data twice, or skips the data, or mixes data up (e.g. previous time step data arrives later)
Is there any recommendations I need to search and study?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a few basic questions on Dask:
Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?
As an edit:
My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).
As a second edit:
I think I am also somewhat unclear regarding the relation between Futures and delayed computations.
Thx
1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.
2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).
3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).
4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I've got a very nice machine to play with over at Azure. It's got 16 cores and memory up the wazoo.
Running on it is an app I wrote that does a LOT of crunching. Basically dividing up about 100,000 text documents into ngrams and creating a document index.
I recently moved this app over from a pretty small AWS instance with about 1/20th of the processing power. I couldn't even do 40,000 records without running out of memory. It took about 30 minutes to index 30,000 records.
So now, even with all that processing power, I'm still sitting here waiting 30 minutes to crunch 30,000 records. Is it just the nature of this type of process? Or am I not really taking advantage of my resources properly?
EDIT (THE CODE EXPLANATION):
The part of the app taking the most time is looping through NLTK library looking for named entities within the text of each document. I am running a loop of the 100k documents through a process very similar to this example:
https://gist.github.com/onyxfish/322906
Some stats:
Windows Azure VM
Python 2.7 (32 bit) (Enthought Canopy Environment)
Numpy 1.7.0
Stats:
If your process takes 0.3% of CPU time and takes a long time to execute, it clearly isn't CPU-bound.
If I had to guess based on the limited information provided, I'd guess that the code is I/O-bound. Write a little program that simply reads the 100,000 files and time it in the exact same execution environment. If that too is slow, you might want to consider merging the many files into few; it should improve things considerably.