I am trying to use to_parquet but it crashes my system due to memory error. I've discovered it's trying to save 100-300 of my partitions at a time.
Is it possible to somehow specify that I want fewer partitions processed at a time in order to prevent a crash due to using up all the RAM?
Dask will use as many threads at a time as you give it. The tasks may be "processing" but that just means that they have been sent to a worker, which will handle them when it has a spare thread.
I am trying to use to_parquet but it crashes my system due to memory error.
However it could still be that your partitions are large enough that you can't fit several of them in memory at once. In this case you might want to select a smaller partition size. See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions for more information.
Related
I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.
I've been slowly learning to use the multiprocessing library in Python these last few days, and I've come to a point where I'm asking myself this question, and I can't find an answer to this.
I understand that the answer might vary depending on the application, so I'll explain what my application is.
I've created a scheduler in my main process that control when multiple processes execute (these processes are spawned earlier, loop continuously and execute code when my scheduler raises a flag in a shared Value). Using counters in my scheduler, I can have multiple processes executing code at different frequencies (in the 100-400 Hz range), and they are all perfectly synchronized.
For example, one process executes a dynamic model of a quadcopter (ode) at 400 Hz and updates the quadcopter's state. My other processes (command generation and trajectory generation) run at lower frequencies (200 Hz and 100 Hz), but require the updated state. I see currently 2 methods of doing this:
With Pipes: This requires separate Pipes for the dynamics/control and dynamics/trajectory connections. Furthermore, I need the control and trajectory processes to use the latest calculated quadcopter's state, so I need to flush the Pipes until the last value in them. This works, but doesn't look very clean.
With a shared Value/Array : I would only need one Array for the state, my dynamics process would write to it, while my other processes would read from it. I would probably have to implement locks (can I read a shared Value/Array from 2 processes at the same time without a lock?). This hasn't been tested yet, but would probably be cleaner.
I've read around that it is a bad practice to use shared memory too much (why is that?). Yes, I'll be updating it at 400 Hz and reading it at 200 and 100 Hz, but it's not going to be such a large array (10-ish float or doubles). However, I've also read that shared memory is faster that Pipes/Queues, and I would like to prioritize speed in my code, if it's not too much of an issue to use shared memory.
Mind you, I'll have to send generated commands to my dynamics process (another 5-ish floats), and generated desired states to my control process (another 10-ish floats), so that's either more shared Arrays, of more Pipes.
So I was wondering, for my application, what are the pros and cons of both methods. Thanks!
we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.
When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.
We have so far considered the possible factors as mentioned in these SO postings:
Spark java.lang.OutOfMemoryError : Java Heap space
Spark java.lang.OutOfMemoryError: Java heap space (also the list of possible solutions by samthebest did not helped to solve the issue)
At this points we have the following question(s):
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Do you know any (common?) mistake which may lead to the observed behevior?
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Ans: There are multiple factors to decide the number of partitions.
1) There may be cases where having number of partitions 3-4 X times of your cores will be good case(considering each partition going to be processing more than few secs)
2) Partitions shouldn't be too small or too large(128MB or 256MB) will be good enough
Do you know any (common?) mistake which may lead to the observed behevior?
Can you check the executor memory and disk that is available to run the size.
If you can specify more details about the job e.g. number of cores, executor memory, number of executors and disk available it will be helpful to point out the issue.
I have multiprocessing code wherein each process does a disk write (pickling data), and the resulting pickle files can be upwards of 50 MB (and sometimes even more than 1 GB depending on what I'm doing). Also, different processes are not writing to the same file, each process writes a separate file (or set of files).
Would it be a good idea to implement a lock around disk writes so that only one process is writing to the disk at a time? Or would it be best to just let the operating system sort it out even if that means 4 processes may be trying to write 1 GB to the disk at the same time?
As long as the processes aren't fighting over the same file; let the OS sort it out. That's its job.
Unless your processes try and dump their data in one big write, the OS is in a better position to schedule disk writes.
If you do use one big write, you mighy try and partition it in smaller chunks. That might give the OS a better chance of handling them.
Of course you will hit a limit somewhere. Your program might be the CPU-bound, memory-bound or disk-bound. It might hit different limits depending on the input or load.
But unless you've got evidence that you're constantly disk-bound and you've got a good idea how to solve that, I'd say don't bother. Because the days that a write system call actuall meant that the data was directly sent to disk are long gone.
Most operating systems these days use unallocated RAM as a disk cache. And HDD's have built-in caches as well. Unless you disable both of these (which will give you a huge performance hit) there is precious little connection between your program completing a write and and the data actually hitting the plates or flash.
You might consider using memmap (if your OS supports it), and let the OS's virtual memory do the work for you. See e.g. the architect notes for the Varnish cache.
I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.