What is the proper way of using connection pools in a streaming pyspark application ?
I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application.
Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.
Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:
spark.python.worker.reuse has to be set to true.
A way to reference an object between different calls.
The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.
Related
I am planing to setup a small proxy service for a remote sensor, that only accepts one connection. I have a temporary solution and I am now designing a more robust version, and therefore dived deeper into the python multiprocessing module.
I have written a couple of systems in python using a main process, which spawns subprocesses using the multiprocessing module and used multiprocessing.Queue to communicate between them. This works quite well and some of theses programs/scripts are doing their job in a production environment.
The new case is slightly different since it uses 2+n processes:
One data-collector, that reads data from the sensor (at 100Hz) and every once in a while receives short ASCII strings as command
One main-server, that binds to a socket and listens, for new connections and spawns...
n child-servers, that handle clients who want to have the sensor data
while communication from the child servers to the data collector seems pretty straight forward using a multiprocessing.Queue which manages a n:1 connection well enough, I have problems with the other way. I can't use a queue for that as well, because all child-servers need to get all the data the sensor produces, while they are active. At least I haven't found a way to configure a Queue to mimic that behaviour, as get takes the top most out of the Queue by design.
I looked into shared memory already, which massively increases the management overhead, since as far as I understand it while using it, I would basically need to implement a streaming buffer myself.
The only safe way I see right now, is using a redis server and messages queues, but I am a bit hesitant, since that would need more infrastructure than I like.
Is there a pure python internal way?
maybe You can use MQTT for that ?
You did not clearly specify, but sounds like observer pattern -
or do You want the clients to poll each time they need data ?
It depends which delays / data rate / jitter etc. You can accept.
after You provided the information :
The whole setup runs on one machine in one process space. What I would like to have, is a way without going through a third party process
I would suggest to check for observer pattern.
More informations can be found for example:
https://www.youtube.com/watch?v=_BpmfnqjgzQ&t=1882s
and
https://refactoring.guru/design-patterns/observer/python/example
and
https://www.protechtraining.com/blog/post/tutorial-the-observer-pattern-in-python-879
and
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Observer.html
Your Server should fork for each new connection and register with the observer, and will be therefore informed about every change.
I have a singleton in-memory data-structure inside my Django project (some kind of kd-tree that needs to get accessed all across the project).
For those that don't know Django, I believe the same issue would appear with a regular Python code.
I know it's evil (Singleton), and I'm looking for better ways to implement that, but my question here is related to another topic:
I am instantiating the singleton inside my code by calling
Singleton.instance() and it gives me the object correctly, it then stays at some place in the memory inside my ./manage.py runserver.
The problem is that I am making some asynchronous processing with Celery on this same Singleton data-structure (such as reconstructing the kd-tree).
BUT when launching a Celery worker, it runs the code within a different process, and therefore has a different memory space, which means that it works on a totally different instance of the Singleton.
What would be the best Design Pattern to this issue? I have thought of doing all the processing related to my data-structure inside the Django project (without using Celery) but what I liked very much about Celery is that the processing required on the data-structure can take a long-time (around 30 seconds) and it needs to handle concurrency nicely (there could be several requests at the same time to reconstruct the Kd-Tree).
I would be very glad to have some insights for this since I'm not making any progress those last 3 days.
Thanks a lot.
I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.
They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets
I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.
Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.
I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.
This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)
Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.
I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/