How to change the dask scheduler and workers?

How to change the dask scheduler and workers? - python

I am newbie to dask and distributed. I want to change the dask scheduler instead of localhost to use another address of a server. I didn't find how to do it in the internet.
Could you help me please ?
Thanks.

I suppose you can just pass the IP of a server into a worker constructor, like it's done it the docs http://docs.dask.org/en/latest/setup/python-advanced.html?highlight=scheduler#worker
w = Worker('tcp://{your_ip_here}:8786')

From the user session's point of view, you connect to the remote Dask scheduler using the client:
client = dask.distributed.Client('tcp://machine.ip:port')
where you need to fill in the machine's address and port as appropriate. You should not be constructing a Worker in your session, I am assuming that you scheduler already has some workers set up to talk to.
Yes, there are also ways to include the default address in config files, including having the scheduler write it for you on start-up, but the XML file you mention is unlikely to be something directly read by Dask. On the other hand, whoever designed your system may have its own config layer.

Related

Multiple Connections to MongoDB Instance using Pymongo

I am fairly new to MongoDB, and I wondering how can I establish multiple connections to a single Mongo instance without specifying ports or making a new config file for each user. I am running the Mongo instance in a singularity container on a remote server.
Here is my sample config file:
# mongod.conf
# for documentation of all options, see:
# https://docs.mongodb.com/manual/reference/configuration-options/
# where to write logging data for debugging and such.
systemLog:
destination: file
logAppend: true
path: /path-to-log/
# network interfaces
net:
port: 27017
bindIp: 127.0.0.1
maxIncomingConnections: 65536
#security
security:
authorization: 'enabled'
Do I need to use replica set? If so, can someone explain the concept behind a replica set?
Do I need to change my config file? If so, what changes do I need to make to allow for multiple connections?
Here is my code that I use to connect to the server (leaving out import statements for clarity):
PWD = "/path-to-singularity-container/"
os.chdir(PWD)
self.p = subprocess.Popen(f"singularity run --bind {PWD}/data:/data/db mongo.sif --auth --config {PWD}/mongod.conf", shell=True, preexec_fn=os.setpgrp)
connection_string = "mongodb://user:password#127.0.0.1:27017/"
client = pymongo.MongoClient(connection_string, serverSelectionTimeoutMS=60_000)
EDIT: I am trying to have multiple people connect to MongoDB using pymongo at the same time given the same connection string. I am not sure how I can achieve this without giving each user a separate config. file.
Thank you for your help!

you can enough value of ulimit. Mongod tracks each incoming connection with a file descriptor and a thread.
you can go through the below link which will explain each component of ulimit parameters and their values.
https://docs.mongodb.com/manual/reference/ulimit/
For HA solutions, if you don't want downtime in your environment then you need to go with HA solution which means 3 nodes replica set which can afford one node down at a time.
If the primary node goes down, there will be internal voting and the new node will promote as new primary within seconds. So your application will be less impacted. Another benefit, if your node got crashed, you have another copy of data.
Hope this will answer your question.

No special work is required, you simply create a client and execute queries.

Accessing Luigi visualizer on AWS

I’ve been using the Luigi visualizer for pipelining my python code.
Now I’ve started using an aws instance, and want to access the visualizer from my own machine.
Any ideas on how I could do that?

We had the very same problem today on GCP, and solved with the following steps:
setting firewall rules for incoming TCP connections on port used by the service (which by default is 8082);
installing apache2 server on the instance with a site.conf configuration that resolve incoming requests on ip-of-instance:8082.
That's it. Hope this can help.

Good question and I'm amazed I can't find a duplicate on StackOverflow. You broadly need to do two things:
Ensure the luigi webserver is hosting content correct. You can probably do this via site.conf, or you can probably do it via luigi's default-scheduler-host property. This corresponds to #PierluigiPuce second point.
Correctly expose and secure your EC2 instance. This is a VPC exercise (see docs) and is an entire area to learn, but in short you need to configure your VPC so that the valid requests are routed to the instance on the correct port, and invalid requests are blocked. This corresponds to #PierluigiPuce first point.
Your primary consideration is whether it is okay for this to be public facing. Probably not. Then you can secure the instance via IP address ranges, via a VPN, or even via SSH port forwarding through a jump host.
Having it completely open is the easiest and worst solution. Putting the instance in a public subnet, and restricting access based on IP address, is probably the second easiest solution, and might be a reasonable compromise for you.

Executing a command on a remote server with decoupling, redundancy, and asynchronous

I have a few servers that require executing commands on other servers. For example a Bitbucket Server post receive hook executing a git pull on another server. Another example is the CI server pulling a new docker image and restarting an instance on another server.
I would normally use ssh for this, creating a user/group specifically for the job with limited permission.
A few downsides with ssh:
Synchronous ssh call means a git push will have to wait until complete.
If a host is not contactable for whatever reason, the ssh command will fail.
Maintaining keys, users, and sudoers permissions can become unwieldy.
Few possibilities:
Find an open source out of the box solution (I have tried with no luck so far)
Set up an REST API on each server that accepts calls with some type of authentication, e.g. POST https://server/git/pull/?apikey=a1b2c3
Set up Python/Celery to execute tasks on a different queue for each host. This means a celery worker on each server that can execute commands and possibly a service that accepts REST API calls, converting them to Celery tasks.
Is there a nice solution to this problem?

Defining the problem
You want to be able to trigger a remote task without waiting for it to complete.
This can be achieved in any number of ways, including with SSH. You can execute a remote command without waiting for it to complete by closing or redirecting all I/O streams, e.g. like this:
ssh user#host "/usr/bin/foobar </dev/null >/dev/null 2>&1"
You want to be able to defer the task if the host is currently unavailable.
This requires a queuing/retry system of some kind. You will also need to decide whether the target hosts will be querying for messages ("pull") or whether messages will be sent to the target hosts from elsewhere ("push").
You want to simplify access control as much as possible.
There's no way to completely avoid this issue. One solution would be to put most of the authentication logic in a centralized task server. This splits the problem into two parts: configuring access rights in the task server, and configuring authentication between the task server and the target hosts.
Example solutions
Hosts attempt to start tasks over SSH using method above for asynchrony. If host is unavailable, task is written to local file. Cron job periodically retries sending failed tasks. Access control via SSH keys.
Hosts add tasks by writing commands to files on an SFTP server. Cron job on target hosts periodically checks for new commands and executes them if found. Access control managed via SSH keys on the SFTP server.
Hosts post tasks to REST API which adds them to queue. Celery daemon on each target host consumes from queue and executes tasks. Access managed primarily by credentials sent to the task queuing server.
Hosts post tasks to API which adds tasks to queue. Task consumer nodes pull tasks off the queue and send requests to API on target hosts. Authentication managed by cryptographic signature of sender appended to request, verified by task server on target host.
You can also look into tools that do some or all of the required functions out of the box. For example, some Google searching came up with Rundeck which seems to have some job scheduling capabilities and a REST API. You should also consider whether you can leverage any existing automated deployment or management tools already present in your system.
Conclusions
Ultimately, there's no single right answer to this question. It really depends on your particular needs. Ask yourself: How much time and effort do you want to spend creating this system? What about maintenance? How reliable does it need to be? How much does it need to scale? And so on, ad infinitum...

Python and queue

This question is related to this project.
I have to send device information in json format to a server. However, I'm concerned about what will happen if i can't connect to the remote server? I don't want to loose data, so I thought that each datum could be in a queue. A connection thread could parse the queue and send data to server. In my opinion this is a better solution than having a connection thread which sends data directly. Am I correct?

Something like a queue is always suitable to decouple things. Especially when the processing of data can be deferred. A queue implementation like RabbitMQ is transactional so you it will integrate nicely into a system where transactional integrity is a must.

Right, you might want a persistent queue-like storage between the application components. Depending on your requirements it might be anything ranging from the simple file to fully fledged transactional storage.

Best way for client to fire off separate process without blocking client/server communication

The end result I am trying to achieve is allow a server to assign specific tasks to a client when it makes it's connection. A simplified version would be like this
Client connects to Server
Server tells Client to run some network task
Client receives task and fires up another process to complete task
Client tells Server it has started
Server tells Client it has another task to do (and so on...)
A couple of notes
There would be a cap on how many tasks a client can do
The client would need to be able to monitor the task/process (running? died?)
It would be nice if the client could receive data back from the process to send to the server if needed
At first, I was going to try threading, but I have heard python doesn't do threading correctly (is that right/wrong?)
Then it was thought to fire of a system call from python and record the PID. Then send certain signals to it for status, stop, (SIGUSR1, SIGUSR2, SIGINT). But not sure if that will work, because I don't know if I can capture data from another process. If you can, I don't have a clue how that would be accomplished. (stdout or a socket file?)
What would you guys suggest as far as the best way to handle this?

Use spawnProcess to spawn a subprocess. If you're using Twisted already, then this should integrate pretty seamlessly into your existing protocol logic.

Use Celery, a Python distributed task queue. It probably does everything you want or can be made to do everything you want, and it will also handle a ton of edge cases you might not have considered yet (what happens to existing jobs if the server crashes, etc.)
You can communicate with Celery from your other software using a messaging queue like RabbitMQ; see the Celery tutorials for details on this.
It will probably be most convenient to use a database such as MySQL or PostgreSQL to store information about tasks and their results, but you may be able to engineer a solution that doesn't use a database if you prefer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.