MLFLow artifact logging and retrieve on remote server

MLFLow artifact logging and retrieve on remote server - python

I am trying to setup a MLFlow tracking server on a remote machine as a systemd service.
I have a sftp server running and created a SSH key pair.
Everything seems to work fine except the artifact logging. MLFlow seems to not have permissions to list the artifacts saved in the mlruns directory.
I create an experiment and log artifacts in this way:
uri = 'http://192.XXX:8000'
mlflow.set_tracking_uri(uri)
mlflow.create_experiment('test', artifact_location='sftp://192.XXX:_path_to_mlruns_folder_')
experiment=mlflow.get_experiment_by_name('test')
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name=run_name) as run:
mlflow.log_param(_parameter_name_, _parameter_value_)
mlflow.log_artifact(_an_artifact_, _artifact_folder_name_)
I can see the metrics in the UI and the artifacts in the correct destination folder on the remote machine. However, in the UI I receive this message when trying to see the artifacts:
Unable to list artifacts stored
under sftp://192.XXX:path_to_mlruns_folder/run_id/artifacts
for the current run. Please contact your tracking server administrator
to notify them of this error, which can happen when the tracking
server lacks permission to list artifacts under the current run's root
artifact directory.
I cannot figure out why as the mlruns folder has drwxrwxrwx permissions and all the subfolders have drwxrwxr-x. What am I missing?
UPDATE
Looking at it with fresh eyes, it seems weird that it tries to list files through sftp://192.XXX:, it should just look in the folder _path_to_mlruns_folder_/_run_id_/artifacts. However, I still do not know how to circumvent that.

The problem seems to be that by default the systemd service is run by root.
Specifying a user and creating a ssh key pair for that user to access the same remote machine worked.
[Unit]
Description=MLflow server
After=network.target
[Service]
Restart=on-failure
RestartSec=20
User=_user_
Group=_group_
ExecStart=/bin/bash -c 'PATH=_yourpath_/anaconda3/envs/mlflow_server/bin/:$PATH exec mlflow server --backend-store-uri postgresql://mlflow:mlflow#localhost/mlflow --default-artifact-root sftp://_user_#192.168.1.245:_yourotherpath_/MLFLOW_SERVER/mlruns -h 0.0.0.0 -p 8000'
[Install]
WantedBy=multi-user.target
_user_ and _group_ should be the same listed by ls -la in the mlruns directory.

Related

Datadog Python log collection from self-hosted Github Runner

I'm trying to collect logs from cron jobs running on our self hosted Github runners, but so far can only see the actual github-runner host logs.
I've created a self-hosted Github Runner in AWS running on Unbtu with a standard config.
We've also installed the Datadog agent v7 with their script and basic configuration, and added log collection from files using these instructions
Our config for log collection is below.
curl https://s3.amazonaws.com/dd-agent/scripts/install_script.sh -o ddinstall.sh
export DD_API_KEY=${datadog_api_key}
export DD_SITE=${datadog_site}
export DD_AGENT_MAJOR_VERSION=7
bash ./ddinstall.sh
# Configure logging for GitHub runner
tee /etc/datadog-agent/conf.d/runner-logs.yaml << EOF
logs:
- type: file
path: /home/ubuntu/actions-runner/_diag/Worker_*.log
service: github
source: github-worker
- type: file
path: /home/ubuntu/actions-runner/_diag/Runner_*.log
service: github
source: github-runner
EOF
chown dd-agent:dd-agent /etc/datadog-agent/conf.d/runner-logs.yaml
# Enable log collection
echo 'logs_enabled: true' >> /etc/datadog-agent/datadog.yaml
systemctl restart datadog-agent
After these steps, I can see logs from our Github runners servers. However, on those runners we have several python cron jobs running in Docker containers, logging to stdout. I can see those logs in the Github Runner UI, but they're not available in Datadog, and those are the logs I'd really like to capture, so I can extract metrics from.
Do the docker containers for the python scripts need some special datadog setup as well? Do they need to log to a file that the datadog agents registers as a log file in the setup above?

Service file is not giving permission to required files

I am executing a Python script through service file. The python script is responsible to create 3 more scripts and then execute them one by one. I am also giving permission to all of them and also to a folder in my home directory.
The problem here is that on executing the service file none of the Python files or the folder is getting the permissions. I am giving 777 permission
Following is my service file
[Unit]
Description=systemd service to run upload script
[Service]
Type=simple
User=jetson
ExecStart=/usr/bin/python3 /home/project/file_upload.py
[Install]
WantedBy=multi-user.target
The folder I am trying to give permission to is created by the azure Iotedge module
Please let me know if I need to make any changes in the service file.

I resolved this issue by creating the folder before starting the service file and gave it permission so now there are no issues.

Django running on an ECS task does not work. "Connection refused" or "No data response" when requesting the webapp

I have some problems running Django on an ECS task.
I want to have a Django webapp running on an ECS task and accessible to the world.
Here are the symptoms:
When I run an ECS task using Django python manage.py runserver 0.0.0.0:8000 as entry point for my container, I have a connection refused response.
When I run the task using Gunicorn using gunicorn --bind 0.0.0.0:8000 my-project.wsgi I have no data response.
I don't see logs on CloudWatch and I can't find any server's logs when I ssh to the ECS instance.
Here are some of my settings related to that kind of issue:
I have set my ECS instance security groups inbound to All TCP | TCP | 0 - 65535 | 0.0.0.0/0 to be sure it's not a firewall problem. And I can assert that because I can run a ruby on rails server on the same ECS instance perfectly.
In my container task definition I set a port mapping to 80:8000 and an other to 8000:8000.
In my settings.py, I have set ALLOWED_HOSTS = ["*"] and DEBUG = False.
Locally my server run perfectly on the same docker image when doing a docker run -it -p 8000:8000 my-image gunicorn --bind=0.0.0.0:8000 wsgi or same with manage.py runserver.
Here is my docker file for a Gunicorn web server.
FROM python:3.6
WORKDIR /usr/src/my-django-project
COPY my-django-project .
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["gunicorn","--bind","0.0.0.0:8000","wsgi"]
# CMD ["python","manage.py", "runserver", "0.0.0.0:8000"]
Any help would be grateful!

To help you debugging:
What is the status of the job when you are trying to access your webapp.
Figure out which instance the job is running and try docker ps on that ecs instance for the running job.
If you are able see the container or the job running on the instance, try access your webapp directly on the server with command like curl http://localhost:8000 or wget
If you container is not running. Try docker ps -a and see which one has just stopped and check with docker logs -f
With this approach, you can cut out all AWS firewall settings, so that you can see if your container is configured correctly. I think it will help you tracking down the issue easier.
After you figuring out the container is running fine and you are able to request with localhost, then you can work on security group inbound/outbound filter.

Nginx Django and Gunicorn. Gunicorn sock file is missing?

I have an ansible provisioned VM based on this one https://github.com/jcalazan/ansible-django-stack but for some reason trying to start Gunicorn gives the following error:
Can't connect to /path/to/my/gunicorn.sock
and in nginx log file:
connect() to unix:/path/to/my/gunicorn.sock failed (2: No such file or directory) while connecting to upstream
And actually the socket file is missing in the specified directory. I have checked the permissions of the directory and they are fine.
Here is my gunicorn_start script:
NAME="{{ application_name }}"
DJANGODIR={{ application_path }}
SOCKFILE={{ virtualenv_path }}/run/gunicorn.sock
USER={{ gunicorn_user }}
GROUP={{ gunicorn_group }}
NUM_WORKERS={{ gunicorn_num_workers }}
# Set this to 0 for unlimited requests. During development, you might want to
# set this to 1 to automatically restart the process on each request (i.e. your
# code will be reloaded on every request).
MAX_REQUESTS={{ gunicorn_max_requests }}
echo "Starting $NAME as `whoami`"
# Activate the virtual environment.
cd $DJANGODIR
. ../../bin/activate
# Set additional environment variables.
. ../../bin/postactivate
# Create the run directory if it doesn't exist.
RUNDIR=$(dirname $SOCKFILE)
test -d $RUNDIR || mkdir -p $RUNDIR
# Programs meant to be run under supervisor should not daemonize themselves
# (do not use --daemon).
exec gunicorn \
--name $NAME \
--workers $NUM_WORKERS \
--max-requests $MAX_REQUESTS \
--user $USER --group $GROUP \
--log-level debug \
--bind unix:$SOCKFILE \
{{ application_name }}.wsgi
Can anyone suggest what else could cause the missing socket file?
Thanks

Well, since I don't have enough rep to comment, I'll mention here that there is not a lot of specificity suggested by the missing socket, but I can tell you a bit about how I started in your shoes and got things to work.
The long and short of it is that gunicorn has encountered a problem when run by upstart and either never got up and running or shut down. Here are some steps that may help you get more info to track down your issue:
In my case, when this happened, gunicorn never got around to doing any error logging, so I had to look elsewhere. Try ps auxf | grep gunicorn to see if you have any workers going. I didn't.
Looking in the syslog for complaints from upstart, grep init: /var/log/syslog, showed me that my gunicorn service had been stopped because it was respawning too fast, though I doubt that'll be your problem since you don't have a respawn in your conf. Regardless, you might find something there.
After seeing gunicorn was failing to run or log errors, I decided to try running it from the command line. Go to the directory where your manage.py lives and run the expanded version of your upstart command against your gunicorn instance. Something like (Replace all of the vars with the appropriate litterals instead of the garbage I use.):
/path/to/your/virtualenv/bin/gunicorn --name myapp --workers 4 --max-requests 10 --user appuser --group webusers --log-level debug --error-logfile /somewhere/I/can/find/error.log --bind unix:/tmp/myapp.socket myapp.wsgi
If you're lucky, you may get a python traceback or find something in your gunicorn error log after running the command manually. Some things that can go wrong:
django errors (maybe problems loading your settings module?). Make sure your wsgi.py is referencing the appropriate settings module on the server.
whitespace issues in your upstart script. I had a tab hiding among spaces that munged things up.
user/permission issues. Finally, I was able to run gunicorn as root on the command line but not as a non-root user via the upstart config.
Hope that helps. It's been a couple of long days tracking this stuff down.

I encountered the same problem after following Michal Karzynski's great guide 'Setting up Django with Nginx, Gunicorn, virtualenv, supervisor and PostgreSQL'.
And this is how I solved it.
I had this variable in the bash script used to start gunicorn via Supervisor (myapp/bin/gunicorn_start):
SOCKFILE={{ myapp absolute path }}/run/gunicorn.sock
Which, when you run the bash script for the first time, creates a 'run' folder and a sock file using root privileges. So I sudo deleted the run folder, and then recreated it without sudo privileges and voila! Now if you rerun Gunicorn or Supervisor you won't have the annoying missing sock file error message anymore!
TL;DR
Sudo delete run folder.
Recreate it without sudo privileges.
Run Gunicorn again.
????
Profit

The error could also arise when you haven't pip installed a requirement. In my case, looking at the gunicorn error logs, I found that there was a missing module. Usually happens when you forget to pip install new requirements.

Well, I worked on this issue for more than a week and finally was able to figure it out.
Please follow links from digital ocean , but they did not pinpoint important issues one which includes
no live upstreams while connecting to upstream
*4 connect() to unix:/myproject.sock failed (13: Permission denied) while connecting to upstream
gunicorn OSError: [Errno 1] Operation not permitted
*1 connect() to unix:/tmp/myproject.sock failed (2: No such file or directory)
etc.
These issues are basically permission issue for connection between Nginx and Gunicorn.
To make things simple, I recommend to give same nginx permission to every file/project/python program you create.
To solve all the issue follow this approach:
First thing is :
Log in to the system as a root user
Create /home/nginx directory.
After doing this, follow as per the website until Create an Upstart Script.
Run chown -R nginx:nginx /home/nginx
For upstart script, do the following change in the last line :
exec gunicorn --workers 3 --bind unix:myproject.sock -u nginx -g nginx wsgi
DONT ADD -m permission as it messes up the socket. From the documentation of Gunicorn, when -m is default, python will figure out the best permission
Start the upstart script
Now just go to /etc/nginx/nginx.conf file.
Go to the server module and append:
location / {
include proxy_params;
proxy_pass http<>:<>//unix:/home/nginx/myproject.sock;
}
REMOVE <>
Do not follow the digitalocean aricle from here on
Now restart nginx server and you are good to go.

I had the same problem and found out that I had set the DJANGO_SETTINGS_MODULE to production settings in the the gunicorn script and the wsgi settings were using dev.
I pointed the DJANGO_SETTINGS_MODULE to dev and everything worked.

Properly specifying path for git pull from my local development machine

I'm trying to setup Fabric so that I can automatically deploy my Django app to my web server.
What I want to do is to pull the data from my Development machine (os X) to the server.
How do I correctly specify my path in the git url?
This is the error I'm getting:
$ git pull
fatal: '/Users/Bryan/work/tempReview_app/.': unable to chdir or not a git archive
fatal: The remote end hung up unexpectedly
This is .git/config:
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
ignorecase = true
[remote "origin"]
fetch = +refs/heads/*:refs/remotes/origin/*
url = /Users/Bryan/work/my_app/.
[branch "master"]
remote = origin
merge = refs/heads/master

On your server, create a folder called myapp. Chdir to this folder, and then run
server ~/myapp$ git init
Then, let git know about your server. After this, push to the server's repository from your local machine.
local ~/myapp$ git remote add origin user#server:~/myapp.git
local ~/myapp$ git push origin master
Anytime you want to push changes to your server, just run git push. If you make a mistake, just log in to your server and git co last-known-good-commit or something to that effect.
Git hooks are also very useful in situations such as the one you're facing. I would give you pointers on that but I don't know what your workflow is like, so it probably wouldn't be very helpful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.