Datadog Python log collection from self-hosted Github Runner

Datadog Python log collection from self-hosted Github Runner - python

I'm trying to collect logs from cron jobs running on our self hosted Github runners, but so far can only see the actual github-runner host logs.
I've created a self-hosted Github Runner in AWS running on Unbtu with a standard config.
We've also installed the Datadog agent v7 with their script and basic configuration, and added log collection from files using these instructions
Our config for log collection is below.
curl https://s3.amazonaws.com/dd-agent/scripts/install_script.sh -o ddinstall.sh
export DD_API_KEY=${datadog_api_key}
export DD_SITE=${datadog_site}
export DD_AGENT_MAJOR_VERSION=7
bash ./ddinstall.sh
# Configure logging for GitHub runner
tee /etc/datadog-agent/conf.d/runner-logs.yaml << EOF
logs:
- type: file
path: /home/ubuntu/actions-runner/_diag/Worker_*.log
service: github
source: github-worker
- type: file
path: /home/ubuntu/actions-runner/_diag/Runner_*.log
service: github
source: github-runner
EOF
chown dd-agent:dd-agent /etc/datadog-agent/conf.d/runner-logs.yaml
# Enable log collection
echo 'logs_enabled: true' >> /etc/datadog-agent/datadog.yaml
systemctl restart datadog-agent
After these steps, I can see logs from our Github runners servers. However, on those runners we have several python cron jobs running in Docker containers, logging to stdout. I can see those logs in the Github Runner UI, but they're not available in Datadog, and those are the logs I'd really like to capture, so I can extract metrics from.
Do the docker containers for the python scripts need some special datadog setup as well? Do they need to log to a file that the datadog agents registers as a log file in the setup above?

Related

Can't see the logs for new relic on python service

I'm trying to add the logs from my python service that runs on a docker
I followed this tutorial: https://docs.newrelic.com/docs/apm/agents/python-agent/installation/standard-python-agent-install/
So I added this: RUN newrelic-admin generate-config defcf97e23c0621d66d085a35e56f93fac788774 newrelic.ini to my docker file
and changed my entrypoint to run my python like this: NEW_RELIC_CONFIG_FILE=newrelic.ini newrelic-admin run-program python run.py
and when I look at the logs on my pod I can see that the new relic is being used:
2022-09-29 14:09:43,457 (6/MainThread) newrelic.core.agent INFO - New Relic Python Agent (8.2.0.181)
But when I look at new relic in the new service I created for it in Services - APM I can't see it for some reason:
Am I looking at the wrong place or am I missing something?

Use az webapp deployment source to deploy code from git but when the app is in a git repository subfolder

I'm trying to deploy code from my gitlab CI runner to Azure. I'm using the az webapp deployment to do the job:
az webapp deployment source config --branch master --manual-integration --name [myWebApp] --repo-url [git url] --app-working-dir [folder] --resource-group [myResourceGroup]
But there is a problem, the default working directory is the git 's root folder. My application is inside a child folder from root. I check the command configuration and I saw --app-working-dir which does what I want, again, but only works if --cd-project-url is set (I don't need and I'm not using VSTS). Also, there is no example available of using this config. I'm looking to have a command set similar to what I used when deploying via azure cloud shell: cd <app-folder>; az webapp up --sku B1 --name <app-name>. There is a way to do the deploy in a simpler form?

Create a .deployment file in your root directory of your git repository. And add the project sub folder in the configuration. It will work then.
https://github.com/projectkudu/kudu/wiki/Customizing-deployments

MLFLow artifact logging and retrieve on remote server

I am trying to setup a MLFlow tracking server on a remote machine as a systemd service.
I have a sftp server running and created a SSH key pair.
Everything seems to work fine except the artifact logging. MLFlow seems to not have permissions to list the artifacts saved in the mlruns directory.
I create an experiment and log artifacts in this way:
uri = 'http://192.XXX:8000'
mlflow.set_tracking_uri(uri)
mlflow.create_experiment('test', artifact_location='sftp://192.XXX:_path_to_mlruns_folder_')
experiment=mlflow.get_experiment_by_name('test')
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name=run_name) as run:
mlflow.log_param(_parameter_name_, _parameter_value_)
mlflow.log_artifact(_an_artifact_, _artifact_folder_name_)
I can see the metrics in the UI and the artifacts in the correct destination folder on the remote machine. However, in the UI I receive this message when trying to see the artifacts:
Unable to list artifacts stored
under sftp://192.XXX:path_to_mlruns_folder/run_id/artifacts
for the current run. Please contact your tracking server administrator
to notify them of this error, which can happen when the tracking
server lacks permission to list artifacts under the current run's root
artifact directory.
I cannot figure out why as the mlruns folder has drwxrwxrwx permissions and all the subfolders have drwxrwxr-x. What am I missing?
UPDATE
Looking at it with fresh eyes, it seems weird that it tries to list files through sftp://192.XXX:, it should just look in the folder _path_to_mlruns_folder_/_run_id_/artifacts. However, I still do not know how to circumvent that.

The problem seems to be that by default the systemd service is run by root.
Specifying a user and creating a ssh key pair for that user to access the same remote machine worked.
[Unit]
Description=MLflow server
After=network.target
[Service]
Restart=on-failure
RestartSec=20
User=_user_
Group=_group_
ExecStart=/bin/bash -c 'PATH=_yourpath_/anaconda3/envs/mlflow_server/bin/:$PATH exec mlflow server --backend-store-uri postgresql://mlflow:mlflow#localhost/mlflow --default-artifact-root sftp://_user_#192.168.1.245:_yourotherpath_/MLFLOW_SERVER/mlruns -h 0.0.0.0 -p 8000'
[Install]
WantedBy=multi-user.target
_user_ and _group_ should be the same listed by ls -la in the mlruns directory.

ModuleNotFoundError: No module named 'django' by Deploying on Azure

I'm trying to deploy a django web app to the Microsoft Azure and this is correctly deployed by the pipeline on DevOps Azure, but I get the error message (ModuleNotFoundError: No module named 'django) on portal Azure and cannot reach my app via the URL.
The app also works properly locally
Here is the whole error message: '''https://pastebin.com/mGHSS8kQ'''
How can I solve this error?

I understand you have tried the steps suggested in the SO thread Eyap shared, and few things here are already covers that. Kindly review these settings.
You can use this command instead - source /antenv3.6/bin/activate.
As a side note- The antenv will be available only after a deployment is initiated. Kindly check the “/” path from SSH and you should see a folder with name starting from antenv.
Browse to .python_packages/lib/python3.6/site-packages/ or .python_packages/lib/site-packages/. Kindly review the file path exists.
Review the Application logs as well (/home/LogFiles folder) from Kudu- https://<yourwebpp-name>.scm.azurewebsites.net/api/logs/docker
The App Service deployment engine automatically activates a virtual environment and runs
pip install -r requirements.txt
The requirements.txt file must be in the project root for dependencies to be installed.
For Django apps, App Service looks for a file named wsgi.py within your app code, and then runs Gunicorn using the following command:
is the name of the folder that contains wsgi.py
gunicorn --bind=0.0.0.0 --timeout 600 .wsgi
If you want more specific control over the startup command, use a custom startup command, replace with the name of folder that contains wsgi.py, and add a --chdir argument if that module is not in the project root.
For additional details, please checkout this document
Configure a Linux Python app for Azure App Service
Quickstart: Create a Python app in Azure App Service on Linux

Django running on an ECS task does not work. "Connection refused" or "No data response" when requesting the webapp

I have some problems running Django on an ECS task.
I want to have a Django webapp running on an ECS task and accessible to the world.
Here are the symptoms:
When I run an ECS task using Django python manage.py runserver 0.0.0.0:8000 as entry point for my container, I have a connection refused response.
When I run the task using Gunicorn using gunicorn --bind 0.0.0.0:8000 my-project.wsgi I have no data response.
I don't see logs on CloudWatch and I can't find any server's logs when I ssh to the ECS instance.
Here are some of my settings related to that kind of issue:
I have set my ECS instance security groups inbound to All TCP | TCP | 0 - 65535 | 0.0.0.0/0 to be sure it's not a firewall problem. And I can assert that because I can run a ruby on rails server on the same ECS instance perfectly.
In my container task definition I set a port mapping to 80:8000 and an other to 8000:8000.
In my settings.py, I have set ALLOWED_HOSTS = ["*"] and DEBUG = False.
Locally my server run perfectly on the same docker image when doing a docker run -it -p 8000:8000 my-image gunicorn --bind=0.0.0.0:8000 wsgi or same with manage.py runserver.
Here is my docker file for a Gunicorn web server.
FROM python:3.6
WORKDIR /usr/src/my-django-project
COPY my-django-project .
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["gunicorn","--bind","0.0.0.0:8000","wsgi"]
# CMD ["python","manage.py", "runserver", "0.0.0.0:8000"]
Any help would be grateful!

To help you debugging:
What is the status of the job when you are trying to access your webapp.
Figure out which instance the job is running and try docker ps on that ecs instance for the running job.
If you are able see the container or the job running on the instance, try access your webapp directly on the server with command like curl http://localhost:8000 or wget
If you container is not running. Try docker ps -a and see which one has just stopped and check with docker logs -f
With this approach, you can cut out all AWS firewall settings, so that you can see if your container is configured correctly. I think it will help you tracking down the issue easier.
After you figuring out the container is running fine and you are able to request with localhost, then you can work on security group inbound/outbound filter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.