Run FastAPI inside docker container - python

I am deploying a web scraping microservice in a Docker container. I've used Scrapy and I am exposing an API call using FastAPI that will execute the crawler command.
I've created a docker container using Ubuntu as the base and installed all required dependencies. Then I use 'exec container_name bash' as an entry point to run the FastAPI server command. But how do I run the server as a background job?
I've tried building from the FastAPI docker image (tiangolo/uvicorn-gunicorn-fastapi:python3.6) but it fails to start.

I used the tiangolo/uvicorn-gunicorn-fastapi:python3.6 image & installed my web scraping dependencies in there along with environment variables and changing working directory to the folder from which scrapy crawl mybot command can be executed.
The issue I was facing with this solution earlier is a response timeout because I am running the above scrapy crawl mybot command as an OS process using os.popen('scrapy crawl mybot') inside the API function, logging the output, and then returning the response. It is not the right way, I know - I will try to run it as a background job; but it's a workaround for now.
Dockerfile:
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.6
# Install dependencies:
COPY requirements.txt .
RUN pip3 install -r requirements.txt
ENV POSTGRESQL_HOST=localhost
ENV POSTGRESQL_PORT=5432
ENV POSTGRESQL_DB=pg
ENV POSTGRESQL_USER=pg
ENV POSTGRESQL_PWD=pwd
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
COPY ./app /app
WORKDIR "/app"
FastAPI Endpoint:
#app.get("/scraper/crawlWeb")
async def scrapy_crawl_web(bot_name):
current_time = datetime.datetime.now()
start = current_time.timestamp()
print("--START JOB at " + str(current_time))
stream = os.popen(
'scrapy crawl %s 2>&1 & echo "$! `date`" >> ./scrapy_pid.txt' % bot_name)
output = stream.read()
print(output)
current_time = datetime.datetime.now()
end = current_time.timestamp()
print("--END JOB at " + str(current_time))
return "Crawler job took %s minutes and closed at %s" % ((end-start)/60.00, str(current_time))

Related

Execute bash script Using FLASK+DOCKER

I've been trying to create a FLASK api that executes simple shell scripts (e.g session = Popen(['echo server_info.js" | node {}'.format(cmd)],shell=True, stdout=PIPE, stderr=PIPE). This worked very well but when i dockerized the application, the script stopped running and returned this error: b'/bin/sh: 1: /path No such file or directory.
Also, I use swagger and blueprint for the FLASK. the dockerized version shows the swagger but does not update any change I make in the swagger.json file (the code:
SWAGGER_URL = '/swagger'
API_URL = '/static/swagger.json'
SWAGGERUI_BLUEPRINT = get_swaggerui_blueprint(
SWAGGER_URL,
API_URL,
config={
'app_name': "NAME"
}
))
Also the docker file code:
FROM python:3.7
RUN mkdir /usr/src/app
WORKDIR /usr/src/app
COPY . /usr/src/app
RUN pip3 install --trusted-host pypi.python.org -r requirements.txt
EXPOSE 5000
ENTRYPOINT ["python3", "/usr/src/app/app.py"]
Any suggestion?

Python process never exits in Docker container during CircleCI workflow

I have a Dockerfile that looks like this:
FROM python:3.6
WORKDIR /app
ADD . /app/
# Install system requirements
RUN apt-get update && \
xargs -a requirements_apt.txt apt-get install -y
# Install Python requirements
RUN python -m pip install --upgrade pip
RUN python -m pip install -r requirements_pip.txt
# Circle CI ignores entrypoints by default
ENTRYPOINT ["dostuff"]
I have a CircleCI config that does:
version: 2.1
orbs:
aws-ecr: circleci/aws-ecr#6.15.3
jobs:
benchmark_tests_dev:
docker:
- image: blah_blah_image:test_dev
#auth
steps:
- checkout
- run:
name: Compile and run benchmarks
command: make bench
workflows:
workflow_test_and_deploy_dev:
jobs:
- aws-ecr/build-and-push-image:
name: build_test_dev
context: my_context
account-url: AWS_ECR_ACCOUNT_URL
region: AWS_REGION
repo: my_repo
aws-access-key-id: AWS_ACCESS_KEY_ID
aws-secret-access-key: AWS_SECRET_ACCESS_KEY
dockerfile: Dockerfile
tag: test_dev
filters:
branches:
only: my-test-branch
- benchmark_tests_dev:
requires: [build_test_dev]
context: my_context
filters:
branches:
only: my-test-branch
- aws-ecr/build-and-push-image:
name: deploy_dev
requires: [benchmark_tests_dev]
context: my_context
account-url: AWS_ECR_ACCOUNT_URL
region: AWS_REGION
repo: my_repo
aws-access-key-id: AWS_ACCESS_KEY_ID
aws-secret-access-key: AWS_SECRET_ACCESS_KEY
dockerfile: Dockerfile
tag: test2
filters:
branches:
only: my-test-branch
make bench looks like:
bench:
python tests/benchmarks/bench_1.py
python tests/benchmarks/bench_2.py
Both benchmark tests follow this pattern:
# imports
# define constants
# Define functions/classes
if __name__ == "__main__":
# Run those tests
If I build my Docker container on my-test-branch locally, override the entrypoint to get inside of it, and run make bench from inside the container, both Python scripts execute perfectly and exit.
If I commit to the same branch and trigger the CircleCI workflow, the bench_1.py runs and then never exits. I have tried switching the order of the Python scripts in the make command. In that case, bench_2.py runs and then never exits. I have tried putting a sys.exit() at the end of the if __name__ == "__main__": block of both scripts and that doesn't force an exit on CircleCI. I the first script to be run will run to completion because I have placed logs throughout the script to track progress. It just never exits.
Any idea why these scripts would run and exit in the container locally but not exit in the container on CircleCI?
EDIT
I just realized "never exits" is an assumption I'm making. It's possible the script exits but the CircleCI job hangs silently after that? The point is the script runs, finishes, and the CircleCI job continues to run until I get a timeout error at 10 minutes (Too long with no output (exceeded 10m0s): context deadline exceeded).
Turns out the snowflake.connector Python lib we were using has this issue where if an error occurs during an open Snowflake connection, the connection is not properly closed and the process hangs. There is also another issue where certain errors in that lib are being logged and not raised, causing the first issue to occur silently.
I updated our snowflake IO handler to explicitly open/close a connection for every read/execute so that this doesn't happen. Now my scripts run just fine in the container on CircleCI. I still don't know why they ran in the container locally and not remotely, but I'm going to leave that one for the dev ops gods.

Passing AWS credentials to Python container

I am building Python container for the first time using VS Code and WSL2. Here is my sample Python code. It runs fine in VS interactive mode because it is picking up my default AWS credentials.
import boto3
s3BucketName = 'test-log-files'
s3 = boto3.resource('s3')
def s3move():
try:
s3.Object(s3BucketName, "destination/Changes.xlsx").copy_from(CopySource=(s3BucketName + "/source/Changes.xlsx"))
s3.Object(s3BucketName,"source/Changes.xlsx").delete()
print("Transfer Complete")
except:
print("Transfer failed")
if __name__ == "__main__":
s3move()
The Dockerfile built by VS Code:
# For more information, please refer to https://aka.ms/vscode-docker-python
FROM python:3.8-slim-buster
# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1
# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1
# Install pip requirements
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN pip install boto3
WORKDIR /app
COPY . /app
# Switching to a non-root user, please refer to https://aka.ms/vscode-docker-python-user-rights
RUN useradd appuser && chown -R appuser /app
USER appuser
# During debugging, this entry point will be overridden. For more information, please refer to https://aka.ms/vscode-docker-python-debug
CMD ["python", "S3MoveFiles/S3MoveFiles.py"]
I would like to test this using docker container and seems like I have to pass the AWS credentials to the container. While there are other ways and probably more secure ways, I wanted to test the method by mounting the volume in a docker command as an argument.
docker run -v ~/.aws/credentials:/appuser/home/.aws/credentials:ro image_id
I get the "Transfer failed" message in the Terminal window in VS Code. What am I doing wrong here? Checked several articles but couldn't find any hints. I am not logged in as root.

Openwhisk Docker different behavior from IBM cloud CLI than its frontend

I want to run my Python program on IBM cloud functions, because of dependencies this needs to be done in an OpenWhisk Docker. I've changed my code so that it accepts a json:
json_input = json.loads(sys.argv[1])
INSTANCE_NAME = json_input['INSTANCE_NAME']
I can run it from the terminal:
python main/main.py '{"INSTANCE_NAME": "example"}'
I've added this Python program to OpenWhisk with this Dockerfile:
# Dockerfile for example whisk docker action
FROM openwhisk/dockerskeleton
ENV FLASK_PROXY_PORT 8080
### Add source file(s)
ADD requirements.txt /action/requirements.txt
RUN cd /action; pip install -r requirements.txt
# Move the file to
ADD ./main /action
# Rename our executable Python action
ADD /main/main.py /action/exec
CMD ["/bin/bash", "-c", "cd actionProxy && python -u actionproxy.py"]
But now if I run it using IBM Cloud CLI I just get my Json back:
ibmcloud fn action invoke --result e2t-whisk --param-file ../env_var.json
# {"INSTANCE_NAME": "example"}
And if I run from the IBM Cloud Functions website with the same Json feed I get an error like it's not even there.
stderr: INSTANCE_NAME = json_input['INSTANCE_NAME']",
stderr: KeyError: 'INSTANCE_NAME'"
What can be wrong that the code runs when directly invoked, but not from the OpenWhisk container?

Request had insufficient authentication scopes (403) when trying to write crawling data to Bigquery from pipeline of Scrapy

I'm trying to build Scrapy crawler: spider will crawl data then in pipeline.py, the data will save to Bigquery. I built it by docker, setup crontab job and push to Google Cloud Server to daily running.
The problem is when crontab executes scrapy crawler, it got "google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/project_name/datasets/dataset_name/tables/table_name: Request had insufficient authentication scopes.".
For more detail, when accessing to its container(docker exec -it ... /bin/bash) and execute it manually(scrapy crawl spider_name), it works like charm. The data appears in Bigquery.
I use service account (json file) having bigquery.admin role to setup GOOGLE_APPLICATION_CREDENTIALS.
# spider file is fine
# pipeline.py
from google.cloud import bigquery
import logging
from scrapy.exceptions import DropItem
...
class SpiderPipeline(object):
def __init__(self):
# BIGQUERY
# Setup GOOGLE_APPLICATION_CREDENTIALS in docker file
self.client = bigquery.Client()
table_ref = self.client.dataset('dataset').table('data')
self.table = self.client.get_table(table_ref)
def process_item(self, item, spider):
if item['key']:
# BIGQUERY
'''Order: key, source, lang, created, previous_price, lastest_price, rating, review_no, booking_no'''
rows_to_insert = [( item['key'], item['source'], item['lang'])]
error = self.client.insert_rows(self.table, rows_to_insert)
if error == []:
logging.debug('...Save data to bigquery {}...'.format(item['key']))
# raise DropItem("Missing %s!" % item)
else:
logging.debug('[Error upload to Bigquery]: {}'.format(error))
return item
raise DropItem("Missing %s!" % item)
In docker file:
FROM python:3.5-stretch
WORKDIR /app
COPY requirements.txt ./
RUN pip install --trusted-host pypi.python.org -r requirements.txt
COPY . /app
# For Bigquery
# key.json is already in right location
ENV GOOGLE_APPLICATION_CREDENTIALS='/app/key.json'
# Sheduler cron
RUN apt-get update && apt-get -y install cron
# Add crontab file in the cron directory
ADD crontab /etc/cron.d/s-cron
# Give execution rights on the cron job
RUN chmod 0644 /etc/cron.d/s-cron
# Apply cron job
RUN crontab /etc/cron.d/s-cron
# Create the log file to be able to run tail
RUN touch /var/log/cron.log
# Run the command on container startup
CMD cron && tail -f /var/log/cron.log
In crontab:
# Run once every day at midnight. Need empty line at the end to run.
0 0 * * * cd /app && /usr/local/bin/scrapy crawl spider >> /var/log/cron.log 2>&1
In conclusion, how to get crontab run crawler without 403 error. Thank anyone so much for support.
I suggest you load the service account directly in your code and not from the environment like this:
from google.cloud import bigquery
from google.cloud.bigquery.client import Client
service_account_file_path = "/app/key.json" # your service account auth file file
client = bigquery.Client.from_service_account_json(service_account_file_path)
The rest of the code should stay the same as you verify it's a working code

Categories

Resources