I would like to set up a Python shell job as an ETL job. This job would do some basic data cleaning.
For now, I have a script that runs locally. I would like to test this script (or parts of it). I tried to setup a dev endpoint like it's explained here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html
Everything went fine, I can ssh into the machine. But with this tutorial, I get a "gluepyspark" shell. And this shell isn't compatible with the AWS data wrangler library: https://github.com/awslabs/aws-data-wrangler.
I would like to know if it's possible to setup a dev endpoint to test python shell jobs. Alternatively, I could also accept a workflow to test python shell jobs.
Related
Is it possible to run Locust tests from inline code without access to the shell? I am trying to deploy an AWS Lambda-based backend for my application to run load tests. I have not been able to find any documentation about running tests directly from the Python file without using the terminal.
This is covered in the Locust docs, though it's not framed quite as you are phrasing it.
"It is possible to start a load test from your own Python code, instead of running Locust using the locust command."
https://docs.locust.io/en/stable/use-as-lib.html
I am trying to run a Python script that runs a bunch of queries against my tables on my Snowflake database and based on the results of the queries stores the output in Snowflake tables. This new company that I work for leverage Informatica Cloud as their ETL tool and while my tool works on Microsoft Azure (ADF) and Azure Batch, I cannot figure out for the life of me, how to trigger the Python script from Informatica Cloud Data Integration tool.
I think this can be tricky for cloud implementation.
You can create an executable from your py script. Then put that file in Informatica cloud agent server. Then you can call it using shell command.
You can also put the py file in agent server and run it using shell like
$PYTHON_HOME/python your_script.py
You need to make sure py version is compatible and you have all packages installed in agent server.
You can refer to the below screenshot for how to setup shell command. Then you can run it as part of some workflow. Schedule it if needed.
I'm working on a flask API, which one of the endpoint is to receive a message and publish it to PubSub. Currently, in order to test that endpoint, I will have to manually spin-up a PubSub emulator from the command line, and keep it running during the test. It working just fine, but it wouldn't be ideal for automated test.
I wonder if anyone knows a way to spin-up a test PubSub emulator from python? Or if anyone has a better solution for testing such an API?
As far as I know, there is no Python native Google Cloud PubSub emulator available.
You have few options, all of them require launching an external program from Python:
Just invoke the gcloud command you mentioned: gcloud beta emulators pubsub start [options] directly from your python application to start this as an external program.
The PubSub emulator which comes as part of Cloud SDK is a JAR file bootstrapped by the bash script present in CLOUD_SDK_INSTALL_DIR/platform/pubsub-emulator/bin/cloud-pubsub-emulator. You could possibly run this bash script directly.
Here is a StackOverflow answer which covers multiple ways to launch an external program from Python.
Also, it is not quite clear from your question how you're calling the PubSub APIs in Python.
For unit tests, you could consider setting up a wrapper over the code which actually invokes the Cloud PubSub APIs, and inject a fake for this API wrapper. This way, you can test the rest of the code which invokes just your fake API wrapper instead of the real API wrapper and not worry about starting any external programs.
For integration tests, the PubSub emulator will definitely be useful.
This is how I usually do:
1. I create a python client class which does publish and subscribe with the topic, project and subscription used in emulator.
Note: You need to set PUBSUB_EMULATOR_HOST=localhost:8085 as env in your python project.
2. I spin up a pubsub-emulator as a docker container.
Note: You need to set some envs, mount volumes and expose port 8085.
set following envs for container:
PUBSUB_EMULATOR_HOST
PUBSUB_PROJECT_ID
PUBSUB_TOPIC_ID
PUBSUB_SUBSCRIPTION_ID
Write whatever integration tests you want to. Use publisher or subscriber from client depending on your test requirements.
I am building a complex Python application that distributes data between very different services, devices, and APIs. Obviously, there is a lot of private authentication information. I am handling it by passing it with environmental variables within a Supervisor process using the environment= keyword in the configuration file.
I have also a test that checks whether all API authentication information is set up correctly and whether the external APIs are available. Currently I am using Nosetest as test runner.
Is there a way to run the tests in the Supervisor context without brute force parsing the supervisor configuration file within my test runner?
I decided to use Python Celery which is already installed on my machine. My API queries are wrapped as tasks and send to Celery. Given this setup I created my testrunner as just another task that runs the API tests.
The web application tests do not need the stored credentials but run fine in the Celery context as well.
I have set of simple python 2.7 scripts. Also, I have set of linux nodes. I want to run at a specific time these scripts on these nodes.
Each script may work on each node. The script is not able to run on multiple nodes simultaneously.
So, I want to complete 3 simple tasks:
To deploy set of scripts.
To run at a specific time main script with specific parameters on any node.
To get result, when script is finished.
It seems, that I am able to complete first task. I have the following code snippet:
import urllib
import urlparse
from pyspark import SparkContext
def path2url(path):
return urlparse.urljoin(
'file:', urllib.pathname2url(path))
MASTER_URL = "spark://My-PC:7077"
deploy_zip_path = "deploy.zip"
sc = SparkContext(master=("%s" % MASTER_URL), appName="Job Submitter", pyFiles=[path2url("%s" % deploy_zip_path)])
But I have problems. This code immediately launches tasks. But I want just deploy scripts to all nodes.
I would recommend keeping the code to deploy your PySpark scripts outside of your PySpark scripts.
Chronos is a job scheduler that runs on Apache Mesos. Spark can run on Mesos. Chronos runs jobs as a shell command. So, you can run your scripts with any arguments you specify. You will need to deploy Spark and your scripts to Mesos nodes. Then, you can run submit your Spark scripts with Chronos using spark-submit as the command.
You would store your results by writing to some kind of storage mechanism within your PySpark scripts. Spark has support for text files, HDFS, Amazon S3, and more. If Spark doesn't support the storage mechanism you need, you can use an external library that does. For example, I write to Cassandra in my PySpark scripts using cassandra-driver.