I have a python script that mounts a storage account in databricks and then installs a wheel from the storage account. I am trying to run it as a cluster init script but it keeps failing. My script is of the form:
#/databricks/python/bin/python
mount_point = "/mnt/...."
configs = {....}
source = "...."
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
dbutils.fs.mount(source = source, mount_point = mount_point, extra_configs = configs)
dbutils.library.install("dbfs:/mnt/.....")
dbutils.library.restartPython()
It works when I run it in directly in a notebook but if I save to a file called dbfs:/databricks/init_scripts/datalakes/init.py and use it as cluster init script, the cluster fails to start and the error message says that the init script has a non-zero exit status. I've checked the logs and it appears that it is running as bash instead of python:
bash: line 1: mount_point: command not found
I have tried running the python script from a bash script called init.bash containing this one line:
/databricks/python/bin/python "dbfs:/databricks/init_scripts/datalakes/init.py"
Then the cluster using init.bash fails to start, with the logs saying it can't find the python file:
/databricks/python/bin/python: can't open file 'dbfs:/databricks/init_scripts/datalakes/init.py': [Errno 2] No such file or directory
Can anyone tell me how I could get this working please?
Related question: Azure Databricks cluster init script - Install wheel from mounted storage
The solution I went with was to run a notebook which mounts the storage and creates a bash init script that just installs the wheel. Something like this:
mount_point = "/mnt/...."
configs = {....}
source = "...."
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
dbutils.fs.mount(source = source, mount_point = mount_point, extra_configs = configs)
dbutils.fs.put("dbfs:/databricks/init_scripts/datalakes/init.bash","""
/databricks/python/bin/pip install "../../../dbfs/mnt/package-source/parser-3.0-py3-none-any.whl"""", True)"
Related
I have a ros2 package and successfully create a docker image of it. Then when im inside the container i would like to run only one single node of the ros2 package. So first create the environment with PATH=$PATH:/home/user/.local/bin then vcs import . <system_integration/ros.repos then docker pull ghcr.io/test-inc/base_images:foxy. Im running and executing the docker with
docker run --name test -d --rm -v $(pwd):/home/ros2/foxy/src ghcr.io/company-inc/robot1_vnc_ros2:foxy
docker exec -it test /bin/bash
Then when Im inside the docker I build the package with
colcon build --symlink-install --event-handlers console_cohesion+ --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-up-to system_integration
So now im inside the docker in the root#1942eef8d977:~/ros2/foxy and would like to run one python node. But ros2 run package_name node_name would not work right? Im not familiar much with docker so not sure how to run the node. Any help
Thanks
Before you can use ros2 run to run packages, you have to source the correct workspace. Otherwise you could not use auto-completion tab to find any packages and thus no package can be run
To do so:
cd into your the root path of your workspace
colcon build your workspace (which your packages should under the src directory)
run this line to source it source install/setup.bash
you can check it with echo $COLCON_PATH_PREFIX to see the path is correctly sourced
Try rerun the ros2 command to start your node
Have you sourced the setup file within the container?
Where ever the package source is located, you need to run source ./install/setup.bash
I typically install packages in EMR through Spark's install_pypi_package method. This limits where I can install packages from. How can I install a package from a specific GitHub branch? Is there a way I can do this through the install_pypi_package method?
If you have access to cluster creation step, you can install the package using pip from github at bootstrap. (install_pypi_package is needed because the cluster is already running at that time and packages might not resolve on all nodes)
Installing prior Cluster is running:
A simple example (e.g with download.sh bootstrap file) of bootstrap and installing from github using pip is
#!/bin/bash
sudo pip install <you-repo>.git
then you can use this bash at bootstrap as
aws emr create-cluster --name "Test cluster" --bootstrap-actions
Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
or you can use pip3 in bootstrap
sudo pip3 install <you-repo>.git
or just clone it and build it locally on EMR with setup.py file
#!/bin/bash
git clone <your-repo>.git
sudo python setup.py install
After Cluster is running (Complex and not recommended)
If you still want to install or build a custom package when the cluster is already running, AWS has some explanation here that uses AWS-RunShellScript to install package on all core nodes. It says
(I) Install the package to Master node, (doing pip install on running cluster via shell or a jupyter notebook on top of it)
(II) Running following script locally on EMR, for which you pass cluster-id and boostrap script path(for e.g download.sh above) as arguments.
import argparse
import time
import boto3
def install_libraries_on_core_nodes(
cluster_id, script_path, emr_client, ssm_client):
"""
Copies and runs a shell script on the core nodes in the cluster.
:param cluster_id: The ID of the cluster.
:param script_path: The path to the script, typically an Amazon S3 object URL.
:param emr_client: The Boto3 Amazon EMR client.
:param ssm_client: The Boto3 AWS Systems Manager client.
"""
core_nodes = emr_client.list_instances(
ClusterId=cluster_id, InstanceGroupTypes=['CORE'])['Instances']
core_instance_ids = [node['Ec2InstanceId'] for node in core_nodes]
print(f"Found core instances: {core_instance_ids}.")
commands = [
# Copy the shell script from Amazon S3 to each node instance.
f"aws s3 cp {script_path} /home/hadoop",
# Run the shell script to install libraries on each node instance.
"bash /home/hadoop/install_libraries.sh"]
for command in commands:
print(f"Sending '{command}' to core instances...")
command_id = ssm_client.send_command(
InstanceIds=core_instance_ids,
DocumentName='AWS-RunShellScript',
Parameters={"commands": [command]},
TimeoutSeconds=3600)['Command']['CommandId']
while True:
# Verify the previous step succeeded before running the next step.
cmd_result = ssm_client.list_commands(
CommandId=command_id)['Commands'][0]
if cmd_result['StatusDetails'] == 'Success':
print(f"Command succeeded.")
break
elif cmd_result['StatusDetails'] in ['Pending', 'InProgress']:
print(f"Command status is {cmd_result['StatusDetails']}, waiting...")
time.sleep(10)
else:
print(f"Command status is {cmd_result['StatusDetails']}, quitting.")
raise RuntimeError(
f"Command {command} failed to run. "
f"Details: {cmd_result['StatusDetails']}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument('cluster_id', help="The ID of the cluster.")
parser.add_argument('script_path', help="The path to the script in Amazon S3.")
args = parser.parse_args()
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
install_libraries_on_core_nodes(
args.cluster_id, args.script_path, emr_client, ssm_client)
if __name__ == '__main__':
main()
I want to run my Python program on IBM cloud functions, because of dependencies this needs to be done in an OpenWhisk Docker. I've changed my code so that it accepts a json:
json_input = json.loads(sys.argv[1])
INSTANCE_NAME = json_input['INSTANCE_NAME']
I can run it from the terminal:
python main/main.py '{"INSTANCE_NAME": "example"}'
I've added this Python program to OpenWhisk with this Dockerfile:
# Dockerfile for example whisk docker action
FROM openwhisk/dockerskeleton
ENV FLASK_PROXY_PORT 8080
### Add source file(s)
ADD requirements.txt /action/requirements.txt
RUN cd /action; pip install -r requirements.txt
# Move the file to
ADD ./main /action
# Rename our executable Python action
ADD /main/main.py /action/exec
CMD ["/bin/bash", "-c", "cd actionProxy && python -u actionproxy.py"]
But now if I run it using IBM Cloud CLI I just get my Json back:
ibmcloud fn action invoke --result e2t-whisk --param-file ../env_var.json
# {"INSTANCE_NAME": "example"}
And if I run from the IBM Cloud Functions website with the same Json feed I get an error like it's not even there.
stderr: INSTANCE_NAME = json_input['INSTANCE_NAME']",
stderr: KeyError: 'INSTANCE_NAME'"
What can be wrong that the code runs when directly invoked, but not from the OpenWhisk container?
I am using python SDK package to run docker from python.
Here is the docker command I tried to run using python package:
docker run -v /c/Users/msagovac/pdf_ocr:/home/docker jbarlow83/ocrmypdf-polyglot --skip-text 0ce9d58432bf41174dde7148486854e2.pdf output.pdf
Here is a python code:
import docker
client = docker.from_env()
client.containers.run('jbarlow83/ocrmypdf-polyglot', '--skip-text "0ce9d58432bf41174dde7148486854e2.pdf" "output.pdf"', "-v /c/Users/msagovac/pdf_ocr:/home/docker")
Error says file ot found. I am not sure where to set run options:
-v /c/Users/msagovac/pdf_ocr:/home/docker
Try with named parameters:
client.containers.run(
image='jbarlow83/ocrmypdf-polyglot',
command='--skip-text "0ce9d58432bf41174dde7148486854e2.pdf" "output.pdf"',
volumes={'/c/Users/msagovac/pdf_ocr': {'bind': '/home/docker', 'mode': 'rw'}},
)
Also it seems that the path of the volume to mount is incorrect, try with C:/Users/msagovac/pdf_ocr
I am trying to run a very simple python file that simply prints "woof" within a docker container. As far as I know I have created a docker container called:
c5d3c4c383d1
I then run the following command, in an attempt to tell myself what directory I am running things from in docker:
sudo docker run c5d3c4c383d1 pwd
This returns the following value:
/
Which I assume to be my root directory, so I go to my root directory. Typing pwd shows:
/
I then create a file called meow.py via the nano command and enter in this a single line that is:
print("Woof!")
I save this and confirm this is in the / directory with an ls command.
I then enter the following:
sudo docker run c5d3c4c383d1 python meow.py
Which returns:
python: can't open file 'meow.py': [Errno 2] No such file or directory
I don't understand this. Obviously I am not in the root directory when running a command with the docker as the meow.py file is DEFINETLY in the root directory but it is saying this file cannot be found. What the heck... As i said when I run pwd within the docker container it says i am in the / directory, but I cannot be given this file not found error.
docker is a container ... thats its root directory ... think of it like a totally different machine that you would normally ssh into... try something like this
docker run -it c5d3c4c383d1 bash
thats basically like you have just ssh'd into your remote machine
go ahead and try some commands (ls,pwd,etc)
now run echo print("hello world")>test.py
now run ls you should see your test.py ... go ahead and run it with python test.py
now you can exit you container ... if you launch the same container again you should still have your test.py file there... although i think its more common that people write a dockerfile that sets up their environment and then they just treat each session as disposable, as opposed to keeping the same container