I'm trying to export a CSV to an Azure Data Lake Storage but when the file system/container does not exist the code breaks. I have also read through the documentation but I cannot seem to find anything helpful for this situation.
How do I go about creating a container in Azure Data Lake Storage if the container specified by the user does not exist?
Current Code:
try:
file_system_client = service_client.get_file_system_client(file_system="testfilesystem")
except Exception:
file_system_client = service_client.create_file_system(file_system="testfilesystem")
Traceback:
(FilesystemNotFound) The specified filesystem does not exist.
RequestId:XXXX
Time:2021-03-31T13:39:21.8860233Z
The try catch pattern should be not used here since the Azure Data lake gen2 library has the built in exists() method for file_system_client.
First, make sure you've installed the latest version library: azure-storage-file-datalake 12.3.0. If you're not sure which version you're using, please use pip show azure-storage-file-datalake command to check the current version.
Then you can use the code below:
from azure.storage.filedatalake import DataLakeServiceClient
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", "xxx"), credential="xxx")
#the get_file_system_client method will not throw error if the file system does not exist, if you're using the latest library 12.3.0
file_system_client = service_client.get_file_system_client("filesystem333")
print("the file system exists: " + str(file_system_client.exists()))
#create the file system if it does not exist
if not file_system_client.exists():
file_system_client.create_file_system()
print("the file system is created.")
#other code
I've tested it locally, it can work successfully:
Related
I am creating an azure-ml webservice. The following script shows the code for creating the webservice and deploying it locally.
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
ws = Workspace.from_config()
model = Model(ws,'textDNN-20News')
ws.write_config(file_name='config.json')
env = Environment(name="init-env")
python_packages = ['numpy', 'pandas']
for package in python_packages:
env.python.conda_dependencies.add_pip_package(package)
dummy_inference_config = InferenceConfig(
environment=env,
source_directory="./source_dir",
entry_script="./init_score.py",
)
from azureml.core.webservice import LocalWebservice
deployment_config = LocalWebservice.deploy_configuration(port=6789)
service = Model.deploy(
ws,
"myservice",
[model],
dummy_inference_config,
deployment_config,
overwrite=True,
)
service.wait_for_deployment(show_output=True)
As it can be seen, the above code deploys "entry_script = init_score.py" to my local machine. Within the entry_script, I need to load the workspace again to connect to azure SQL database. I do it like the following :
from azureml.core import Dataset, Datastore
from azureml.data.datapath import DataPath
from azureml.core import Workspace
def init():
pass
def run(data):
try:
ws = Workspace.from_config()
# create tabular dataset from a SQL database in datastore
datastore = Datastore.get(ws, 'sql_db_name')
query = DataPath(datastore, 'SELECT * FROM my_table')
tabular = Dataset.Tabular.from_sql_query(query, query_timeout=10)
df = tabular.to_pandas_dataframe()
return len(df)
except Exception as e:
output0 = "{}:".format(type(e).__name__)
output1 = "{} ".format(e)
output2 = f"{type(e).__name__} occured at line {e.__traceback__.tb_lineno} of {__file__}"
return output0 + output1 + output2
The try-catch block is for catching the potential exception thrown and return it as an output.
The exception that I keep getting is:
UserErrorException: The workspace configuration file config.json, could not be found in /var/azureml-app or its
parent directories. Please check whether the workspace configuration file exists, or provide the full path
to the configuration file as an argument. You can download a configuration file for your workspace,
via http://ml.azure.com and clicking on the name of your workspace in the right top.
I have actually tried to save the config file by passing an absolute path to the path argument of both ws.write_config(path='my_absolute_path'), and also when loading it to the Workspace.from_config(path='my_absolute_path'), but I got pretty much the same error:
UserErrorException: The workspace configuration file config.json, could not be found in /var/azureml-app/my_absolute_path or its
parent directories. Please check whether the workspace configuration file exists, or provide the full path
to the configuration file as an argument. You can download a configuration file for your workspace,
via http://ml.azure.com and clicking on the name of your workspace in the right top.
Looks like even providing the path does not change the root directory that the entry script starts locating from.
I also tried to directly saving the file to /var/azureml-app/, but this path is not recognized when I passed it to the ws.write_config(path='/var/azureml-app/').
Do you have any idea where exactly is the /var/azureml-app/?
Any idea on how to fix this?
Hi I am pretty new in this AWS world, what I am trying to do is connect a python client to the AWS IoT service and publish a message, I am using the SDK python and its example, but I have problems whit the certification process, I already have created the thing, the policies and the certification and I downloaded the files, but in the python program I have no idea if I am writing the path to this files in a correct way,
First I tried writing the whole path of each file and nothing then I tried just putting "certificados\thefile" and nothing .
The error that pops up says the error is the path which precesily I do not how to write it.
Thanks for taking the time and sotty if this question is too basic I am just jumping into this.
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
import time as t
import json
import AWSIoTPythonSDK.MQTTLib as AWSIoTPyMQTT
# Define ENDPOINT, CLIENT_ID, PATH_TO_CERT, PATH_TO_KEY, PATH_TO_ROOT, MESSAGE, TOPIC, and RANGE
ENDPOINT = "MYENDPOINT"
CLIENT_ID = "testDevice"
PATH_TO_CERT = "certificados/5a7e19a0269abe740ac8b38a1bfdab115d14074eb212167a3ba359c0d237a8c3-certificate.pem.crt"
PATH_TO_KEY = "certificados/5a7e19a0269abe740ac8b38a1bfdab115d14074eb212167a3ba359c0d237a8c3-private.pem.key"
PATH_TO_ROOT = "certificados/AmazonRootCA1.pem"
MESSAGE = "Hello World"
TOPIC = "Prueba/A"
RANGE = 20
myAWSIoTMQTTClient = AWSIoTPyMQTT.AWSIoTMQTTClient(CLIENT_ID)
myAWSIoTMQTTClient.configureEndpoint(ENDPOINT, 8883)
myAWSIoTMQTTClient.configureCredentials(PATH_TO_ROOT, PATH_TO_KEY, PATH_TO_CERT)
myAWSIoTMQTTClient.connect()
print('Begin Publish')
for i in range (RANGE):
data = "{} [{}]".format(MESSAGE, i+1)
message = {"message" : data}
myAWSIoTMQTTClient.publish(TOPIC, json.dumps(message), 1)
print("Published: '" + json.dumps(message) + "' to the topic: " + "'test/testing'")
t.sleep(0.1)
print('Publish End')
myAWSIoTMQTTClient.disconnect()
I have created a directory on my deskopt to store this files, its name is "certificados" and from there I am taking the path but it doesn't work.
OSError: certificados/AmazonRootCA1.pem: No such file or directory
Also I am using VS code to run this application.
The error is pretty clear, it can't find the CA cert file at the path you've given it. The path you've given will be interpreted relative to where the files are executed, which is most likely going to be relative to the python file it's self. If that's not the Desktop then you need to provide the fully qualified path:
So assuming Linux, change the paths to:
PATH_TO_CERT = "/home/user/Desktop/certificados/5a7e19a0269abe740ac8b38a1bfdab115d14074eb212167a3ba359c0d237a8c3-certificate.pem.crt"
PATH_TO_KEY = "/home/user/Desktop/certificados/5a7e19a0269abe740ac8b38a1bfdab115d14074eb212167a3ba359c0d237a8c3-private.pem.key"
PATH_TO_ROOT = "/home/user/Desktop/certificados/AmazonRootCA1.pem"
I am trying to run tabula-py on AWS Lambda on Python3.7 environment. The code is quite straight-forward :
import tabula
def main(event, context):
try:
print(event['Url'])
df = tabula.read_pdf(event['Url'])
print(str(df))
return {
"StatusCode":200,
"ResponseCode":0,
"ResponseMessage": str(df)
}
except Exception as e:
print('exception = %r' % e)
return {
"ResponseCode":1,
"ErrorMessage": str(e)
}
As you can see, there's just one real line of code having tabula.read_pdf(). I am not writing the files to anywhere yet I am getting exception as exception = OSError(30, 'Read-only file system')
FYI, the tabula details are available here
Following is what I've already tried and didn't work :
Verified if the url is read correctly. Also tried by a harc-coded link in the code.
Checking on Google, Stackoverflow & Co. but did not find something which can solve this issue.
Removed __pycache__ directory from the ZIP before uploading it to update the code. Also ensured nothing OS-specific local directory is in the lambda deployment package.
Any help will be highly appreciated.
tabula is writing to os, whereas you can try different pdf table scrap package for now camelot .
I have a Lambda function that needs to use pandas, sqlalchemy, and cx_Oracle.
Installing and packaging all these libraries together exceeds the 250MB deployment package limit of AWS Lambda.
I would like to include only the .zip of the Oracle Basic Light Package, then extract and use it at runtime.
What I have tried
My project is structured as follows:
cx_Oracle-7.2.3.dist-info/
dateutil/
numpy/
pandas/
pytz/six-1.12.0.dist-info/
sqlalchemy/
SQLAlchemy-1.3.8.egg-info/
cx_Oracle.cpython-36m-x86_64-linux-hnu.so
instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip
main.py
six.py
template.yml
In main.py, I run the following:
import json, traceback, os
import sqlalchemy as sa
import pandas as pd
def main(event, context):
try:
unzip_oracle()
return {'statusCode': 200,
'body': json.dumps(run_query()),
'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin':'*'}}
except:
em = traceback.format_exc()
print("Error encountered. Error is: \n" + str(em))
return {'statusCode': 500,
'body': str(em),
'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin':'*'}}
def unzip_oracle():
print('extracting oracle drivers and copying results to /var/task/lib')
os.system('unzip /var/task/instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip -d /tmp')
print('extraction steps complete')
os.system('export ORACLE_HOME=/tmp/instantclient_19_3')
def get_db_connection():
return sa.engine.url.URL('oracle+cx_oracle',
username='do_not_worry', password='about_any',
host='of_these', port=1521,
query=dict(service_name='details')
)
def run_query():
query_text = """SELECT * FROM dont_worry_about_it"""
conn = sa.create_engine(get_db_connection())
print('Connected')
df = pd.read_sql(sa.text(query_text), conn)
print(df.shape)
return df.to_json(orient='records')
This returns the error:
sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory". See https://oracle.github.io/odpi/doc/installation.html#linux for help
(Background on this error at: http://sqlalche.me/e/4xp6)
What I have also tried
I've tried:
Adding
Environment:
Variables:
ORACLE_HOME: /tmp
LD_LIBRARY_PATH: /tmp
to template.yml and redeploying. Same error as above.
Adding os.system('export LD_LIBRARY_PATH=/tmp/instantclient_19_3') into the python script. Same error as above.
Many cp and ln things that were forbidden in Lambda outside of the /tmp folder. Same error as above.
One way that works, but is bad
If I make a folder called lib/ in the Lambda package, and include an odd assortment of libaio.so.1, libclntsh.so, etc. files, the function will work as expected, for some reason. I ended up with this:
<all the other libraries and files as above>
lib/
-libaio.so.1
-libclntsh.so
-libclntsh.so.10.1
-libclntsh.so.11.1
-libclntsh.so.12.1
-libclntsh.so.18.1
-libclntsh.so.19.1
-libclntshcore.so.19.1
-libipc1.so
-libmql1.so
-libnnz19.so
-libocci.so
-libocci.so.10.1
-libocci.so.11.1
-libocci.so.12.1
-libocci.so.18.1
-libocci.so.19.1
-libociicus.so
-libons.so
However, I chose these files through trial and error and don't want to go through this again.
Is there a way to unzip instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip in Lambda at runtime, and make Lambda see/use it to connect to an Oracle database?
I am not by any means an expert at python but this line seems very strange
print('extracting oracle drivers and copying results to /var/task/lib')
os.system('unzip /var/task/instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip -d /tmp')
print('extraction steps complete')
os.system('export ORACLE_HOME=/tmp/instantclient_19_3')
Normally, you you will have very limited access to OS level API with Lambda. And even when you do, It can behave the way you do not expect It to do. ( Think as if : Who owns the "unzip" feature? File created by this command would be visible / invokable by who? )
I see you mentioned that you have no issue extracting the files which is also a bit strange
My only answer for you is that
1/ Try to "bring your own" tools ( Unzip, etc.. )
2/ Never try to do OS level calls. Like os.system('export ...') , Always use the full path
Looking again at your question, seems like the way you specify environment variable is conflicting
ORACLE_HOME: /tmp
should not it be
Environment:
Variables:
ORACLE_HOME: /tmp/instantclient_19_3
LD_LIBRARY_PATH: /tmp/instantclient_19_3
Also, see: How to access an AWS Lambda environment variable from Python
Python is my preferred language but any supported by Lambda will do.
-- All AWS Architecture --
I have Prod, Beta, and Gamma branches and corresponding folders in S3. I am looking for a method to have Lambda respond to a CodeCommit trigger and based on the Branch that triggered it, clone the repo and place the files in the appropriate S3 folder.
S3://Example-Folder/Application/Prod
S3://Example-Folder/Application/Beta
S3://Example-Folder/Application/Gamma
I tried to utilize GitPython but it does not work because Lambda does not have Git installed on the base Lambda AMI and GitPython depends on it.
I also looked through the Boto3 docs and there are only custodial tasks available; it is not able to return the project files.
Thank you for the help!
The latest version of the boto3 codecommit includes the methods get_differences and get_blob.
You can get all the content of a codecommit repository using these two methods (at least, if you are not interested in the retaining the .git history).
The script below takes all the content of the master branch and adds it to a tar file. Afterwards you could upload it to s3 as you please.
You can run this as a lambda function, which can be invoked when you push to codecommit.
This works with the current lambda python 3.6 environment.
botocore==1.5.89
boto3==1.4.4
import boto3
import pathlib
import tarfile
import io
import sys
def get_differences(repository_name, branch="master"):
response = codecommit.get_differences(
repositoryName=repository_name,
afterCommitSpecifier=branch,
)
differences = []
while "nextToken" in response:
response = codecommit.get_differences(
repositoryName=repository_name,
afterCommitSpecifier=branch,
nextToken=response["nextToken"]
)
differences += response.get("differences", [])
else:
differences += response["differences"]
return differences
if __name__ == "__main__":
repository_name = sys.argv[1]
codecommit = boto3.client("codecommit")
repository_path = pathlib.Path(repository_name)
buf = io.BytesIO()
with tarfile.open(None, mode="w:gz", fileobj=buf) as tar:
for difference in get_differences(repository_name):
blobid = difference["afterBlob"]["blobId"]
path = difference["afterBlob"]["path"]
mode = difference["afterBlob"]["mode"] # noqa
blob = codecommit.get_blob(
repositoryName=repository_name, blobId=blobid)
tarinfo = tarfile.TarInfo(str(repository_path / path))
tarinfo.size = len(blob["content"])
tar.addfile(tarinfo, io.BytesIO(blob["content"]))
tarobject = buf.getvalue()
# save to s3
Looks like LambCI does exactly you want.
Unfortunately, currently CodeCommit doesn’t have an API to upload the repository to S3 bucket. However, if you are open to trying out CodePipeline, You can configure AWS CodePipeline to use a branch in an AWS CodeCommit repository as the source stage for your code. In this way, when you make changes to your selected tracking branch in CodePipeline, an archive of the repository at the tip of that branch will be delivered to your CodePipelie bucket. For more information about CodePipeline, please refer to following link:
http://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-simple-codecommit.html