I have a Lambda function that needs to use pandas, sqlalchemy, and cx_Oracle.
Installing and packaging all these libraries together exceeds the 250MB deployment package limit of AWS Lambda.
I would like to include only the .zip of the Oracle Basic Light Package, then extract and use it at runtime.
What I have tried
My project is structured as follows:
cx_Oracle-7.2.3.dist-info/
dateutil/
numpy/
pandas/
pytz/six-1.12.0.dist-info/
sqlalchemy/
SQLAlchemy-1.3.8.egg-info/
cx_Oracle.cpython-36m-x86_64-linux-hnu.so
instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip
main.py
six.py
template.yml
In main.py, I run the following:
import json, traceback, os
import sqlalchemy as sa
import pandas as pd
def main(event, context):
try:
unzip_oracle()
return {'statusCode': 200,
'body': json.dumps(run_query()),
'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin':'*'}}
except:
em = traceback.format_exc()
print("Error encountered. Error is: \n" + str(em))
return {'statusCode': 500,
'body': str(em),
'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin':'*'}}
def unzip_oracle():
print('extracting oracle drivers and copying results to /var/task/lib')
os.system('unzip /var/task/instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip -d /tmp')
print('extraction steps complete')
os.system('export ORACLE_HOME=/tmp/instantclient_19_3')
def get_db_connection():
return sa.engine.url.URL('oracle+cx_oracle',
username='do_not_worry', password='about_any',
host='of_these', port=1521,
query=dict(service_name='details')
)
def run_query():
query_text = """SELECT * FROM dont_worry_about_it"""
conn = sa.create_engine(get_db_connection())
print('Connected')
df = pd.read_sql(sa.text(query_text), conn)
print(df.shape)
return df.to_json(orient='records')
This returns the error:
sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory". See https://oracle.github.io/odpi/doc/installation.html#linux for help
(Background on this error at: http://sqlalche.me/e/4xp6)
What I have also tried
I've tried:
Adding
Environment:
Variables:
ORACLE_HOME: /tmp
LD_LIBRARY_PATH: /tmp
to template.yml and redeploying. Same error as above.
Adding os.system('export LD_LIBRARY_PATH=/tmp/instantclient_19_3') into the python script. Same error as above.
Many cp and ln things that were forbidden in Lambda outside of the /tmp folder. Same error as above.
One way that works, but is bad
If I make a folder called lib/ in the Lambda package, and include an odd assortment of libaio.so.1, libclntsh.so, etc. files, the function will work as expected, for some reason. I ended up with this:
<all the other libraries and files as above>
lib/
-libaio.so.1
-libclntsh.so
-libclntsh.so.10.1
-libclntsh.so.11.1
-libclntsh.so.12.1
-libclntsh.so.18.1
-libclntsh.so.19.1
-libclntshcore.so.19.1
-libipc1.so
-libmql1.so
-libnnz19.so
-libocci.so
-libocci.so.10.1
-libocci.so.11.1
-libocci.so.12.1
-libocci.so.18.1
-libocci.so.19.1
-libociicus.so
-libons.so
However, I chose these files through trial and error and don't want to go through this again.
Is there a way to unzip instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip in Lambda at runtime, and make Lambda see/use it to connect to an Oracle database?
I am not by any means an expert at python but this line seems very strange
print('extracting oracle drivers and copying results to /var/task/lib')
os.system('unzip /var/task/instantclient-basiclite-linux.x64-19.3.0.0.0dbru.zip -d /tmp')
print('extraction steps complete')
os.system('export ORACLE_HOME=/tmp/instantclient_19_3')
Normally, you you will have very limited access to OS level API with Lambda. And even when you do, It can behave the way you do not expect It to do. ( Think as if : Who owns the "unzip" feature? File created by this command would be visible / invokable by who? )
I see you mentioned that you have no issue extracting the files which is also a bit strange
My only answer for you is that
1/ Try to "bring your own" tools ( Unzip, etc.. )
2/ Never try to do OS level calls. Like os.system('export ...') , Always use the full path
Looking again at your question, seems like the way you specify environment variable is conflicting
ORACLE_HOME: /tmp
should not it be
Environment:
Variables:
ORACLE_HOME: /tmp/instantclient_19_3
LD_LIBRARY_PATH: /tmp/instantclient_19_3
Also, see: How to access an AWS Lambda environment variable from Python
Related
I'm trying to export a CSV to an Azure Data Lake Storage but when the file system/container does not exist the code breaks. I have also read through the documentation but I cannot seem to find anything helpful for this situation.
How do I go about creating a container in Azure Data Lake Storage if the container specified by the user does not exist?
Current Code:
try:
file_system_client = service_client.get_file_system_client(file_system="testfilesystem")
except Exception:
file_system_client = service_client.create_file_system(file_system="testfilesystem")
Traceback:
(FilesystemNotFound) The specified filesystem does not exist.
RequestId:XXXX
Time:2021-03-31T13:39:21.8860233Z
The try catch pattern should be not used here since the Azure Data lake gen2 library has the built in exists() method for file_system_client.
First, make sure you've installed the latest version library: azure-storage-file-datalake 12.3.0. If you're not sure which version you're using, please use pip show azure-storage-file-datalake command to check the current version.
Then you can use the code below:
from azure.storage.filedatalake import DataLakeServiceClient
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", "xxx"), credential="xxx")
#the get_file_system_client method will not throw error if the file system does not exist, if you're using the latest library 12.3.0
file_system_client = service_client.get_file_system_client("filesystem333")
print("the file system exists: " + str(file_system_client.exists()))
#create the file system if it does not exist
if not file_system_client.exists():
file_system_client.create_file_system()
print("the file system is created.")
#other code
I've tested it locally, it can work successfully:
I am trying to run tabula-py on AWS Lambda on Python3.7 environment. The code is quite straight-forward :
import tabula
def main(event, context):
try:
print(event['Url'])
df = tabula.read_pdf(event['Url'])
print(str(df))
return {
"StatusCode":200,
"ResponseCode":0,
"ResponseMessage": str(df)
}
except Exception as e:
print('exception = %r' % e)
return {
"ResponseCode":1,
"ErrorMessage": str(e)
}
As you can see, there's just one real line of code having tabula.read_pdf(). I am not writing the files to anywhere yet I am getting exception as exception = OSError(30, 'Read-only file system')
FYI, the tabula details are available here
Following is what I've already tried and didn't work :
Verified if the url is read correctly. Also tried by a harc-coded link in the code.
Checking on Google, Stackoverflow & Co. but did not find something which can solve this issue.
Removed __pycache__ directory from the ZIP before uploading it to update the code. Also ensured nothing OS-specific local directory is in the lambda deployment package.
Any help will be highly appreciated.
tabula is writing to os, whereas you can try different pdf table scrap package for now camelot .
I have created an event grid triggered azure function in python. I have deployed my solution to azure successfully and the execution is working fine. But, I have an issue with calling another python script in the same folder location. My code is given below: -
import os, json, subprocess
import logging
import azure.functions as func
def main(event: func.EventGridEvent):
try:
correctionsMessages = event.get_json()
for correctionMessage in correctionsMessages:
strMessage = json.dumps(correctionMessage)
full_path_to_script = os.path.join(os.path.dirname(os.path.realpath(__file__)) + '/' + correctionMessage['ScriptName'] + '.py')
logging.info('Script Path: %s', full_path_to_script)
logging.info('Parameter: %s', json.dumps(detectionMessage))
subprocess.check_call('python '+ full_path_to_script + ' ' + json.dumps(strMessage))
result = json.dumps({
'id': event.id,
'data': event.get_json(),
'topic': event.topic,
'subject': event.subject,
'event_type': event.event_type,
})
logging.info('Python EventGrid trigger processed an event: %s', result)
except Exception as e:
logging.info('Error: %s', e)
The above code is giving error for subprocess.check_call. Error is "Error: [Errno 2] No such file or directory: 'python /home/site/wwwroot/Detections/Script1.py". Script1.py is in same folder with init.py. When i am running this function locally, it is working absolutely fine.
Per my experience, the error was caused by the subprocess.check_call function not know the call path of python, not due to the Script1.py path.
On your local for Azure Functions development environment, the python path has been configured in the local environment variable, so the subprocess.check_call function could invoke python via search the python execute file from the paths of environment variable. But on cloud, there is not a python path value pre-configured in the same environment variable, only the Azure Function Host know the real absoluted path for Python.
So the solution is to find out the real absoluted path of Python and use it instead of python in your code.
However, in Azure Function for Python stack runtime, I think it's not a good idea for using subprocess.check_call to spawn a child process to do some processing for a given message. The safe and correct way is to define a function in Script1.py or directly in __init__.py to pass the given message as parameters to realize the same feature.
I have run into a peculiar issue while using the teradatasql package (installed from pypi). I use the following code (let's call it pytera.py) to query a database:
from dotenv import load_dotenv
import pandas as pd
import teradatasql
# Load the database credentials from .env file
_ = load_dotenv()
db_host = os.getenv('db_host')
db_username = os.getenv('db_username')
db_password = os.getenv('db_password')
def run_query(query):
"""Run query string on teradata and return DataFrame."""
if query.strip()[-1] != ';':
query += ';'
with teradatasql.connect(host=db_host, user=db_username,
password=db_password) as connect:
df = pd.read_sql(query, connect)
return df
When I import this function in the IPython/Python interpreter or in Jupyter Notebook, I can run queries just fine like so:
import pytera as pt
pt.run_query('select top 5 * from table_name;')
However, if I save the above code in a .py file and try to run it, I get an error message most of the time (not all the time). The error message is below.
E teradatasql.OperationalError: [Version 16.20.0.49] [Session 0] [Teradata SQL Driver] Hostname lookup failed for None
E at gosqldriver/teradatasql.(*teradataConnection).makeDriverError TeradataConnection.go:1046
E at gosqldriver/teradatasql.(*Lookup).getAddresses CopDiscovery.go:65
E at gosqldriver/teradatasql.discoverCops CopDiscovery.go:137
E at gosqldriver/teradatasql.newTeradataConnection TeradataConnection.go:133
E at gosqldriver/teradatasql.(*teradataDriver).Open TeradataDriver.go:32
E at database/sql.dsnConnector.Connect sql.go:600
E at database/sql.(*DB).conn sql.go:1103
E at database/sql.(*DB).Conn sql.go:1619
E at main.goCreateConnection goside.go:229
E at main._cgoexpwrap_e6e101e164fa_goCreateConnection _cgo_gotypes.go:214
E at runtime.call64 asm_amd64.s:574
E at runtime.cgocallbackg1 cgocall.go:316
E at runtime.cgocallbackg cgocall.go:194
E at runtime.cgocallback_gofunc asm_amd64.s:826
E at runtime.goexit asm_amd64.s:2361
E Caused by lookup None on <ip address redacted>: server misbehaving
I am using Python 3.7.3 and teradatasql 16.20.0.49 on Ubuntu (WSL) 18.04.
Perhaps not coincidentally, I run into a similar issue when trying a similar workflow on Windows (using the teradata package and the Teradata Python drivers installed). Works when I connect inside the interpreter or in Jupyter, but not in a script. In the Windows case, the error is:
E teradata.api.DatabaseError: (10380, '[08001] [Teradata][ODBC] (10380) Unable to establish connection with data source. Missing settings: {[DBCName]}')
I have a feeling that there's something basic that I'm missing, but I can't find a solution to this anywhere.
Thanks ravioli for the fresh eyes. Turns out the issue was loading in the environment variables using dotenv. My module is in a Python package (separate folder), and my script and .env files are in the working directory.
dotenv successfully reads in the environment variables (.env in my working directory) when I run the code in my original post, line by line, in the interpreter or in Jupyter. However, when I run the same code in a script, it does not find in the .env file in my working directory. That will be a separate question I'll have to find the answer to.
import teradatasql
import pandas as pd
def run_query(query, db_host, db_username, db_password):
"""Run query string on teradata and return DataFrame."""
if query.strip()[-1] != ';':
query += ';'
with teradatasql.connect(host=db_host, user=db_username,
password=db_password) as connect:
df = pd.read_sql(query, connect)
return df
The code below runs fine in a script now:
import pytera as pt
from dotenv import load_dotenv()
_ = load_dotenv()
db_host = os.getenv('db_host')
db_username = os.getenv('db_username')
db_password = os.getenv('db_password')
data = pt.run_query('select top 5 * from table_name;', db_host, db_username, db_password)
It looks like your client can't find the Teradata server, which is why you see that DBCName missing error. This should be the "system name" of your Teradata server (i.e. TDServProdA).
A couple things to try:
If you are trying to connect directly with a hostname, try disabling COP discovery in your connection with this flag: cop = false. More info
Try updating your hosts file on your local system. From the documentation:
Modifying the hosts File
If your site does not use DNS, you must define the IP address and the
Teradata Database name to use in the system hosts file on the
computer.
Locate the hosts file on the computer. This file is typically located in the following folder: %SystemRoot%\system32\drivers\etc
Open the file with a text editor, such as Notepad.
Add the following entry to the file: xxx.xx.xxx.xxx sssCOP1 where xxx.xx.xxx.xxx is the IP address and where sss is the Teradata
Database name.
Save the hosts file.
Link 1
Link 2
I have been searching since a couple of days for a solution without success.
We have a windows service build to copy some files from one location to another one.
So I build the code shown below with Python 3.7.
The full coding can be found on Github.
When I run the service using python all is working fine, I can install the service and also start the service.
This using commands:
Install the service:
python jis53_backup.py install
Run the service:
python jis53_backup.py start
When I now compile this code using pyinstaller with command:
pyinstaller -F --hidden-import=win32timezone jis53_backup.py
After the exe is created, I can install the service but when trying to start the service I get the error:
Error starting service: The service did not respond to the start or
control request in a timely fashion
I have gone through multiple posts on Stackoverflow and on Google related to this error however, without success. I don't have the option to install the python 3.7 programs on the PC's that would need to run this service. That's why we are trying to get a .exe build.
I have made sure to have the path updated according to the information that I found in the different questions.
Image of path definitions:
I also copied the pywintypes37.dll file.
From -> Python37\Lib\site-packages\pywin32_system32
To -> Python37\Lib\site-packages\win32
Does anyone have any other suggestions on how to get this working?
'''
Windows service to copy a file from one location to another
at a certain interval.
'''
import sys
import time
from distutils.dir_util import copy_tree
import servicemanager
import win32serviceutil
import win32service
from HelperModules.CheckFileExistance import check_folder_exists, create_folder
from HelperModules.ReadConfig import (check_config_file_exists,
create_config_file, read_config_file)
from ServiceBaseClass.SMWinService import SMWinservice
sys.path += ['filecopy_service/ServiceBaseClass',
'filecopy_service/HelperModules']
class Jis53Backup(SMWinservice):
_svc_name_ = "Jis53Backup"
_svc_display_name_ = "JIS53 backup copy"
_svc_description_ = "Service to copy files from server to local drive"
def start(self):
self.conf = read_config_file()
if not check_folder_exists(self.conf['dest']):
create_folder(self.conf['dest'])
self.isrunning = True
def stop(self):
self.isrunning = False
def main(self):
self.ReportServiceStatus(win32service.SERVICE_RUNNING)
while self.isrunning:
# Copy the files from the server to a local folder
# TODO: build function to trigger only when a file is changed.
copy_tree(self.conf['origin'], self.conf['dest'], update=1)
time.sleep(30)
if __name__ == '__main__':
if sys.argv[1] == 'install':
if not check_config_file_exists():
create_config_file()
if len(sys.argv) == 1:
servicemanager.Initialize()
servicemanager.PrepareToHostSingle(Jis53Backup)
servicemanager.StartServiceCtrlDispatcher()
else:
win32serviceutil.HandleCommandLine(Jis53Backup)
I was also facing this issue after compiling using pyinstaller. For me, the issue was that I was using the paths to configs and logs file in dynamic way, for ex:
curr_path = os.path.dirname(os.path.abspath(__file__))
configs_path = os.path.join(curr_path, 'configs', 'app_config.json')
opc_configs_path = os.path.join(curr_path, 'configs', 'opc.json')
log_file_path = os.path.join(curr_path, 'logs', 'application.log')
This was working fine when I was starting the service using python service.py install/start. But after compiling it using pyinstaller, it always gave me error of not starting in timely fashion.
To resolve this, I made all the dynamic paths to static, for ex:
configs_path = 'C:\\Program Files (x86)\\ScantechOPC\\configs\\app_config.json'
opc_configs_path = 'C:\\Program Files (x86)\\ScantechOPC\\configs\\opc.json'
debug_file = 'C:\\Program Files (x86)\\ScantechOPC\\logs\\application.log'
After compiling via pyinstaller, it is now working fine without any error. Looks like when we do dynamic path, it doesn't get the actual path to files and thus it gives error.
Hope this solves your problem too. Thanks