Is it possible to use dask dataframe with teradata python module?

Is it possible to use dask dataframe with teradata python module? - python

I have this code:
import teradata
import dask.dataframe as dd
login = login
pwd = password
udaExec = teradata.UdaExec (appName="CAF", version="1.0",
logConsole=False)
session = udaExec.connect(method="odbc", DSN="Teradata",
USEREGIONALSETTINGS='N', username=login,
password=pwd, authentication="LDAP");
And the connection is working.
I want to get a dask dataframe. I have tried this:
sqlStmt = "SOME SQL STATEMENT"
df = dd.read_sql_table(sqlStmt, session, index_col='id')
And I'm getting this error message:
AttributeError: 'UdaExecConnection' object has no attribute '_instantiate_plugins'
Does anyone have a suggestion?
Thanks in advance.

read_sql_table expects a SQLalchemy connection string, not a "session" as you are passing. I have not heard of teradata being used via sqlalchemy, but apparently there is at least one connector you could install, and possibly other solutions using the generic ODBC driver.
However, you may wish to use a more direct approach using delayed, something like
from dask import delayed
# make a set of statements for each partition
statements = [sqlStmt + " where id > {} and id <= {}".format(bounds)
for bounds in boundslist] # I don't know syntax for tera
def get_part(statement):
# however you make a concrete dataframe from a SQL statement
udaExec = ..
session = ..
df = ..
return dataframe
# ideally you should provide the meta and divisions info here
df = dd.from_delayed([delayed(get_part)(stm) for stm in statements],
meta= , divisions=)
We will be interested to hear of your success.

Related

How do I run cx_Oracle queries within my session limit?

I am querying an Oracle database through cx_Oracle Python and getting the following error after 12 queries. How can I run multiple queries within the same session? Or do I need to quit the session after each query? If so, how do I do that?
DatabaseError: (cx_Oracle.DatabaseError) ORA-02391: exceeded
simultaneous SESSIONS_PER_USER limit (Background on this error at:
https://sqlalche.me/e/14/4xp6)
My code looks something like:
import pandas as pd
import cx_Oracle
from sqlalchemy import create_engine
def get_query(item):
...
return valid_query
cx_Oracle.init_oracle_client(lib_dir=r"\instantclient_21_3")
user = ...
password = ...
tz = pytz.timezone('US/Eastern')
dsn_tns = cx_Oracle.makedsn(...,
...,
service_name=...)
cstr = f'oracle://{user}:{password}#{dsn_tns}'
engine = create_engine(cstr,
convert_unicode=False,
pool_recycle=10,
pool_size=50,
echo=False)
df_dicts = {}
for item in items:
df = pd.read_sql_query(get_query(item), con=cstr)
df_dicts[item.attribute] = df
Thank you!

You can use the cx_Oracle connection object directly in Pandas - for Oracle connection I always found that worked better than sqlalchemy for simple use as a Pandas connection.
Something like:
conn = cx_Oracle.connect(f'{user}/{password}#{dsn_tns}')
df_dicts = {}
for item in items:
df = pd.read_sql(sql=get_query(item), con=conn, parse_dates=True)
df_dicts[item.attribute] = df
(Not sure if you had dates, I just remember that being a necessary element for parsing).

PySpark to MySQL Insert Error?

I am trying to learn PySpark and have written a simple script that loads some JSON files from one of my HDFS directories, loads each in as a python dictionary (using json.loads() ) and then for each object, extracts some fields.
The relevant info is stored in a Spark Dataframe and I want to insert this data into a MySQL Table (I created this locally).
But, when I run this, I get an error with my connection URL.
It says "java.lang.RuntimeException: [1.5] failure: ``.'' expected but `:' found"
At this point:
jdbc:mysql://localhost:3306/bigdata?user=root&password=pwd
^
Database name is "bigdata"
username and password are included above
Port number I believe is correct
Here's the full script I have....:
import json
import pandas as pd
import numpy as np
from pyspark import SparkContext
from pyspark.sql import Row, SQLContext
SQL_CONNECTION="jdbc:mysql://localhost:3306/bigdata?user=root&password=pwd"
sc=SparkContext()
sqlContext = SQLContext(sc)
cols = ['Title', 'Site']
df = pd.DataFrame(columns=cols)
#First, load my files as RDD and convert them as JSON
rdd1 = sc.wholeTextFiles("hdfs://localhost:8020/user/ashishu/Project/sample data/*.json")
rdd2 = rdd1.map(lambda kv: json.loads(kv[1]))
#Read in the RDDs and do stuff
for record in rdd2.take(2):
title = record['title']
site = record['thread']['site_full']
vals = np.array([title, site])
df.loc[len(df)] = vals
sdf = sqlContext.createDataFrame(df)
sdf.show()
sdf.insertInto(SQL_CONNECTION, "sampledata")
SQL_CONNECTION is the connection URL at the beginning, and "sampledata" is the name of the table I want to insert into in MySQL. The specific database to use was specified in the connection url ("bigdata").
This is my spark-submit statement:
./bin/spark-submit /Users/ashishu/Desktop/sample.py --driver-class-path /Users/ashishu/Documents/Spark/.../bin/mysql-connector-java-5.1.42/mysql-connector-java-5.1.42-bin.jar
I am using Spark 1.6.1
Am I missing something stupid here about the MySQL connection? I tried replacing the ":" (between jdbc and mysql) with a "." but that obviously didn't fix anything and generated a different error...
Thanks
EDIT
I modified my code as per suggestions so that instead of using sdf.InsertInto, I said...
sdf.write.jdbc(SQL_CONNECTION, table="sampledata", mode="append")
However, now I get a new error after using the following submit command in terminal:
./bin/spark-submit sample.py --jars <path to mysql-connector-java-5.1.42-bin.jar>
The error is basically saying "an error occurred while calling o53.jdbc, no suitable driver found".
Any idea about this one?

insertInto expects a tablename or database.tablename thats why its throwing . expected but : found error. What you need is jdbc dataframe writer i.e. see here http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.jdbc
something like -
sdf.write.jdbc(SQL_CONNECTION, table=bigdata.sampledata,mode='append')

I figured it out, the solution was to create a spark-env.sh file in my /spark/conf folder and in it, have the following setting:
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/<path to your mysql connector jar file>
Thanks!

Python: Having Trouble Connecting to Postgresql DB

I have been using the following lines of code for the longest time, without any hitch, but today it seems to have produced the following error and I cannot figure out why. The strange thing is that I have other scripts that use the same code and they all seem to work...
import pandas as pd
import psycopg2
link_conn_string = "host='<host>' dbname='<db>' user='<user>' password='<pass>'"
conn = psycopg2.connect(link_conn_string)
df = pd.read_sql("SELECT * FROM link._link_bank_report_lms_loan_application", link_conn_string)
Error Message:
"Could not parse rfc1738 URL from string '%s'" % name)
sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string 'host='<host>' dbname='<db>' user='<user>' password='<pass>''

change link_conn_string to be like here:
postgresql://[user[:password]#][netloc][:port][/dbname][?param1=value1&...]
Eg:
>>> import psycopg2
>>> cs = 'postgresql://vao#localhost:5432/t'
>>> c = psycopg2.connect(cs)
>>> import pandas as pd
>>> df = pd.read_sql("SELECT now()",c)
>>> print df;
now
0 2017-02-27 21:58:27.520372+00:00

Your connection string is wrong.
You should remove the line where you assign a string to "link_conn_string" variable and replace the next line with something like this (remember to replace localhost, postgres and secret with the name of the machine where postgresql is running, the user and password required to make that connection):
conn = psycopg2.connect(dbname="localhost", user="postgres", password="secret")
Also, you can check if the database is working with "psql" command, from the terminal, type (again, remember to change user and database):
psql -U postgresql database

DB2 sql query to a Pandas DataFrame from my Mac

I have a need to access data that resides in a remote db2 database via a sql statement and convert it to a Pandas DataFrame. All from my Mac. I looked at using Pandas' read_sql with the ibm_db_sa adapter, but it looks like the prerequisite client side software is not supported on the Mac
I came up with an jdbc option, which I'm posting, but I'm curious to know if anyone else has any ideas

Here's an option using jdbc, the pip installable JayDeBeApi and the appropriate db jar file
Note: this could be used for other jdbc/jaydebeapi compliant databases like Oracle, MS Sql Server, etc
import jaydebeapi
import pandas as pd
def read_jdbc(sql, jclassname, driver_args, jars=None, libs=None):
'''
Reads jdbc compliant data sources and returns a Pandas DataFrame
uses jaydebeapi.connect and doc strings :-)
https://pypi.python.org/pypi/JayDeBeApi/
:param sql: select statement
:param jclassname: Full qualified Java class name of the JDBC driver.
e.g. org.postgresql.Driver or com.ibm.db2.jcc.DB2Driver
:param driver_args: Argument or sequence of arguments to be passed to the
Java DriverManager.getConnection method. Usually the
database URL. See
http://docs.oracle.com/javase/6/docs/api/java/sql/DriverManager.html
for more details
:param jars: Jar filename or sequence of filenames for the JDBC driver
:param libs: Dll/so filenames or sequence of dlls/sos used as
shared library by the JDBC driver
:return: Pandas DataFrame
'''
try:
conn = jaydebeapi.connect(jclassname, driver_args, jars, libs)
except jaydebeapi.DatabaseError as de:
raise
try:
curs = conn.cursor()
curs.execute(sql)
columns = [desc[0] for desc in curs.description] #getting column headers
#convert the list of tuples from fetchall() to a df
return pd.DataFrame(curs.fetchall(), columns=columns)
except jaydebeapi.DatabaseError as de:
raise
finally:
curs.close()
conn.close()
Some examples
#DB2
conn = 'jdbc:db2://<host>:5032/<db>:currentSchema=<schema>;'
class_name = 'com.ibm.db2.jcc.DB2Driver'
sql = 'SELECT name FROM table_name FETCH FIRST 5 ROWS ONLY'
df = read_jdbc(sql, class_name, [conn, 'myname', 'mypwd'])
#PostgreSQL
conn = 'jdbc:postgresql://<host>:5432/<db>?currentSchema=<schema>'
class_name = 'org.postgresql.Driver'
jar = '/path/to/jar/postgresql-9.4.1212.jar'
sql = 'SELECT name FROM table_name LIMIT 5'
df = read_jdbc(sql, class_name, [conn, 'myname', 'mypwd'], jars=jar)

I got a simpler answer from https://stackoverflow.com/a/33805547/914967 where it uses pip module ibm_db only:
import ibm_db
import ibm_db_dbi
import pandas as pd
conn_handle = ibm_db.connect('DATABASE={};HOSTNAME={};PORT={};PROTOCOL=TCPIP;UID={};PWD={};'.format(db_name, hostname, port_number, user, password), '', '')
conn = ibm_db_dbi.Connection(conn_handle)
df = pd.read_sql(sql, conn)

Bob, you should check out ibmdbpy (https://pypi.python.org/pypi/ibmdbpy). It is a pandas data frame style API to DB2 and dashDB tables. It supports both underlying DB2 client drivers, ODBC and JDBC.
So as prerequisites you need to set up the DB2 client driver package for Mac that you can find here: http://www-01.ibm.com/support/docview.wss?uid=swg21385217

After #IanBjorhovde commented on my question I investigated another solution that allows me to use sqlalchemy and pandas' read_sql()
Here are the steps I took. Note: I got this working on OSX Yosemite (10.10.4) for python 3.4 and 3.5
1) Download IBM DB2 Express-C (no-cost community edition of DB2)
https://www-01.ibm.com/marketing/iwm/iwm/web/pick.do?source=swg-db2expressc&S_TACT=000000VR&lang=en_US&S_OFF_CD=10000761
2) After navigating to the unzipped dir
sudo ./db2_install
I accepted the default location of /opt/IBM/db2/V10.1
3) Install ibm_db and ibm_db_sa
pip install ibm_db
I built ibm_db_sa from source because the pip installed failed
python setup.py install
That should do it. You might get an error like 'Reason: image not found' when you try to connect to your db so read this for the fix. Note: might require a reboot
Example usage:
import ibm_db_sa
import pandas as pd
from sqlalchemy import select, create_engine
eng = create_engine('ibm_db_sa://<user_name>:<pwd>#<host>:5032/<db name>')
sql = 'SELECT name FROM table_name FETCH FIRST 5 ROWS ONLY'
df = pd.read_sql(sql, eng)

Using JPype - How can I access JDBC Meta Data Functions

I'm using JayDeBeAPI which uses JPype to load FileMaker's JDBC driver and pull data.
But I also want to be able to get a listing of all tables in the database.
In the JDBC documentation (page 55) it lists the following functions:
The JDBC client driver supports the following Meta Data functions:
getColumns
getColumnPrivileges
getMetaData
getTypeInfo
getTables
getTableTypes
Any ideas how I might call them from JPype or JayDeBeAPI?
If it helps, here's my current code:
import jaydebeapi
import jpype
jar = r'/opt/drivers/fmjdbc.jar'
args='-Djava.class.path=%s' % jar
jvm_path = jpype.getDefaultJVMPath()
jpype.startJVM(jvm_path, args)
conn = jaydebeapi.connect('com.filemaker.jdbc.Driver',
SETTINGS['SOURCE_URL'], SETTINGS['SOURCE_UID'], SETTINGS['SOURCE_PW'])
curs = conn.cursor()
#Sample Query:
curs.execute("select * from table")
result_rows = curs.fetchall()
Update:
Here's some progress and it seems like it should work, but I'm getting the error below. Any ideas?
> conn.jconn.metadata.getTables()
*** RuntimeError: No matching overloads found. at src/native/common/jp_method.cpp:121

Ok, thanks to eltabo and Juan Mellado I figured it out!
I just had to pass in the correct parameters to match the method signature.
Here's the working code:
import jaydebeapi
import jpype
jar = r'/opt/drivers/fmjdbc.jar'
args='-Djava.class.path=%s' % jar
jvm_path = jpype.getDefaultJVMPath()
jpype.startJVM(jvm_path, args)
conn = jaydebeapi.connect('com.filemaker.jdbc.Driver',
SETTINGS['SOURCE_URL'], SETTINGS['SOURCE_UID'], SETTINGS['SOURCE_PW'])
results = source_conn.jconn.getMetaData().getTables(None, None, "%", None)
#I'm not sure if this is how to read the result set, but jaydebeapi's cursor object
# has a lot of logic for getting information out of a result set, so let's harness
# that.
table_reader_cursor = source_conn.cursor()
table_reader_cursor._rs = results
read_results = table_reader_cursor.fetchall()
#get just the table names
[row[2] for row in read_results if row[3]=='TABLE']

From ResultSet Javadoc:
public ResultSet getTables(String catalog,
String schemaPattern,
String tableNamePattern,
String[] types)
throws SQLException
You need pass the four parameter to the method. I'm not a python developer, but in Java I use :
ResultSet rs = metadata.getTables(null, "public", "%" ,new String[] {"TABLE"} );
to get all the tables (and only the tables) in a schema.
Regards.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it possible to use dask dataframe with teradata python module? - python

Related

How do I run cx_Oracle queries within my session limit?

PySpark to MySQL Insert Error?

Python: Having Trouble Connecting to Postgresql DB

DB2 sql query to a Pandas DataFrame from my Mac

Using JPype - How can I access JDBC Meta Data Functions

Categories

Resources