SQL error when using pyspark to insert data to Mariadb - python

I try to insert data to mariadb using pyspark and jdbc, but it seems that the pyspark doesn't generate the right SQL,my Spark version is 2.1.0, I din't have this problem util the manager of the cluster updating the Spark from 1.6.1 to 2.1.0, here is My python code
from pyspark.sql import Row, SparkSession as SS
if __name__ == "__main__":
spark = SS.builder.appName("boot_count").getOrCreate()
sc = spark.SparkContext
l = [(str(20160101), str(1)]
rdd = sc.parallelize(l)
rdd = rdd.map(lambda x: Row(day=x[0], count=x[1]))
df = spark.createDataFrame(rdd)
df.createOrReplaceTempView("boot_count")
mysql_url = "jdbc:mariadb://master.cluster:3306/dbname"
properties = {'user': 'root', 'driver': 'org.mariadb.jdbc.Driver'}
df.write.jdbc(url=mysql_url, table="boot_count", mode="append",
properties=properties)
Here is my error information
Caused by: java.sql.SQLSyntaxErrorException: (conn:364) You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '"count","day") VALUES ('1','20160101')' at line 1 Query is : INSERT INTO boot_count ("count","day") VALUES ('1','20160101')
I use command in MariaDB to solve this problem.
>set global sql_mode=ANSI_QUOTES

Either put backtics around column names or use the setting that allows double-quotes around column names.

Related

Is it possible to use dask dataframe with teradata python module?

I have this code:
import teradata
import dask.dataframe as dd
login = login
pwd = password
udaExec = teradata.UdaExec (appName="CAF", version="1.0",
logConsole=False)
session = udaExec.connect(method="odbc", DSN="Teradata",
USEREGIONALSETTINGS='N', username=login,
password=pwd, authentication="LDAP");
And the connection is working.
I want to get a dask dataframe. I have tried this:
sqlStmt = "SOME SQL STATEMENT"
df = dd.read_sql_table(sqlStmt, session, index_col='id')
And I'm getting this error message:
AttributeError: 'UdaExecConnection' object has no attribute '_instantiate_plugins'
Does anyone have a suggestion?
Thanks in advance.
read_sql_table expects a SQLalchemy connection string, not a "session" as you are passing. I have not heard of teradata being used via sqlalchemy, but apparently there is at least one connector you could install, and possibly other solutions using the generic ODBC driver.
However, you may wish to use a more direct approach using delayed, something like
from dask import delayed
# make a set of statements for each partition
statements = [sqlStmt + " where id > {} and id <= {}".format(bounds)
for bounds in boundslist] # I don't know syntax for tera
def get_part(statement):
# however you make a concrete dataframe from a SQL statement
udaExec = ..
session = ..
df = ..
return dataframe
# ideally you should provide the meta and divisions info here
df = dd.from_delayed([delayed(get_part)(stm) for stm in statements],
meta= , divisions=)
We will be interested to hear of your success.

Can't Save Dataframe to Local Mac Machine

I am using a Databricks notebook and trying to export my dataframe as CSV to my local machine after querying it. However, it does not save my CSV to my local machine. Why?
Connect to Database
#SQL Connector
import pandas as pd
import psycopg2
import numpy as np
from pyspark.sql import *
#Connection
cnx = psycopg2.connect(dbname= 'test', host='test', port= '1234', user= 'test', password= 'test')
cursor = cnx.cursor()
SQL Query
query = """
SELECT * from products;
"""
# Execute the query
try:
cursor.execute(query)
except OperationalError as msg:
print ("Command skipped: ")
#Fetch all rows from the result
rows = cursor.fetchall()
# Convert into a Pandas Dataframe
df = pd.DataFrame( [[ij for ij in i] for i in rows] )
Exporting Data as CSV to Local Machine
df.to_csv('test.csv')
It does NOT give any error but when I go to my Mac machine's search icon to find "test.csv", it is not existent. I presume that the operation did not work, thus the file was never saved from the Databricks cloud server to my local machine...Does anybody know how to fix it?
Select from SQL Server:
import pypyodbc
cnxn = pypyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=Server_Name;"
"Database=TestDB;"
"Trusted_Connection=yes;")
#cursor = cnxn.cursor()
#cursor.execute("select * from Actions")
cursor = cnxn.cursor()
cursor.execute('SELECT * FROM Actions')
for row in cursor:
print('row = %r' % (row,))
From SQL Server to Excel:
import pyodbc
import pandas as pd
# cnxn = pyodbc.connect("Driver={SQL Server};SERVER=xxx;Database=xxx;UID=xxx;PWD=xxx")
cnxn = pyodbc.connect("Driver={SQL Server};SERVER=EXCEL-PC\SQLEXPRESS;Database=NORTHWND;")
data = pd.read_sql('SELECT * FROM Orders',cnxn)
data.to_excel('C:\\your_path_here\\foo.xlsx')
Since you are using Databricks, you are most probably working on a remote machine. Like it was already mentioned, saving the way you do wont work (file will be save to the machine your notebooks master node is on). Try running:
import os
os.listdir(os.getcwd())
This will list all the files that are in directory from where notebook is running (at least it is how jupyter notebooks work). You should see saved file here.
However, I would think that Databricks provides a utility functions to their clients for easy data download from the cloud. Also, try using spark to connect to db - might be a little more convenient.
I think these two links should be useful for you:
Similar question on databricks forums
Databricks documentation
Because you're running this in a Databricks notebook, when you're using Pandas to save your file to test.csv, this is being saved to the Databricks driver node's file directory. A way to test this out is the following code snippet:
# Within Databricks, there are sample files ready to use within
# the /databricks-datasets folder
df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", inferSchema=True, header=True)
# Converting the Spark DataFrame to a Pandas DataFrame
import pandas as pd
pdDF = df.toPandas()
# Save the Pandas DataFrame to disk
pdDF.to_csv('test.csv')
The location of your test.csv is within the /databricks/driver/ folder of your Databricks' cluster driver node. To validate this:
# Run the following shell command to see the results
%sh cat test.csv
# The output directory is shown here
%sh pwd
# Output
# /databricks/driver
To save the file to your local machine (i.e. your Mac), you can view the Spark DataFrame using the display command within your Databricks notebook. From here, you can click on the "Download to CSV" button which is highlighted in red in the below image.

PySpark to MySQL Insert Error?

I am trying to learn PySpark and have written a simple script that loads some JSON files from one of my HDFS directories, loads each in as a python dictionary (using json.loads() ) and then for each object, extracts some fields.
The relevant info is stored in a Spark Dataframe and I want to insert this data into a MySQL Table (I created this locally).
But, when I run this, I get an error with my connection URL.
It says "java.lang.RuntimeException: [1.5] failure: ``.'' expected but `:' found"
At this point:
jdbc:mysql://localhost:3306/bigdata?user=root&password=pwd
^
Database name is "bigdata"
username and password are included above
Port number I believe is correct
Here's the full script I have....:
import json
import pandas as pd
import numpy as np
from pyspark import SparkContext
from pyspark.sql import Row, SQLContext
SQL_CONNECTION="jdbc:mysql://localhost:3306/bigdata?user=root&password=pwd"
sc=SparkContext()
sqlContext = SQLContext(sc)
cols = ['Title', 'Site']
df = pd.DataFrame(columns=cols)
#First, load my files as RDD and convert them as JSON
rdd1 = sc.wholeTextFiles("hdfs://localhost:8020/user/ashishu/Project/sample data/*.json")
rdd2 = rdd1.map(lambda kv: json.loads(kv[1]))
#Read in the RDDs and do stuff
for record in rdd2.take(2):
title = record['title']
site = record['thread']['site_full']
vals = np.array([title, site])
df.loc[len(df)] = vals
sdf = sqlContext.createDataFrame(df)
sdf.show()
sdf.insertInto(SQL_CONNECTION, "sampledata")
SQL_CONNECTION is the connection URL at the beginning, and "sampledata" is the name of the table I want to insert into in MySQL. The specific database to use was specified in the connection url ("bigdata").
This is my spark-submit statement:
./bin/spark-submit /Users/ashishu/Desktop/sample.py --driver-class-path /Users/ashishu/Documents/Spark/.../bin/mysql-connector-java-5.1.42/mysql-connector-java-5.1.42-bin.jar
I am using Spark 1.6.1
Am I missing something stupid here about the MySQL connection? I tried replacing the ":" (between jdbc and mysql) with a "." but that obviously didn't fix anything and generated a different error...
Thanks
EDIT
I modified my code as per suggestions so that instead of using sdf.InsertInto, I said...
sdf.write.jdbc(SQL_CONNECTION, table="sampledata", mode="append")
However, now I get a new error after using the following submit command in terminal:
./bin/spark-submit sample.py --jars <path to mysql-connector-java-5.1.42-bin.jar>
The error is basically saying "an error occurred while calling o53.jdbc, no suitable driver found".
Any idea about this one?
insertInto expects a tablename or database.tablename thats why its throwing . expected but : found error. What you need is jdbc dataframe writer i.e. see here http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.jdbc
something like -
sdf.write.jdbc(SQL_CONNECTION, table=bigdata.sampledata,mode='append')
I figured it out, the solution was to create a spark-env.sh file in my /spark/conf folder and in it, have the following setting:
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/<path to your mysql connector jar file>
Thanks!

Querying json object in dataframe using Pyspark

I have a MySql table with following schema:
id-int
path-varchar
info-json {"name":"pat", "address":"NY, USA"....}
I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using
df = sqlContext.sql("select * from dbTable")
This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).
select id, info->"$.name" from dbTable where info->"$.name"='pat'
from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')
There is already a function named get_json_object
For your situation:
df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
""")
res.show()
Spark sql is almost like HIVE sql, you can see
https://cwiki.apache.org/confluence/display/Hive/Home

DB2 sql query to a Pandas DataFrame from my Mac

I have a need to access data that resides in a remote db2 database via a sql statement and convert it to a Pandas DataFrame. All from my Mac. I looked at using Pandas' read_sql with the ibm_db_sa adapter, but it looks like the prerequisite client side software is not supported on the Mac
I came up with an jdbc option, which I'm posting, but I'm curious to know if anyone else has any ideas
Here's an option using jdbc, the pip installable JayDeBeApi and the appropriate db jar file
Note: this could be used for other jdbc/jaydebeapi compliant databases like Oracle, MS Sql Server, etc
import jaydebeapi
import pandas as pd
def read_jdbc(sql, jclassname, driver_args, jars=None, libs=None):
'''
Reads jdbc compliant data sources and returns a Pandas DataFrame
uses jaydebeapi.connect and doc strings :-)
https://pypi.python.org/pypi/JayDeBeApi/
:param sql: select statement
:param jclassname: Full qualified Java class name of the JDBC driver.
e.g. org.postgresql.Driver or com.ibm.db2.jcc.DB2Driver
:param driver_args: Argument or sequence of arguments to be passed to the
Java DriverManager.getConnection method. Usually the
database URL. See
http://docs.oracle.com/javase/6/docs/api/java/sql/DriverManager.html
for more details
:param jars: Jar filename or sequence of filenames for the JDBC driver
:param libs: Dll/so filenames or sequence of dlls/sos used as
shared library by the JDBC driver
:return: Pandas DataFrame
'''
try:
conn = jaydebeapi.connect(jclassname, driver_args, jars, libs)
except jaydebeapi.DatabaseError as de:
raise
try:
curs = conn.cursor()
curs.execute(sql)
columns = [desc[0] for desc in curs.description] #getting column headers
#convert the list of tuples from fetchall() to a df
return pd.DataFrame(curs.fetchall(), columns=columns)
except jaydebeapi.DatabaseError as de:
raise
finally:
curs.close()
conn.close()
Some examples
#DB2
conn = 'jdbc:db2://<host>:5032/<db>:currentSchema=<schema>;'
class_name = 'com.ibm.db2.jcc.DB2Driver'
sql = 'SELECT name FROM table_name FETCH FIRST 5 ROWS ONLY'
df = read_jdbc(sql, class_name, [conn, 'myname', 'mypwd'])
#PostgreSQL
conn = 'jdbc:postgresql://<host>:5432/<db>?currentSchema=<schema>'
class_name = 'org.postgresql.Driver'
jar = '/path/to/jar/postgresql-9.4.1212.jar'
sql = 'SELECT name FROM table_name LIMIT 5'
df = read_jdbc(sql, class_name, [conn, 'myname', 'mypwd'], jars=jar)
I got a simpler answer from https://stackoverflow.com/a/33805547/914967 where it uses pip module ibm_db only:
import ibm_db
import ibm_db_dbi
import pandas as pd
conn_handle = ibm_db.connect('DATABASE={};HOSTNAME={};PORT={};PROTOCOL=TCPIP;UID={};PWD={};'.format(db_name, hostname, port_number, user, password), '', '')
conn = ibm_db_dbi.Connection(conn_handle)
df = pd.read_sql(sql, conn)
Bob, you should check out ibmdbpy (https://pypi.python.org/pypi/ibmdbpy). It is a pandas data frame style API to DB2 and dashDB tables. It supports both underlying DB2 client drivers, ODBC and JDBC.
So as prerequisites you need to set up the DB2 client driver package for Mac that you can find here: http://www-01.ibm.com/support/docview.wss?uid=swg21385217
After #IanBjorhovde commented on my question I investigated another solution that allows me to use sqlalchemy and pandas' read_sql()
Here are the steps I took. Note: I got this working on OSX Yosemite (10.10.4) for python 3.4 and 3.5
1) Download IBM DB2 Express-C (no-cost community edition of DB2)
https://www-01.ibm.com/marketing/iwm/iwm/web/pick.do?source=swg-db2expressc&S_TACT=000000VR&lang=en_US&S_OFF_CD=10000761
2) After navigating to the unzipped dir
sudo ./db2_install
I accepted the default location of /opt/IBM/db2/V10.1
3) Install ibm_db and ibm_db_sa
pip install ibm_db
I built ibm_db_sa from source because the pip installed failed
python setup.py install
That should do it. You might get an error like 'Reason: image not found' when you try to connect to your db so read this for the fix. Note: might require a reboot
Example usage:
import ibm_db_sa
import pandas as pd
from sqlalchemy import select, create_engine
eng = create_engine('ibm_db_sa://<user_name>:<pwd>#<host>:5032/<db name>')
sql = 'SELECT name FROM table_name FETCH FIRST 5 ROWS ONLY'
df = pd.read_sql(sql, eng)

Categories

Resources