I am trying to learn PySpark and have written a simple script that loads some JSON files from one of my HDFS directories, loads each in as a python dictionary (using json.loads() ) and then for each object, extracts some fields.
The relevant info is stored in a Spark Dataframe and I want to insert this data into a MySQL Table (I created this locally).
But, when I run this, I get an error with my connection URL.
It says "java.lang.RuntimeException: [1.5] failure: ``.'' expected but `:' found"
At this point:
jdbc:mysql://localhost:3306/bigdata?user=root&password=pwd
^
Database name is "bigdata"
username and password are included above
Port number I believe is correct
Here's the full script I have....:
import json
import pandas as pd
import numpy as np
from pyspark import SparkContext
from pyspark.sql import Row, SQLContext
SQL_CONNECTION="jdbc:mysql://localhost:3306/bigdata?user=root&password=pwd"
sc=SparkContext()
sqlContext = SQLContext(sc)
cols = ['Title', 'Site']
df = pd.DataFrame(columns=cols)
#First, load my files as RDD and convert them as JSON
rdd1 = sc.wholeTextFiles("hdfs://localhost:8020/user/ashishu/Project/sample data/*.json")
rdd2 = rdd1.map(lambda kv: json.loads(kv[1]))
#Read in the RDDs and do stuff
for record in rdd2.take(2):
title = record['title']
site = record['thread']['site_full']
vals = np.array([title, site])
df.loc[len(df)] = vals
sdf = sqlContext.createDataFrame(df)
sdf.show()
sdf.insertInto(SQL_CONNECTION, "sampledata")
SQL_CONNECTION is the connection URL at the beginning, and "sampledata" is the name of the table I want to insert into in MySQL. The specific database to use was specified in the connection url ("bigdata").
This is my spark-submit statement:
./bin/spark-submit /Users/ashishu/Desktop/sample.py --driver-class-path /Users/ashishu/Documents/Spark/.../bin/mysql-connector-java-5.1.42/mysql-connector-java-5.1.42-bin.jar
I am using Spark 1.6.1
Am I missing something stupid here about the MySQL connection? I tried replacing the ":" (between jdbc and mysql) with a "." but that obviously didn't fix anything and generated a different error...
Thanks
EDIT
I modified my code as per suggestions so that instead of using sdf.InsertInto, I said...
sdf.write.jdbc(SQL_CONNECTION, table="sampledata", mode="append")
However, now I get a new error after using the following submit command in terminal:
./bin/spark-submit sample.py --jars <path to mysql-connector-java-5.1.42-bin.jar>
The error is basically saying "an error occurred while calling o53.jdbc, no suitable driver found".
Any idea about this one?
insertInto expects a tablename or database.tablename thats why its throwing . expected but : found error. What you need is jdbc dataframe writer i.e. see here http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.jdbc
something like -
sdf.write.jdbc(SQL_CONNECTION, table=bigdata.sampledata,mode='append')
I figured it out, the solution was to create a spark-env.sh file in my /spark/conf folder and in it, have the following setting:
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/<path to your mysql connector jar file>
Thanks!
Related
I have a SQL statement that I want to run against an oracle database using a JDBC driver in databricks. I can get this to successfully run if the SQL statement is quite short, for example if it's just selecting all of the data from a table with no filters etc.. (e.g. select * from tbl)
However, I have an extremely long SQL code that I need to execute so I am creating the string to pass through to the JDBC driver by loading it from a .sql file saved on the databricks file storage.
When running this I was presented with an error and on investigation / printing the results of the text file I find it drops some of the SQL statement and provides a message before resuming the sql statement:
*** WARNING: skipped 62431 bytes of output ***
Effectively it looks like this in the printed string:
sum (
case
when dpr.pricing_Type in ('P', 'C') then
nvl (
decode (dpr.price / 100, null, 0,
decode (apr.price, 'N',
*** WARNING: skipped 62431 bytes of output ***
then
dpr.percentage_applied
else
0
end
) as price_percent,
Note that the code prior to the warning is for a completely different field to that of the code after the warning message.
Are there any suggestions on the cause of this and how to resolve it?
The full script I am running is below for reference, also note that the .sql file I am using is only 113kb and I am using python 3.7.5 via runtime 7.4 of databricks:
%python
# setup jdbc credentials (from key vault) and url
jdbcUsername = dbutils.secrets.get(scope="USER", key="ID")
jdbcPassword = dbutils.secrets.get(scope="PWD", key="PWD")
jdbcUrl = "jdbc:oracle:thin:#<REDACTED>"
jdbcDrv = "oracle.jdbc.driver.OracleDriver"
# Table Name
OutputTbl = "db.tblCore"
# Drop table.
spark.sql("DROP TABLE IF EXISTS " + OutputTbl )
# parallelism
lbound = 20160101
ubound = 20210115
fsize = "1000"
colname = "date_value_yyyymmdd"
numParts = "10"
# Get sql stetment from file.
with open('/dbfs/FileStore/shared_uploads/<REDACTED>/SQL', 'r') as f:
sql = file.read()
# Create DF and write output to a table.
spdf = (spark.read.format("jdbc")
.option("driver", jdbcDrv)
.option("url", jdbcUrl)
.option("user", jdbcUsername)
.option("password", jdbcPassword)
.option("dbtable", sql)
.option("numPartitions", numParts)
.option("fetchsize", fsize)
.option("partitionColumn", colname)
.option("lowerBound", lbound)
.option("upperBound", ubound)
.load())
spdf.write.mode("overwrite").saveAsTable(OutputTbl)
This is not an error, it's just a warning that says that the output was truncated to prevent overloading of the browser, etc. You may look into driver's & executors log via Spark UI of your cluster - there should be more information...
I would also suggest to first try to execute that statement directly against Oracle, just to check if it works at all
I could use some help. In part 1, "Getting Started with SQL and BigQuery", I'm running into the following issue. I've gotten down to In[7]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()
and I get the error:
getting_started_with_bigquery.py:41: UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.
client.list_rows(table, max_results=5).to_dataframe()
I'm writing my code in Notepad++ then running by calling it in the command prompt on Windows. I've gotten everything else working up until this point, but I'm having trouble finding a solution to this problem. A Google search leads me to the source code for google.cloud.bigquery.table which looks like that error should come up if pandas is not installed, so I installed it and I added import pandas to my code, but I'm still getting the same error.
Here is my full code:
from google.cloud import bigquery
import os
import pandas
#need to set credential path
credential_path = (r"C:\Users\crlas\learningPython\google_application_credentials.json")
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
#create a "Client" object
client = bigquery.Client()
#construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
#API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
#list all tables in the dataset
tables = list(client.list_tables(dataset))
#print all table names
for table in tables:
print(table.table_id)
print()
#construct a reference to the "full" table
table_ref = dataset_ref.table("full")
#API request - fetch the dataset
table = client.get_table(table_ref)
#print info on all the columns in the "full" table
print(table.schema)
# print("table schema should have printed above")
print()
#preview first 5 lines of the table
client.list_rows(table, max_results=5).to_dataframe()
As the warning message says - UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.
So this is still working with the warning and using tabledata api to retrieve data.
You just need to point the output to a dataframe object and print it, like below:
df = client.list_rows(table, max_results=5).to_dataframe()
print(df)
I have this code:
import teradata
import dask.dataframe as dd
login = login
pwd = password
udaExec = teradata.UdaExec (appName="CAF", version="1.0",
logConsole=False)
session = udaExec.connect(method="odbc", DSN="Teradata",
USEREGIONALSETTINGS='N', username=login,
password=pwd, authentication="LDAP");
And the connection is working.
I want to get a dask dataframe. I have tried this:
sqlStmt = "SOME SQL STATEMENT"
df = dd.read_sql_table(sqlStmt, session, index_col='id')
And I'm getting this error message:
AttributeError: 'UdaExecConnection' object has no attribute '_instantiate_plugins'
Does anyone have a suggestion?
Thanks in advance.
read_sql_table expects a SQLalchemy connection string, not a "session" as you are passing. I have not heard of teradata being used via sqlalchemy, but apparently there is at least one connector you could install, and possibly other solutions using the generic ODBC driver.
However, you may wish to use a more direct approach using delayed, something like
from dask import delayed
# make a set of statements for each partition
statements = [sqlStmt + " where id > {} and id <= {}".format(bounds)
for bounds in boundslist] # I don't know syntax for tera
def get_part(statement):
# however you make a concrete dataframe from a SQL statement
udaExec = ..
session = ..
df = ..
return dataframe
# ideally you should provide the meta and divisions info here
df = dd.from_delayed([delayed(get_part)(stm) for stm in statements],
meta= , divisions=)
We will be interested to hear of your success.
I am using a Databricks notebook and trying to export my dataframe as CSV to my local machine after querying it. However, it does not save my CSV to my local machine. Why?
Connect to Database
#SQL Connector
import pandas as pd
import psycopg2
import numpy as np
from pyspark.sql import *
#Connection
cnx = psycopg2.connect(dbname= 'test', host='test', port= '1234', user= 'test', password= 'test')
cursor = cnx.cursor()
SQL Query
query = """
SELECT * from products;
"""
# Execute the query
try:
cursor.execute(query)
except OperationalError as msg:
print ("Command skipped: ")
#Fetch all rows from the result
rows = cursor.fetchall()
# Convert into a Pandas Dataframe
df = pd.DataFrame( [[ij for ij in i] for i in rows] )
Exporting Data as CSV to Local Machine
df.to_csv('test.csv')
It does NOT give any error but when I go to my Mac machine's search icon to find "test.csv", it is not existent. I presume that the operation did not work, thus the file was never saved from the Databricks cloud server to my local machine...Does anybody know how to fix it?
Select from SQL Server:
import pypyodbc
cnxn = pypyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=Server_Name;"
"Database=TestDB;"
"Trusted_Connection=yes;")
#cursor = cnxn.cursor()
#cursor.execute("select * from Actions")
cursor = cnxn.cursor()
cursor.execute('SELECT * FROM Actions')
for row in cursor:
print('row = %r' % (row,))
From SQL Server to Excel:
import pyodbc
import pandas as pd
# cnxn = pyodbc.connect("Driver={SQL Server};SERVER=xxx;Database=xxx;UID=xxx;PWD=xxx")
cnxn = pyodbc.connect("Driver={SQL Server};SERVER=EXCEL-PC\SQLEXPRESS;Database=NORTHWND;")
data = pd.read_sql('SELECT * FROM Orders',cnxn)
data.to_excel('C:\\your_path_here\\foo.xlsx')
Since you are using Databricks, you are most probably working on a remote machine. Like it was already mentioned, saving the way you do wont work (file will be save to the machine your notebooks master node is on). Try running:
import os
os.listdir(os.getcwd())
This will list all the files that are in directory from where notebook is running (at least it is how jupyter notebooks work). You should see saved file here.
However, I would think that Databricks provides a utility functions to their clients for easy data download from the cloud. Also, try using spark to connect to db - might be a little more convenient.
I think these two links should be useful for you:
Similar question on databricks forums
Databricks documentation
Because you're running this in a Databricks notebook, when you're using Pandas to save your file to test.csv, this is being saved to the Databricks driver node's file directory. A way to test this out is the following code snippet:
# Within Databricks, there are sample files ready to use within
# the /databricks-datasets folder
df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", inferSchema=True, header=True)
# Converting the Spark DataFrame to a Pandas DataFrame
import pandas as pd
pdDF = df.toPandas()
# Save the Pandas DataFrame to disk
pdDF.to_csv('test.csv')
The location of your test.csv is within the /databricks/driver/ folder of your Databricks' cluster driver node. To validate this:
# Run the following shell command to see the results
%sh cat test.csv
# The output directory is shown here
%sh pwd
# Output
# /databricks/driver
To save the file to your local machine (i.e. your Mac), you can view the Spark DataFrame using the display command within your Databricks notebook. From here, you can click on the "Download to CSV" button which is highlighted in red in the below image.
I try to insert data to mariadb using pyspark and jdbc, but it seems that the pyspark doesn't generate the right SQL,my Spark version is 2.1.0, I din't have this problem util the manager of the cluster updating the Spark from 1.6.1 to 2.1.0, here is My python code
from pyspark.sql import Row, SparkSession as SS
if __name__ == "__main__":
spark = SS.builder.appName("boot_count").getOrCreate()
sc = spark.SparkContext
l = [(str(20160101), str(1)]
rdd = sc.parallelize(l)
rdd = rdd.map(lambda x: Row(day=x[0], count=x[1]))
df = spark.createDataFrame(rdd)
df.createOrReplaceTempView("boot_count")
mysql_url = "jdbc:mariadb://master.cluster:3306/dbname"
properties = {'user': 'root', 'driver': 'org.mariadb.jdbc.Driver'}
df.write.jdbc(url=mysql_url, table="boot_count", mode="append",
properties=properties)
Here is my error information
Caused by: java.sql.SQLSyntaxErrorException: (conn:364) You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '"count","day") VALUES ('1','20160101')' at line 1 Query is : INSERT INTO boot_count ("count","day") VALUES ('1','20160101')
I use command in MariaDB to solve this problem.
>set global sql_mode=ANSI_QUOTES
Either put backtics around column names or use the setting that allows double-quotes around column names.