I have a table in azure sql database. I want to delete some data from it using jdbc connector in pyspark.
I have tried this
query=delete from table where condition
spark.read\
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", 'jdbcurl') \
.option("database", 'db') \
.option("user", "user") \
.option("password", "pass") \
.option("query",query)
But this does not seem to work. I cannot do .load() since delete does not return anything and it gives me an error.
I found a solution here that uses a custom defined function in scala but I want to do it in python.
Is there a way to do this?
These types of queries are not supported by Apache Spark until or unless your not using Delta.
To do so you first need to create one database connection using pyodbc.
and then try running your query like this:
connection.execute("delete statement")
Related
I'm trying to write a dataframe from a pyspark job that runs on AWS Glue to a Databricks cluster and I'm facing an issue I can't solve.
Here's the code that does the writing :
spark_df.write.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", table_name) \
.option("password", pwd) \
.mode("overwrite") \
.save()
Here's the error I'm getting :
[Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: \nno viable alternative at input '(\"TEST_COLUMN\"'(line 1, pos 82)\n\n== SQL ==\nCREATE TABLE MYDATABASE.MYTABLE (\"TEST_COLUMN\" TEXT
It seems that the issue comes from the fact the SQL statement is using the column name with double quotes instead of single quotes and it's failing because of that.
I thought simple things like that would be managed automatically by spark but it seems it's not the case.
Do you know how can I solve that issue please ?
Thanks in advance !
I want to issue a DESC TABLE SQL command for a Snowflake table and using Azure Databricks but I can't quite figure it out! I'm not getting any errors but I'm not getting any results either. Here's the Python code I'm using:
options_vcp = {
"sfUrl": snowflake_url,
"sfUser": user,
"sfPassword": password,
"sfDatabase": db,
"sfWarehouse": wh,
"sfSchema": sch
}
sfUtils = sc._jvm.net.snowflake.spark.snowflake.Utils
sfUtils.runQuery(options_vcp, "DESC TABLE myTable")
I can download the Snowflake table using the "sfDatabase", "sfWarehouse", etc. values so they seem to be correct. I can run the DESC TABLE command in Snowflake and get correct results. But the only output I'm getting from databricks is this:
Out[1]: JavaObject id=o315
Does anyone know how to display this JavaObject or know of a different method to run DESC TABLE from Databricks?
From doc: Executing DDL/DML SQL Statements:
The runQuery method returns only TRUE or FALSE. It is intended for statements that do not return a result set, for example DDL statements like CREATE TABLE and DML statements like INSERT, UPDATE, and DELETE. It is not useful for statements that return a result set, such as SELECT or SHOW.
Alternative approach is to use INFORMATION_SCHEMA.COLUMNS view:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("query", "SELECT * FROM information_schema.columns WHERE table_name ILIKE 'myTable'")
.load()
Related: Moving Data from Snowflake to Spark:
When using DataFrames, the Snowflake connector supports SELECT queries only.
Usage Notes
Currently, the connector does not support other types of queries (e.g. SHOW or DESC, or DML statements) when using DataFrames.
I suggest using get_ddl() in your select statement to get the object definition:
https://docs.snowflake.com/en/sql-reference/functions/get_ddl.html
While trying to load data into redshift from AWS S3, I am facing an issue with any column in the redshift table of type decimal. I am able to load non-decimal number in redshift, but can't able load datatype like Numeric(18,4).
DF schema in S3: A Integer, B string, C decimal(18,4), D timestamp
Redshift table schema: A INTEGER, B VARCHAR(20), C NUMERIC(18,4), D TIMESTAMP
Error Message from stl_load_errors table:
Invalid digit, Value '"', Pos 0, Type: Decimal
Data that redshift is trying to add:
2|sample_string||2021-04-03|
Why decimal column is coming as Empty or NULL?? AS you see above, the redshift data all data come in proper format except decimal column which is empty.
This is the code that I am using to load data into redshift from S3:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("App_name").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryo.unsafe", "true")
spark.conf.set("spark.kryoserializer.buffer.max", "1024m")
df = spark.read.parquet(s3_input_path)
pre_query = """
begin;
create table {} (like {});
end;
""".format(temp_table,actual_table_name)
post_query = """
begin;
--some action
insert into {} select * from {};
end;
""".format(actual_table_name,temp_table)
df.write.format('com.databricks.spark.redshiftLib') \
.option("url", "jdbc:redshift://aws.redshift.amazonaws.sampleurl.com:5439/") \
.option("user","UserName") \
.option("preactions",pre_query) \
.option("password","Password") \
.option("dbtable","table_name" ) \
.option("extracopyoptions", "ACCEPTINVCHARS AS '?' TRUNCATECOLUMNS") \
.option("postactions",post_query) \
.options("tempformat", "CSV GZIP") \
.option("tempdir", "s3a://aws-bucket/") \
.option("csvseparator","|") \
.option("forward_spark_s3_credentials","true")\
.mode("append") \
.save()
I got the problem, I was using Spark 2.x. In order to save tempdir in CSV format, you need spark 3.x. You can use latest version, 3.0.0-preview1.
You can upgrade your spark
or
you can use your command like spark-submit --packages com.databricks:spark-redshift_2.10:3.0.0-preview1....
Explanation:
When writing to Redshift, data is first stored in a temp folder in S3 before being loaded into Redshift. The default format used for storing temp data between Apache Spark and Redshift is Spark-Avro. However, Spark-Avro stores a decimal as a binary, which is interpreted by Redshift as empty strings or nulls.
But I want to improve performance and remove this blank issue, for that purpose CSV format is best suitable. I was using Spark 2.x which by default use Avro tempformat, even if we mention it externally.
So after giving 3.0.0-preview1 package with command, It can now use the features that are present in Spark 3.x.
Reference:
https://kb.databricks.com/data/redshift-fails-decimal-write.html
https://github.com/databricks/spark-redshift/issues/308
Looked around but haven't been able to find this quesiton yet... I'm working in Jupyter notebook writing python code, and a lot of the datasets we use are in Teradata, and so my code usually looks like this:
cs = '''
(
select
*
from SST01.data
where snap_dt = '2020-08-31'
)foo'''
dfclnt_status = spark.read.format('jdbc') \
.option('url', 'jdbc:teradata://teradataservernamehere') \
.option('driver', 'com.teradata.jdbc.TeraDriver') \
.option('user', 'redacted') \
.option('password', PASS) \
.option('dbtable', cs) \
.load()
I know that in spark when running code against our Hive tables I can pass date variables using '{VAR}' but when I try to apply the same thing in queries against Teradata I get this error:
Py4JJavaError: An error occurred while calling o233.load.
: java.sql.SQLException: [Teradata Database] [TeraJDBC 16.30.00.00] [Error 3535] [SQLState 22003] A character string failed conversion to a numeric value.
How is it possible to pass date variables into Teradata?
EDIT: My variables look like this:
END_DT='2020-08-31'
The easiest way is probably to explicitly convert your field to a date, like so:
to_date('2020-08-31')
If you're still getting an error, take a look at the table DDL. The error says the field is numeric.
I'm reading data from various JDBC sources using PySpark's read method. JDBC reads from Teradata, mySQL, Oracle, SQL Server are all working 100%, however, I'm now trying to read from Informix and the result is the column headers in the column values in stead of the actual data:
query_cbu = '''
SELECT first 5
ac2_analysis_p
FROM informix.ac2_aux_cust
'''
Specifying the header option did not help:
df_cbu = \
spark.read.format("jdbc") \
.option("url", url) \
.option("dbtable", '({}) tbl'.format(query_cbu)) \
.option("user", db_username) \
.option("password", db_password) \
.option("header", "true") \
.load()
df_cbu.show()
Result:
+--------------+
|ac2_analysis_p|
+--------------+
|ac2_analysis_p|
|ac2_analysis_p|
|ac2_analysis_p|
|ac2_analysis_p|
|ac2_analysis_p|
+--------------+
Using the same jdbc driver (ifxjdbc.jar) values are returned correctly from DBVisualiser:
I can't imagine any mechanism that can cause this. Can anyone advise me where to start looking for the problem?
I do believe (and I saw this once before some time ago so going from memory here) that you need to enable DELIMIDENT in your JDBC driver URL.
DELIMIDENT=Y
https://www.ibm.com/support/knowledgecenter/en/SSGU8G_12.1.0/com.ibm.jdbc_pg.doc/ids_jdbc_040.htm#ids_jdbc_040
The reason is that while the other JDBC drivers already quote username/table names in the metadata that Spark goes after, Informix JDBC does not which confuses Sparks JDBC layer. Enabling DELIMIDENT in the driver adds those. There are other repercussions to using DELIMIDENT so make sure it does what you want, but it should be fine to turn it on.