Unable to create table in snowflake from databricks for kafka topic - python

I am connecting a kafka topic, applying some transformation on dataframe and writing that data in snowflake with help of databricks. If table is present, it is successfully writing the data. However, if it is not present, it is giving me an error:
net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
Object 'new_table_name' does not exist or not authorized.
This is the code I am using:
def foreach_batch_function(df, epoch_id):
df.write.format("snowflake").options(**sfOptions).option("dbtable", "new_table_name").mode('append').save()
query = my_df.writeStream.foreachBatch(foreach_batch_function).trigger(processingTime='30 seconds').start()
query.awaitTermination()
Note : With the same user I can create a new table manually in snowflake.

When the mode is changed to "Overwrite", it is creating table but old data of previous batch job was getting deleted.
preactions was creating table but was not suitable for current scenario. As it is creating for each batch and thus failing.
So, I used following part to create table if it doesn't exist.
sf_utils = sc._jvm.net.snowflake.spark.snowflake.Utils
sf_utils.runQuery(sfOptions,"create table if not exists {table_name} like existing_table".format(table_name=db_table))

Related

Tableau error Missing data for not-null field when using Redshift table

I have loaded a pandas dataframe into a new redshift table by using create engine sqlalchemy connection using pd.to_sql function.
However, when I load it in Tableau, it throws an error -->
I know there is some BLANKASNULL function but not sure where to write that code while pushing the dataframe into redshift. Can someone please help?

Pyspark: Incremental load, How to overwrite/update Hive table where data is being read

I'm currently writing a script for a daily incremental ETL. I used a initial load script to load base data to a hive table. Thereafter, I created a daily incremental script and reads from the same table, and uses that same data to run the 2nd script.
Initially, I tried to "APPEND" the new data with the daily incremental script, however that seemed to create duplicate rows. So, now I'm attempting to "OVERWRITE" the hive table instead, thus creating the below exception.
I noticed others with a similar issue, that want to read and overwrite the same table have tried to "refreshTable" before overwriting... I tried this solution as well, but I'm still receiving the same error?
Maybe I should refresh the table path as well?
-Thanks
The Error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://essnp/sync/dev_42124_edw_b/EDW/SALES_MKTG360/RZ/FS_FLEET_ACCOUNT_LRF/Data/part-00000-4db6432b-f59c-4112-83c2-672140348454-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
End of my code:
###### Loading the LRF table ######
spark.catalog.refreshTable(TABLE_NAME)
hive.write_sdf_as_parquet(spark,final_df_converted,TABLE_NAME,TABLE_PATH,mode='overwrite')
print("LOAD COMPLETED " + str(datetime.now()))
####### Ending SparkSession #######
spark.sparkContext.stop()
spark.stop() ```
It's not a good habit to read and write to same path,
as per spark DAG lineage, some time read and write both happens at the same time, so it's expected.
Better to read from one location and write to another location.

Issue with Bigquery table created using Dataframe in Python

I have created a temporary Bigquery table using Python and loaded data from a panda dataframe (code snippet given below).
client=bigquery.Client(project)
client.create_table(tmp_table)
client.load_table_from_dataframe(df,tmp_table)
The table is being created successfully and I can run select queries from web UI.
But when I run a select query using python
query =f"""select * from {tmp_table.project_id}.{tmp_table.dataset_id}.{tmp_table.table_id} """
It throws error select * would expand to zero columns
This is because there python is not able to detect any schema. Below query returns null:
print(tmp_table.schema)
If I hardcode the table name like below, it works fine :
query =f"""select * from project_id.dataset_id.table_id """
Can someone suggest how do I get data from the temporary table using a select query in python? I can't hardcode table name as it's being created at runtime.

When I save a PySpark DataFrame with saveAsTable in AWS EMR Studio, where does it get saved?

I can save a dataframe using df.write.saveAsTable('tableName') and read the subsequent table with spark.table('tableName') but I'm not sure where the table is actually getting saved?
It is stored under the default location of your database.
You can get the location by running the following spark sql query:
spark.sql("DESCRIBE TABLE EXTENDED tableName")
You can find the Location under the # Detailed Table Information section.
Please find a sample output below:

Google BigQuery Results Don't Show

I created a python script that pushes a pandas dataframe into Google BigQuery and it looks as though I'm able to query the table directly from GBQ. However, another user is unable to view the results when they query from that same table I generated on GBQ. This seems to be a Big Query issue because when they tried to connect to GBQ and query the table indirectly using pandas, it seemed to work fine (pd.read_gbq("SELECT * FROM ...", project_id)). What is causing this strange behaviour?
What I'm seeing:
What they are seeing:
I've encountered this when loading tables to BigQuery via Python GBQ. If you take the following steps, the table will display properly
Load dataframe to BigQuery via Python GBQ
SELECT * FROM uploaded_dataset.uploaded_dataset; doing so will properly show the table
Within the BigQuery UI, save the table (as a new table name)
From there, you will be able to see the table properly. Unfortunately, I don't know how to resolve this without a manual step in the UI.

Categories

Resources