I'm curious if there's a way to reference Databricks tables without importing them to every Databricks notebook.
Here's what I normally do:
'''
# Load the required tables
df1 = spark.read.load("dbfs:/hive_metastore/cadors_basic_event")
# Convert the dataframe to a temporary view for SQL processing
df1.createOrReplaceTempView('Event')
# Perform join to create master table
master_df = spark.sql(f'''
SELECT O.CADORSNUMBER, O.EVENT_CD, O.EVENT_SEQ_NUM,\
E.EVENT_NAME_ENM, E.EVENT_NAME_FNM, E.EVENT_DESCRIPTION_ETXT, E.EVENT_DESCRIPTION_FTXT,\
E.EVENT_GROUP_TYPE_CD, O.DATE_CREATED_DTE, O.DATE_LAST_UPDATE_DTE\
FROM Occ_Events O INNER JOIN Event E\
ON O.EVENT_CD = E.EVENT_CD\
ORDER BY O.CADORSNUMBER''')
'''
However, I also remember in SQL Server Management Studio, you could easily reference these tables and their fields without having to "import" the table into each notebook like I did above. For example:
'''
SELECT occ.cadorsnumber,\
occ_evt.event_seq_num, occ_evt.event_cd,\
evt.event_name_enm, evt.event_group_type_cd,\
evt_grp.event_group_type_elbl\
FROM cadorsstg.occurrence_information occ\
JOIN cadorsstg.occurrence_events occ_evt ON (occ_evt.cadorsnumber = occ.cadorsnumber)\
JOIN cadorsstg.ta003_event evt ON (evt.event_cd = occ_evt.event_cd)\
JOIN cadorsstg.ta012_event_group_type evt_grp ON (evt_grp.event_group_type_cd = evt.event_group_type_cd)\
WHERE occ.date_deleted_dte IS NULL AND occ_evt.date_deleted_dte IS NULL\
ORDER BY occ.cadorsnumber, occ_evt.event_seq_num;
'''
The way I do it currently is not really scalable and gets very tedious when I'm working with multiple tables. If there's a better way to do this, I'd highly appreciate any tips/advice.
I've tried using SELECT/USE SCHEMA (database name), but that didn't work.
I agree with David there are several ways to do this and you are confusing the concepts. I am going to add some links for you to study.
1 - Data is stored in files. The storage can be either remote or local storage. To use remote storage, I suggest mounting since it allows older python libraries access to the storage. Only utilities such as dbutils.fs() understand urls.
https://www.mssqltips.com/sqlservertip/7081/transform-raw-file-refined-file-microsoft-azure-databricks-synapse/
2 - Data engineering is used to join and transform input files into a new output file. The spark.read() and spark.write() are key to reading and writing files using the power of the cluster.
https://learn.microsoft.com/en-us/azure/databricks/clusters/configure
This same processing can be done with python's libraries but it will not leverage the power of the worker nodes. It will run at the executor node. Please look into the high level design of a cluster.
3 - Data engineering can be done with dataframes. But this means you have to get very good at the methods associated with the object.
https://learn.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes
In the example below, I read in two data sample data files. I join the files, remove a duplicate column and save as a new file.
A - Read files code is the same for both design patterns (dataframes + pyspark)
# read in low temps
path1 = "/databricks-datasets/weather/low_temps"
df1 = (
spark.read
.option("sep", ",")
.option("header", "true")
.option("inferSchema", "true")
.csv(path1)
)
# read in high temps
path2 = "/databricks-datasets/weather/high_temps"
df2 = (
spark.read
.option("sep", ",")
.option("header", "true")
.option("inferSchema", "true")
.csv(path2)
)
B - The data engineering code uses methods when dealing with data frames.
# rename columns - file 1
df1 = df1.withColumnRenamed("temp", "low_temp")
# rename columns - file 2
df2 = df2.withColumnRenamed("temp", "high_temp")
df2 = df2.withColumnRenamed("date", "date2")
# join + drop col
df3 = df1.join(df2, df1["date"] == df2["date2"]).drop("date2")
# show top 5 rows
display(df3.head(5))
C - Write files code is the same for both design patterns (dataframes + pyspark)
Now that data frame (df3) has our data, we write to storage. The /lake/bronze directory is on local storage. It is a make believe data lake.
# How many partitions?
df3.rdd.getNumPartitions()
# Write out csv file with 1 partition
dst_path = "/lake/bronze/weather/temp"
(
df3.repartition(1).write
.format("parquet")
.mode("overwrite")
.save(dst_path)
)
4 - Data engineering can be done with Spark SQL. But this means you have to expose the datasets as temporary views. Both steps A + C are the same.
B.1 - This code exposes the dataframes as temporary views.
# create temp view
df1.createOrReplaceTempView("tmp_low_temps")
# create temp view
df2.createOrReplaceTempView("tmp_high_temps")
B.2 - This code replaces the methods with Spark SQL (pyspark).
# make sql string
sql_stmt = """
select
l.date as obs_date,
h.temp as obs_high_temp,
l.temp as obs_low_temp
from
tmp_high_temps as h
join
tmp_low_temps as l
on
h.date = l.date
"""
# execute
df3 = spark.sql(sql_stmt)
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/
5 - Last but not least, who wants to always query the data using a data frame. We can create a HIVE database and TABLE to expose the stored file on the storage.
I have a utility function that finds the save file and renames it from a temporary subdirectory.
# create single file
unwanted_file_cleanup("/lake/bronze/weather/temp/", "/lake/bronze/weather/temperature-data.parquet", "parquet")
Last but not least, we create a database and table. Look into the concepts of managed and unmanaged tables as well as a remote meta store. I usually use unmanaged table with the default hive meta store.
%sql
DROP DATABASE IF EXISTS talks CASCADE
%sql
CREATE DATABASE IF NOT EXISTS talks
%sql
CREATE TABLE talks.weather_observations
USING PARQUET
LOCATION '/lake/bronze/weather/temperature-data.parquet'
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table.html
In short, I hope you now have a good understanding of data processing using either data frames or pyspark.
Sincerely
John Miner ~ The Crafty DBA ~ Data Platform MVP
PS: I have a couple videos out on you tube somewhere on this topic.
Related
I have a s3 bucket with the structure //storage-layer/raw/__SOME_FOLDERS__. EG: //storage-layer/raw/GTest and //storage-layer/raw/HTest. In these folders, there is the potential to have a few other folders as well, such as raw/GTest/abc, raw/HTest/xyz. There will not be an overlap in folders abc and xyz from GTest or HTest.
I am successful in setting up a spark structured streaming to monitor raw/GTest/abc for parquet files coming in, and writing the results out to console.
def process_row(df, epoch_id):
df.show()
# Structured Streaming
(
self.spark
.readStream
.format("parquet")
.option("maxFilesPerTrigger", 20)
.option("inferSchema", "true")
.load("s3a://storage-layer/raw/GTest/abc/*")
.writeStream
.format("console")
.outputMode("append")
.trigger(processingTime="5 seconds")
# .foreachBatch(process_row)
.start()
.awaitTermination()
)
My problem is, how can i set up 1 structured streaming app to readStream from the upper folder: storage-layer/raw/* do some processing on it, and save it into a completely different folder / bucket in s3?
I have taken a look at foreachBatch above, but i'm not sure how to set it up such that it can achieve the end result. I get the error message Unable to infer schema for Parquet. It must be specified manually.
Example of end result:
parquet files saving into s3 storage-layer/raw/GTest/abc -> structured streamed + processed into storage-layer/processed/GTest/abc as parquet file.
parquet files saving into s3 storage-layer/raw/HTest/xyz -> structured streamed + processed into storage-layer/processed/HTest/xyz as parquet file.
For Unable to infer the schema for Parquet. It must be specified manually. Spark stream cannot infer schema automatically as we see in static read. So need to provide schema explicitly for the data at s3a://storage-layer/raw/* programmatically or stored in an external file. Have a look at this.
You have two different source locations so need two readStream. If the data at storage-layer/raw/* has the same schema and you want to achieve it using only one readStream then include an extra field as stream_source_path at writing process and the process which writes data at storage-layer/raw/* should populate this field. So now your streaming app knows from which source location data is being read and you can derive two data frames based on stream_source_path value from a single readStream.
The above two data frames can be now written to separate sinks.
Spark has out-of-box support for File sink and you want to write data in parquet format. So you don't need foreach or foreachbatch implementation.
Code snippet -
val schemaObj = new Schema.Parser().parse(avsc_schema_file)
val schema = SchemaConverters.toSqlType(schemaObj).dataType.asInstanceOf[StructType]
val stream = sparkSession.readStream
.schema(schema)
.format("parquet")
.option("cleanSource","archive")
.option("maxFilesPerTrigger", "1")
.option("sourceArchiveDir","storage-layer/streaming_source_archive/")
.option("latestFirst", value = true)
.load("s3a://storage-layer/raw/*")
val df_abc = stream.filter(col("stream_source_path") === "storage-layer/raw/GTest/abc")
val df_xyz = stream.filter(col("stream_source_path") === "storage-layer/raw/GTest/xyz")
df_abc.writeStream
.format("parquet")
.option("path", "storage-layer/processed/GTest/abc")
.option("checkpointLocation", "storage-layer/streaming_checkpoint/GTest/abc")
.trigger(Trigger.ProcessingTime("2 seconds"))
.start()
df_xyz.writeStream
.format("parquet")
.option("path", "storage-layer/processed/GTest/xyz")
.option("checkpointLocation", "storage-layer/streaming_checkpoint/GTest/xyz")
.trigger(Trigger.ProcessingTime("2 seconds"))
.start()
sparkSession.streams.active.foreach(x => x.awaitTermination())
I am running a custom component in kubeflow to do some data manipulation and then save the result as a big query table. How do I register the table as an artifact so that I can pass it down to the different stages of the pipeline?
Eventually I am planning on making a parallelfor up to create multiple bigquery tables from which i will create multiple machine learning models. I would like to be able to pass these tables to the next stage so that I can create models from them.
Currently what i am doing is just saving the uri into a pandas dataframe.
def get_the_data(
project_id: str,
url: str,
dataset_uri: Output[Dataset],
lag: int = 0,
):
## table name
table_id = url + "_lag_" + str(lag)
## code to query and create new table
##
##
## store URI in a dataframe which can be passed to next stage
df=pd.DataFrame(data=[table_id], columns = ['path'])
df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')
Eventually i am going to be using a parallelfor op to run this component multiple times in parallel and create multiple tables. I don't know how to manage and collect the table ids so i can run subsequent ops on them.
I keep multiple data with similar name in Google Cloud Storage. My data comes here daily via API and I want to add them to my table in BigQuery, which is refreshed daily, and I do this via Python. When I do WRITE_TRUNCATE, it deletes the table and creates a new table with all the files in the storage, but I need to protect my table because I have historical data in it and they are not in the storage. When I WRITE_APPEND, I have a duplicate problem because it adds all the files in the storage. Anyone here have a suggested solution?
here is a piece of my code:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True
job_config.max_bad_records = 5
uri = "gs://" + gcsbucket + "/" + tableprefix ```
#AhmetBuğraBUĞA, As you have mentioned in the comment, Date names were written at the end of the data and the problem can be solved by adding dates in the BigQuery as below which takes a single day from the dates.
(dt.datetime.today() - dt.timedelta(day=1)).strftime("%Y-%m-%d")
I am trying to find a way to move our MySQL databases and put them on Amazon Redshift for its speed and scalable storage. They recommend splitting the data into multiple files and using the COPY command to copy data from S3 into the data warehouse. I am using Python to attempt to automate this process and plan to use boto3 for client side encryption of the data
s3 = boto3.client('s3',
aws_access_key_id='[Access key id]',
aws_secret_access_key='[Secret access key]')
filename = '[S3 file path]'
bucket_name = '[Bucket name]'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)
#create table for data
statement = 'create table [table_name] ([table fields])'
conn = psycopg2.connect(
host='[host]',
user='[user]',
port=5439,
password='[password]',
dbname='dev')
cur = conn.cursor()
cur.execute(statement)
conn.commit()
#load data to redshift
conn_string = "dbname='dev' port='5439' user='[user]' password='[password]'
host='[host]'"
conn = psycopg2.connect(conn_string);
cur = conn.cursor()
cur.execute("""copy [table_name] from '[data location]'
access_key_id '[Access key id]'
secret_access_key '[Secret access key]'
region 'us-east-1'
null as 'NA'
delimiter ','
removequotes;""")
conn.commit()
The problem is with this code is I think I would have to individually create a table for every table and then copy it over for every file individually. Is there a way to get the data into redshift using a single copy for multiple files? Or is it possible to run multiple copy statements at once? And is it possible to do this without creating a table for every single file?
Redshift does support a parallelized form of COPY from a single connection, and in fact, it appears to be an anti pattern to concurrently COPY data to the same tables from multiple connections.
There are two ways to do parallel ingestion:
Specify a common prefix in the COPY FROM, instead of a specific file name.
In this case, COPY will attempt to load all files from the bucket / folder with that prefix
OR, provide a manifest file, containing the names of the files
In both instances, you should split the source data up into an appropriate number of files of approximately equal size. Again from the docs:
Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 32 slices.
Edit - I am using Windows 10
Is there a faster alternative to pd._read_sql_query for a MS SQL database?
I was using pandas to read the data and add some columns and calculations on the data. I have cut out most of the alterations now and I am basically just reading (1-2 million rows per day at a time; my query is to read all of the data from the previous date) the data and saving it to a local database (Postgres).
The server I am connecting to is across the world and I have no privileges at all other than to query for the data. I want the solution to remain in Python if possible. I'd like to speed it up though and remove any overhead. Also, you can see that I am writing a file to disk temporarily and then opening it to COPY FROM STDIN. Is there a way to skip the file creation? It is sometimes over 500mb which seems like a waste.
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
df.to_csv('../raw/temp_table.csv', index=False)
df= open('../raw/temp_table.csv')
process_file(conn=pg_engine, table_name=table_name, file_object=df)
UPDATE:
you can also try to unload data using bcp utility, which might be lot faster compared to pd.read_sql(), but you will need a local installation of Microsoft Command Line Utilities for SQL Server
After that you can use PostgreSQL's COPY ... FROM ......
OLD answer:
you can try to write your DF directly to PostgreSQL (skipping the df.to_csv(...) and df= open('../raw/temp_table.csv') parts):
from sqlalchemy import create_engine
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
pg_engine = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
df.to_sql(table_name, pg_engine, if_exists='append')
Just test whether it's faster compared to COPY FROM STDIN...