I am running a custom component in kubeflow to do some data manipulation and then save the result as a big query table. How do I register the table as an artifact so that I can pass it down to the different stages of the pipeline?
Eventually I am planning on making a parallelfor up to create multiple bigquery tables from which i will create multiple machine learning models. I would like to be able to pass these tables to the next stage so that I can create models from them.
Currently what i am doing is just saving the uri into a pandas dataframe.
def get_the_data(
project_id: str,
url: str,
dataset_uri: Output[Dataset],
lag: int = 0,
):
## table name
table_id = url + "_lag_" + str(lag)
## code to query and create new table
##
##
## store URI in a dataframe which can be passed to next stage
df=pd.DataFrame(data=[table_id], columns = ['path'])
df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')
Eventually i am going to be using a parallelfor op to run this component multiple times in parallel and create multiple tables. I don't know how to manage and collect the table ids so i can run subsequent ops on them.
Related
I'm curious if there's a way to reference Databricks tables without importing them to every Databricks notebook.
Here's what I normally do:
'''
# Load the required tables
df1 = spark.read.load("dbfs:/hive_metastore/cadors_basic_event")
# Convert the dataframe to a temporary view for SQL processing
df1.createOrReplaceTempView('Event')
# Perform join to create master table
master_df = spark.sql(f'''
SELECT O.CADORSNUMBER, O.EVENT_CD, O.EVENT_SEQ_NUM,\
E.EVENT_NAME_ENM, E.EVENT_NAME_FNM, E.EVENT_DESCRIPTION_ETXT, E.EVENT_DESCRIPTION_FTXT,\
E.EVENT_GROUP_TYPE_CD, O.DATE_CREATED_DTE, O.DATE_LAST_UPDATE_DTE\
FROM Occ_Events O INNER JOIN Event E\
ON O.EVENT_CD = E.EVENT_CD\
ORDER BY O.CADORSNUMBER''')
'''
However, I also remember in SQL Server Management Studio, you could easily reference these tables and their fields without having to "import" the table into each notebook like I did above. For example:
'''
SELECT occ.cadorsnumber,\
occ_evt.event_seq_num, occ_evt.event_cd,\
evt.event_name_enm, evt.event_group_type_cd,\
evt_grp.event_group_type_elbl\
FROM cadorsstg.occurrence_information occ\
JOIN cadorsstg.occurrence_events occ_evt ON (occ_evt.cadorsnumber = occ.cadorsnumber)\
JOIN cadorsstg.ta003_event evt ON (evt.event_cd = occ_evt.event_cd)\
JOIN cadorsstg.ta012_event_group_type evt_grp ON (evt_grp.event_group_type_cd = evt.event_group_type_cd)\
WHERE occ.date_deleted_dte IS NULL AND occ_evt.date_deleted_dte IS NULL\
ORDER BY occ.cadorsnumber, occ_evt.event_seq_num;
'''
The way I do it currently is not really scalable and gets very tedious when I'm working with multiple tables. If there's a better way to do this, I'd highly appreciate any tips/advice.
I've tried using SELECT/USE SCHEMA (database name), but that didn't work.
I agree with David there are several ways to do this and you are confusing the concepts. I am going to add some links for you to study.
1 - Data is stored in files. The storage can be either remote or local storage. To use remote storage, I suggest mounting since it allows older python libraries access to the storage. Only utilities such as dbutils.fs() understand urls.
https://www.mssqltips.com/sqlservertip/7081/transform-raw-file-refined-file-microsoft-azure-databricks-synapse/
2 - Data engineering is used to join and transform input files into a new output file. The spark.read() and spark.write() are key to reading and writing files using the power of the cluster.
https://learn.microsoft.com/en-us/azure/databricks/clusters/configure
This same processing can be done with python's libraries but it will not leverage the power of the worker nodes. It will run at the executor node. Please look into the high level design of a cluster.
3 - Data engineering can be done with dataframes. But this means you have to get very good at the methods associated with the object.
https://learn.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes
In the example below, I read in two data sample data files. I join the files, remove a duplicate column and save as a new file.
A - Read files code is the same for both design patterns (dataframes + pyspark)
# read in low temps
path1 = "/databricks-datasets/weather/low_temps"
df1 = (
spark.read
.option("sep", ",")
.option("header", "true")
.option("inferSchema", "true")
.csv(path1)
)
# read in high temps
path2 = "/databricks-datasets/weather/high_temps"
df2 = (
spark.read
.option("sep", ",")
.option("header", "true")
.option("inferSchema", "true")
.csv(path2)
)
B - The data engineering code uses methods when dealing with data frames.
# rename columns - file 1
df1 = df1.withColumnRenamed("temp", "low_temp")
# rename columns - file 2
df2 = df2.withColumnRenamed("temp", "high_temp")
df2 = df2.withColumnRenamed("date", "date2")
# join + drop col
df3 = df1.join(df2, df1["date"] == df2["date2"]).drop("date2")
# show top 5 rows
display(df3.head(5))
C - Write files code is the same for both design patterns (dataframes + pyspark)
Now that data frame (df3) has our data, we write to storage. The /lake/bronze directory is on local storage. It is a make believe data lake.
# How many partitions?
df3.rdd.getNumPartitions()
# Write out csv file with 1 partition
dst_path = "/lake/bronze/weather/temp"
(
df3.repartition(1).write
.format("parquet")
.mode("overwrite")
.save(dst_path)
)
4 - Data engineering can be done with Spark SQL. But this means you have to expose the datasets as temporary views. Both steps A + C are the same.
B.1 - This code exposes the dataframes as temporary views.
# create temp view
df1.createOrReplaceTempView("tmp_low_temps")
# create temp view
df2.createOrReplaceTempView("tmp_high_temps")
B.2 - This code replaces the methods with Spark SQL (pyspark).
# make sql string
sql_stmt = """
select
l.date as obs_date,
h.temp as obs_high_temp,
l.temp as obs_low_temp
from
tmp_high_temps as h
join
tmp_low_temps as l
on
h.date = l.date
"""
# execute
df3 = spark.sql(sql_stmt)
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/
5 - Last but not least, who wants to always query the data using a data frame. We can create a HIVE database and TABLE to expose the stored file on the storage.
I have a utility function that finds the save file and renames it from a temporary subdirectory.
# create single file
unwanted_file_cleanup("/lake/bronze/weather/temp/", "/lake/bronze/weather/temperature-data.parquet", "parquet")
Last but not least, we create a database and table. Look into the concepts of managed and unmanaged tables as well as a remote meta store. I usually use unmanaged table with the default hive meta store.
%sql
DROP DATABASE IF EXISTS talks CASCADE
%sql
CREATE DATABASE IF NOT EXISTS talks
%sql
CREATE TABLE talks.weather_observations
USING PARQUET
LOCATION '/lake/bronze/weather/temperature-data.parquet'
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table.html
In short, I hope you now have a good understanding of data processing using either data frames or pyspark.
Sincerely
John Miner ~ The Crafty DBA ~ Data Platform MVP
PS: I have a couple videos out on you tube somewhere on this topic.
I have created a column family in local cassandra as below with cqlsh.
CREATE TABLE sample.stackoverflow_question12 (
id1 int,
class1 int,
name1 text,
PRIMARY KEY (id1)
)
I have a sample csv file with name "data.csv" and the data in the file is as below.
id1 | name1 |class1
1 | hello | 10
2 | world | 20
Used the below python code to connect db and load data from csv by using Anaconda (After installation of Cassandra driver using pip in anaconda)
#Connecting to local Cassandra server
from Cassandra.Cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
cluster = Cluster(["127.0.0.1"],auth_provider = auth_provider,protocol_version=4)
session = cluster.connect()
session.set_keyspace('sample')
cluster.connect()
#File loading
prepared = session.prepare(' Insert into stackoverflow_question12 (id1,class1,name1)VALUES (?, ?, ?)')
with open('D:/Cassandra/NoSQL/data.csv', 'r') as fares:
for fare in fares:
columns=fare.split(",")
id1=columns[0]
class1=columns[1]
name1=columns[2]
session.execute(prepared, [id1,class1,name1])
#closing the file
fares.close()
when I executed the above code getting below error.
Received an argument of invalid type for column "id1". Expected: <class 'cassandra.cqltypes.Int32Type'>, Got: <class 'str'>; (required argument is not an integer)
When I changed data types to text and ran the above code then it loads data with header fields too.
Can anyone help me to make changes in my code to load data without header content? or your successful code also fine if any.
The reason to make column names as id1 and class1 is id and class are keywords and throwing error in the code when used within "fares" loop.
But in real world column names would be seen as class and id. How to run code when these type of columns came into picture?
The another question I got in mind is Cassandra will store primary key first then remaining keys in ascending order. Can we load csv columns which are not indexed same as Cassandra columns storage?
Based on this, I need to build another solution.
You need to use types accordingly to your schema - for integer columns you need to use int(columns...) because split generates strings. If you want to skip header, then you can do something like this:
cnt = 0
with open('D:/Cassandra/NoSQL/data.csv', 'r') as fares:
if cnt = 0:
continue
for fare in fares:
...
Although it's better to use Python's built-in CSV reader that could be customized to skip header automatically...
P.S. If you just want to load data from CSV, I recommend to use external tools, like DSBulk that are flexible and heavily optimized for that task. See following blog posts for examples:
https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading
https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading
https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings
https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting
https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations
Has anyone worked with the Amazon Quantum Ledger Database (QLDB) Amazon ion files? If so, do you know how to extract the "data" part to formulate tables? Maybe use python to scrape the data?
I am trying to get the "data" information from these files which are stored in s3 (I don't have access to QLDB so I cannot query directly) and then upload the results to Glue.
I am trying to perform an ETL job using GLue, but Glue doesn't like Amazon Ion files so I need to either query data from these files or scrape the files for relevant information.
Thanks.
PS : by "data" information I mean this:
{
PersonId:"4tPW8xtKSGF5b6JyTihI1U",
LicenseNumber:"LEWISR261LL",
LicenseType:"Learner",
ValidFromDate:2016–12–20,
ValidToDate:2020–11–15
}
ref : https://docs.aws.amazon.com/qldb/latest/developerguide/working.userdata.html
Have you tried working with the Amazon Ion library ?
Assuming the data mentioned in the question is present in a file called "myIonFile.ion" and if the file has only ion objects in it, we can read the data from the file as follows:
from amazon.ion import simpleion
file = open("myIonFile.ion", "rb") # opening the file
data = file.read() # getting the bytes for the file
iondata = simpleion.loads(data, single_value=False) # Loading as ion data
print(iondata['PersonId']) # should print "4tPW8xtKSGF5b6JyTihI1U"
Further guidance on using the ion library is provided in the Ion Cookbook
Besides, I'm unsure about your use-case but interacting with QLDB can also be done via the QLDB Driver which has a direct dependency on the Ion library.
Nosiphiwe,
AWS Glue is able to read Amazon Ion input. Many other services and applications can't, though, so it's a good idea to use Glue to convert the Ion data to JSON. Note that Ion is a super-set of JSON, adding some data types to JSON, so converting Ion to JSON may cause some down-conversion.
One good way to get access to your QLDB documents from the QLDB S3 export is to use Glue to extract the document data, store it in S3 as JSON, and query it with Amazon Athena. The process would go as follows:
Export your ledger data to S3
Create a Glue crawler to crawl and catalog the exported data.
Run a Glue ETL job to extract the revision data from the export files, convert it to JSON, and write it out to S3.
Create a Glue crawler to crawl and catalog the extracted data.
Query the extracted document revision data using Amazon Athena.
Take a look at the PySpark script below. It extracts just the revision metadata and data payload from the QLDB export files.
The QLDB export maps the table for each document, but separately from the revisions data. You'll have to do some extra coding to include the table name in your revision data in the output. The code below doesn't do this, so you'll end up with all of your revisions in one table in the output.
Also note that you'll get whatever revisions happen to be in the exported data. That is, you might get multiple document revisions for a given document ID. Depending on your intended use of the data, you may need to figure out how to grab just the latest revision of each document ID.
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import explode
from pyspark.sql.functions import col
from awsglue.dynamicframe import DynamicFrame
# Initializations
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
# Load data. 'vehicle-registration-ion' is the name of your database in the Glue catalog for the export data. '2020' is the name of your table in the Glue catalog.
dyn0 = glueContext.create_dynamic_frame.from_catalog(database = "vehicle-registration-ion", table_name = "2020", transformation_ctx = "datasource0")
# Only give me exported records with revisions
dyn1 = dyn0.filter(lambda line: "revisions" in line)
# Now give me just the revisions element and convert to a Spark DataFrame.
df0 = dyn1.select_fields("revisions").toDF()
# Revisions is an array, so give me all of the array items as top-level "rows" instead of being a nested array field.
df1 = df0.select(explode(df0.revisions))
# Now I have a list of elements with "col" as their root node and the revision
# fields ("data", "metadata", etc.) as sub-elements. Explode() gave me the "col"
# root node and some rows with null "data" fields, so filter out the nulls.
df2 = df1.where(col("col.data").isNotNull())
# Now convert back to a DynamicFrame
dyn2 = DynamicFrame.fromDF(df2, glueContext, "dyn2")
# Prep and send the output to S3
applymapping1 = ApplyMapping.apply(frame = dyn2, mappings = [("col.data", "struct", "data", "struct"), ("col.metadata", "struct", "metadata", "struct")], transformation_ctx = "applymapping1")
datasink0 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://YOUR_BUCKET_NAME_HERE/YOUR_DESIRED_OUTPUT_PATH_HERE/"}, format = "json", transformation_ctx = "datasink0")
I hope this helps!
I am trying to find a way to move our MySQL databases and put them on Amazon Redshift for its speed and scalable storage. They recommend splitting the data into multiple files and using the COPY command to copy data from S3 into the data warehouse. I am using Python to attempt to automate this process and plan to use boto3 for client side encryption of the data
s3 = boto3.client('s3',
aws_access_key_id='[Access key id]',
aws_secret_access_key='[Secret access key]')
filename = '[S3 file path]'
bucket_name = '[Bucket name]'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)
#create table for data
statement = 'create table [table_name] ([table fields])'
conn = psycopg2.connect(
host='[host]',
user='[user]',
port=5439,
password='[password]',
dbname='dev')
cur = conn.cursor()
cur.execute(statement)
conn.commit()
#load data to redshift
conn_string = "dbname='dev' port='5439' user='[user]' password='[password]'
host='[host]'"
conn = psycopg2.connect(conn_string);
cur = conn.cursor()
cur.execute("""copy [table_name] from '[data location]'
access_key_id '[Access key id]'
secret_access_key '[Secret access key]'
region 'us-east-1'
null as 'NA'
delimiter ','
removequotes;""")
conn.commit()
The problem is with this code is I think I would have to individually create a table for every table and then copy it over for every file individually. Is there a way to get the data into redshift using a single copy for multiple files? Or is it possible to run multiple copy statements at once? And is it possible to do this without creating a table for every single file?
Redshift does support a parallelized form of COPY from a single connection, and in fact, it appears to be an anti pattern to concurrently COPY data to the same tables from multiple connections.
There are two ways to do parallel ingestion:
Specify a common prefix in the COPY FROM, instead of a specific file name.
In this case, COPY will attempt to load all files from the bucket / folder with that prefix
OR, provide a manifest file, containing the names of the files
In both instances, you should split the source data up into an appropriate number of files of approximately equal size. Again from the docs:
Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 32 slices.
I want to load data from hundreds of CSV files on Google cloud Storage and append them to a single table on Bigquery on a daily basis using cloud dataflow (preferable using python SDK). Can you please let me know how I Can accomplish that?
Thanks
We can do it through Python as well.
Please find the below code snippet.
def format_output_json(element):
"""
:param element: is the row data in the csv
:return: a dictionary with key as column name and value as real data in a row of the csv.
:row_indices: I have hard-coded here, but can get it at the run time.
"""
row_indices = ['time_stamp', 'product_name', 'units_sold', 'retail_price']
row_data = element.split(',')
dict1 = dict()
for i in range(len(row_data)):
dict1[row_indices[i]] = row_data[i]
return [dict1]