I am trying to find a way to move our MySQL databases and put them on Amazon Redshift for its speed and scalable storage. They recommend splitting the data into multiple files and using the COPY command to copy data from S3 into the data warehouse. I am using Python to attempt to automate this process and plan to use boto3 for client side encryption of the data
s3 = boto3.client('s3',
aws_access_key_id='[Access key id]',
aws_secret_access_key='[Secret access key]')
filename = '[S3 file path]'
bucket_name = '[Bucket name]'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)
#create table for data
statement = 'create table [table_name] ([table fields])'
conn = psycopg2.connect(
host='[host]',
user='[user]',
port=5439,
password='[password]',
dbname='dev')
cur = conn.cursor()
cur.execute(statement)
conn.commit()
#load data to redshift
conn_string = "dbname='dev' port='5439' user='[user]' password='[password]'
host='[host]'"
conn = psycopg2.connect(conn_string);
cur = conn.cursor()
cur.execute("""copy [table_name] from '[data location]'
access_key_id '[Access key id]'
secret_access_key '[Secret access key]'
region 'us-east-1'
null as 'NA'
delimiter ','
removequotes;""")
conn.commit()
The problem is with this code is I think I would have to individually create a table for every table and then copy it over for every file individually. Is there a way to get the data into redshift using a single copy for multiple files? Or is it possible to run multiple copy statements at once? And is it possible to do this without creating a table for every single file?
Redshift does support a parallelized form of COPY from a single connection, and in fact, it appears to be an anti pattern to concurrently COPY data to the same tables from multiple connections.
There are two ways to do parallel ingestion:
Specify a common prefix in the COPY FROM, instead of a specific file name.
In this case, COPY will attempt to load all files from the bucket / folder with that prefix
OR, provide a manifest file, containing the names of the files
In both instances, you should split the source data up into an appropriate number of files of approximately equal size. Again from the docs:
Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 32 slices.
Related
I have a pyspark.sql.dataframe sourcing some parquet-files which contains a column with the dataformat binary, it holds one PDF-file per row. Currently, i can write them locally by calling write_documents:
# full_path includes name of file and its suffix (.pdf)
def write_document_locally(full_path: str, byte_file: bytearray):
with open(full_path, "wb") as f:
f.write(byte_file)
def write_documents(data_frame: sql.DataFrame) -> None:
[
write_document_locally(full_path=full_path, byte_file=byte_file)
for full_path, byte_file in zip(
data_frame["file_path_and_name"], data_frame["byte_file"]
)
]
From the same job I'm also writing a parquet-table to a separate location. Both folders that are created including the resulting PDF/parquet-files are partitioned by year and id. In the PDF-case i partition by manually concatenating year=XXXX/id=XX to the full_path, in the parquet-case i use:
data_frame.write.mode("overwrite").partitionBy("year", "id").parquet(path=another_path)
To replicate the PDF-export in AWS and writing it to a S3-bucket instead, i would have to use boto3. I'm wondering whether there is a more efficient way of doing this using data_frame.write instead.
The problems with using boto3 is 1) I will write the pdf locally in one driver before uploading it to S3 which is inefficient and gathers all data in one driver (i think), 2) it would not create partitions automatically for me.
I have Azure delta tables per the below folder structure on blob storage.
Lvl1/Lvl2/db1/Table1
Lvl1/Lvl2/db1/Table2
Lvl1/Lvl2/db1/Table3
Lvl1/Lvl2/db2/Table1
Lvl1/Lvl2/db2/Table2
Lvl1/Lvl2/db2/Table3
Lvl1/Lvl2/db3/Table1
I want to create Hive Metastore table links for All the above tables under a single database
So I created the database using the following Command
spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb')
I am currently linking the tables by using the following command
Tablename = [dynamically generates the tablename]
spark.sql(f'CREATE TABLE IF NOT EXISTS parent_db.{tablename} USING DELTA LOCATION \'{path}\'')
I want spark to read all the above table locations, and create the tables with the tablenames within the single database that I have created above.
So Hive Metastore when browsed from Databricks Data tab should look like this
Parent_db --> db1_table1
Db2_table1
Db2_table2
Db1_table2
Db1_table3
Db3_table3
.
.
.
I can create the dynamic table namings with db1, db2,db3 … the issue is only to read all the tables from the delta location and create the tables (reading all subfolders within the root folder)
So All i want is to loop through the Folders and create link for all tables under the single db.
Any help with this one please …
I have reproduced the above and able to get the tables stored in hive meta store database.
First, I have the same delta tables in my blob storage with the same path at my mount location.
Then use the below code to get the paths list of delta tables and create a list for dbs and tables like below.
import glob, os
paths=[x[5:]for x in glob.iglob('/dbfs/mnt/data/Lvl1/Lvl2/**/*')]
print("paths list : ",paths)
dbs_list=[x[-2] for x in [y.split('/') for y in paths]]
print("dbs list : ",dbs_list)
table_list=[x[-1] for x in [y.split('/') for y in paths]]
print("table list : ",table_list)
Then use the below code to create the tables in hive metastore.
spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb2')
for i in range(0,len(paths)):
table_name=dbs_list[i]+'_'+table_list[i]
spark.sql(f'CREATE TABLE IF NOT EXISTS parentdb2.{table_name} USING DELTA LOCATION \'{paths[i]}\'')
My Execution:
Result:
I'm curious if there's a way to reference Databricks tables without importing them to every Databricks notebook.
Here's what I normally do:
'''
# Load the required tables
df1 = spark.read.load("dbfs:/hive_metastore/cadors_basic_event")
# Convert the dataframe to a temporary view for SQL processing
df1.createOrReplaceTempView('Event')
# Perform join to create master table
master_df = spark.sql(f'''
SELECT O.CADORSNUMBER, O.EVENT_CD, O.EVENT_SEQ_NUM,\
E.EVENT_NAME_ENM, E.EVENT_NAME_FNM, E.EVENT_DESCRIPTION_ETXT, E.EVENT_DESCRIPTION_FTXT,\
E.EVENT_GROUP_TYPE_CD, O.DATE_CREATED_DTE, O.DATE_LAST_UPDATE_DTE\
FROM Occ_Events O INNER JOIN Event E\
ON O.EVENT_CD = E.EVENT_CD\
ORDER BY O.CADORSNUMBER''')
'''
However, I also remember in SQL Server Management Studio, you could easily reference these tables and their fields without having to "import" the table into each notebook like I did above. For example:
'''
SELECT occ.cadorsnumber,\
occ_evt.event_seq_num, occ_evt.event_cd,\
evt.event_name_enm, evt.event_group_type_cd,\
evt_grp.event_group_type_elbl\
FROM cadorsstg.occurrence_information occ\
JOIN cadorsstg.occurrence_events occ_evt ON (occ_evt.cadorsnumber = occ.cadorsnumber)\
JOIN cadorsstg.ta003_event evt ON (evt.event_cd = occ_evt.event_cd)\
JOIN cadorsstg.ta012_event_group_type evt_grp ON (evt_grp.event_group_type_cd = evt.event_group_type_cd)\
WHERE occ.date_deleted_dte IS NULL AND occ_evt.date_deleted_dte IS NULL\
ORDER BY occ.cadorsnumber, occ_evt.event_seq_num;
'''
The way I do it currently is not really scalable and gets very tedious when I'm working with multiple tables. If there's a better way to do this, I'd highly appreciate any tips/advice.
I've tried using SELECT/USE SCHEMA (database name), but that didn't work.
I agree with David there are several ways to do this and you are confusing the concepts. I am going to add some links for you to study.
1 - Data is stored in files. The storage can be either remote or local storage. To use remote storage, I suggest mounting since it allows older python libraries access to the storage. Only utilities such as dbutils.fs() understand urls.
https://www.mssqltips.com/sqlservertip/7081/transform-raw-file-refined-file-microsoft-azure-databricks-synapse/
2 - Data engineering is used to join and transform input files into a new output file. The spark.read() and spark.write() are key to reading and writing files using the power of the cluster.
https://learn.microsoft.com/en-us/azure/databricks/clusters/configure
This same processing can be done with python's libraries but it will not leverage the power of the worker nodes. It will run at the executor node. Please look into the high level design of a cluster.
3 - Data engineering can be done with dataframes. But this means you have to get very good at the methods associated with the object.
https://learn.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes
In the example below, I read in two data sample data files. I join the files, remove a duplicate column and save as a new file.
A - Read files code is the same for both design patterns (dataframes + pyspark)
# read in low temps
path1 = "/databricks-datasets/weather/low_temps"
df1 = (
spark.read
.option("sep", ",")
.option("header", "true")
.option("inferSchema", "true")
.csv(path1)
)
# read in high temps
path2 = "/databricks-datasets/weather/high_temps"
df2 = (
spark.read
.option("sep", ",")
.option("header", "true")
.option("inferSchema", "true")
.csv(path2)
)
B - The data engineering code uses methods when dealing with data frames.
# rename columns - file 1
df1 = df1.withColumnRenamed("temp", "low_temp")
# rename columns - file 2
df2 = df2.withColumnRenamed("temp", "high_temp")
df2 = df2.withColumnRenamed("date", "date2")
# join + drop col
df3 = df1.join(df2, df1["date"] == df2["date2"]).drop("date2")
# show top 5 rows
display(df3.head(5))
C - Write files code is the same for both design patterns (dataframes + pyspark)
Now that data frame (df3) has our data, we write to storage. The /lake/bronze directory is on local storage. It is a make believe data lake.
# How many partitions?
df3.rdd.getNumPartitions()
# Write out csv file with 1 partition
dst_path = "/lake/bronze/weather/temp"
(
df3.repartition(1).write
.format("parquet")
.mode("overwrite")
.save(dst_path)
)
4 - Data engineering can be done with Spark SQL. But this means you have to expose the datasets as temporary views. Both steps A + C are the same.
B.1 - This code exposes the dataframes as temporary views.
# create temp view
df1.createOrReplaceTempView("tmp_low_temps")
# create temp view
df2.createOrReplaceTempView("tmp_high_temps")
B.2 - This code replaces the methods with Spark SQL (pyspark).
# make sql string
sql_stmt = """
select
l.date as obs_date,
h.temp as obs_high_temp,
l.temp as obs_low_temp
from
tmp_high_temps as h
join
tmp_low_temps as l
on
h.date = l.date
"""
# execute
df3 = spark.sql(sql_stmt)
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/
5 - Last but not least, who wants to always query the data using a data frame. We can create a HIVE database and TABLE to expose the stored file on the storage.
I have a utility function that finds the save file and renames it from a temporary subdirectory.
# create single file
unwanted_file_cleanup("/lake/bronze/weather/temp/", "/lake/bronze/weather/temperature-data.parquet", "parquet")
Last but not least, we create a database and table. Look into the concepts of managed and unmanaged tables as well as a remote meta store. I usually use unmanaged table with the default hive meta store.
%sql
DROP DATABASE IF EXISTS talks CASCADE
%sql
CREATE DATABASE IF NOT EXISTS talks
%sql
CREATE TABLE talks.weather_observations
USING PARQUET
LOCATION '/lake/bronze/weather/temperature-data.parquet'
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table.html
In short, I hope you now have a good understanding of data processing using either data frames or pyspark.
Sincerely
John Miner ~ The Crafty DBA ~ Data Platform MVP
PS: I have a couple videos out on you tube somewhere on this topic.
I have data that exists in a zipped format in container A that I need to transform using a Python script and am trying to schedule this to occur within Azure, but when writing the output to a new storage container (container B), it simply outputs a csv with the name of the file inside rather than the data.
I've followed the tutorial given on the microsoft site exactly, but I can't get it to work - what am I missing?
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
file_n='iris.csv'
# Load iris dataset from the task node
df = pd.read_csv(file_n)
# Subset records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False, encoding="utf-8")
# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")
Specifically, the final line seems to be just giving me the output of a csv called "iris_setosa.csv" with a contents of "iris_setosa.csv" in cell A1 rather than the actual data that it reads in.
Update:
replace create_blob_from_text with create_blob_from_path.
create_blob_from_text creates a new blob from str/unicode, or updates the content of an existing blob. So you will find text iris_setosa.csv in the content of the new blob.
create_blob_from_path creates a new blob from a file path, or updates the content of an existing blob. It is what you want.
This workaround uses copy_blob and delete_blob to move Azure Blob from one container to another.
from azure.storage.blob import BlobService
def copy_azure_files(self):
blob_service = BlobService(account_name='account_name', account_key='account_key')
blob_name = 'iris_setosa.csv'
copy_from_container = 'test-container'
copy_to_container = 'demo-container'
blob_url = blob_service.make_blob_url(copy_from_container, blob_name)
# blob_url:https://demostorage.blob.core.windows.net/test-container/iris_setosa.csv
blob_service.copy_blob(copy_to_container, blob_name, blob_url)
#for move the file use this line
blob_service.delete_blob(copy_from_container, blob_name)
As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.