Processing large number of JSONs (~12TB) with Databricks

Processing large number of JSONs (~12TB) with Databricks - python

I am looking for guidance/best practice to approach a task. I want to use Azure-Databricks and PySpark.
Task: Load and prepare data so that it can be efficiently/quickly analyzed in the future. The analysis will involve summary statistics, exploratory data analysis and maybe simple ML (regression). Analysis part is not clearly defined yet, so my solution needs flexibility in this area.
Data: session level data (12TB) stored in 100 000 single line JSON files. JSON schema is nested, includes arrays. JSON schema is not uniform but new fields are added over time - data is a time-series.
Overall, the task is to build an infrastructure so the data can be processed efficiently in the future. There will be no new data coming in.
My initial plan was to:
Load data into blob storage
Process data using PySpark
flatten by reading into data frame
save as parquet (alternatives?)
Store in a DB so the data can be quickly queried and analyzed
I am not sure which Azure solution (DB) would work here
Can I skip this step when data is stored in efficient format (e.g. parquet)?
Analyze the data using PySpark by querying it from DB (or from blob storage when in parquet)
Does this sound reasonable? Does anyone has materials/tutorials that follow similar process so I could use them as blueprints for my pipeline?

Yes, it's sound reasonable, and in fact it's quite standard architecture (often referred as lakehouse). Usual implementation approach is following:
JSON data loaded into blob storage are consumed using Databricks Auto Loader that provides efficient way of ingesting only new data (since previous run). You can trigger pipeline regularly, for example, nightly, or run it continuously if data arriving all the time. Auto Loader is also handling schema evolution of input data.
Processed data is better to store as Delta Lake tables that provide better performance than "plain" Parquet due use of additional information in the transaction log so it's possible to efficiently access only necessary data. (Delta Lake is built on top of Parquet, but has more capabilities).
Processed data then could be accessed via Spark code, or via Databricks SQL (it could be more efficient for reporting, etc., as it's heavily optimized for BI workloads). Due the big amount of data, storing them in some "traditional" database may not be very efficient or be very costly.
P.S. I would recommend to look on implementing this with Delta Live Tables that may simplify development of your pipelines.
Also, you may have access to Databricks Academy that has introductory courses about lakehouse architecture and data engineering patterns. If you don't have access to it, you can at least look to Databricks courses published on GitHub.

Related

Where am I supposed to store the matching engine data?

I'm working a specific algorithm related to matching engine using python and I was wondering where to store the data (orderbook)?
Is there a fast way (read and write) data to a storage rather than the database? taking in the consideration the matching engine has to be fast in reading and writing the data in the stored place.
I tried to save the data in a usual database (postgres) but it seems to be slow in writing, reading and updating

I was involved with a financial matching engine once. The only way we could manage the volume of data was to forego the dbms. Live data was kept in memory, and an append-only log of order book changes was kept in flat files in the local file system. Actual trades were stored in a proper (beefy) db, but they were orders of magnitude fewer than orderbook changes.
On restart, the in-memory order book would be reconstructed from the log data. This was S.L.O.W. In normal operation, there was no need to read from storage. We did sharding to spread orderbooks around to keep the memory requirements manageable, and had multiple parallel instances of each shard to protect against the slow restarts.

Migrating large tables using Airflow

I'm new to using Airflow (and newish to Python.)
I need to migrate some very large MySQL tables to s3 files using Airflow. All of the relevant hooks and operators in Airflow seem geared to using Pandas dataframes to load the full SQL output into memory and then transform/export to the desired file format.
This is causing obvious problems for the large tables which cannot fully fit into memory and are failing. I see no way to have Airflow read the query results and save them off to a local file instead of tanking it all up into memory.
I see ways to bulk_dump to output results to a file on the MySQL server using the MySqlHook, but no clear way to transfer that file to s3 (or to Airflow local storage then to s3).
I'm scratching my head a bit because I've worked in Pentaho which would easily handle this problem, but cannot see any apparent solution.
I can try to slice the tables up into small enough chunks that Airflow/Pandas can handle them, but that's a lot of work, a lot of query executions, and there are a lot of tables.
What would be some strategies for moving very large tables from a MySQL server to s3?

You don't have to use Airflow transfer operators if they don't fit to your scale. You can (and probably should) create your very own CustomMySqlToS3Operator with the logic that fits to your process.
Few options:
Don't transfer all the data in one task. slice the data based on dates/number of rows/other. You can use several tasks of CustomMySqlToS3Operator in your workflow. This is not alot of work as you mentioned. This is simply the matter of providing the proper WHERE conditions to the SQL queries that you generate. Depends on the process that you build You can define that every run process the data of a single day thus your WHERE condition is simple date_column between execution_date and next_execution_date (you can read about it in https://stackoverflow.com/a/65123416/14624409 ) . Then use catchup=True to backfill runs.
Use Spark as part of your operator.
As you pointed you can dump the data to local disk and then upload it to S3 using load_file method of S3Hook. This can be done as part of the logic of your CustomMySqlToS3Operator or if you prefer as Python callable from PythonOperator.

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

how to apply elasticsearch using python on files in azure-data-lake?

I have uploaded files in azure data lake by python in gen1. Those files, which exist in azure data lake, I need to apply elastic search to those files (the files can be .pdf, .csv, .xlsx, .doc.) using python django.

This article helps you to index and query large amounts of structured data by combining ADLS and Elasticsearch using third-party tool called Dremio.
About Dremio: Dremio provides a self-service semantic layer and governance for your data. Dremio’s semantic layer is an integrated, searchable catalog in the Data Graph that indexes all of your metadata, allowing business users to easily make sense of the data in the data lake. Anything created by users—spaces, directories, and virtual datasets make up the semantic layer, all of which is indexed and searchable. The relationships between your data sources, virtual datasets, and all your queries are also maintained in the Data Graph, creating a data lineage, allowing you to govern and maintain your data.
Azure Data Lake Store is a highly scalable and secure data storage and analytics service that deals with big data problems easily. It provides a variety of functions and solutions for data management and governance.
Elasticsearch is a powerful search and analytics engine. It is highly popular due to the scale-out architecture, JSON data model, and text search capabilities. Also, with the help of Elasticsearch, you can index and query large amounts of structured data, use convenient RESTful API, and more.
Hope this helps.

Google CloudSQL : structuring history data on cloudSQL

I'm using google cloudSQL for applying advance search on people data to fetch the list of users. In datastore, there are data already stored there with 2 model. First is used to track current data of users and other model is used to track historical timeline. The current data is stored on google cloudSQL are more than millions rows for all users. Now I want to implement advance search on historical data including between dates by adding all history data to cloud.
If anyone can suggest the better structure for this historical model as I've gone through many of the links and articles. But cannot find proper solution as I have to take care of the performance for search (In Current search, the time is taken to fetch result is normal but when history is fetched, It'll scan all the records which causes slowdown of queries because of complex JOINs as needed). The query that is used to fetch the data from cloudSQL are made dynamically based on the users' need. For example, A user want the employees list whose manager is "xyz.123#abc.in" , by using python code, the query will built accordingly. Now a user want to find users whose manager WAS "xyz.123#abc.in" with effectiveFrom 2016-05-02 to 2017-01-01.
As I've find some of the usecases for structure as below:
1) Same model as current structure with new column flag for isCurrentData (status of data whether it is history or active)
Disadv.:
- queries slowdown while fetching data as it will scan all records.
Duplication of data might increase.
These all disadv. will affect the performance of advance search by increasing time.
Solution to this problem is to partition whole table into diff tables.
2) Partition based on year.
As time passes, this will generate too many tables.
3) 2 tables might be maintained.
1st for current data and second one for history. But when user want to search data on both models will create complexity of build query.
So, need suggestions for structuring historical timeline with improved performance and effective data handling.
Thanks in advance.

Depending on how often you want to do live queries vs historical queries and the size of your data set, you might want to consider placing the historical data elsewhere.
For example, if you need quick queries for live data and do many of them, but can handle higher-latency queries and only execute them sometimes, you might consider periodically exporting data to Google BigQuery. BigQuery can be useful for searching a large corpus of data but has much higher latency and doesn't have a wire protocol that is MySQL-compatible (although it's query language will look familiar to those who know any flavor of SQL). In addition, while for Cloud SQL you pay for data storage and the amount of time your database is running, in BigQuery you mostly pay for data storage and the amount of data scanned during your query executions. Therefore, if you plan on executing many of these historical queries it may get a little expensive.
Also, if you don't have a very large data set, BigQuery may be a bit of an overkill. How large is your "live" data set and how large do you expect your "historical" data set to grow over time? Is it possible to just increase the size of the Cloud SQL instance as the historical data grows until the point at which it makes sense to start exporting to Big Query?

#Kevin Malachowski : Thanks for guiding me with your info and questions as It gave me new way of thinking.
Historical data records will be more than 0.3-0.5 million(maximum). Now I'll use BigQuery for historical advance search.
For live data-cloudSQL will be used as we must focus on perfomance for fetched data.
Some of performance issue will be there for historical search, when a user wants both results from live as well as historical data. (BigQuery is taking time near about 5-6 sec[or more] for worst case) But it will be optimized as per data and structure of the model.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.