In this illuminating answer a technique is presented to get the data sizes from statically coded datasets that live in a Google Big Query Project.
To further automate this, I am operating in a Python Jupyter notebook environment using the BQ wrapper to make BQ Queries inside the Python. I would like to construct a Big Query that will fetch all the datasets that live in my Project (they have the form of an 8- or 9-digit identifier) as a first step to feed into a Python data structure, then use the referenced answer programmatically to get table sizes for all the identified datasets. Is there a way to do this?
Related
I am looking for guidance/best practice to approach a task. I want to use Azure-Databricks and PySpark.
Task: Load and prepare data so that it can be efficiently/quickly analyzed in the future. The analysis will involve summary statistics, exploratory data analysis and maybe simple ML (regression). Analysis part is not clearly defined yet, so my solution needs flexibility in this area.
Data: session level data (12TB) stored in 100 000 single line JSON files. JSON schema is nested, includes arrays. JSON schema is not uniform but new fields are added over time - data is a time-series.
Overall, the task is to build an infrastructure so the data can be processed efficiently in the future. There will be no new data coming in.
My initial plan was to:
Load data into blob storage
Process data using PySpark
flatten by reading into data frame
save as parquet (alternatives?)
Store in a DB so the data can be quickly queried and analyzed
I am not sure which Azure solution (DB) would work here
Can I skip this step when data is stored in efficient format (e.g. parquet)?
Analyze the data using PySpark by querying it from DB (or from blob storage when in parquet)
Does this sound reasonable? Does anyone has materials/tutorials that follow similar process so I could use them as blueprints for my pipeline?
Yes, it's sound reasonable, and in fact it's quite standard architecture (often referred as lakehouse). Usual implementation approach is following:
JSON data loaded into blob storage are consumed using Databricks Auto Loader that provides efficient way of ingesting only new data (since previous run). You can trigger pipeline regularly, for example, nightly, or run it continuously if data arriving all the time. Auto Loader is also handling schema evolution of input data.
Processed data is better to store as Delta Lake tables that provide better performance than "plain" Parquet due use of additional information in the transaction log so it's possible to efficiently access only necessary data. (Delta Lake is built on top of Parquet, but has more capabilities).
Processed data then could be accessed via Spark code, or via Databricks SQL (it could be more efficient for reporting, etc., as it's heavily optimized for BI workloads). Due the big amount of data, storing them in some "traditional" database may not be very efficient or be very costly.
P.S. I would recommend to look on implementing this with Delta Live Tables that may simplify development of your pipelines.
Also, you may have access to Databricks Academy that has introductory courses about lakehouse architecture and data engineering patterns. If you don't have access to it, you can at least look to Databricks courses published on GitHub.
I'm doing a web scraping + data analysis project that consists of scraping product prices every day, clean the data, and store that data into a PostgreSQL database. The final user won't have access to the data in this database (scraping every day becomes huge, so eventually I won't be able to upload in GitHub), but I want to explain how to replicate the project. The steps are basically:
Scraping with Selenium (Python), and save the raw data into CSV files (already in GitHub);
Read these CSV files, clean the data, and store it into the database (the cleaning script already in GitHub);
Retrieve the data from the database to create dashboards and anything that I want (not yet implemented).
To clarify, my question is about how can I teach someone that sees my project, to replicate it, given that this person won't have the database info (tables, columns). My idea is:
Add SQL queries in a folder, showing how to create the database skeleton (same tables and columns);
Add in README info like creating the environment variables to access the database;
It is okay doing that? I'm looking for best practices in this context. Thanks!
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
Can a MariaDB be used with Zarr or migrated to Zarr in a lossless fashion, if so please provide some guidance on how this can be achieved?
I have searched the Zarr docs and MariaDB docs and did not find enough information on this topic. I don't want to lose or modify any of the data and I would like to be able to decompress or restore the data to it's original MariaDB state. I receive output in the form of a 4TB MariaDB (10.2) containing multiple tables of various dimensions and multiple variable types. I am using python (+3.6) and would like to take advantage of Zarr for the purpose of being able to perform exploratory data analysis on the data contained across multiple tables in the MariaDB while it is compressed in an effort to save local disk space. The storage of the data and processing of the data is all done locally and there is no plan to utilize cloud services.
I have considered converting the MariaDB to a sqlite database with Python but stopped looking into that route as I understand this could lead to a loss/corruption of data.
Thank you advance,
Brian
I have a Django-based application with an Oracle backend. I want to do analysis of the application's data in R. What can I do?
I would like to avoid directly querying the database because there are several aspects of our Django models that make it hard to understand the resulting database schema and would make the SQL very complicated.
I would also like to avoid writing a separate Python script to manually export data to a file and then load that file into R because these separate steps would slow down the analysis and iteration process.
My ideal would be some interface that would allow me to write Django queries directly in R. As far as I can tell the only option for this is rPython and this would be tricky to set up with the necessary Django/Python environment variables et al (right?). Are there any other ways this direct interface could be possible?
I want to get the data into R because: 1) there are some statistical R packages that aren't well implemented in Python, 2) I am quicker at transforming data in R than Python, and 3) I need to plot the results and I find it easier to make ggplot2 plots look nice than matplotlib.