how Install the Delta Lake package on the on-premise environment? - python

I want make a data lake for my self without using any cloud service. I now have an Debian server and I want create this data lake with Databricks solution, Delta Lake.
As I search all sample for stablish Delta Lake in could service.
How can I do this in my own server?
Maybe I want create an cluster for store data and doing machine learning. And I want use only python for create Delta Lake.

It's a broad question. The Delta Lake itself is just a library that allows you to work with data in a specific format. To use it you need few things:
Compute layer that will read & save Delta Lake data. You can run Apache Spark on the local machine or on the Hadoop or Kubernetes cluster or work with Delta files using Python or Rust libraries (although you may not get all features available). Full list of integrations is available here.
Storage layer to keep your Delta Lake tables - if you use one server, then you can use local file system, but as data size grows then you need to think about distributed filesystem, like, HDFS, MinIO, etc.
Data access layer - how you will access that data. It could be Spark code, or something like that, but you may need also to expose data via JDBC/ODBC - in this case you may need to setup Spark's Thrift server or something like that.

Related

Google Cloud Storage JSONs to Pandas Dataframe to Warehouse

I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).

Is there any way to replicate realtime streaming from azure blob storage to to azure my sql

We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.

Access datalake from Azure datafactory V2 using on demand HD Insight cluster

I am trying to execute spark job from on demand HD Insight cluster using Azure datafactory.
Documentation indicates clearly that ADF(v2) does not support datalake linked service for on demand HD insight cluster and one have to copy data onto blob from copy activity and than execute the job. BUT this work around seems to be a hugely resource expensive in case of a billion files on a datalake. Is there any efficient way to access datalake files either from python script that execute spark jobs or any other way to directly access the files.
P.S Is there a possiblity of doing similar thing from v1, if yes then how? "Create on-demand Hadoop clusters in HDInsight using Azure Data Factory" describe on demand hadoop cluster that access blob storage but I want on demand spark cluster that access datalake.
P.P.s Thanks in advance
Currently, we don't have support for ADLS data store with HDI Spark cluster in ADF v2. We plan to add that in the coming months. Till then, you will have to contiue using the workaround as you mentioned in your post above. Sorry for the inconvenience.
The Blob storage is used for the scripts and config files that the On Demand cluster will use. In the scripts you write and store in the attached Blob storage they can write from ADLS to SQLDB for example.

AWS Redshift Data Processing

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!
ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!
The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.
Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.

Combining many log files in Amazon S3 and read in locally

I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.

Categories

Resources