Retrieving Data from Elasticsearch-SQL CLI - Insert into Dataframe - python

I am working with Elasticsearch 6.7 which has an Elasticsearch SQL cli. This allows me to run more standard SQL queries. This is preferred over the API method as the query capabilities are much more robust.
I am attempting to run a query through this CLI and insert those results into a pandas data frame. Is this something I can do via the subprocess method or is there an easier/better way. This will go into production so it needs to run on multiple environments.
This python program will be running on a different host than the Elasticsearch machine.

Related

Is there a way to connect to Microsoft SQL Server with Apache Beam?

I wanted to ask if there's a connector, function, or other that makes it posible to connect to Microsoft SQL Server tables in Apache beam, so as to write to Bigquery.
Thanks!
Yes! Apache Beam supports JdbcIO, so as long as a Dataflow Worker can reach the database and you use the right drivers for it, it should be simpler to achieve.
It is available to Python as apache_beam.io.jdbc and uses the expansion service, so you can get it working as long as you use Dataflow Runner v2.
If you want something even simpler, you may be able to use the JDBC to BigQuery Google-provided template, which is currently in beta/pre-GA.

Is there a way to programmatically DROP a SQL Database in Azure?

I am working on a process to automatically remove and add databases to Azure. When the database isn't in use, it can be removed from Azure and placed in cheaper S3 storage as a .bacpac.
I am using SqlPackage.exe from Microsoft as a PowerShell script to export and import these databases from and to Azure respectively in either direction. I invoke it via a Python script to use boto3.
The issue I have is with the down direction at step 3. The sequence would be:
Download the Azure SQL DB to a .bacpac (can be achieved with SqlPackage.exe)
Upload this .bacpac to cheaper S3 storage (using boto3 Python SDK)
Delete the Azure SQL Database (It appears the Azure Blob Python SDK can't help me, and it appears SQLPackage.exe does not have a delete function)
Is step 3 impossible to automate with a script? Could a workaround be to SqlPackage.exe import a small dummy .bacpac with the same name to overwrite the old bigger DB?
Thanks.
To remove an Azure SQL Database using PowerShell, you will need to use Remove-AzSqlDatabase Cmdlet.
To remove an Azure SQL Database using Azure CLI, you will need to us az sql db delete.
If you want to write code in Python to delete the database, you will need to use Azure SDK for Python.

Azure Databricks Python Job

I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
Azure Data Factory
Azure Databricks
Both 1+2
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.
If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).
You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow. The smallest cluster will consume two vm’s - each with 4 cores.
I would look at using Azure Functions instead. Python is now an option: https://learn.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01
Azure Functions also have great integration with Azure Data Factory so your workflow would still work.

Delivering python program with database postgresql

I want to give my python program which uses postgresql as database to another person without them having the fuzz of installing postgresql themselves.
Is there a way of doing that without switching to sqlite?

Testing Hive + spark python programs locally?

I'd like to develop programs with spark + hive and unit test them locally.
Is there a way to get hive to run in-process? Or something else that will facilitate unit testing?
I'm using python 2.7 on Mac
EDIT: since spark 2, it is possible to create a local hive metastore that can be used in test. the original answer is at the bottom.
from the spark sql programming guide:
When working with Hive, one must instantiate SparkSession with Hive
support, including connectivity to a persistent Hive metastore,
support for Hive serdes, and Hive user-defined functions. Users who do
not have an existing Hive deployment can still enable Hive support.
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started. Note that the hive.metastore.warehouse.dir property in
hive-site.xml is deprecated since Spark 2.0.0. Instead, use
spark.sql.warehouse.dir to specify the default location of database in
warehouse. You may need to grant write privilege to the user who
starts the Spark application.
basically what it means that if you don't configure hive, spark will create a metastore for you, and store it on local disk.
2 configuration that you should be a aware of:
spark.sql.warehouse.dir - a spark config, points to where the data in the table is stored on the disk, ie: "/path/to/test/folder/warehouse/"
javax.jdo.option.ConnectionURL - this is a hive config, and should be set in hive-site.xml (or as a system property), ie: "jdbc:derby:;databaseName=/path/to/test/folder/metastore_db;create=true"
those are not mandatory (since they have a default value), but sometimes it is convenient to set them explicitly
you need to make sure to clean the test folder between tests, to have a clean env for each suite
Original Answer:
I would recommend installing a vagrant box that contains the a full (small) hadoop cluster in a VM on your machine.
you can find a ready vagrant here: http://blog.cloudera.com/blog/2014/06/how-to-install-a-virtual-apache-hadoop-cluster-with-vagrant-and-cloudera-manager/
That way your test could run in the same environment as production

Categories

Resources