I want to send data from kibana(Elasticsearch) to mysql.
Is there any simple way to do so directly or if it possible through python?
I think the whole task can be divided into two parts:
how to fetch the data from elasticsearch (you can do it via python): https://elasticsearch-py.readthedocs.io/en/master/
how to add data to mysql (you can do it via python):
https://dev.mysql.com/doc/connector-python/en/connector-python-example-cursor-transaction.html
Btw, you can check this page to find out the sample script for getting all documents from one index in ES via python: https://discuss.elastic.co/t/get-all-documents-from-an-index/86977
What you need is called an ETL, I am not giving an exact answer, since your question is more general.
You can develop a small Python script to achieve this, but in general, this is more useful to use a real ETL.
I recommend Apache Spark, with the official elasticsearch-hadoop plugin:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#write-data-to-jdbc
Exemple in Scala (but you could use Python or Java or R):
val df = sqlContext.read().format("org.elasticsearch.spark.sql").load("spark/trips")
df.write.jdbc(jdbcUrl, "_table_")
The benefits:
Spark will distribute work via workers (will read all elasticsearch
shards in same time!)
Handle failover
Let you modify data
Related
so I have been asked to write a python script that pulls out all the Glue databases in our aws account, and then lists all the tables and partitions in the database in a CSV file? Its acceptable for it to just run on desktop for now, would really love some guidance on how to do this/direction on how to go about this as I'm a new junior and would like to explore my options before going back to my manager
format:
layout of csv file
Can be easily done using Boto3 - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client
I'll start it off for you and you can figure out the rest.
import boto3
glue_client = boto3.client('glue')
db_name_list = [db['Name'] for db in glue_client.get_databases()['DatabaseList']]
I haven't tested this code but it should create a list of all names of your databases. From here you can then use this information to run nested loops to get your tables get_tables(DatabaseName= ...) and then next your partitions get_partitions(DatabaseName=...,TableName=...).
Make sure to read the documentation to double check the arguments youre providing are correct.
EDIT: You will also likely need to use a paginator if you have a large amount of values to be returned. Best practice would be to use the paginator for all three calls which would just mean an additional loop at each step. Documentation about paginator is here - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Paginator.GetDatabases
And there is plenty of stackoverflow examples on how to use it.
I am currently porting some code from Spark to the MSSQL Analytical Services with Python. Everything is nice and dandy, but I am not sure if my solution is the correct one for multiple inputs for the scripts.
Consider the following code snippet:
DROP PROCEDURE IF EXISTS SampleModel;
GO
CREATE PROCEDURE SampleModel
AS
BEGIN
exec sp_execute_external_script
#language =N'Python',
#script=N'
import sys
sys.path.append("C:\path\to\custom\package")
from super_package.sample_model import run_model
OutputDataSet = run_model()'
WITH RESULT SETS ((Score float));
END
GO
INSERT INTO [dbo].[SampleModelPredictions] (prediction) EXEC [dbo].[SampleModel]
GO
I have a custom package called super_package and a sample model called sample_model. Since this model uses multiple database tables as input, and I would rather have everything in one place I have a module which connects to the database and fetches the data directly:
def go_go_get_data(query, config):
return rx_data_step(RxSqlServerData(
sql_query=query,
connection_string=config.connection_string,
user=config.user,
password=config.password))
Inside the run_model() function I fetch all necessary data from the database with the go_go_get_data function.
If the data is too big to handle in one go I would to some pagination.
In general I cannot join the tables so this solution doesn't work.
The questions is: Is this the right approach to tackle this problem? Or did I miss something? For now this works, but as I am still in the development / tryout phase I cannot be certain that this will scale. I would rather use the parameters for the stored procedure than fetching inside the Python context.
As you've already figured out, sp_execucte_external_script only allows one result set to be passed in. :-(
You can certainly query from inside the script to fetch data as long as your script is okay with the fact that it's not executing under the current SQL session's user's permissions.
If pagination is important and one data set is significantly larger than the others and you're using Enterprise Edition, you might consider passing the largest data set into the script in chunks using sp_execute_external_script's streaming feature.
If you'd like all of your data to be assembled in SQL Server (vs. fetched by queries in your script), you could try to serialize the result sets and then pass them in as parameters (link describes how to do this in R but something similar should be possible with Python).
I'm trying to build a simple time series database in prometheus. I'm looking at financial time series data, and need somewhere to store the data to quickly access via Python. I'm loading the data into the time series via xml or .csvs, so this isn't some crazy "lots of data in and out at the same time" kind of project. I'm the only user, and maybe have a couple others use it in time and just want something thats easy to load data into, and pull out of.
I was hoping for some guidance on how to do this. Few questions:
1) Is it simple to pull data from a prometheus database via python?
2) I wanted to run this all locally off my windows machine, is that doable?
3) Am I completely overengineering this? (My worry with SQL is that it would be a mess to work with, as its large time series data sets)
Thanks
Prometheus is intended primarily for operational monitoring. While you may be able to get something working, Prometheus doesn't for example support bulk loading of data.
1) Is it simple to pull data from a prometheus database via python?
The HTTP api should be easy to use.
2) I wanted to run this all locally off my windows machine, is that doable?
That should work.
3) Am I completely overengineering this? (My worry with SQL is that it would be a mess to work with, as its large time series data sets)
I'd more say that Prometheus is probably not the right tool for the job here. Up to say 100GB I'd consider a SQL database to be a good starting point.
My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.
I would like to know :
1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I've worked on Postgres AS 8.3)
2) Are there any (free) tools available for this task (extract and import)
3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?
I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.
1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.
2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:
CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
'gpfdist://server:port1/path/to/filep1.txt',
'gpfdist://server:port2/path/to/filep2.txt',
'gpfdist://server:port3/path/to/filep3.txt'
) FORMAT 'text' (delimiter E'\t' null 'null' escape 'off') ENCODING 'UTF8';
CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;
If you play to their rules and your data is clean, the loading can be blazing fast.
3) You don't need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.
Many of Greenplum's utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.
For data transfer into greenplum, I've written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.
Generally, it is really slow if you use SQL insert or merge to import big bulk data.
The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.
And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).
1) It's not vanilla postgres
2) I have used pentaho data integration with good success in various types of data transfer projects.
It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully.
Also I believe Pentaho support Greenplum specifically though I have no experience of this.
I want to be able to add daily info to each object and want to have the ability to delete info x days old easily. With the tables I need to look at the trends and do stuff like selecting objects which match some criteria.
Edit: I asked this because I'm not able to think of a way to implement deleting old data easily because you cannot delete tables in sqlite
Using sqlite would it be the best option, is file based, easy to use, you can use Lookups with SQL and it's builtin on python you don't need to install anything.
→ http://docs.python.org/library/sqlite3.html
If your question means that you are just going to be using "table like data" but not bound to a db, look into using this python modul: Module for table like snytax
If you are going to be binding to a back end, and not* distributing your data among computers, then SQLite is the way to go.
A "proper" database would probably be the way to go. If your application only runs on one computer and the database doesn't get to big, sqlite is good and easy to use with python (standard module sqlite3, see the Library Reference for more information)
take a look at the sqlite3 module, it lets you create a single-file database (no server to setup) that will let you perform sql queries. It's part of the standard library in python, so you don't need to install anythin additional.
http://docs.python.org/library/sqlite3.html