how to create a glue search script in python

how to create a glue search script in python - python

so I have been asked to write a python script that pulls out all the Glue databases in our aws account, and then lists all the tables and partitions in the database in a CSV file? Its acceptable for it to just run on desktop for now, would really love some guidance on how to do this/direction on how to go about this as I'm a new junior and would like to explore my options before going back to my manager
format:
layout of csv file

Can be easily done using Boto3 - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client
I'll start it off for you and you can figure out the rest.
import boto3
glue_client = boto3.client('glue')
db_name_list = [db['Name'] for db in glue_client.get_databases()['DatabaseList']]
I haven't tested this code but it should create a list of all names of your databases. From here you can then use this information to run nested loops to get your tables get_tables(DatabaseName= ...) and then next your partitions get_partitions(DatabaseName=...,TableName=...).
Make sure to read the documentation to double check the arguments youre providing are correct.
EDIT: You will also likely need to use a paginator if you have a large amount of values to be returned. Best practice would be to use the paginator for all three calls which would just mean an additional loop at each step. Documentation about paginator is here - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Paginator.GetDatabases
And there is plenty of stackoverflow examples on how to use it.

Related

Mapping fields inside other fields

Hello I would like to make an app that allows the user to import data from a source of his choice (Airtable, xls, csv, JSON) and export to a JSON which will be pushed to an Sqlite database using an API.
The "core" of the functionality of the app is that it allows the user to create a "template" and "map" of the source columns inside the destination columns. Which source column(s) go to which destination column is up to the user. I am attaching two photos here (used in airtable/zapier), so you can get a better idea of the end result:
adding fields inside fields - airtableadding fields inside fields - zapier
I would like to know if you can recommend a library or a way to come about this problem? I have tried to look for some python or nodejs libraries, I am lost between using ETL libraries, some recommended using mapping/zipping features, others recommend coding my own classes. Do you know any libraries that allow to do the same thing as airtable/zapier ? Any suggestions ?

Save file on databases is really a bad practice since it takes up a lot of database storage space and would add latency in the communication.
I hardly recommend saving it on disk and store the path on database.

How to copy Elasticsearch data to SQL Server

I want to send data from kibana(Elasticsearch) to mysql.
Is there any simple way to do so directly or if it possible through python?

I think the whole task can be divided into two parts:
how to fetch the data from elasticsearch (you can do it via python): https://elasticsearch-py.readthedocs.io/en/master/
how to add data to mysql (you can do it via python):
https://dev.mysql.com/doc/connector-python/en/connector-python-example-cursor-transaction.html
Btw, you can check this page to find out the sample script for getting all documents from one index in ES via python: https://discuss.elastic.co/t/get-all-documents-from-an-index/86977

What you need is called an ETL, I am not giving an exact answer, since your question is more general.
You can develop a small Python script to achieve this, but in general, this is more useful to use a real ETL.
I recommend Apache Spark, with the official elasticsearch-hadoop plugin:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#write-data-to-jdbc
Exemple in Scala (but you could use Python or Java or R):
val df = sqlContext.read().format("org.elasticsearch.spark.sql").load("spark/trips")
df.write.jdbc(jdbcUrl, "_table_")
The benefits:
Spark will distribute work via workers (will read all elasticsearch
shards in same time!)
Handle failover
Let you modify data

Multiple input sources for MSSQL Server 2017 Python analytical services

I am currently porting some code from Spark to the MSSQL Analytical Services with Python. Everything is nice and dandy, but I am not sure if my solution is the correct one for multiple inputs for the scripts.
Consider the following code snippet:
DROP PROCEDURE IF EXISTS SampleModel;
GO
CREATE PROCEDURE SampleModel
AS
BEGIN
exec sp_execute_external_script
#language =N'Python',
#script=N'
import sys
sys.path.append("C:\path\to\custom\package")
from super_package.sample_model import run_model
OutputDataSet = run_model()'
WITH RESULT SETS ((Score float));
END
GO
INSERT INTO [dbo].[SampleModelPredictions] (prediction) EXEC [dbo].[SampleModel]
GO
I have a custom package called super_package and a sample model called sample_model. Since this model uses multiple database tables as input, and I would rather have everything in one place I have a module which connects to the database and fetches the data directly:
def go_go_get_data(query, config):
return rx_data_step(RxSqlServerData(
sql_query=query,
connection_string=config.connection_string,
user=config.user,
password=config.password))
Inside the run_model() function I fetch all necessary data from the database with the go_go_get_data function.
If the data is too big to handle in one go I would to some pagination.
In general I cannot join the tables so this solution doesn't work.
The questions is: Is this the right approach to tackle this problem? Or did I miss something? For now this works, but as I am still in the development / tryout phase I cannot be certain that this will scale. I would rather use the parameters for the stored procedure than fetching inside the Python context.

As you've already figured out, sp_execucte_external_script only allows one result set to be passed in. :-(
You can certainly query from inside the script to fetch data as long as your script is okay with the fact that it's not executing under the current SQL session's user's permissions.
If pagination is important and one data set is significantly larger than the others and you're using Enterprise Edition, you might consider passing the largest data set into the script in chunks using sp_execute_external_script's streaming feature.
If you'd like all of your data to be assembled in SQL Server (vs. fetched by queries in your script), you could try to serialize the result sets and then pass them in as parameters (link describes how to do this in R but something similar should be possible with Python).

Importing a CSV file into a PostgreSQL DB using Python-Django

Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.

How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?

I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.

I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);

Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.

use standard datastore index or build my own

I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.
Thanks!

Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).

I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to create a glue search script in python - python

Related

Mapping fields inside other fields

How to copy Elasticsearch data to SQL Server

Multiple input sources for MSSQL Server 2017 Python analytical services

Importing a CSV file into a PostgreSQL DB using Python-Django

use standard datastore index or build my own

Categories

Resources