What are some of the fastest ways to process this? - python

I'm kind-of at a crossroads in my application - where I'm using python/django, mysql, and ubuntu 12.04
My application will be accessing other applications online, making indexes of their path structure, and submitting forms. If you think of this happening with 10s or 100s of accounts with 1 or more domain names each, the performance can get a little out of hand.
My initial thinking was to setup an ec2 environment to distribute the load of accessing all of these paths on each domain across many ec2 instances, each running celery/rabbitmq to distribute the processing load across these ec2 instances.
The thing is - I want to store the results of submitting forms in which I access. I read that I would likely need to use a nosql db (e.g. hadoop, redis, etc).
My question to you all is:
Is there a different way to use celery/rabbitmq with a SQL-db and what are the advantages/disadvantages?
I can see one problem with having to use nosql : the learning curve .
Secondly: is there some other way to distribute the (processing) load of several python scripts being run at the same time on multiple ec2 environments?
Thank you.

Is there a different way to use celery/rabbitmq with a SQL-db and what
are the advantages/disadvantages? I can see one problem with having to
use nosql : the learning curve
Yes.
If you are talking about storing your Django application/model data, you can use it with any SQL type of database as long as you have the Python bindings for it. Most popular SQL databases have python binding.
If you are referring to storing task results in a specific backend there's support for multiple databases/protocols SQL and noSQL. I believe there's no specific advantage or disadvantage between storing the results either in SQL (MySQL, Posgtgres) or noSQL (Mongo, CouchDB), but that's just my personal opinion and that depends on what type of application you are running. These are some of the examples that you can use for SQL databases (from their docs):
# sqlite (filename) CELERY_RESULT_BACKEND = ‘db+sqlite:///results.sqlite’
# mysql CELERY_RESULT_BACKEND = ‘db+mysql://scott:tiger#localhost/foo’
# postgresql CELERY_RESULT_BACKEND = ‘db+postgresql://scott:tiger#localhost/mydatabase’
# oracle CELERY_RESULT_BACKEND = ‘db+oracle://scott:tiger#127.0.0.1:1521/sidname’
If you are referring to a broker (queuing mechanism), celery only supports RabbitMQ and redis.
Secondly: is there some other way to distribute the (processing) load
of several python scripts being run at the same time on multiple ec2
environments?
That's exactly what celery does, you can setup your workers on multiple machines which can be different EC2 instances. Then all you have to do is point their celery installations to the same queues/broker in your configs. If you want redundancy in your broker (RabbitMQ and/or Redis) you should look at setting them up in clustered configs.

Related

Why do we need airflow hooks?

Doc says:
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators. Ref
But why do we need them?
I want to select data from one Postgres DB, and store to another one. Can I use, for example, psycopg2 driver inside python script, which runs by a python operator, or airflow should know for some reason what exactly I'm doing inside script, so, I need to use PostgresHook instead of just psycopg2 driver?
You should use just PostresHook. Instead of using psycopg2 as so:
conn = f'{pass}:{server}#host etc}'
cur = conn.cursor()
cur.execute(query)
data = cur.fetchall()
You can just type:
postgres = PostgresHook('connection_id')
data = postgres.get_pandas_df(query)
Which can also make use of encryption of connections.
So using hooks is cleaner, safer and easier.
While it is possible to just hardcode the connections in your script and run it, the power of hooks will allow to edit environment variables from within the UI.
Have a look at "Automate AWS Tasks Thanks to Airflow Hooks" to learn a bit more about how to use hooks.

AWS Redshift Data Processing

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!
ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!
The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.
Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.

A command-line/API tool for tracking history of tables in a database, does it exist or should I go and develop one?

I am currently working on a project where I need to do databases synchronization. We have a main database on a server and a webapp on it to interact with the data. But since this data is geographic (complex polygons and some points), it is more convenient and more efficient for the users to have a local database when working on the polygons (we use QGIS), and then upload the changes in the server. But while an user was working locally, it is possible that some points were modified in the server (it is only possible to interact with the points on the server). This is why I need the ability to synchronize the databases.
Having an history of INSERT, UPDATE and DELETE of the points on the local database and the same on the server database should be enough to reconstruct a history of the points and then synchronize.
By the way, we use Spatialite for local databases and PostGIS for the server main database
I found a bunch of resources on how to do this using triggers on databases:
http://database-programmer.blogspot.com/2008/07/history-tables.html
How to Store Historical Data
...
But I could not find any tool or library for doing this without having to manually write the triggers. For my needs I could absolutely do it manually, but I feel like it is also something that could be made easier and more convenient with a dedicated command-line/API tool. The tool would for instance generate history tables and triggers for the tables where the user want to track an history, and we could also imagine different options such as:
Which columns do we want to track?
Do we only want to track the actions, or also the values?
...
So, to conclude, my questions are:
Is there any existing tool doing this? I searched and found nothing.
Do you think it would be feasible/relevant to implement a such tool? I was thinking in doing it in Python (since my project is Django-powered), enable different backends (right now I need SQLite/Spatialite and PostgreSQL/PostGIS)...
Thank's for your answers,
Dim'
Chek out GeoGig. GeoGig can track and synchronize geodata from various sources, i.e Postgis, Esri shapefile and spatialite. It implements the typical Git workflow but on data. You will have a data repository on a server which can be cloned and pulled and pushed from your local workstation.
GeoGit is a young project, still in beta but already powerful and features rich, having the ability to merge different commits, create diffs, switch branches, track history and all other typical Git tasks.
A example of a tipical GeoGig workflow:
Geogig has a comfortable command line interface:
# on http://server, initialize and start the remote repository on port 8182 (defaut)
geogig init
geogig serve
# on local, clone the remore repository to your machine
geogig clone http://server:8182 your_repository
cd your_repository/
# on local, import in geogig the data you are working on (Postgis)
geogig pg import --schema public --database your_database --user your_user --password your_pass --table your_table
# on local, add the local changes
geogig add
# on local, commit your changes
geogig commit -m "First commit"
# on local, push to the remote repository
geogig push
You could ask bucardo to do the heavy lifting in terms of multi-master-synchronization. Have a look at https://bucardo.org/wiki/Bucardo
They promise they can even synchronize between different types of databases, eg. postgresql <-> sqlite, http://blog.endpoint.com/2015/08/bucardo-postgres-replication-pgbench.html
I'm not sure about special geospatial capabilities though (synchronizing only regions).
Geogig is definitively worth a try. You can plug a Geogig Repository directly into GeoServer to serve WMS and to edit Features via Web/WFS.
As Wander hinted at, this is not as simple as "Having an history of INSERT, UPDATE and DELETE" and keeping them syncronized. There's lots going on under the hood. There are plenty of DBMS tools for replication / mirroring. Here is one example for PostreSQL: pgpool.
Thank for the answers Wander Nauta and David G, I totally agree on the fact that it is not as simple as this to perform synchronization in general. I should have given more details, but it my case I was believing it could be enough because:
The local data is always a subset of the server data, and each user is assigned a subset. So there is always only one person working offline on a given subset.
On the server, the users can only modify/delete the data they created.
To give more informations on the context, each user is locally digitizing an area in a district, from aerial images. Each user is assigned a district to digitize, and is able to upload his work on the server. On the server, through a webapp, the users can consult the work of everyone, post problem points and comment them, mainly to point out a doubt or an omission on the digitizing. What I want is the users to be able to download a copy of the district they're working on with the points added by their collegues, solve the problems locally, delete the points, eventually add new doubts and upload again.
There is not really a master/slave relation between a local database and the server one, each one has a specific role. Because of this, I am not sure that replication/mirroring will feed my needs, but maybe I'm wrong? Also, I'd like to avoing going with a too sophisticated solution that outfits the needs, and avoid adding too many new dependendies, because the needs won't evolve much.

Search for a key in django.core.cache

I am writing a simple livechat app using django. I keep data about chat sessions in a static variable on my Chat class. Locally it really works.
I have deployed a test version of an app on heroku, but heroku is a cloud platform. There is no synchronization between class variables in different threads.
So I decided to use memcached. But I can't find if django.core.cache allows search for a key in cache or iterate through entire cache to check values. What is the best way to solve this problem?
Memcached only allows you to get/set entries by their keys. You can't iterate these entries to check something. But if your cache keys are sequential (like sess1, sess2, etc.) you can try to check for existence in a loop:
for i in range(1000):
sess = cache.get('sess%s' % i)
# some logic
But anyway it seems like a bad design decision. I don't have enough information about what you're doing but I guess that some sort of persistent storage (like database) would work nice. You can also consider http://redis.io/ which has more features than memcached but still very fast.

Best way to extract data from a FileMaker Pro database in a script?

My job would be easier, or at least less tedious if I could come up with an automated way (preferably in a Python script) to extract useful information from a FileMaker Pro database. I am working on Linux machine and the FileMaker database is on the same LAN running on an OS X machine. I can log into the webby interface from my machine.
I'm quite handy with SQL, and if somebody could point me to some FileMaker plug-in that could give me SQL access to the data within FileMaker, I would be pleased as punch. Everything I've found only goes the other way: Having FileMaker get data from SQL sources. Not useful.
It's not my first choice, but I'd use Perl instead of Python if there was a Perl-y solution at hand.
Note: XML/XSLT services (as suggested by some folks) are only available on FM Server, not FM Pro. Otherwise, that would probably be the best solution. ODBC is turning out to be extremely difficult to even get working. There is absolutely zero feedback from FM when you set it up so you have to dig through /var/log/system.log and parse obscure error messages.
Conclusion: I got it working by running a python script locally on the machine that queries the FM database through the ODBC connections. The script is actually a TCPServer that accepts socket connections from other systems on the LAN, runs the queries, and returns the data through the socket connection. I had to do this to bypass the fact that FM Pro only accepts ODBC connections locally (FM server is required for external connections).
It has been a really long time since I did anything with FileMaker Pro, but I know that it does have capabilities for an ODBC (and JDBC) connection to be made to it (however, I don't know how, or if, that translates to the linux/perl/python world though).
This article shows how to share/expose your FileMaker data via ODBC & JDBC:
Sharing FileMaker Pro data via ODBC or JDBC
From there, if you're able to create an ODBC/JDBC connection you could query out data as needed.
You'll need the FileMaker Pro installation CD to get the drivers. This document details the process for FMP 9 - it is similar for versions 7.x and 8.x as well. Versions 6.x and earlier are completely different and I wouldn't bother trying (xDBC support in those previous versions is "minimal" at best).
FMP 9 supports SQL-92 standard syntax (mostly). Note that rather than querying tables directly you query using the "table occurrence" name which serves as a table alias of sorts. If the data tables are stored in multiple files it is possible to create a single FMP file with table occurrences/aliases pointing to those data tables. There's an "undocumented feature" where such a file must have a table defined in it as well and that table "related" to any other table on the relationships graph (doesn't matter which one) for ODBC access to work. Otherwise your queries will always return no results.
The PDF document details all of the limitations of using the xDBC interface FMP provides. Performance of simple queries is reasonably fast, ymmv. I have found the performance of queries specifying the "LIKE" operator to be less than stellar.
FMP also has an XML/XSLT interface that you can use to query FMP data over an HTTP connection. It also provides a PHP class for accessing and using FMP data in web applications.
If your leaning is to Python, you may be interested in checking out the Python Wrapper for Filemaker. It provides two way access to the Filemaker data via Filemaker's built-in XML services. You can find some quite thorough information on this at:
http://code.google.com/p/pyfilemaker/

Categories

Resources