My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.
I would like to know :
1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I've worked on Postgres AS 8.3)
2) Are there any (free) tools available for this task (extract and import)
3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?
I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.
1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.
2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:
CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
'gpfdist://server:port1/path/to/filep1.txt',
'gpfdist://server:port2/path/to/filep2.txt',
'gpfdist://server:port3/path/to/filep3.txt'
) FORMAT 'text' (delimiter E'\t' null 'null' escape 'off') ENCODING 'UTF8';
CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;
If you play to their rules and your data is clean, the loading can be blazing fast.
3) You don't need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.
Many of Greenplum's utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.
For data transfer into greenplum, I've written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.
Generally, it is really slow if you use SQL insert or merge to import big bulk data.
The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.
And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).
1) It's not vanilla postgres
2) I have used pentaho data integration with good success in various types of data transfer projects.
It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully.
Also I believe Pentaho support Greenplum specifically though I have no experience of this.
Related
Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP
I am retrieving structured numerical data (float 2-3 decimal spaces) via http requests from a server. The data comes in as sets of numbers which are then converted into an array/list. I want to then store each set of data locally on my computer so that I can further operate on it.
Since there are very many of these data sets which need to be collected, simply writing each data set that comes in to a .txt file does not seem very efficient. On the other hand I am aware that there are various solutions such as mongodb, python to sql interfaces...ect but i'm unsure of which one I should use and which would be the most appropriate and efficient for this scenario.
Also the database that is created must be able to interface and be queried from different languages such as MATLAB.
If you just want to store it somewhere so MATLAB can work with it; pick your choice from the databases supported by matlab and then install the appropriate drivers for Python for that database.
All databases in Python have a standard API (called the dbapi) so there is a uniform way of dealing with databases.
As you haven't told us how to intend to work with this data later on, it is difficult to provide any further specifics.
the idea is that i wish to essentially download all of the data onto
my machine so that i can operate on it locally later (run analytics
and perform certain mathematical operations on it) instead of having
to call it from the server constantly.
For that purpose you can use any storage mechanism from text files to any of the databases supported by MATLAB - as all databases supported by MATLAB are supported by Python.
You can choose to store the data as "text" and then do the numeric calculations on the application side (ie, MATLAB side). Or you can choose to store the data as numbers/float/decimal (depending on the precision you need) and this will allow you to do some calculations on the database side.
If you just want to store it as text and do calculations on the application side then the easiest option is mongodb as it is schema-less. You would be storing the data as JSON - which may be the format that it is being retrieved from the web.
If you wish to take advantages of some math functions or other capabilities (for example, geospatial calculations) then a better choice is a traditional database that you are familiar with. You'll have to create a schema and define the datatypes for each of your incoming data objects; and then store them appropriately in order to take advantage of the database's query features.
I can recommend using a lightweight ORM like peewee which can use a number of SQL databases as the storage method. Then it becomes a matter of choosing the database you want. The simplest database to use is sqlite, but should you decide that's not fast enough switching to another database like PostgreSQL or MySQL is trivial.
The advantage of the ORM is that you can use Python syntax to interact with the SQL database and don't have to learn any SQL.
Have you considered HDF5? It's very efficient for numerical data, and is supported by both Python and Matlab.
I am building a database interface using Python's peewee module. I am trying to figure out how to insert data into an existing database where I do not know the schema.
My idea is to use playhouse.reflection.Introspector to find out the database schema, then use that information to create class objects which can then be inserted into the existing database.
So far I've gotten to:
introspector = Introspector.from_database(database)
models = introspector.generate_models()
I'm don't know where to go from there.
1) Can I create database objects in this manner? What is the next step?
2) Is there an easier way to do this?
peewee includes an introspection tool called pwiz that can (basically) introspect a database and produce model definitions. It is run as a command line script and dumps the model definitions to stdout, so invokation is like any other unix tool. Here is an example from the docs:
python -m pwiz -e postgresql my_postgres_db > mymodels.py
From there edit mymodels.py to get what you need.
You could do this on the fly, but it would require a few steps and is hackish (not to mention pointless if you really don't know anything about the schema):
Run pwiz as an os command
Read it to pick out the model names
Import whatever you find
BUT
If you really don't know the schema to start with then you have no idea what the semantics of the database are anyway, which means whatever you find is literally meaningless. Unless you at least know some schema/table/column names you are hunting for (in which case you do know something about the schema) there isn't really much you can do with regard to inserting data (not in a sane way), though you could certainly dump data from the db. But if you just wanted a database dump then pg_dump would have been easier.
I suspect this is actually an X-Y problem. What problem is it you are trying to solve by using this technique? What effect is it supposed to achieve within the context of your system?
If you want to create a GUI, check out the sqlite_web project. It uses Peewee to create a web-based SQLite database manager.
I want to be able to add daily info to each object and want to have the ability to delete info x days old easily. With the tables I need to look at the trends and do stuff like selecting objects which match some criteria.
Edit: I asked this because I'm not able to think of a way to implement deleting old data easily because you cannot delete tables in sqlite
Using sqlite would it be the best option, is file based, easy to use, you can use Lookups with SQL and it's builtin on python you don't need to install anything.
→ http://docs.python.org/library/sqlite3.html
If your question means that you are just going to be using "table like data" but not bound to a db, look into using this python modul: Module for table like snytax
If you are going to be binding to a back end, and not* distributing your data among computers, then SQLite is the way to go.
A "proper" database would probably be the way to go. If your application only runs on one computer and the database doesn't get to big, sqlite is good and easy to use with python (standard module sqlite3, see the Library Reference for more information)
take a look at the sqlite3 module, it lets you create a single-file database (no server to setup) that will let you perform sql queries. It's part of the standard library in python, so you don't need to install anythin additional.
http://docs.python.org/library/sqlite3.html
What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.