From MongoDB to PostgreSQL - Django - python

Could any one shed some light on how to migrate my MongoDB to PostgreSQL? What tools do I need, what about handling primary keys and foreign key relationships, etc?
I had MongoDB set up with Django, but would like to convert it back to PostgreSQL.

Whether the migration is easy or hard depends on a very large number of things including how many different versions of data structures you have to accommodate. In general you will find it a lot easier if you approach this in stages:
Ensure that all the Mongo data is consistent in structure with your RDBMS model and that the data structure versions are all the same.
Move your data. Expect that problems will be found and you will have to go back to step 1.
The primary problems you can expect are data validation problems because you are moving from a less structured data platform to a more structured one.
Depending on what you are doing regarding MapReduce you may have some work there as well.

In the mean time, Postgres Foreign Data Wrapper for MongoDB has emerged (versions 9.1-9.4). With it, one can set up a view to MongoDB, via the PostgreSQL, and then handle the data as SQL.
This would probably mean rather easy way to copy data as well.
Limitations of FDW that I have faced:
objects within arrays (in MongoDB) do not seem to be addressable
objects with dynamic key names do not seem to be addressable
I know it's 2015 now. :)

Related

Storing large dataset of tweets: Text files vs Database

I have collected a large Twitter dataset (>150GB) that is stored in some text files. Currently I retrieve and manipulate the data using custom Python scripts, but I am wondering whether it would make sense to use a database technology to store and query this dataset, especially given its size. If anybody has experience handling twitter datasets of this size, please share your experiences, especially if you have any suggestions as to what database technology to use and how long the import might take. Thank you
I recommend using a database schema for this, especially considering it's size. (this is without knowing anything about what the dataset holds) That being said, I suggest now or for future questions of this nature using the software suggestions website for this plus adding more about what the dataset would look like.
As for suggesting a certain database in specific, I recommend doing some research about what each do but for something that just holds data with no relations any will do and could show great query improvement vs just txt files as query's can be cached and data is faster to retrieve due to how databases store and lookup files weather it just be hashed values or whatever they use.
Some popular databases:
MYSQL, PostgreSQL - Relational Databases (simple and fast and easy to use/setup but need some knowledge of SQL)
MongoDB - NoSQL Database (also easy to use and setup and no SQL needed, it relies more on dicts to access DB through the API. Also memory mapped so can be faster than Relational but need to have enough RAM for the Indexes.)
ZODB - Full Python NoSQL Database (Kind of like MongoDB but written in Python)
These are very light and brief explanations of each DB, be sure to do your research before using them, they each have their pros and cons. Also, remember this is just a couple of many popular and highly used Databases, there's also TinyDB, SQLite (comes with Python), and PickleDB that are full Python but are generally for small applications.
My experience is mainly with PostgreSQL, TinyDB, and MongoDB, my favorite being MongoDB and PGSQL. For you, I'd look at either of those but don't limit yourself there's a slue of them plus many drivers that help you write easier/less code if that's what you want. Remember google is your friend! And welcome to Stack Overflow!
Edit
If your dataset is and will remain fairly simple but just large and you want to keep with using txt files, consider pandas and maybe a JSON or a csv format and library. It can greatly help and increase efficiency when querying/managing data like this from txt files plus less memory usage as it won't always or ever need the entire dataset in memory.
you can try using any NOSql DB. Mongo DB would be a good place to start

Recommendation for manipulating data with python vs. SQL for a django app

Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP

Python, computationally efficient data storage methods

I am retrieving structured numerical data (float 2-3 decimal spaces) via http requests from a server. The data comes in as sets of numbers which are then converted into an array/list. I want to then store each set of data locally on my computer so that I can further operate on it.
Since there are very many of these data sets which need to be collected, simply writing each data set that comes in to a .txt file does not seem very efficient. On the other hand I am aware that there are various solutions such as mongodb, python to sql interfaces...ect but i'm unsure of which one I should use and which would be the most appropriate and efficient for this scenario.
Also the database that is created must be able to interface and be queried from different languages such as MATLAB.
If you just want to store it somewhere so MATLAB can work with it; pick your choice from the databases supported by matlab and then install the appropriate drivers for Python for that database.
All databases in Python have a standard API (called the dbapi) so there is a uniform way of dealing with databases.
As you haven't told us how to intend to work with this data later on, it is difficult to provide any further specifics.
the idea is that i wish to essentially download all of the data onto
my machine so that i can operate on it locally later (run analytics
and perform certain mathematical operations on it) instead of having
to call it from the server constantly.
For that purpose you can use any storage mechanism from text files to any of the databases supported by MATLAB - as all databases supported by MATLAB are supported by Python.
You can choose to store the data as "text" and then do the numeric calculations on the application side (ie, MATLAB side). Or you can choose to store the data as numbers/float/decimal (depending on the precision you need) and this will allow you to do some calculations on the database side.
If you just want to store it as text and do calculations on the application side then the easiest option is mongodb as it is schema-less. You would be storing the data as JSON - which may be the format that it is being retrieved from the web.
If you wish to take advantages of some math functions or other capabilities (for example, geospatial calculations) then a better choice is a traditional database that you are familiar with. You'll have to create a schema and define the datatypes for each of your incoming data objects; and then store them appropriately in order to take advantage of the database's query features.
I can recommend using a lightweight ORM like peewee which can use a number of SQL databases as the storage method. Then it becomes a matter of choosing the database you want. The simplest database to use is sqlite, but should you decide that's not fast enough switching to another database like PostgreSQL or MySQL is trivial.
The advantage of the ORM is that you can use Python syntax to interact with the SQL database and don't have to learn any SQL.
Have you considered HDF5? It's very efficient for numerical data, and is supported by both Python and Matlab.

Pandas as fast data storage for Flask application

I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.

Is it a good practice to use pickled data instead of additional tables?

Many times while creating database structure, I get stuck at the question, what would be more effective, storing data in pickled format in a column in the same table or create additional table and then use JOIN.
Which path should be followed, any advice ?
For example:
There is a table of Customers, containing fields like Name, Address
Now for managing Orders (each customer can have many), you can either create an Order table or store the orders in a serialized format in a separate column in the Customers table only.
It's usually better to create seperate tables. If you go with pickling and later find you want to query the data in a different way, it could be difficult.
See Database normalization.
Usually it's best to keep your data normalized (i.e. create more tables). Storing data 'pickled' as you say, is acceptable, when you don't need to perform relational operations on them.
Mixing SQL databases and pickling seems to ask for trouble. I'd go with either sticking all data in the SQL databases or using only pickling, in the form of the ZODB, which is a Python only OO database that is pretty damn awesome.
Mixing makes case sometimes, but is usually just more trouble than it's worth.
I agree with Mchi, there is no problem storing "pickled" data if you don't need to search or do relational type operations.
Denormalisation is also an important tool that can scale up database performance when applied correctly.
It's probably a better idea to use JSON instead of pickles. It only uses a little more space, and makes it possible to use the database from languages other than Python
I agree with #Lennart Regebro. You should probably see whether you need a Relational DB or an OODB. If RDBMS is your choice, I would suggest you stick with more tables. IMHO, pickling may have issues with scalability. If thats what you want, you should look at ZODB. It is pretty good and supports caching etc for better performance

Categories

Resources