I'm trying to build a simple time series database in prometheus. I'm looking at financial time series data, and need somewhere to store the data to quickly access via Python. I'm loading the data into the time series via xml or .csvs, so this isn't some crazy "lots of data in and out at the same time" kind of project. I'm the only user, and maybe have a couple others use it in time and just want something thats easy to load data into, and pull out of.
I was hoping for some guidance on how to do this. Few questions:
1) Is it simple to pull data from a prometheus database via python?
2) I wanted to run this all locally off my windows machine, is that doable?
3) Am I completely overengineering this? (My worry with SQL is that it would be a mess to work with, as its large time series data sets)
Thanks
Prometheus is intended primarily for operational monitoring. While you may be able to get something working, Prometheus doesn't for example support bulk loading of data.
1) Is it simple to pull data from a prometheus database via python?
The HTTP api should be easy to use.
2) I wanted to run this all locally off my windows machine, is that doable?
That should work.
3) Am I completely overengineering this? (My worry with SQL is that it would be a mess to work with, as its large time series data sets)
I'd more say that Prometheus is probably not the right tool for the job here. Up to say 100GB I'd consider a SQL database to be a good starting point.
Related
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP
I have a Django-based application with an Oracle backend. I want to do analysis of the application's data in R. What can I do?
I would like to avoid directly querying the database because there are several aspects of our Django models that make it hard to understand the resulting database schema and would make the SQL very complicated.
I would also like to avoid writing a separate Python script to manually export data to a file and then load that file into R because these separate steps would slow down the analysis and iteration process.
My ideal would be some interface that would allow me to write Django queries directly in R. As far as I can tell the only option for this is rPython and this would be tricky to set up with the necessary Django/Python environment variables et al (right?). Are there any other ways this direct interface could be possible?
I want to get the data into R because: 1) there are some statistical R packages that aren't well implemented in Python, 2) I am quicker at transforming data in R than Python, and 3) I need to plot the results and I find it easier to make ggplot2 plots look nice than matplotlib.
I have a Python Flask app I'm writing, and I'm about to start on the backend. The main part of it involves users POSTing data to the backend, usually a small piece of data every second or so, to later be retrieved by other users. The data will always be retrieved within under an hour, and could be retrieved in as low as a minute. I need a database or storage solution that can constantly take in and store the data, purge all data that was retrieved, and also perform a purge on data that's been in storage for longer than an hour.
I do not need any relational system; JSON/key-value should be able to handle both incoming and outgoing data. And also, there will be very constant reading, writing, and deleting.
Should I go with something like MongoDB? Should I use a database system at all, and instead write to a directory full of .json files constantly, or something? (Using only files is probably a bad idea, but it's kind of the extent of what I need.)
You might look at mongoengine we use it in production with flask(there's an extension) and it has suited our needs well, there's also mongoalchemy which I haven't tried but seems to be decently popular.
The downside to using mongo is that there is no expire automatically, having said that you might take a look at using redis which has the ability to auto expire items. There are a few ORMs out there that might suit your needs.
My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.
I would like to know :
1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I've worked on Postgres AS 8.3)
2) Are there any (free) tools available for this task (extract and import)
3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?
I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.
1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.
2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:
CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
'gpfdist://server:port1/path/to/filep1.txt',
'gpfdist://server:port2/path/to/filep2.txt',
'gpfdist://server:port3/path/to/filep3.txt'
) FORMAT 'text' (delimiter E'\t' null 'null' escape 'off') ENCODING 'UTF8';
CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;
If you play to their rules and your data is clean, the loading can be blazing fast.
3) You don't need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.
Many of Greenplum's utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.
For data transfer into greenplum, I've written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.
Generally, it is really slow if you use SQL insert or merge to import big bulk data.
The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.
And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).
1) It's not vanilla postgres
2) I have used pentaho data integration with good success in various types of data transfer projects.
It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully.
Also I believe Pentaho support Greenplum specifically though I have no experience of this.