Where am I supposed to store the matching engine data? - python

I'm working a specific algorithm related to matching engine using python and I was wondering where to store the data (orderbook)?
Is there a fast way (read and write) data to a storage rather than the database? taking in the consideration the matching engine has to be fast in reading and writing the data in the stored place.
I tried to save the data in a usual database (postgres) but it seems to be slow in writing, reading and updating

I was involved with a financial matching engine once. The only way we could manage the volume of data was to forego the dbms. Live data was kept in memory, and an append-only log of order book changes was kept in flat files in the local file system. Actual trades were stored in a proper (beefy) db, but they were orders of magnitude fewer than orderbook changes.
On restart, the in-memory order book would be reconstructed from the log data. This was S.L.O.W. In normal operation, there was no need to read from storage. We did sharding to spread orderbooks around to keep the memory requirements manageable, and had multiple parallel instances of each shard to protect against the slow restarts.

Related

Processing large number of JSONs (~12TB) with Databricks

I am looking for guidance/best practice to approach a task. I want to use Azure-Databricks and PySpark.
Task: Load and prepare data so that it can be efficiently/quickly analyzed in the future. The analysis will involve summary statistics, exploratory data analysis and maybe simple ML (regression). Analysis part is not clearly defined yet, so my solution needs flexibility in this area.
Data: session level data (12TB) stored in 100 000 single line JSON files. JSON schema is nested, includes arrays. JSON schema is not uniform but new fields are added over time - data is a time-series.
Overall, the task is to build an infrastructure so the data can be processed efficiently in the future. There will be no new data coming in.
My initial plan was to:
Load data into blob storage
Process data using PySpark
flatten by reading into data frame
save as parquet (alternatives?)
Store in a DB so the data can be quickly queried and analyzed
I am not sure which Azure solution (DB) would work here
Can I skip this step when data is stored in efficient format (e.g. parquet)?
Analyze the data using PySpark by querying it from DB (or from blob storage when in parquet)
Does this sound reasonable? Does anyone has materials/tutorials that follow similar process so I could use them as blueprints for my pipeline?
Yes, it's sound reasonable, and in fact it's quite standard architecture (often referred as lakehouse). Usual implementation approach is following:
JSON data loaded into blob storage are consumed using Databricks Auto Loader that provides efficient way of ingesting only new data (since previous run). You can trigger pipeline regularly, for example, nightly, or run it continuously if data arriving all the time. Auto Loader is also handling schema evolution of input data.
Processed data is better to store as Delta Lake tables that provide better performance than "plain" Parquet due use of additional information in the transaction log so it's possible to efficiently access only necessary data. (Delta Lake is built on top of Parquet, but has more capabilities).
Processed data then could be accessed via Spark code, or via Databricks SQL (it could be more efficient for reporting, etc., as it's heavily optimized for BI workloads). Due the big amount of data, storing them in some "traditional" database may not be very efficient or be very costly.
P.S. I would recommend to look on implementing this with Delta Live Tables that may simplify development of your pipelines.
Also, you may have access to Databricks Academy that has introductory courses about lakehouse architecture and data engineering patterns. If you don't have access to it, you can at least look to Databricks courses published on GitHub.

Pandas as fast data storage for Flask application

I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.

Read-only python shelve to store {key->value} on google-app-engine?

I'm implementing morphology module for my app on google-app-engine. To do this I need a key-value database that stores a lot of small objects. I also need to perform huge amount of queries. So, I consider shelve module to be perfect solution.
Unfortunately, google-app-engine doesn't allow to use any python built-in databases because it doesn't allow to write into local files. However, I don't need writing, but only reading.
Is there any implementation of read-only db that can be run under google-app-engine.
P.S. I don't consider using google app engine datastore for this purpose because of huge amount (but small sized) of stored objects and because of huge abount of queries.
If you only need a read-only db, then write the data into a file before you deploy your app. You can read from it with your program on appengine, just like you would read a template file. Write the data in sorted order and use a binary search to locate the key(s).
Another option is to put the key-value pairs into memcache. That's very fast and can handle a huge amount of queries.

A good blobstore / memcache solution

Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!
Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs

Memory usage of file versus database for simple data storage

I'm writing the server for a Javascript app that has a syncing feature. Files and directories being created and modified by the client need to be synced to the server (the same changes made on the client need to be made on the server, including deletes).
Since every file is on the server, I'm debating the need for a MySQL database entry corresponding to each file. The following information needs to be kept on each file/directory for every user:
Whether it was deleted or not (since deletes need to be synced to other clients)
The timestamp of when every file was last modified (so I know whether the file needs updating by the client or not)
I could keep both of those pieces of information in files (e.g. .deleted file and .modified file in every user's directory containing file paths + timestamps in the latter) or in the database.
However, I also have to fit under an 80mb memory constraint. Between file storage and
database storage, which would be more memory-efficient for this purpose?
Edit: Files have to be stored on the filesystem (not in a database), and users have a quota for the storage space they can use.
Probably the filesystem variant will be more efficient memory wise as long as the number of files is low, but that solution probably won't scale. Databases are optimized to do exactly that. Searching the filesystem, opening the file, searching the document, will be expensive as the number of files and requests increase.
But nobody says you have to use MySQl. A NoSQL database like Redis, or maybe something like CouchDB (where you could keep the file itself and include versioning) might be solutions that are more attractive.
here a quick comparison of NoSQL databases.
And a longer comparison.
Edit: From your comments, I would build it as follows: create an API abstracting the backend for all the operations you want to do. Then implement the backend part with the 2 or 3 operations that happen most, or could be more expensive, for the filesytem, and for a database (or two). Test and benchmark.
I'd go for one of the NoSQL databases. You can store file contents and provide some key function based on user's IDs in order to retrieve those contents when you need them. Redis or Casandra can be good choices for this case. There are many libs to use these databases in Python as well as in many other languages.
In my opinion, the only real way to be sure is to build a test system and compare the space requirements. It shouldn't take that long to generate some random data programatically. One might think the file system would be more efficient, but databases can and might compress the data or deduplicate it, or whatever. Don't forget that a database would also make it easier to implement new features, perhaps access control.

Categories

Resources