Storing and Searching Protobuf messages in a database

Storing and Searching Protobuf messages in a database - python

I have a nested data structure defined with protocol-buffer messages. I have a service that receives these messages. On the server side, I need to store these messages and be able to search/find messages that have certain values for different fields, or to find the message(s) that is referenced in another one.
I have searched on what would be the best way to do it, and it seems having a database that can store these messages (directly or via a JSON) and allow query in them would be a good way.
I searched on what kind of database would provide this support effectively, but it was not very successful.
One way I found was around MongoDB, setting a mirror schema and converting messages to JSON and storing on MongoDB.
I also found the ProfaneDB, where the problem it states to address is "very much" like what I need. However, it seems it has been dormant in the last 3-4 years, and not sure how stable/scalable it is, or there has been more recent, or more popular solutions for this.
I thought there should be better solutions to go for this use case. I'd appreciate if one could advise what would be a good way to do this?

I think you should discard the binary protobuf messages as soon as you've unmarshaled them on your server. Unless you have a legal requirement to retain the transmitted message as-is. The protobuf format is optimized for network transmission (on-the-wire) not searching.
Once you have the message in your preferred language's structs type, most databases will be able to store the data. Your focus would then need to be on how you wish to access the data, what levels of reliability, availability, consistency etc. How much you want to pay...
One important requirement is whether you want to have structured queries against your data or whether you want free-form (arbitrary|text) searches. For the former, you may consider SQL and NoSQL databases. For the latter, something like Elasticsearch.
There are so many excellent, well-supported, cloud-based (if you want it) databases that can meet your needs, that you should disregard any that aren't popular unless you have very specific needs that are only addressed by a niche solution.

Related

in-memory database with publish subscribe and query filter?

I am looking for a trading UI solution for my work. I require an in-memory database that can
store table pattern data (rows and columns) with indexing capability.
Provide publish and subscribe mechanism. There will be multiple subscribers to the topic/table.
Query filter capability since every user will have different criteria for subscription.
I have found out a few technologies/options myself.
AMPS (60 East technologies): The most efficient one. Provides pretty much everything I mentioned above. But this is a paid solution. It is column based storage and allows indexing as well.
Mongodb Tailable Cursor/Capped Collection: This also provides query based subscription with open cursors, though it is not in-memory. Any thoughts on its performance. (I expect more than million rows with 100s of column)
Use simple pubsub mechanism and perform query filter at client. But that would require unnecessary data flow which will result in security issues and performance bottleneck.
Any suggestion on the product or toolset ideal for such a scenario. Our client side is a Python/C++ UI with server side will have a mixture of C++/java/python components. All ideas are welcome.
Many thanks!

SQLite, maybe? https://www.sqlite.org/index.html
I'm not exactly sure about your publish/subscribe mechanism requirements, but SQLite is used all over the place.
Though, to be honest, your in memory database seems like it's going to be huge ("I expect more than [a] million rows with 100s of column").

MongoDB reference vs nested

I want to store 'status updates' in mongodb. Therefore this collection/array can get very big.I think one option would be to save the documents in an array nested in the user/group/... document.(Different collections need their own 'status updates')The other way would be to create another collection save the messages their and relate the user/group/... to the status updates via another objectId
I want to know
what is faster
what is easier to administrate and query
I think I'm not going to use an orm/drm just "plain" pymongo.
I haven't found any clear answer in the docs, maybe someone already tested this?

This is an older presentation, but still relevant for these kinds of questions, and discusses some of the tradeoffs.
http://www.10gen.com/presentations/mongosf2011/schemascale
TLDR(W) - it depends how many updates is "very big", and how you're accessing them. If you always need to access the full set at once and they're < 16MB, you can embed, if you generally need only a few at a time you can link. There's also a hybrid approach which is to embed recent and link the rest.

In need of a light, changing database/storage solution

I have a Python Flask app I'm writing, and I'm about to start on the backend. The main part of it involves users POSTing data to the backend, usually a small piece of data every second or so, to later be retrieved by other users. The data will always be retrieved within under an hour, and could be retrieved in as low as a minute. I need a database or storage solution that can constantly take in and store the data, purge all data that was retrieved, and also perform a purge on data that's been in storage for longer than an hour.
I do not need any relational system; JSON/key-value should be able to handle both incoming and outgoing data. And also, there will be very constant reading, writing, and deleting.
Should I go with something like MongoDB? Should I use a database system at all, and instead write to a directory full of .json files constantly, or something? (Using only files is probably a bad idea, but it's kind of the extent of what I need.)

You might look at mongoengine we use it in production with flask(there's an extension) and it has suited our needs well, there's also mongoalchemy which I haven't tried but seems to be decently popular.
The downside to using mongo is that there is no expire automatically, having said that you might take a look at using redis which has the ability to auto expire items. There are a few ORMs out there that might suit your needs.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Database change underneath SQLObject

I'm starting a web project that likely should be fine with SQLite. I have SQLObject on top of it, but thinking long term here -- if this project should require a more robust (e.g. able to handle high traffic), I will need to have a transition plan ready. My questions:
How easy is it to transition from one DB (SQLite) to another (MySQL or Firebird or PostGre) under SQLObject?
Does SQLObject provide any tools to make such a transition easier? Is it simply take the objects I've defined and call createTable?
What about having multiple SQLite databases instead? E.g. one per visitor group? Does SQLObject provide a mechanism for handling this scenario and if so, what is the mechanism to use?
Thanks,
Sean

3) Is quite an interesting question. In general, SQLite is pretty useless for web-based stuff. It scales fairly well for size, but scales terribly for concurrency, and so if you are planning to hit it with a few requests at the same time, you will be in trouble.
Now your idea in part 3) of the question is to use multiple SQLite databases (eg one per user group, or even one per user). Unfortunately, SQLite will give you no help in this department. But it is possible. The one project I know that has done this before is Divmod's Axiom. So I would certainly check that out.
Of course, it would probably be much easier to just use a good concurrent DB like the ones you mention (Firebird, PG, etc).
For completeness:
1 and 2) It should be straightforward without you actually writing much code. I find SQLObject a bit restrictive in this department, and would strongly recommend SQLAlchemy instead. This is far more flexible, and if I was starting a new project today, I would certainly use it over SQLObject. It won't be moving "Objects" anywhere. There is no magic involved here, it will be transferring rows in tables in a database. Which as mentioned you could do by hand, but this might save you some time.

Your success with createTable() will depend on your existing underlying table schema / data types. In other words, how well SQLite maps to the database you choose and how SQLObject decides to use your data types.
The safest option may be to create the new database by hand. Then you'll have to deal with data migration, which may be as easy as instantiating two SQLObject database connections over the same table definitions.
Why not just start with the more full-featured database?

I'm not sure I understand the question.
The SQLObject documentation lists six kinds of connections available. Further, the database connection (or scheme) is specified in a connection string. Changing database connections from SQLite to MySQL is trivial. Just change the connection string.
The documentation lists the different kinds of schemes that are supported.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.