GCP Datastore vs Search API performance benchmarks? - python

Are there any existing benchmarks about GCP Datastore queries and Search queries performance?
I'm interested how the performance changes as the data grows. For instance, if we have:
class Project:
members = ndb.StringProperty(repeated=True)
and we have document in Search like:
SearchDocument([AtomField(name=member, value='value...'), ...])
I want to run a search to get all project ids the user is member of. Something like:
ndb.query(keys_only=True).filter(Project.members == 'This Member')
in Datastore and similar query in the Search.
How would the performance compare when there are 10, 100, ... 16 * 6 objects?
I'm interested whether there is some rule of thumb about the latency we could expect for this simple kind of queries. Of course I can go and try that, but would like to get some intuitive idea about the performance I can expect beforehand, if someone had done similar benchmarks. Also, I would like to avoid spending $ and time on writing/reading data I would later need to delete, so if someone could share their experience, that would be much appreciated!
p.s. I use Python, but would assume the answer would be same/similar for all languages which have support for GCP.

Until this moment, Api Search is only supported for Python 2, unfortunately this version of Python is no longer supported, so you should consider that you will not be able to receive support for this service.
On the other hand, take a look at the code provided in this thread, it can give you an idea of how to perform a benchmark test for Datastore using python 3.

Related

Advantage of Django ORM V/S Performing raw SQL queries [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
If you are motivate to the "pros" of an ORM and why would you use an ORM to management/client, what are those reasons would be?
Try and keep one reason per answer so that we can see which one gets voted up as the best reason.
The most important reason to use an ORM is so that you can have a rich, object oriented business model and still be able to store it and write effective queries quickly against a relational database. From my viewpoint, I don't see any real advantages that a good ORM gives you when compared with other generated DAL's other than the advanced types of queries you can write.
One type of query I am thinking of is a polymorphic query. A simple ORM query might select all shapes in your database. You get a collection of shapes back. But each instance is a square, circle or rectangle according to its discriminator.
Another type of query would be one that eagerly fetches an object and one or more related objects or collections in a single database call. e.g. Each shape object is returned with its vertex and side collections populated.
I'm sorry to disagree with so many others here, but I don't think that code generation is a good enough reason by itself to go with an ORM. You can write or find many good DAL templates for code generators that do not have the conceptual or performance overhead that ORM's do.
Or, if you think that you don't need to know how to write good SQL to use an ORM, again, I disagree. It might be true that from the perspective of writing single queries, relying on an ORM is easier. But, with ORM's it is far too easy to create poor performing routines when developers don't understand how their queries work with the ORM and the SQL they translate into.
Having a data layer that works against multiple databases can be a benefit. It's not one that I have had to rely on that often though.
In the end, I have to reiterate that in my experience, if you are not using the more advanced query features of your ORM, there are other options that solve the remaining problems with less learning and fewer CPU cycles.
Oh yeah, some developers do find working with ORM's to be fun so ORM's are also good from the keep-your-developers-happy perspective. =)
Speeding development. For example, eliminating repetitive code like mapping query result fields to object members and vice-versa.
Making data access more abstract and portable. ORM implementation classes know how to write vendor-specific SQL, so you don't have to.
Supporting OO encapsulation of business rules in your data access layer. You can write (and debug) business rules in your application language of preference, instead of clunky trigger and stored procedure languages.
Generating boilerplate code for basic CRUD operations. Some ORM frameworks can inspect database metadata directly, read metadata mapping files, or use declarative class properties.
You can move to different database software easily because you are developing to an abstraction.
Development happiness, IMO. ORM abstracts away a lot of the bare-metal stuff you have to do in SQL. It keeps your code base simple: fewer source files to manage and schema changes don't require hours of upkeep.
I'm currently using an ORM and it has sped up my development.
So that your object model and persistence model match.
To minimise duplication of simple SQL queries.
The reason I'm looking into it is to avoid the generated code from VS2005's DAL tools (schema mapping, TableAdapters).
The DAL/BLL i created over a year ago was working fine (for what I had built it for) until someone else started using it to take advantage of some of the generated functions (which I had no idea were there)
It looks like it will provide a much more intuitive and cleaner solution than the DAL/BLL solution from http://wwww.asp.net
I was thinking about created my own SQL Command C# DAL code generator, but the ORM looks like a more elegant solution
Abstract the sql away 95% of the time so not everyone on the team needs to know how to write super efficient database specific queries.
I think there are a lot of good points here (portability, ease of development/maintenance, focus on OO business modeling etc), but when trying to convince your client or management, it all boils down to how much money you will save by using an ORM.
Do some estimations for typical tasks (or even larger projects that might be coming up) and you'll (hopefully!) get a few arguments for switching that are hard to ignore.
Compilation and testing of queries.
As the tooling for ORM's improves, it is easier to determine the correctness of your queries faster through compile time errors and tests.
Compiling your queries helps helps developers find errors faster. Right? Right. This compilation is made possible because developers are now writing queries in code using their business objects or models instead of just strings of SQL or SQL like statements.
If using the correct data access patterns in .NET it is easy to unit test your query logic against in memory collections. This speeds the execution of your tests because you don't need to access the database, set up data in the database or even spin up a full blown data context.[EDIT]This isn't as true as I thought it was as unit testing in memory can present difficult challenges to overcome. But I still find these integration tests easier to write than in previous years.[/EDIT]
This is definitely more relevant today than a few years ago when the question was asked, but that may only be the case for Visual Studio and Entity Framework where my experience lies. Plugin your own environment if possible.
.net tiers using code smith templates
http://nettiers.com/default.aspx?AspxAutoDetectCookieSupport=1
Why code something that can be generated just as well.
convince them how much time / money you will save when changes come in and you don't have to rewrite your SQL since the ORM tool will do that for you
I think one cons is that ORM will need some updation in your POJO. mainly related to schema, relation and query. so scenario where you are not suppose to make changes in model objects, might be because it is shared among more that on project or b/w client and server. so in such cases you will need to split it in two levels, which will require additional efforts .
i am an android developer and as you know mobile apps are usually not huge in size, so this additional effort to segregate pure-model and orm-affected-model does not seems worth full.
i understand that question is generic one. but mobile apps are also come inside generic umbrella.

Searching with the pyramid framework

I'm trying to implement a search function into my website, which is running on pyramid, and I was wondering what is the most efficient way of approaching this problem. I am currently looking into Whoosh and MySQL full text searching with SqlAlchemy. I need a fast and simple implementation, and wondering which one would be the best choice.
I tried using fulltext with the native database for a while and it just was too much work to keep things working across sqlite, mysql, and pgsql. I ported all the search code over to whoosh and have been really happy ever since. It performs well for small workloads, is pure python, and no server to setup.
You just implement it almost like writing and updating a file on disk. From what I've read it does well in the single millions of documents. I'm using it with some 18k documents with an index size of around 100MB. There's a lot of flexibility to implement various tokenizing and other config with it. I really suggest people start there and if they out grow the whoosh, then look at starting up extra processes with elasticsearch, lucene/solr, and the like.
You can see how I've got it implemented here:
https://github.com/mitechie/Bookie/blob/develop/bookie/models/fulltext.py
and I update it using SqlAlchemy event hooks:
https://github.com/mitechie/Bookie/blob/develop/bookie/models/__init__.py#L663
and you can judge a basic implementation of it by searching:
https://bmark.us/search
I'm a huge fan of ElasticSearch. It's the easiest to set up, maintain, and work with.
I generally use requests.
to index:
requests.put("http://localhost:9200/myindex/category/",data=json.dumps(document))
to search:
requests.get("http://localhost:9200/myindex/category/_search?q="+somequery)
you can get way more in depth in searching using the DSL:
http://www.elasticsearch.org/guide/reference/query-dsl/

use standard datastore index or build my own

I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.
Thanks!
Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).
I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.

Building a DSL query language

i'm working on a project (written in Django) which has only a few entities, but many rows for each entity.
In my application i have several static "reports", directly written in plain SQL. The users can also search the database via a generic filter form. Since the target audience is really tech-savvy and at some point the filter doesn't fit their needs, i think about creating a query language for my database like YQL or Jira's advanced search.
I found http://sourceforge.net/projects/littletable/ and http://www.quicksort.co.uk/DeeDoc.html, but it seems that they only operate on in-memory objects. Since the database can be too large for holding it in-memory, i would prefer that the query is translated in SQL (or better a Django query) before doing the actual work.
Are there any library or best practices on how to do this?
Writing such a DSL is actually surprisingly easy with PLY, and what ho—there's already an example available for doing just what you want, in Django. You see, Django has this fancy thing called a Q object which make the Django querying side of things fairly easy.
At DjangoCon EU 2012, Matthieu Amiguet gave a session entitled Implementing Domain-specific Languages in Django Applications in which he went through the process, right down to implementing such a DSL as you desire. His slides, which include all you need, are available on his website. The final code (linked to from the last slide, anyway) is available at http://www.matthieuamiguet.ch/media/misc/djangocon2012/resources/compiler.html.
Reinout van Rees also produced some good comments on that session. (He normally does!) These cover a little of the missing context.
You see in there something very similar to YQL and JQL in the examples given:
groups__name="XXX" AND NOT groups__name="YYY"
(modified > 1/4/2011 OR NOT state__name="OK") AND groups__name="XXX"
It can also be tweaked very easily; for example, you might want to use groups.name rather than groups__name (I would). This modification could be made fairly trivially (allow . in the FIELD token, by modifying t_FIELD, and then replacing . with __ before constructing the Q object in p_expression_ID).
So, that satisfies simple querying; it also gives you a good starting point should you wish to make a more complex DSL.
I've faced exactly this problem - a large database which needs searching. I made some static reports and several fancy filters using django (very easy with django) just like you have.
However the power users were clamouring for more. I decided that there already was a DSL that they all knew - SQL. The question was how to make it secure enough.
So I used django permissions to give the power users permission to make SQL queries in a new table. I then made a view for the not-quite-so-power users to use these queries. I made them take optional parameters. The queries were run using Python's lower level DB-API which django is using under the hood for its ORM anyway.
The real trick was opening a read only database connection to run these queries just to make sure that no updates were ever run. I made a read only connection by creating a different user in the database with lower permissions and opening a specific connection for that in the view.
TL;DR - SQL is the way to go!
Depending on the form of your data, the types of queries your users need to use, and the frequency that your data is updated, an alternative to the pure SQL solution suggested by Nick Craig-Wood is to index your data in Solr and then run queries against it.
Solr is an added layer of complexity (configuration, data synchronization) but it is super-fast, can handle large datasets, and provides a (relatively) intuitive query language.
You could write your own SQL-ish language using pyparsing, actually. There is even pretty verbose example you could extend.

Graph databases and RDF triplestores: storage of graph data in python

I need to develop a graph database in python (I would enjoy if anybody can join me in the development. I already have a bit of code, but I would gladly discuss about it).
I did my research on the internet. in Java, neo4j is a candidate, but I was not able to find anything about actual disk storage. In python, there are many graph data models (see this pre-PEP proposal, but none of them satisfy my need to store and retrieve from disk.
I do know about triplestores, however. triplestores are basically RDF databases, so a graph data model could be mapped in RDF and stored, but I am generally uneasy (mainly due to lack of experience) about this solution. One example is Sesame. Fact is that, in any case, you have to convert from in-memory graph representation to RDF representation and viceversa in any case, unless the client code wants to hack on the RDF document directly, which is mostly unlikely. It would be like handling DB tuples directly, instead of creating an object.
What is the state-of-the-art for storage and retrieval (a la DBMS) of graph data in python, at the moment? Would it make sense to start developing an implementation, hopefully with the help of someone interested in it, and in collaboration with the proposers for the Graph API PEP ? Please note that this is going to be part of my job for the next months, so my contribution to this eventual project is pretty damn serious ;)
Edit: Found also directededge, but it appears to be a commercial product
I have used both Jena, which is a Java framework, and Allegrograph (Lisp, Java, Python bindings). Jena has sister projects for storing graph data and has been around a long, long time. Allegrograph is quite good and has a free edition, I think I would suggest this cause it is easy to install, free, fast and you could be up and going in no time. The power you would get from learning a little RDF and SPARQL may very well be worth your while. If you know SQL already then you are off to a great start. Being able to query your graph using SPARQL would yield some great benefits to you. Serializing to RDF triples would be easy, and some of the file formats are super easy ( NT for instance ). I'll give an example. Lets say you have the following graph node-edge-node ids:
1 <- 2 -> 3
3 <- 4 -> 5
these are already subject predicate object form so just slap some URI notation on it, load it in the triple store and query at-will via SPARQL. Here it is in NT format:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
Now query for all nodes two hops from node 1:
SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}
This would of course yield <http://mycompany.com#5>.
Another candidate would be Mulgara, written in pure Java. Since you seem more interested in Python though I think you should take a look at Allegrograph first.
I think the solution really depends on exactly what it is you want to do with the graph once you have managed to store it on disk/in database, and this is a little unclear in your question. However, a couple of things you might wish to consider are:
if you just want to persist the graph without using any of the features or properties you might expect from an rdbms solution (such as ACID), then how about just pickling the objects into a flat file? Very rudimentary, but like I say, depends on exactly what you want to achieve.
ZODB is an object database for Python (a spin off from the Zope project I think). I can't say I've had much experience of it in a high performance environment, but bar a few restrictions does allow you to store Python objects natively.
if you wish to pursue RDF, there is an RDF Alchemy project which might help to alleviate some of your concerns about converting from your graph to RDF structures and I think has Sesame as part of it's stack.
There are some other persistence tools detailed on the python site which may be of interest, however I spent quite a while looking into this area last year, and ultimately I found there wasn't a native Python solution that met my requirements.
The most success I had was using MySQL with a custom ORM and I posted a couple of relevant links in an answer to this question. Additionally, if you want to contribute to an RDBMS project, when I spoke to someone from Open Query about a Graph storage engine for MySQL them seemed interested in getting active participation in their project.
Sorry I can't give a more definitive answer, but I don't think there is one... If you do start developing your own implementation, I'd be interested to keep up-to-date with how you get on.
Greetings from your Serius Cybernetics Intelligent Agent!
Some useful links...
Programming the Semantic Web
SEMANTIC PROGRAMMING
RDFLib Python Library for RDF
Hmm, maybe you should take a look at CubicWeb
Regarding Neo4j, did you notice the existing Python bindings? As for the disk storage, take a look at this thread on the mailing list.
For graphdbs in Python, the Hypergraph Database Management System project was recently started on SourceForge by Maurice Ling.
Redland (http://librdf.org) is probably the solution you're looking for. It has Python bindings too.
RDFLib is a python library that you can use. Using harschware's example:
Create a test.nt file like below:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
To query for all nodes two hops from node 1 in RDFLib:
from rdflib import Graph
g = Graph()
g.parse("test.nt", format="nt")
qres = g.query(
"""SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}"""
)
for row in qres:
print(node)
Should return the answer <http://mycompany.com#5>.

Categories

Resources