I need to store a undirected graph in a Google App Engine database.
For optimization purposes, I am thinking to use database indexes.
Using Google App Engine, is there any way to define the columns of a database table to create its index?
I will need some optimization, since my app uses this stored undirected graph on a content-based filtering for item recommendation. Also, the recommender algorithm updates the weights of some graph's edges.
If it is not possible to use database indexes, please suggest another method to reduce query time for the graph table. I believe my algorithm does more data retrieval operations from graph table than write operations.
PS: I am using Python.
Perhaps this will help: http://code.google.com/intl/sv-SE/appengine/docs/python/datastore/queriesandindexes.html#Defining_Indexes_With_Configuration
are you actually seeing prohibitively slow queries? i'm guessing not. i suspect this is somewhat premature optimization. the app engine datastore doesn't do any sorting, filtering, joins, or other meaningful operations in memory, so query times are generally fairly constant. in particular, query latency does not depend on the number of entities of your datastore, or even the number of entities that match your query. it only depends on the number of results you ask for.
on a related note, adding indexes to your datastore will not speed up existing queries. if a query needs a custom index, it won't degrade and run slower without it. the query simply won't run at all until you add the index.
for the specific query you mention, select * from edges where vertex1 == x and vertex2 == y, the datastore can run it without a custom index at all. see this section of the docs for more details.
in short, just run the queries you need, and don't think too much about indices or try to optimize as if you were a DBA. it's not a relational database. :P
Related
thanks for hearing me out.
I have a dataset that is a matrix of shape 75000x10000 filled with float values. Think of it like heatmap/correlation matrix. I want to store this in a SQLite database (SQLite because I am modifying an existing Django project). The source data file is 8 GB in size and I am trying to use python to carry out my task.
I have tried to use pandas chunking to read the file into python and transform it into unstacked pairwise indexed data and write it out onto a json file. But this method is eating up my computational cost. For a chunk of size 100x10000 it generates a 200 MB json file.
This json file will be used as a fixture to form the SQLite database in Django backend.
Is there a better way to do this? Faster/Smarter way. I don't think a 90 GB odd json file written out taking a full day is the way to go. Not even sure if Django databases can take this load.
Any help is appreciated!
SQLite is quite impressive for what it is, but it's probably not going to give you the performance you are looking for at that scale, so even though your existing project is Django on SQLite I would recommend simply writing a Python wrapper for a different data backend and just using that from within Django.
More importantly, forget about using Django models for something like this; they are an abstraction layer built for convenience (mapping database records to Python objects), not for performance. Django would very quickly choke trying to build 100s of millions of objects since it doesn't understand what you're trying to achieve.
Instead, you'll want to use a database type / engine that's suited to the type of queries you want to make; if a typical query consists of a hundred point queries to get the data in particular 'cells', a key-value store might be ideal; if you're typically pulling ranges of values in individual 'rows' or 'columns' then that's something to optimize for; if your queries typically involve taking sub-matrices and performing predictable operations on them then you might improve the performance significantly by precalculating certain cumulative values; and if you want to use the full dataset to train machine learning models, you're probably better off not using a database for your primary storage at all (since databases by nature sacrifice fast-retrieval-of-full-raw-data for fast-calculations-on-interesting-subsets), especially if your ML models can be parallelised using something like Spark.
No DB will handle everything well, so it would be useful if you could elaborate on the workload you'll be running on top of that data -- the kind of questions you want to ask of it?
I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.
Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.
To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).
The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)
I have a large dataset of events in a Postgres database that is too large to analyze in memory. Therefore I would like to quantize the datetimes to a regular interval and perform group by operations within the database prior to returning results. I thought I would use SqlSoup to iterate through the records in the appropriate table and make the necessary transformations. Unfortunately I can't figure out how to perform the iteration in such a way that I'm not loading references to every record into memory at once. Is there some way of getting one record reference at a time in order to access the data and update each record as needed?
Any suggestions would be most appreciated!
Chris
After talking with some folks, it's pretty clear the better answer is to use Pig to process and aggregate my data locally. At the scale, I'm operating it wasn't clear Hadoop was the appropriate tool to be reaching for. One person I talked to about this suggests Pig will be orders of magnitude faster than in-DB operations at the scale I'm operating at which is about 10^7 records.
There's two columns in the table inside mysql database. First column contains the fingerprint while the second one contains the list of documents which have that fingerprint. It's much like an inverted index built by search engines. An instance of a record inside the table is shown below;
34 "doc1, doc2, doc45"
The number of fingerprints is very large(can range up to trillions). There are basically following operations in the database: inserting/updating the record & retrieving the record accoring to the match in fingerprint. The table definition python snippet is:
self.cursor.execute("CREATE TABLE IF NOT EXISTS `fingerprint` (fp BIGINT, documents TEXT)")
And the snippet for insert/update operation is:
if self.cursor.execute("UPDATE `fingerprint` SET documents=CONCAT(documents,%s) WHERE fp=%s",(","+newDocId, thisFP))== 0L:
self.cursor.execute("INSERT INTO `fingerprint` VALUES (%s, %s)", (thisFP,newDocId))
The only bottleneck i have observed so far is the query time in mysql. My whole application is web based. So time is a critical factor. I have also thought of using cassandra but have less knowledge of it. Please suggest me a better way to tackle this problem.
Get a high end database. Oracle has some offers. SQL Server also.
TRILLIONS of entries is well beyond the scope of a normal database. THis is very high end very special stuff, especially if you want decent performance. Also get the hardware for it - this means a decent mid range server, 128+gb memory for caching, and either a decent SAN or a good enough DAS setup via SAS.
Remember, TRILLIONS means:
1000gb used for EVERY BYTE.
If the fingerprint is stored as an int64 this is 8000gb disc space alone for this data.
Or do you try running that from a small cheap server iwth a couple of 2tb discs? Good luck.
That data structure isn't a great fit for SQL - the 'correct' design in SQL would be to have a row for each fingerprint/document pair, but querying would be impossibly slow unless you add an index that would take up too much space. For what you are trying to do, SQL adds a lot of overhead to support functions you don't need while not supporting the multiple value column that you do need.
A redis cluster might be a good fit - the atomic set operations should be perfect for what you are doing, and with the right virtual memory setup and consistent hashing to distribute the fingerprints across nodes it should be able to handle the data volume. The commands would then be
SADD fingerprint, docid
to add or update the record, and
SMEMBERS fingerprint
to get all the document ids with that fingerprint.
SADD is O(1). SMEMBERS is O(n), but n is the number of documents in the set, not the number of documents/fingerprints in the system, so effectively also O(1) in this case.
The SQL insert you are currently using is O(n) with n being the very large total number of records, because the records are stored as an ordered list which must be reordered on insert rather than a hash table which is constant time for both get and set.
Greenplum data warehouse, FOC, postgres driven, good luck ...