Quicker way of updating subdocuments - python

My JSON documents (called "i"), have sub documents (called "elements").
I am looping trhough these subdocuments and updating them one at a time. However, to do so (once the value i need is computed), I have mongo scan through all the documents in the database, then through all the subdocuments, and then find the subdocument it needs to update.
I am having major time issues, as I have ~3000 documents and this is taking about 4minutes.
I would like to know if there is a quicker way to do this, without mongo having to scan all the documents but by doing it within the loop.
Here is the code:
for i in db.stuff.find():
for element in i['counts']:
computed_value = element[a] + element[b]
db.stuff.update({'id':i['id'], 'counts.timestamp':element['timestamp']},
{'$set': {'counts.$.total':computed_value}})
I am identifying the overall document by "id" and then the subdocument by its timestamp (which is unique to each subdocument). I need to find a quicker way than this. Thank you for your help.

What indexes do you have on your collection ? This could probably be sped up by creating an index on your embedded documents. You can do this using dot notation -- there's a good explanation and example here.
In your case, you'd do something like
db.stuff.ensureIndex( { "i.elements.timestamp" : 1 });
This will make your searches through embedded documents run much faster.

Your update is based on id (and i assume it is diff from default _id of mongo)
Put index on your id field
You want to set new field for all documents within collection or want to do it only for some matching collection to given criteria? if only for matching collections, use query operator (with index if possible)
dont fetch full document, fetch only those fields which are being used.
What is your avg document size? Use explain and mongostat to understand what is actual bottleneck.

Related

Database schema - unable to link the tables in a way that suits the program

I have created a python script to scrape document for keywords. It runs on a server and a cronjob makes sure that searches are performed multiple times per day (with different keywords at different hours).
To store the results I created the following table:
TABLE: 'SEARCHES:'
search_date(string)
number_of_results(integer)
keywords_used(string)
^-- I created a single string from all keywords
All of this was easy to implement in Python/SQLite. However, now I want to track the number of results per individual keyword.
I have already created created a 'keyword' table,
TABLE: 'KEYWORDS:'
word(string)
total_hits(integer)
last_used(string)
However, I am having trouble coming up with a way of linking both databases in a way that allows me to link the keywords to searches. Presumably the 'searches' table will have a foreignkey linking to the keywords. But there can be as many as 10 keywords per search, yet there is only one foreignkey column.
I looked into ManyToMany relations, but as I understand this would create a large number of rows containing both 'search_id' and 'keyword_id'. Yet, all I need is 1 row per search.
When the program is finished I want create a GUI frontend and be able to list all searches that have been performed in a list / table. Showing not just the keywords that have been used, but information like the search date as well. 1 line per search.
I also want to create a separate overview for the individual keywords, showing their effectiveness.
I'm just unable to come up with a database schema to accommodate this and could use some help to get my nose in the right direction.
I would suggest creating at "Matches" table that is a child to "Searches". Add a "Search ID" field to the Searches table to support the foreign key. The "Matches" table would hold the Search ID, each individual keyword, and perhaps the total hits for that keyword.
Then you can match "Matches" to "Keywords", and go from "Matches" to "Searches" using Search ID.

Dynamodb - query if a list contains

I'm fairly new to NoSQL. Using Python/Boto but this is a fairly general question. Currently trying to switch a project from MongoDB to DynamoDB and seeking some advice on DynamoDB and it's capacity to query if a list contains a certain string. I have been searching for the past day or so but I'm starting to worry that it doesn't have this facility, other than to use scan which is terribly slow considering the db will be queries thousands of times on updates. Similar unanswered question here
I understand primary keys can only be N, S or B and not something like String Set (SS) which would have been useful.
The data is fairly simple and would look something like this. I'm looking for the most efficient way to query the db based on the tag attribute for entries that include 'string1' OR 'string2'. Again, I don't want to use scan but am willing to consider normalization of the data structure if there is a best practice in dynamodb.
{
id: <some number used as a primary key>,
tags: ['string1', 'string2'...],
data: {some JSON object}
}
From what I've read, even using global secondary indexes, this doesn't seem possible which is strange since that would make dynamodb only useful for the most simple queries. Hoping I'm missing something.
In MongoDB, you have multikey indices, but not in DynamoDB.
I'd think you'd need to solve it like you would in a relational database: create a many-to-many relation table with tag as your hash key and entry id as your sort key. And find some way to keep your relation table in sync with your entry table.

Delete documents from ElasticSearch index in python

Using elasticsearch-py, I would like to remove all documents from a specific index, without removing the index. Given that delete_by_query was moved to a separate plugin, I want to know what is the best way to go about this?
It is highly inefficient to delete all the docs by delete by query. More direct and correct action is:
Getting the current mappings (Assuming you are not using index templates)
Dropping the index by DELETE /indexname
Creating the new index and the mappings.
This will take a second, former will take much, much more time and unnecessary disk I/O
Use a Scroll/Scan API call to gather all Document IDs and then call batch delete on those IDs. This is the recommended replacement for the Delete By Query API based on the official documentation.
EDIT: Requested information for using this specifically in elasticsearch-py. Here is the documentation for the helpers. Use the Scan helper to scan throgh all documents. Use the Bulk helper with the delete action to delete all the ids.

How do I transform every doc in a large Mongodb collection without map/reduce?

Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.

Searching across multiple tables (best practices)

I have property management application consisting of tables:
tenants
landlords
units
properties
vendors-contacts
Basically I want one search field to search them all rather than having to select which category I am searching. Would this be an acceptable solution (technology wise?)
Will searching across 5 tables be OK in the long run and not bog down the server? What's the best way of accomplishing this?
Using PostgreSQL
Why not create a view which is a union of the tables which aggregates the columns you want to search on into one, and then search on that aggregated column?
You could do something like this:
select 'tenants:' + ltrim(str(t.Id)), <shared fields> from Tenants as t union
select 'landlords:' + ltrim(str(l.Id)), <shared fields> from Tenants as l union
...
This requires some logic to be embedded from the client querying; it has to know how to fabricate the key that it's looking for in order to search on a single field.
That said, it's probably better if you just have a separate column which contains a "type" value (e.g. landlord, tenant) and then filter on both the type and the ID, as it will be computationally less expensive (and can be optimized better).
You want to use the built-in full text search or a separate product like Lucene. This is optimised for unstructured searches over heterogeneous data.
Also, don't forget that normal indices cannot be used for something LIKE '%...%'. Using a full text search engine will also be able to do efficient substring searches.
I would suggest using a specialized full-text indexing tool like Lucene for this. It will probably be easier to get up and running, and the result is faster and more featureful too. Postgres full text indexes will be useful if you also need structured search capability on top of this or transactionality of your search index is important.
If you do want to implement this in the database, something like the following scheme might work, assuming you use surrogate keys:
for each searchable table create a view that has the primary key column of that table, the name of the table and a concatenation of all the searchable fields in that table.
create a functional GIN or GiST index on the underlying over the to_tsvector() of the exact same concatenation.
create a UNION ALL over all the views to create the searchable view.
After that you can do the searches like this:
SELECT id, table_name, ts_rank_cd(body, query) AS rank
FROM search_view, to_tsquery('search&words') query
WHERE query ## body
ORDER BY rank DESC
LIMIT 10;
You should be fine, and there's really no other good (easy) way to do this. Just make sure the fields you are searching on are properly indexed though.

Categories

Resources