django haystack: how to iterate over all indexed elements

django haystack: how to iterate over all indexed elements - python

I am trying to iterate over a Search Queryset with haystack, but it throws me this error:
Result window is too large, from + size must be less than or equal to: [10000] but was [11010]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.
Is there a way to iterate over all indexed elements? (let's say I have several million records).

max_result_window is an index setting that you can change if you want but most of the time you don't have to, because if you'd like to iterate on all your documents using the search api is not the way you should go.
Try with a scan and scroll api.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
And a personal note: I use elasticsearch with django and I found haystack difficult to use as opposed to elasticsearch-dsl. Try to have a look to elasticsearch-dsl-py. https://github.com/elastic/elasticsearch-dsl-py

Related

LDAP search takes too long

I'm trying to get more than 50.000 records from LDAP (with python-ldap and page control tools).
I have searching filter, which is
(|(field=value_1) (field=value_2)...(field=value_50000)
But this request taking more than 15 minutes. I'm taking 10 attributes from LDAP for these records.
Could you please tell me if is it okay for some large request or I can try to change filter?

You should refine your search base, and make it the closet possible to what you are searching for, for example, instead of querying dc=company,dc=com use ou=people,dc=company,dc=com.
You can also build an index of the field you are searching for, and you can also enable cache for your ldap, and finnally concerning your search filter, if you query the same attribute you can try something like:
&"(field>=MinValue)(field=<MaxValue)"
It's way better the matching every single attribute.

Delete documents from ElasticSearch index in python

Using elasticsearch-py, I would like to remove all documents from a specific index, without removing the index. Given that delete_by_query was moved to a separate plugin, I want to know what is the best way to go about this?

It is highly inefficient to delete all the docs by delete by query. More direct and correct action is:
Getting the current mappings (Assuming you are not using index templates)
Dropping the index by DELETE /indexname
Creating the new index and the mappings.
This will take a second, former will take much, much more time and unnecessary disk I/O

Use a Scroll/Scan API call to gather all Document IDs and then call batch delete on those IDs. This is the recommended replacement for the Delete By Query API based on the official documentation.
EDIT: Requested information for using this specifically in elasticsearch-py. Here is the documentation for the helpers. Use the Scan helper to scan throgh all documents. Use the Bulk helper with the delete action to delete all the ids.

Appengine - ndb query with unknown list size

I have an appengine project written in Python.
I use a model with a tags = ndb.StringProperty(repeated=True).
What I want is, given a list of tags, search for all the objects that have every tag in the list.
My problem is that the list may contain any number of tags.
What should I do?

When you make a query on a list property, it actually creates a set of subqueries at the datastore level. The maximum number of subqueries that can be spawned by a single query is 30. Thus, if your list has more that 30 elements, you will get an exception.
In order to tackle this issue, either you will have to change your database model or create multiple queries based on the number of list elements you have and then combine the results. Both these approaches need to be handled by your code.
Update: In case you need all the tags in the list to match the list property in your model, then you can create your basic query and then append AND operators in a loop (as marcadian describes). For example:
qry = YourModel.query()
qry = qry.filter(YourModel.tags == tag[i]) for enumerate(tags)
But, as I mentioned earlier you should be careful of the length of the list property in your model and your indexes configuration in order to avoid problems like index explosion. For more information about this, you may check:
Datastore Indexes
Index Selection and Advanced Search

Pagination with Dynamodb in web application

I have around 5000+ videos on my database and I have created a pages http://mysite.com/videos to list all the videos. Now I am implementing pagination so that only 20 videos are listed in each page. e.g.
http://mysite.com/videos?page=1 showing first 20 videos, http://mysite.com/videos?page=2 showing next 20 videos.
I have a problem choosing what is the best method to implement pagination. I thought of using table.scan() each time a new page is executed and then selecting only required based on some logic with Python code. But that seems to be quite expensive.
I am using Python / Django with boto library.

In Dynamo you can execute queries by setting a limit. From the documentation at:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
you can read:
ExclusiveStartKey
The primary key of the first item that this
operation will evaluate. Use the value that was returned for
LastEvaluatedKey in the previous operation.
The data type for ExclusiveStartKey must be String, Number or Binary.
No set data types are allowed.
Type: String to AttributeValue object map
Required: No
And
Limit
The maximum number of items to evaluate (not necessarily the number of matching items). If Amazon DynamoDB processes the number of
items up to the limit while processing the results, it stops the
operation and returns the matching values up to that point, and a
LastEvaluatedKey to apply in a subsequent operation, so that you can
pick up where you left off. Also, if the processed data set size
exceeds 1 MB before Amazon DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a
LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information see Query and Scan in the Amazon
DynamoDB Developer Guide.
Type: Number
Required: No
You don't provide any info about how the keys of your table are structured. However the approach would be Query the table for the elements matching your key (and range key if suitable), with the limit set to 20.
In the result, you will receive a "LastEvaluatedKey", that you will have to use for the next query, again with the limit set to 20.

Here are some of options:
You can pre-load all video objects when the application starts up and then do in-memory pagination the way you want. 5000+ objects shouldn't be a big deal.
You can fetch first page and then asynchronously fetch the rest (via scan) and then again do pagination in-memory.
You can create an index table that would store an entry for each page containing the id-s of each video and then to fetch the video of a page you do to calls:
3.1 Fetch the page by page id (simple get operation). This will contain the list of video ids that should be on that page
3.2 Fetch all the videos from 3.1 by doing a multi-get operation
For a similar use case, we loaded all metadata via Javascript objects and do pagination and sorting from there and the end result for user is just nice (fast and responsive). As well, we're using the trick of fetching 1st page and then the whole content again (as DynamoDB doesn't support cursor at the moment)

Limit is not what you think. Here's what I suggest:
Using DynamoDBMapper issue a
numRows = mapper.count(<SomeClass>.class, scanExpression)
to efficiently get the number of rows in your table.
Then run a
PaginatedScanList<FeedEntry> result = mapper.scan(<SomeClass>.class, scanExpression);
to get the List - the key here is PaginatedScanList is lazy-loaded. Be careful not to do a .size() on the result as this will load all the rows. Just use .get() to only load the rows you need.
Iterate over the paginatedScanList using
offset = startPage * pageSize
ArrayList<SomeClass> list = new ArrayList<SomeClass>()
for (i = 0 ... pageSize)
list.add(result.get( offset + i))
Check for out-of-bounds etc. Hope that helps.

Quicker way of updating subdocuments

My JSON documents (called "i"), have sub documents (called "elements").
I am looping trhough these subdocuments and updating them one at a time. However, to do so (once the value i need is computed), I have mongo scan through all the documents in the database, then through all the subdocuments, and then find the subdocument it needs to update.
I am having major time issues, as I have ~3000 documents and this is taking about 4minutes.
I would like to know if there is a quicker way to do this, without mongo having to scan all the documents but by doing it within the loop.
Here is the code:
for i in db.stuff.find():
for element in i['counts']:
computed_value = element[a] + element[b]
db.stuff.update({'id':i['id'], 'counts.timestamp':element['timestamp']},
{'$set': {'counts.$.total':computed_value}})
I am identifying the overall document by "id" and then the subdocument by its timestamp (which is unique to each subdocument). I need to find a quicker way than this. Thank you for your help.

What indexes do you have on your collection ? This could probably be sped up by creating an index on your embedded documents. You can do this using dot notation -- there's a good explanation and example here.
In your case, you'd do something like
db.stuff.ensureIndex( { "i.elements.timestamp" : 1 });
This will make your searches through embedded documents run much faster.

Your update is based on id (and i assume it is diff from default _id of mongo)
Put index on your id field
You want to set new field for all documents within collection or want to do it only for some matching collection to given criteria? if only for matching collections, use query operator (with index if possible)
dont fetch full document, fetch only those fields which are being used.
What is your avg document size? Use explain and mongostat to understand what is actual bottleneck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.