LDAP search takes too long - python

I'm trying to get more than 50.000 records from LDAP (with python-ldap and page control tools).
I have searching filter, which is
(|(field=value_1) (field=value_2)...(field=value_50000)
But this request taking more than 15 minutes. I'm taking 10 attributes from LDAP for these records.
Could you please tell me if is it okay for some large request or I can try to change filter?

You should refine your search base, and make it the closet possible to what you are searching for, for example, instead of querying dc=company,dc=com use ou=people,dc=company,dc=com.
You can also build an index of the field you are searching for, and you can also enable cache for your ldap, and finnally concerning your search filter, if you query the same attribute you can try something like:
&"(field>=MinValue)(field=<MaxValue)"
It's way better the matching every single attribute.

Related

How/Where can I get a list of all resource names in DbPedia?

I am trying to build a question answering system that retrieves answer from DbPedia. At a certain step, I need to find the appropriate resource name to perform the query operation. For that, I was planning to store all the resource name in a database and retrieve the most relavant ones.
Now I'm stuck at this step. I'm new to this whole DbPedia thing and I'm not actually sure if there is a way to get all the resource list of DbPedia.
Previously, I had used python's wikipedia library to get the resource name. But it does not return any value for certain test cases. And so I need to change my approach.

"Get" document from cosmosdb by id (not knowing the _rid)

As MS Support recently told me that using a "GET" is much more efficient in RUs usage than a sql query. I'm wondering if I can (within the azure.cosmos python package or a custom HTTP request to the REST API) get a document by its unique 'id' field (for which I generated a GUIDs) without an SQL Query.
Every example shown are using the link/path of the doc which is built with the '_rid' metadata of the document and not the 'id' field set when creating the doc.
I use a bulk upsert stored procedure I wrote to create my new documents and never retrieve the metadata for each one of them (I have ~ 100 millions docs) so retrieving the _rid would be equivalent to retrieving the doc itself.
The reason that the ReadDocument method is so much more efficient than a SQL query is because it uses _rid instead of a user generated field, even the required id field. This is because the _rid isn't just a unique value, it also encodes information about where that document is physically stored.
To give an example of how this works, let's say you are explaining to someone where a party is this weekend. You could use the name that you use for the house "my friend Ryan's house" or you could use the address "123 ThatOne Street Somewhere, WA 11111". They both are unique identifiers, but for someone trying to get there one is way more efficient than the other.
Telling someone to go to your friend's house is like using your own id. It does map to a specific house, but the person will still need to find out where that physically is to get there. Using the address is like working with the _rid field. Based on that information alone they can get to the party location. Of course, in the real world the person would probably need directions, but the data storage in a database is a lot more organized than most city streets so an address is sufficient to go retrieve the document.
If you want to take advantage of this method you will need to find a way to work with the _rid field.

Django Haystack: Adding custom fields (fl=) to search query string (solr backend)

I am trying to use spatial queries to return a resultset sorted by distance which also has the distance information available for each result.
Getting the distance sorting is easy enough with something like this in my SearchForm:
sqs = SearchQuerySet().dwithin('location', self.user_position, D(mi=100)).distance('location', self.user_position).order_by('distance')
The haystack docs say that by adding a call to .distance() it will make distance information available in the results (http://django-haystack.readthedocs.org/en/master/spatial.html#id3) however I took the query string it generates from the logs and pass it to solr through a web browser and it doesn't contain the distance information for each document. Taking a look at the build_search_kwargs method for the solr_backend it appears to be unimplemented. Am I missing something, why would it be in the docs if it's not implemented?
All I have to do to the query to get what I want is add '+dist:geodist()' to the fl parameter as per the solr docs but I can't find a nice way to do it with haystack. Has anyone done this?
Cheers.

use standard datastore index or build my own

I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.
Thanks!
Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).
I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources