I'm wondering what operation should invoke after adding new entity to database to make this entity searchable with haystack:
should I only update the index?
should I rebuild the whole index?
What's problematic is that new entities will be added frequently and there might be potentially large amount of entities in db.
If you're adding new rows to your database then update_index should be enough.
From the haystack docs:
The conventional method is to use SearchIndex in combination with cron
jobs. Running a ./manage.py update_index every couple hours will keep
your data in sync within that timeframe and will handle the updates in
a very efficient batch.
If you added a new field to your search index, then you would need to run rebuild_index:
If you have an existing SearchIndex and you add a new field to it,
Haystack will add this new data on any updates it sees after that
point. However, this will not populate the existing data you already
have.
In order for the data to be picked up, you will need to run
./manage.py rebuild_index. This will cause all backends to rebuild the
existing data already present in the quickest and most efficient way.
Related
I am using postgresql for my django app.
Managed to delete almost 500000 rows, but size of my DB didn't lower significantly.
Deleted them with smth like lots.objects.filter(id__in=[ids]).delete() in chunks (because it's too hard to delete so many rows in one query).
Some columns have db_index=True, so I think indexes were not deleted.
Do I have the possibility to delete also indexes for deleted objects from django?
Maybe there is also a way to see unused indexes from Django?
None of this has anything to do with Django. If an item is deleted from a database, it is always automatically deleted from any indexes - otherwise indexing just wouldn't work.
Normally you should let Postgres itself determine the size of the database files. Deleted items are removed when a VACUUM operation is done; again, normally Postgres will do this via a regularly scheduled daemon. If you need to specifically recover space, then you can run VACUUM manually. See the docs.
I want to use caching in Django and I am stuck up with how to go about it. I have data in some specific models which are write intensive. records will get added continuously to the model. Each user has some specific data in the model similar to orders table.
Since my model is write intensive I am not sure how effective caching frameworks in Django are going to be. I tried Django view specific caching and I am try to develop a view where first it will pick up data from the cache. Then I will have another call which will bring in data which was added to the model after the caching was done. What I want to do is add the updated data to the original cache data and store it again.
It is like I don't want to expire my cache, I just want to keep adding to my existing cache data. may be once in 3 hrs I can clear it.
Is what I am doing right. Are there better ways than this. Can I really add to items in existing cache.
I will be very glad for your help
You ask about "caching" which is a really broad topic, and the answer is always a mix of opinion, style and the specific app requirements. Here are a few points to consider.
If the data is per user, you can cache it per user:
from django.core.cache import cache
cache.set(request.user.id,"foo")
cache.get(request.user.id)
The common practice it to keep a database flag that tells you if the user's data changed since it was cached. So before you fetch the data from cache, check only this flag from the DB. If the flag says nothing changed, get the data from cache. If it did change, pull from DB, replace the cache, and set the flag again.
The flag check should be fast and simple: one table, indexed by user.id, and a boolean flag field. This will squeeze a lot of index rows into a single DB page, and enables a fast fetching of a single one field row. Yet you still get a persistent updated main storage, that prevents the use of not updated cache data. You can check this flag in a middleware.
You can run expiry in many ways: clear cache when user logs out, run a cron script that clears items, or let the cache backend expire items. If you use a flag check before you use the cache, there is no issue in keeping items in cache except space, and caching backends handle that. If you use the django simple file cache (which is easy, simple and zero config), you will have to clear the cache. A simple cron script will do.
I have an NDB model. Once the data in the model becomes stale, I want to remove stale data items from searches and updates. I could have deleted them, which is explained in this SO post, if not for the need to analyze old data later.
I see two choices
adding a boolean status field, and simply mark entities deleted
move entities to a different model
My understanding of the trade off between these two options
mark-deleted is faster
mark-deleted is more error prone: having extra column would require modifying all the queries to exclude entities that are marked deleted. That will increase complexity and probability of bugs.
Question:
Can move-entities option be made fast enough to be comparable to mark-deleted?
Any sample code as to how to move entities between models efficiently?
Update: 2014-05-14, I decided for the time being to use mark-deleted. I figure there is an additional benefit of fewer RPCs.
Related:
How to delete all entities for NDB Model in Google App Engine for python?
You can use a combination, of the solutions you propose although in my head I think its an over engineering.
1) In first place, write a task queue that will update all of your entities with your new field is_deleted with a default value False, this will prevent all the previous entities to return an error when you ask them if they are deleted.
2) Write your queries in a model level, so you don't have to alter them any time you make a change in your model, but only pass the extra parameter you want to filter on when you make the relevant query. You can get an idea from the model of the bootstrap project gae-init. You can query them with is_deleted = False.
3) BigTable's performance will not be affected if you are querying 10 entities or 10 M entities, but if you want to move the deleted ones in an new Entity model you can try to create a crop job so in the end of the day or something copy them somewhere else and remove the original ones. Don't forget that will use your quota and you mind end up paying literally for the clean up.
Keep in mind also that if there are any dependencies on the entities you will move, you will have to update them also. So in my opinion its better to leave them flagged, and index your flag.
I am using Django with a bunch of models linked to a MySQL database. Every so often, my project needs to generate a new number (sequentially, although this is not important) that becomes an ID for rows in one of the database tables. I cannot use the auto-increment feature in the models because multiple rows will end up having this number (it is not the primary key). Thus far, I have been using global variables in views.py, but every time I change anything and save, the variables are reset with the server. What is the best way to generate a new ID like this (without it being reset all the time), preferably without writing to a file every time? Thanks in advance!
One way is to create a table in your database and save those values that you want in it. Another way is to use HTTP Cookies to save values if you want to avoid server reset problem. Though, I do not prefer this way.
You can follow this link to set and read values from Cookies in django:-
https://docs.djangoproject.com/en/dev/topics/http/sessions/#s-setting-test-cookies
1.I have a list of data and a sqlite DB filled with past data along with some stats on each data. I have to do the following operations with them.
Check if each item in the list is present in DB. if no then collect some stats on the new item and add them to DB.
Check if each item in DB is in the list. if no delete it from DB.
I cannot just create a new DB, coz I have other processing to do on the new items and the missing items.
In short, i have to update the DB with the new data in list. What is best way to do it?
2.I had to use sqlite with python threads. So I put a lock for every DB read and write operation. Now it has slowed down the DB access. What is the overhead for thread lock operation? And Is there any other way to use the DB with multiple threads?
Can someone help me on this?I am using python3.1.
It does not need to check anything, just use INSERT OR IGNORE in first case (just make sure you have corresponding unique fields so INSERT would not create duplicates) and DELETE FROM tbl WHERE data NOT IN ('first item', 'second item', 'third item') in second case.
As it is stated in the official SQLite FAQ, "Threads are evil. Avoid them." As far as I remember there were always problems with threads+sqlite. It's not that sqlite is not working with threads at all, just don't rely much on this feature. You can also make single thread working with database and pass all queries to it first, but effectiveness of such approach is heavily dependent on style of database usage in your program.