1.I have a list of data and a sqlite DB filled with past data along with some stats on each data. I have to do the following operations with them.
Check if each item in the list is present in DB. if no then collect some stats on the new item and add them to DB.
Check if each item in DB is in the list. if no delete it from DB.
I cannot just create a new DB, coz I have other processing to do on the new items and the missing items.
In short, i have to update the DB with the new data in list. What is best way to do it?
2.I had to use sqlite with python threads. So I put a lock for every DB read and write operation. Now it has slowed down the DB access. What is the overhead for thread lock operation? And Is there any other way to use the DB with multiple threads?
Can someone help me on this?I am using python3.1.
It does not need to check anything, just use INSERT OR IGNORE in first case (just make sure you have corresponding unique fields so INSERT would not create duplicates) and DELETE FROM tbl WHERE data NOT IN ('first item', 'second item', 'third item') in second case.
As it is stated in the official SQLite FAQ, "Threads are evil. Avoid them." As far as I remember there were always problems with threads+sqlite. It's not that sqlite is not working with threads at all, just don't rely much on this feature. You can also make single thread working with database and pass all queries to it first, but effectiveness of such approach is heavily dependent on style of database usage in your program.
Related
I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.
"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.
I'm wondering what operation should invoke after adding new entity to database to make this entity searchable with haystack:
should I only update the index?
should I rebuild the whole index?
What's problematic is that new entities will be added frequently and there might be potentially large amount of entities in db.
If you're adding new rows to your database then update_index should be enough.
From the haystack docs:
The conventional method is to use SearchIndex in combination with cron
jobs. Running a ./manage.py update_index every couple hours will keep
your data in sync within that timeframe and will handle the updates in
a very efficient batch.
If you added a new field to your search index, then you would need to run rebuild_index:
If you have an existing SearchIndex and you add a new field to it,
Haystack will add this new data on any updates it sees after that
point. However, this will not populate the existing data you already
have.
In order for the data to be picked up, you will need to run
./manage.py rebuild_index. This will cause all backends to rebuild the
existing data already present in the quickest and most efficient way.
I'm working on simple html scraper in Python 3.4, using peewee as ORM (great ORM btw!). My script takes a bunch of sites, extract necessary data and save them to the database, however every site is scraped in detached process, to improve performance and saved data should be unique. There can be duplicate data not only between sites, but also on particular site, so I want to store them only once.
Example:
Post and Category - many-to-many relation. During scraping, same category appears multiple times in different posts. For the first time I want to save that category to database (create new row). If the same category shows up in different post, I want to bind that post with already created row in db.
My question is - do I have to use atomic updates/inserts (insert one post, save, get_or_create categories, save, insert new rows to many-to-many table, save) or can I use bulk insert somehow? What is the fastest solution to that problem? Maybe some temporary tables shared between processes, which will be bulk insert at the end of work? Im using MySQL db.
Thx for answers and your time
You can rely on the database to enforce unique constraints by adding unique=True to fields or multi-column unique indexes. You can also check the docs on get/create and bulk inserts:
http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-unique-constraints
http://docs.peewee-orm.com/en/latest/peewee/querying.html#get-or-create
http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts
http://docs.peewee-orm.com/en/latest/peewee/querying.html#upsert - upsert with on conflict
Looked for this myself for a while, but found it!
you can use the on_conflict_replace() or on_conflict_ignore() functions to define behaviour for when a record exists in a table that has a uniqueness constraint.
PriceData.insert_many(values).on_conflict_replace().execute()
or
PriceData.insert_many(values).on_conflict_ignore().execute()
More info under "Upsert" here
First off, this is my first project using SQLAlchemy, so I'm still fairly new.
I am making a system to work with GTFS data. I have a back end that seems to be able to query the data quite efficiently.
What I am trying to do though is allow for the GTFS files to update the database with new data. The problem that I am hitting is pretty obvious, if the data I'm trying to insert is already in the database, we have a conflict on the uniqueness of the primary keys.
For Efficiency reasons, I decided to use the following code for insertions, where model is the model object I would like to insert the data into, and data is a precomputed, cleaned list of dictionaries to insert.
for chunk in [data[i:i+chunk_size] for i in xrange(0, len(data), chunk_size)]:
engine.execute(model.__table__.insert(),chunk)
There are two solutions that come to mind.
I find a way to do the insert, such that if there is a collision, we don't care, and don't fail. I believe that the code above is using the TableClause, so I checked there first, hoping to find a suitable replacement, or flag, with no luck.
Before we perform the cleaning of the data, we get the list of primary key values, and if a given element matches on the primary keys, we skip cleaning and inserting the value. I found that I was able to get the PrimaryKeyConstraint from Table.primary_key, but I can't seem to get the Columns out, or find a way to query for only specific columns (in my case, the Primary Keys).
Either should be sufficient, if I can find a way to do it.
After looking into both of these for the last few hours, I can't seem to find either. I was hoping that someone might have done this previously, and point me in the right direction.
Thanks in advance for your help!
Update 1: There is a 3rd option I failed to mention above. That is to purge all the data from the database, and reinsert it. I would prefer not to do this, as even with small GTFS files, there are easily hundreds of thousands of elements to insert, and this seems to take about half an hour to perform, which means if this makes it to production, lots of downtime for updates.
With SQLAlchemy, you simply create a new instance of the model class, and merge it into the current session. SQLAlchemy will detect if it already knows about this object (from cache or the database) and will add a new row to the database if needed.
newentry = model(chunk)
session.merge(newentry)
Also see this question for context: Fastest way to insert object if it doesn't exist with SQLAlchemy
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.