I need to perform a nightly update in the datastore on a relatively large dataset (syncing a subset of corporate data with GAE). I've been using the bulkloader, and it does the job, but the write costs are really adding up. Since I'm specifying key strings for each entity, the bulkloader is essentially rewriting the ENTIRE entity for every record it loads, which in my case, is about 90 writes PER ENTITY. (It's a large, flat dataset with a lot of indexes.) But within my dataset, only six of my 50 properties actually change overnight, so I'm doing a lot of redundant writing.
My first thought was to keep a cache of the prior night's build, loop through it for changes, get the entity, then execute a put() on the properties that need it. This works effectively to reduce writes, but takes a LONG time -- even when I batch the put(). It only takes ~3 minutes to load the ENTIRE dataset with the bulkloader -- and 16-18 just to run the updates! (I'm using remote API, BTW.) This won't work when I scale up.
I tried using ndb.KeyProperty in my model and only updating the changed fields via bulkloader, but then I lose the abilty to query/sort on the keyProperty, which I need.
I also tried StructuredProperties, which DOES let you query/sort, but the structured property doesn't allow you to set an ID for it, so I can't load just the structured property.
So...is there a way for me to reduce these writes and keep the functionality I need? Can I use the bulkloader to update changes only? Do I need to restructure my dataset??
Related
We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore.
Eventually we would move all our data there, however initially focusing on the archived data as a test.
Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row.
We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time.
Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python?
Any comments, suggestions is appreciated thank you in advanced.
There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.
You could always write a quick script using a local queue that parallelizes the work.
The basic gist would be:
Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.
Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.
Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).
If you have properties (which will be indexed by default), this cost goes up substantially.
The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).
A single "new entity write" is equivalent to:
+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.
So your actual cost to load this data is:
120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00
I believe what you are looking for is the put_multi() method.
From the docs, you can use put_multi() to batch multiple put operations. This will result in a single RPC for the batch rather than one for each of the entities.
Example:
# a list of many entities
user_entities = [ UserEntity(name='user %s' % i) for i in xrange(10000)]
users_keys = ndb.put_multi(user_entities) # keys are in same order as user_entities
Also to note, from the docs is that:
Note: The ndb library automatically batches most calls to Cloud Datastore, so in most cases you don't need to use the explicit batching operations shown below.
That said, you may still, as suggested by , use a task queue (I prefer the deferred library) in order to batch-put a lot of data in the background.
As an update to the answer of #JJ Geewax, as of July 1st, 2016
the cost of read and write operations have changed as explained here: https://cloud.google.com/blog/products/gcp/google-cloud-datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases
So writing should have gotten cheaper for the described case, as
writing a single entity only costs 1 write regardless of indexes and will now cost $0.18 per 100,000
I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.
What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.
I have a file that contains ~16,000 lines of information on entities. The user is supposed to upload the file using an HTML upload form, then the system handles this by reading line by line and creating then put()'ing entities onto the datastore.
I'm limited by the 30 second request time limit. I have tried a lot of different work-arounds using Task Queue, forced HTML redirecting, etc. and nothing has worked for me.
I am using forced HTML redirecting to delete all data and this works, albeit VERY slowly. (4th answer here: Delete all data for a kind in Google App Engine)
I can't seem to apply this to my uploading problem, since my method has to be a POST method. Is there a solution somehow? Sample code would be much appreciated since I'm very new to web development in general.
To solve a similar problem, I stored the dataset in a model with a single TextProperty, then spawn a taskqueue task that:
Fetches a dataset from the datastore if there are any left.
Checks if the length of the dataset is <= N, where N is some small number of entities you can put() without a timeout. I used 5. If so, write the individual entities, delete the dataset record, and spawn a new copy of the task.
If the dataset size was bigger than N, split it into N parts in the same format and write those to the datastore, delete the original entity, and spawn a new copy of the task.
If you're doing this to bulk load data, why not use the bulk loader?
If you need the interface to be accessible to non-admin users, then, as suggested, you need to break the file up into decent sized chunks (by taking blocks of n lines each) put them into the datastore, and start a task to deal with each of them.