Is it possible to Bulk Insert using Google Cloud Datastore - python

We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore.
Eventually we would move all our data there, however initially focusing on the archived data as a test.
Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row.
We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time.
Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python?
Any comments, suggestions is appreciated thank you in advanced.

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.
You could always write a quick script using a local queue that parallelizes the work.
The basic gist would be:
Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.
Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.
Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).
If you have properties (which will be indexed by default), this cost goes up substantially.
The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).
A single "new entity write" is equivalent to:
+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.
So your actual cost to load this data is:
120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

I believe what you are looking for is the put_multi() method.
From the docs, you can use put_multi() to batch multiple put operations. This will result in a single RPC for the batch rather than one for each of the entities.
Example:
# a list of many entities
user_entities = [ UserEntity(name='user %s' % i) for i in xrange(10000)]
users_keys = ndb.put_multi(user_entities) # keys are in same order as user_entities
Also to note, from the docs is that:
Note: The ndb library automatically batches most calls to Cloud Datastore, so in most cases you don't need to use the explicit batching operations shown below.
That said, you may still, as suggested by , use a task queue (I prefer the deferred library) in order to batch-put a lot of data in the background.

As an update to the answer of #JJ Geewax, as of July 1st, 2016
the cost of read and write operations have changed as explained here: https://cloud.google.com/blog/products/gcp/google-cloud-datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases
So writing should have gotten cheaper for the described case, as
writing a single entity only costs 1 write regardless of indexes and will now cost $0.18 per 100,000

Related

Django which is the correct way to store discussion inside db?

I use django 1.11 with postgresql as the database. I know how to store and retrive data from a db but I can't find an example to which is the correct way to store and to retrieve an entire discussion of two users.
This is my simple idea:
Two users connect to 127.0.0.1 and in this page there is a text-area form. Both users can write into the text-area and by press a button they post their content. The page will reload and now all message are being displayed.
What I want to know is that if the correct way to store and retrive would be:
one db row => single message user
If two users exchange, say 15 messages, it will store 15 rows and to make a univocal discussion, I can put another column into the db something like discussion "id", so 15 rows would have the same id and the user:
db row1 ---> "pk=1, message=hello there, user=Mike, id=45")
db row2 ---> "pk=2, message=hello world, user=Jessy, id=45")
When the page reload clearly in django will run:
discussion = Discussion.objects.all().filter(id=45)
to retrieve the discussion.
Only two user can discuss in private, so every two user have a discussion page like 127.0.0.1/one, 127.0.0.1/two and so on..
If this is the correct way to store and retrive from the db, my question is how that would scale? Can I rely on that design to store and retrive data from the database efficiently or it will be heavy in near future? I worry that 1000 users could quickly grow into 10000 rows.
So the answer to your question depends on how you plan on using the data in the future and what you need to do with it. It is entirely possible to store an entire conversation between N users in a columnar database such as Postgres as individual records per message. However, as with all programming questions, there are multiple paradigms to answer your question. I will explore the pros/cons of a couple of them here (with the knowledge that there are certainly more).
Paradigm 1 New record (row) per message
Pros:
Simpler querying for individual messages.
Analytical functions can easily be applied at a message level (i.e. summing number of messages by certain users)
Record size is (relatively) small
Cons:
Very long table sizes
Searching becomes time consuming as table grows.
Post-processing needed on a collection (i.e. All records from a conversation)
More work is shifted to the server
Paradigm 2 New record (row) per conversation
Pros:
Simpler querying for individual conversations
Shorter table sizes
Post-processing needed on an object (i.e. The entire conversation stored as a JSON object)
Cons:
Larger row size that can grow substantially depending on the number and size of messages.
Harder to query individual messages or text within messages (need to use more expensive functions such as LIKE % on blobs of text = slow)
Less conducive to preforming any type of analytical function on messages.
Messages become an append exercise
More work is shifted to the client/application
Which is best? YMMV
Again, there are probably a half-dozen or so more ways you could store your application's messages, and all depend on your downstream needs. Additionally, I would implode you to look into projects such as Apache Kafka which specialize in message publishing as potentially a scaleable, drop in solution.
Three recommendations:
If you give PostgreSQL a decent amount of resources (say, an Amazon m3.large instance), then "a lot of rows" for a PostgreSQL database is around 100 million rows (depending). That's not a limit, it's just enough rows that you'll have to spend some time working on performance. So assuming that chats average 100 messages, then that would be one million conversations. So having one row per message is not a performance problem at the scale you're talking about.
Don't use a numerical PK as your main way of ordering conversations (you might still have one, Django likes having one). Do have a timestamptz column, which is how you reconstruct the order of conversations.
Have a unique index on user, timestamptz (since a user can't post two messages simultaneously), and another unique index on conversation, timestamptz (this will allow you to reconstruct conversations quickly).
You should also have a table called "conversations" which summarizes conversation_id, list-of-users, because this will make it easy to answer the request "show me all my conversations".
Does that answer your questions?

Continuous aggregates over large datasets

I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.

App Engine Bulkloader to Update Entities Instead of Replacing

I need to perform a nightly update in the datastore on a relatively large dataset (syncing a subset of corporate data with GAE). I've been using the bulkloader, and it does the job, but the write costs are really adding up. Since I'm specifying key strings for each entity, the bulkloader is essentially rewriting the ENTIRE entity for every record it loads, which in my case, is about 90 writes PER ENTITY. (It's a large, flat dataset with a lot of indexes.) But within my dataset, only six of my 50 properties actually change overnight, so I'm doing a lot of redundant writing.
My first thought was to keep a cache of the prior night's build, loop through it for changes, get the entity, then execute a put() on the properties that need it. This works effectively to reduce writes, but takes a LONG time -- even when I batch the put(). It only takes ~3 minutes to load the ENTIRE dataset with the bulkloader -- and 16-18 just to run the updates! (I'm using remote API, BTW.) This won't work when I scale up.
I tried using ndb.KeyProperty in my model and only updating the changed fields via bulkloader, but then I lose the abilty to query/sort on the keyProperty, which I need.
I also tried StructuredProperties, which DOES let you query/sort, but the structured property doesn't allow you to set an ID for it, so I can't load just the structured property.
So...is there a way for me to reduce these writes and keep the functionality I need? Can I use the bulkloader to update changes only? Do I need to restructure my dataset??

Reduce GAE hrd(db) read operation counts

To reduce GAE Python usage cost, I want to optimize DB read operation. Do you have any suggestions?
I can't understand why GAE shows quite a lot DB read operation than I thought. If you can give general logic how GAE counts DB read operation it also should be very helpful.
Thanks!
You can get the full breakdown of what a high-level operation (get, query, put, delete, ...) costs in low-level operations (small, read, write) here - https://developers.google.com/appengine/docs/billing (scroll down about half way).
I highly recommend using AppStats to help track down where your read operations are coming from. One big thing to watch out for is not to use the offset option with .fetch() for pagination, as this just skips results, but still costs reads. That means if you do .fetch(10, offset=20), it will cost you 30 reads. You want to use query cursors instead.
Another optimization is to fetch by key (.get(keys)) vs querying, which will only cost 1 read operation, as opposed to querying which cost 1 read for the query + 1 read for each entity returned (so a query with 1 entity returned cost 2 reads, but a .get() for that same entity would only cost 1 read. You might also want to look at using projection queries, which cost 1 read for the query, but only 1 small per projected entity retrieved (note: all properties projected must be indexed).
Also, if you're not already, you should be using the NDB API which automatically caches fetches and will help reduce your read operations. Along with the official docs, the NDB cheat sheet by Rodrigo and Guido is a great way to transition from ext.db to ndb.
There are some good tips under Managing Datastore Usage here:
https://developers.google.com/appengine/articles/managing-resources
Lastly, you might also be interested in using gae_mini_profiler, which provides convenient access to AppStats for the current request, as well as other helpful profiling and logging information.
Hard to say why without seeing your code but if you're not already, use memcache to save on db reads.
https://developers.google.com/appengine/docs/python/memcache/usingmemcache

How do I transform every doc in a large Mongodb collection without map/reduce?

Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.

Categories

Resources