Importing CSV takes too long

Importing CSV takes too long - python

The Problem
I am writing an app engine Karaoke Catalogs app. The app is very simple: in the first release, it offers the ability to import CSV song lists into catalogs and display them.
I am having problem with CSV import: it takes a very long time (14 hours) to import 17,500 records in my development environment. In the production environment, it imports about 1000 records, then crashed with code 500. I am going through the logs, but did not find any useful clues.
The Code
class Song(ndb.Model):
sid = ndb.IntegerProperty()
title = ndb.StringProperty()
singer = ndb.StringProperty()
preview = ndb.StringProperty()
#classmethod
def new_from_csv_row(cls, row, parent_key):
song = Song(
sid=int(row['sid']),
title=row['title'],
singer=row['singer'],
preview=row['preview'],
key=ndb.Key(Song, row['sid'], parent=parent_key))
return song
class CsvUpload(webapp2.RequestHandler):
def get(self):
# code omit for brevity
def post(self):
catalog = get_catalog(…) # retrieve old catalog or create new
# upfile is the contents of the uploaded file, not the filename
# because the form uses enctype="multipart/form-data"
upfile = self.request.get('upfile')
# Create the songs
csv_reader = csv.DictReader(StringIO(upfile))
for row in csv_reader:
song = Song.new_from_csv_row(row, catalog.key)
song.put()
self.redirect('/upload')
Sample Data
sid,title,singer,preview
19459,Zoom,Commodores,
19460,Zoot Suit Riot,Cherry Poppin Daddy,
19247,You Are Not Alone,Michael Jackson,Another day has gone. I'm still all alone
Notes
In the development environment, I tried to import up to 17,500 records and experienced no crashing
At first, records are created and inserted quickly, but as the database grows into the thousands, the time it took to create and insert records increases to a few seconds per record.
How do I speed up the import operation? Any suggestion, hint, or tip will be greatly appreciated.
Update
I followed Murph's advice and used a KeyProperty to link a song back to the catalog. The result is about 4 minutes and 20 seconds for 17,500 records--a huge improvement. That means, I did not fully understood how NDB works in App Engine and I still have a long way to learn.
While a big improvement, 4+ minutes is admittedly still too long. I am now looking into Tim's and Dave's advises further shorten the perceived response time of my app.

In Google App Engine's Datastore, writes to an Entity Group are restricted to 1 write per second.
Since you are specifying a "parent" key for every Song, they all end up in one Entity Group, which is very slow.
Would it be acceptable for your use to just use a KeyProperty to keep track of that relationship? That would be much faster, though the data might have more consistency issues.

In addition to the other answer re: entity groups, if the import process is going to take longer than 60 secs, use a task, then you have 10min run time.
Store the csv as a BlobProperty in an entity (if compressed <1MB) or GCS for larger, then fire off a task which retrieves the CSV from storage then does the processing.

First off, Tim is on the right track. If you can't get work done within 60 seconds, defer to a task. But if you can't get work done within 10 minutes, fall back on App Engine MapReduce, which apportions the work of processing your csv across multiple tasks. Consult the demo program, which has some of the pieces that you would need.
For development-time slowness, are you using the --use_sqlite option when starting the dev_appserver?
Murph touches on the other part of your problem. Using Entity groups, you're rate-limited on how many inserts (per Entity group) you can do. Trying to insert 17,500 rows using a single parent isn't going to work at all well. That'll take about 5 hours.
So, do you really need consistent reads? If this is a one-time upload, can you do non-ancestor inserts (with the catalog as a property), then wait a bit for the data to become eventually consistent? This simplifies querying.
If you really, absolutely need consistent reads, you'll probably need to split your writes across multiple parents keys. This will increase your write rate, at the expense of making your ancestor queries more complicated.

Related

Continuous aggregates over large datasets

I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.

Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.

Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)

I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.

Is it possible to Bulk Insert using Google Cloud Datastore

We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore.
Eventually we would move all our data there, however initially focusing on the archived data as a test.
Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row.
We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time.
Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python?
Any comments, suggestions is appreciated thank you in advanced.

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.
You could always write a quick script using a local queue that parallelizes the work.
The basic gist would be:
Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.
Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.
Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).
If you have properties (which will be indexed by default), this cost goes up substantially.
The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).
A single "new entity write" is equivalent to:
+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.
So your actual cost to load this data is:
120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

I believe what you are looking for is the put_multi() method.
From the docs, you can use put_multi() to batch multiple put operations. This will result in a single RPC for the batch rather than one for each of the entities.
Example:
# a list of many entities
user_entities = [ UserEntity(name='user %s' % i) for i in xrange(10000)]
users_keys = ndb.put_multi(user_entities) # keys are in same order as user_entities
Also to note, from the docs is that:
Note: The ndb library automatically batches most calls to Cloud Datastore, so in most cases you don't need to use the explicit batching operations shown below.
That said, you may still, as suggested by , use a task queue (I prefer the deferred library) in order to batch-put a lot of data in the background.

As an update to the answer of #JJ Geewax, as of July 1st, 2016
the cost of read and write operations have changed as explained here: https://cloud.google.com/blog/products/gcp/google-cloud-datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases
So writing should have gotten cheaper for the described case, as
writing a single entity only costs 1 write regardless of indexes and will now cost $0.18 per 100,000

Google App Engine: Modifying 1000 entities

I have about 1000 user account entities like this:
class UserAccount(ndb.Model):
email = ndb.StringProperty()
Some of these email values contain uppercase letters like JohnathanDough#email.com. I want to select all the email values from all UserAccount entities and apply python's email.lower(). How can I do this efficiently, and most importantly, without errors?
Note: The email values are important for login, so I cannot afford to mess this up. Is there a way to backup this data in case of the event that I do make a mistake?
Thank you.

Yes, off course. Even if Datastore Administration is an experimental feature we can backup and restore data without coding. Follow this instruction for the backup flow: Backing up data.
To processing your data instead, the most efficient way is to use the MapReduce library.

Mapreduce works but its an excesive complication if youve never done it before.
Use task queues, each can handle a query result page, store the next pageToken and start another taskqueue for the next page.
Slower than mapreduce if you run the taskqueues secuentially. 1000 entries ia not much. Maybe in one minute it will be done.

Inequality Filters on a date and a number

Am trying to query my Google App Engine datastore [Python], which has a item_name, manufacturing_date and number_of_items_shipped. There are ~1.0 million records in the datastore and ever increasing.
The scenario:
Get all the item_names which have been shipped more than x_items [user input] and manufactured after some_date [user input].
Basically, kind of an inventory check.
Effectively 2 inequalities on properties.
But due to restrictions on queries in GAE, am not able to do this.
Searched SO for this issue. But, no luck till now. Did you come across this issue? If so, were you able to resolve this? Please let me know.
Also in Google I/O 2010, Next Gen Queries, Alfred Fuller mentioned that they are going to remove this restriction soon. Its been more than 8 months, but this restriction is in place even now. Unfortunately.
Appreciate if anyone can post an answer if they were able to circumvent this restriciton.
Thanks a lot.

Building on Sudhir's answer, I'd probably assign each record to a manufacture date "bucket", based on the granularity you care about. If your range of manufacturing dates is over a couple of years, use monthly buckets for example. If your range is just in the last year, weekly.
Now when you want to find records with > n sales and manufacturing date in a given range, do your query once per bucket in that range, and postfilter out the items you are not interested in.
For example (totally untested):
BUCKET_SIZE_DAYS = 10
def put(self):
self.manufacture_bucket = int(self.manufacture_date.toordinal() / BUCKET_SIZE_DAYS)
super(self.__class__, self).put()
def filter_date_after(self, date_start):
first_bucket = int(date_start.toordinal() / BUCKET_SIZE_DAYS)
last_bucket = int(datetime.datetime.today().toordinal() / BUCKET_SIZE_DAYS)
for this_bucket in range(first_bucket, last_bucket+1):
for found in self.filter("manufacture_bucket =", this_bucket):
if found.manufacture_date >= date_start:
yield found
You should be then able to use this like:
widgets.filter("sold >", 7).filter_date_after(datetime.datetime(2010,11,21))
Left as an exercise for the reader:
Making it play nicely with other filters added to the end
Multiple bucket sizes allowing you to always query ln(days in date range) buckets.

Unfortunately, you can't circumvent this restriction, but I can help you model the data in a slightly different way.
First off, Bigtable is suited to very fast reads off large databases - the kind you do when have a million people hitting your app at the same time. What you're trying to do here is a report on historical data. While I would recommend moving the reporting to a RDBMS, there is a way you can do it on Bigtable.
First, override the put() method on your item model to split the date before saving it. What you would do is something like
def put(self):
self.manufacture_day = self.manufacture_date.day
self.manufacture_month = self.manufacture_date.month
self.manufacture_year = self.manufacture_date.year
super(self.__class__, self).put()
You can do this to any level of granularity you want, even hours, minutes, seconds, whatever.
You can apply this retroactively to your database by just loading and saving your item entities. The mapper is very convenient for this.
Then change your query to use the inequality only on the item count, and select the days / months / years you want using normal equalities. You can do ranges by either firing multiple queries or using the IN clause. (Which does the same thing anyway).
This does seem contrived and tough to do, but keep in mind that your reports will run almost instantaneously if you do this, even when millions of people try to run them at the same time. You might not need this kind of scale, but well... that's what you get :D

Simple DB query on Google App Engine taking a lot of CPU time

I'm fairly new to Google App Engine and Python, but I did just release my first real-world site with it. But now I'm getting problems with one path that is using significantly more CPU (and API CPU) time than the other paths. I've narrowed it down to a single datastore fetch that's causing the problem: Carvings.all().fetch(1000)
Under the App Engine dashboard it's reporting "1040cpu_ms 846api_cpu_ms" pretty reliably for each request to that path. It has seemed like this may be the source to some unresponsiveness that my client has experienced with the site in general.
So I can't figure out what is so expensive about this query. Here is the related data model:
class Carving(db.Model):
title = db.StringProperty(required=True)
reference_number = db.StringProperty()
main_category = db.StringProperty()
sub_category = db.StringProperty()
image = db.ReferenceProperty(CarvingImage)
description = db.TextProperty()
price = db.FloatProperty()
size = db.StringProperty()
material = db.StringProperty()
added_at = db.DateTimeProperty(auto_now_add=True)
modified_at = db.DateTimeProperty(auto_now=True)
In other places in the app when I pull this model from the datastore I do more filtering and I guess that's why they aren't causing any troubles. But the total number of entities for this model is just above 90 and I just can't imagine why this is so expensive.

Memcache, if you haven't already, and especially if the same carvings are going to be fetched again and again. If you only have 90 total, I would imagine they would all be in the cache pretty quickly, and then you should be golden.
Do you need all the properties of the Carvings? For example, if you're just displaying a list of carvings, you could have a separate Entity that was something like CarvingSummary that only had a few properties. This would mean your schema was denormalized, but sometimes that's the price you pay for speed.
Also, I'm assuming this is not the first page the user will always hit? If that were the case it could be the cloud spinning up a a new instance.

Sometimes you'll get better performance if you do an indexed query, rather than a query of "all" elements in the model.
Also, consider using memcache.

Do you actually need 1000 entities? CPU time goes up more or less linearly with the number of results retrieved, so if you don't actually need all the results, you may be wasting a lot of time fetching and decoding them.

It could be the image (and/or Text property) that is taking time to load & marshall into objects, depending on how big those properties are.
First prize: just use the memcache as others say. Then the overhead is incurred only on the first hit.
Second prize: I'm not sure how often your images are being changed and how many you might have, but you could consider uploading them as static files and simply linking to them in your HTML. Then it'd be just an HTTP GET from the browser - much lower overhead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.