Google app engine excessive datastore small operations - python

My site has about 50 users and I am getting excessive small datastore operations. I am aggressively memcaching, dont have that many records and still I get millions of small datastore operations. Appstats says the cost is 0 yet the real cost is not 0.
I basically know where the small datastore operations might occur.
Key only operations: I do this but I memcache it until the data is not changed. Plus most of my key only operation have limit=100 (this is max) so to get 12m operations I would need to make 120000 calls (I am assuming fetching 1 key is 1 small operation). As I get about 60-70 visits a day that seems a bit excessive.
I just cant figure out what is causing that many operations. Appstats is giving me no clue.
This is the dashboard.
This is the appstats.

Are you using lots of counts? Seems like this can be a problem that causes excessive datastore small operations.
I don't have your code, but this answer has some suggestions for optimizing your code when experiencing this problem.
Also, take a look at a similar question - Google app engine excessive small datastore operations for similar answers

I notice this old question isn't yet solved, so based on your info, here is another potential cause.
Running my GAE SDK on an very fresh public Azure VM instance (xxx.cloudapp.net), I've noticed a lot of bot traffic coming in trying to find a common open source CMS or cart's admin page. I believe this is due to the bots either utilizing AXFR requests or bruteforce detection of subdomains.
Make sure that you are blocking any unwanted bot traffic and not serving them a dynamic page, hitting your datastore more.
The same condition can also be caused by a rogue AJAX request looping on every page that those 50 users request.

Related

Django and Rails with one common DB

I have earlier worked on Java+Spring to create a web-app.
I have to build a new web-app now.
It will have one centralized db.
There will be two different type of instance of web-app.
Web-App 1:
a) It would have nothing to UI render, no html,js etc.
b) All it need to give is some set of rest API which will
b.1) create some new entries in DB
b.2) modify some entries in DB
b.3) retrieve some of DB records in JSON format.
some frontend code ( doesn't belong to this app) will periodically fetch
this details.
c) it will be used by max by 100,000 people but at a given point of time,
we can expect about 1000 users logged in and doing whats being said in b)
Web-App2 :
a) It will have some dashboards
b) 90% of DB operations would be read operations
c) 10% of DB operations would be write/modify
d) There will be about 1000s of user of this system and at any given point of time
hardly 50-1000 people will be accessing it.
I am thinking of following.
Have Web-App 1 created in python+Django and Web-App 2 created in RoR.
I am planning to use to Dynamo DB and memcache.
Why two different frameworks?
1) So that I get to learn both of them
2) There have been concern about scalability in RoR (and I also know people claim its not there), Web-app 1 may have scaling needs in future.
My questions is Do you see any problem with this combination?
for example active records would want you to use specific namings format for your data base tables? Are there any other concerns similar to this?
Anyone else who have used similar technology stack?
both frameworks are full stack framework and and provide MVC, templating, unit testing, security, db migration, caching, security, ORMs.
For my startup, we also needed to put out a full fleshed website along with an API. We are also using DynamoDB for storing most of the data and are only using MySQL for session info.
I opted to use Ruby on Rails for the Webapp and Sinatra for the API. If you're criteria is simply learning as many new things as possible, then it would make sense to opt for relatively different stacks (django/python and RoR). In our case, we went with sinatra because it's essentially a very lightweight wrapper around Rack and perfect for an API which essentially receives requests, calls one or more services or does some processing and hands out a formatted response. While I don't see any problem with using python/django instead of sinatra, in our case the benefit was having to spend less time working with a different language.
Also, scalability in rails is a bit of an iffy subject. In the end, it's about how you use it. We've had no issues scaling rails with unicorn and nginx. Our business logic is all in the API service and the rails server as well uses the API for most of the work. This means we don't use active record on rails and the website is just another consumer for our API which does all the heavy lifting whether the request comes from an app or the website. Using MySQL for the session store ensures we can route requests to any of the application servers without having to worry about always routing requests from the same client to the same server every time. This allows us to ramp up and down easily only considering the amount of traffic we're getting.
At the time we started working on this, there wasn't an ORM for dynamo db which looked and felt just like active record, so we ended up writing a few high level classes of our own to handle storage and retrieval of models on DynamoDb. Considering DynamoDB is not tailored for scans or joins, this didn't take a lot of effort since we were almost always doing lookups based on keys and ranges. This meant we didn't really need a replacement for active record since the real strength of active record is being able to intuitively do joins, etc. by convention.
DynamoDB does have it's limitations though and you might find yourself in situations where you will need to scan a large number of records. In our case, we also use CloudSearch to index some important info and use it as a fallback for cases when we need to do text based searches which need to scan all our data.

Datastore Viewer missing rows for hours, out of sync with frontend

I have a tiny python app engine project that exposes an endpoint which creates an instance of a certain ndb.Model and then calls put() on it, returning the model as JSON.
My frontend is behaving as expected (the models seem to be created and returned, no errors in the HTTP request or logs console), but the Datastore Viewer is wildly out of sync: Only one of the ~30 objects created today is shown (which happens to be the most recent object), and it took ~10 minutes for that to happen.
Datastore Statistics similarly shows only 8 entities of this type (where I expect >30), but the timestamp for this page shows it has not been updated for almost 48 hours despite the claim Statistics are updated at least once per day.
I know the datastore is eventually consistent, but this has been going on for several hours; either I have a bug (in my trivial program) or something is up with app engine.
I'm a seasoned engineer but new to app engine; any suggestions for common pitfalls or areas I should look into?
Edit: Seems to be a known/ongoing issue with datastore replication that isn't reported on the status dashboard :-\
I have submitted a new defect report about Datastore Statistics not updating at least once every 24 hours. Please star it and/or chime in with comments.
Known issue with datastore replication at the moment (thanks Tim!)
I guess the lesson is, you probably want to subscribe to google-appengine-downtime-notify for notification of outages, since the issue was reported there but not on the appengine dashboard.

Django on GoogleAppEngine: performance howto

I asked this question a few weeks ago. Today I have actually written and released a standard Django application, i.e. a fully-functional relational DB-backed (and consequently fully-functional Django admin) enabled by Google CloudSQL. The only time I had to deviate from doing things the standard Django way was to send email (had to do it the GAE way). My setup is GAE 1.6.4, Python2.7, Django 1.3 using the following in app.yaml:
libraries:
- name: django
version: "1.3"
However I do need you to suggest clear actionable steps to improve to the response time of the initial request when cold of this Django app. I have a simple webapp2 web site on GAE, which does not hit the DB, and when cold the response time is of 1.56s. The Django one, when cold, hits the DB with 2 queries (two count(*) queries over tables containing less than 300 rows each), and the response time is of 10.73s! Not encouraging for a home page ;)
Things that come to mind are to remove the middleware classes I don't need and other Django-specific optimisations. However tips that improve things also from a GAE standpoint would be really useful.
N.B. I don't want this to become a discussion about the merits of going for Django on GAE. I can mention that my personal Django expertise, and resulting development productivity, did bear considerably in adopting Django as opposed to other frameworks. Moreover with CloudSQL, it's easy to move away from GAE (hopefully not!) as the Django code will work everywhere else with little (or no) modifications. Related discussions about such topic can be found here and here.
I don't have a full answer but I'm contributing since I'd like to find a solution as well. I'm currently using a running cron job (I actually need the cron job, so it's not only for keeping my app alive).
I've seen it discussed in one of the GAE/Python/Django related mailing lists that just the time required for loading up all the Django files is significant when compared to webapp, so removing django components that you don't use from deployment should improve your startup time as well. I've been able to shave off about 3 seconds by removing certain parts of the contrib folder. I exclude them in my app.yaml.
My startup time is still around 6 seconds (full app, Django-nonrel, HRD). It used to be more like 4 when my app was simpler.
My suspicion is that Django verifies all its models on startup, and that processing time is significant. If you have time with experimenting with an app with absolutely 0 models, I'd be curious if it made any impact.
I'm also curious as to whether your two initial queries make any significant impact.
When there is no instance running, for example after version upgrade or when there is no request for 15 min, then a request will trigger loading of an instance which takes about 10s. So what you are seeing is normal.
So if your app is idle for periods longer then 15min you will see this behavior. One workaround is to have a cron job ping your instance every 10 min (though I believe google does not like that).
Update:
You can avoid this by enabling billing then in GAE admin under "Application settings" set minimum Idle Instances setting to 1. Note: the min setting is not available on free apps, only max.

Django Passing data between views

I was wondering what is the 'best' way of passing data between views. Is it better to create invisible fields and pass it using POST or should I encode it in my URLS? Or is there a better/easier way of doing this? Sorry if this question is stupid, I'm pretty new to web programming :)
Thanks
There are different ways to pass data between views. Actually this is not much different that the problem of passing data between 2 different scripts & of course some concepts of inter-process communication come in as well. Some things that come to mind are -
GET request - First request hits view1->send data to browser -> browser redirects to view2
POST request - (as you suggested) Same flow as above but is suitable when more data is involved
Django session variables - This is the simplest to implement
Client-side cookies - Can be used but there is limitations of how much data can be stored.
Shared memory at web server level- Tricky but can be done.
REST API's - If you can have a stand-alone server, then that server can REST API's to invoke views.
Message queues - Again if a stand-alone server is possible maybe even message queues would work. i.e. first view (API) takes requests and pushes it to a queue and some other process can pop messages off and hit your second view (another API). This would decouple first and second view API's and possibly manage load better.
Cache - Maybe a cache like memcached can act as mediator. But then if one is going this route, its better to use Django sessions as it hides a whole lot of implementation details but if scale is a concern, memcached or redis are good options.
Persistent storage - store data in some persistent storage mechanism like mysql. This decouples your request taking part (probably a client facing API) from processing part by having a DB in the middle.
NoSql storages - if speed of writes are in other order of hundreds of thousands per sec, then MySql performance would become bottleneck (there are ways to get around by tweaking mysql config but its not easy). Then considering NoSql DB's could be an alternative. e.g: dynamoDB, Redis, HBase etc.
Stream Processing - like Storm or AWS Kinesis could be an option if your use-case is real-time computation. In fact you could use AWS Lambda in the middle as a server-less compute module which would read off and call your second view API.
Write data into a file - then the next view can read from that file (real ugly). This probably should never ever be done but putting this point here as something that should not be done.
Cant think of any more. Will update if i get any. Hope this helps in someway.

Web Scraper: Limit to Requests Per Minute/Hour on Single Domain?

I'm working with a librarian to re-structure his organization's digital photography archive.
I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.
I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?
I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.
Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.
They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.
There's no built-in rate limit for HTTP. Most common web servers are not configured out of the box to rate limit. If rate limiting is in place, it will almost certainly have been put there by the administrators of the website and you'd have to ask them what they've configured.
Some search engines respect a non-standard extension to robots.txt that suggests a rate limit, so check for Crawl-delay in robots.txt.
HTTP does have a concurrent connection limit of two connections, but browsers have already started ignoring that and efforts are underway to revise that part of the standard as it is quite outdated.

Categories

Resources