Google App Engine request logs are duplicated with small time difference - python

Looking into GAE logs of a Python project I found out some amount of duplicated or even triplicated entries with very small difference in their timestamps.
These are the requests triggered by an iPhone device which sends only unique data, so it seems extremely unlikely that this duplication comes from the phone. Especially if you take into account the differences in time between the requests.
00:53:32.139 POST 200 93B 1.6 s APPNAME/1.2 CFNetwork/758.4.3 Darwin/15.5.0 /logData
00:53:32.142 POST 200 93B 930 ms APPNAME/1.2 CFNetwork/758.4.3 Darwin/15.5.0 /logData
00:53:32.279 POST 200 93B 835 ms APPNAME/1.2 CFNetwork/758.4.3 Darwin/15.5.0 /logData
Requests are the same (source ip, headers and so on) with the equal data inside:
{u'version': 1.2, u'data': u'some data', u'user': u'0a9b....0a57'}
And the actual question is "How is that possible"?
Could there be an explanation of such short intervals between duplicated logs?

It happened because of asynchronous tasks: When the iPhone gets a location from its location manager, and when the OS has some free time to run our code, then the application fires an HTTP request to the /logData endpoint. At the same time user activity causes one more HTTP request. Data from local variables will be removed only after the acknowledgment that data is received (HTTP response 200). So as they were triggered almost simultaneously -- both of them got into database and to the GAE logs.

All data stored by Google is triplicated so that there is always a backup, even if one of the nodes goes down.
This is a big factor, because as your data grows, the likelihood of
losing the data becomes very real. Most companies deal with this by
making multiple copies of the data in different datacenters in
different locations. Google also very obviously replicates important
data at least 3x, as the GFS paper suggests.
-- https://www.quora.com/How-does-Google-store-their-data
So it's not surprising that the logs are triplicated onto different servers. Those servers might have slightly different timestamps as you have observed. For me, the question is why you got all three copies of the log entry.

Related

Alphavantage: Requests from different api keys but from the same system

I have written a python script that helps me get the prices from alphavantage using an api key. I have 2 different alphavantage api keys (mine and my brother's) and the python script requests data from alphavantage separately from both keys but from the same laptop. But even though I request it separately from separate keys, I get the maximum api call frequency exceeded error (5 api calls per minute per key).
My best assumption is that alphavantage knows whether the request is coming from the same location or the same system. Is there any workaround this problem? Maybe bounce my signal (sort of found an answer but don't know if that's the problem though!) or pretend like the request is going from different system?
Your IP is limited to 5 requests per minute regardless of how many keys you cycle through, so you would need to cycle through that as well.
I went through the same issue and just ended up buying their premium key. It's well worth it. You can also alternatively add a time delay and just run the script in the background so it doesn't time out.
You can also try polygon API its 60 requests per minute for free
https://polygon.io/pricing

How do I make my program wait if it returns overflow error?

So I'm trying to get data from an API which has a max call limit of 60/min. If I go over this it will return a response [429] too many requests.
I thought maybe a better way was to keep requesting the API until I get a response [200] but I am unsure how to do this.
import requests
r = requests.get("https://api.steampowered.com/IDOTA2Match_570/GetTopLiveGame/v1/?key=" + key + "&partner=0")
livematches = json.loads(r.text)['game_list']
So usually it runs but if it returns anything other than a response [200], my program will fail and anyone who uses the website won't see anything.
Generally with cases like this, I would suggest adding a layer of cache indirection here. You definitely don't want (and can't) try solving this using frontend like you suggested since that's not what the frontend of your app is meant for. Sure, you can add the "wait" ability, but all someone has to do is pull up Chrome Developer tools to grab your API key and then call that as many times as they want. Think of it like this: say you have a chef that can only cook 10 things per hour. Someone can easily come into the restaurant, order 10 things, and then nobody else can order anything for the next hour, which isn't fair. Instead, adding a "cache" layer, which means that you only call the steam API every couple seconds. If 5 people request your site within, say, 5 seconds, you only go to the steam API on the first call, then you save that response. To the other 4 (and anyone who comes within those next few seconds), you return a "cached" version.
The two main reasons for adding a cache API layer are the following:
You would stop exposing key from the frontend. You never want to expose your API key to the frontend directly like this since anyone could just take your key and run many requests and then bam, your site is now down (denial of service attack is trivial). You would instead have the users hit your custom mysite.io/getLatestData endpoint which doesn't need the key since that would be securely stored in your backend.
You won't run into the rate limiting issue. Essentially if your cache only hits the API once every minute, you'll not run into any time where users can't access your site due to an API limit, since it'll return cached results to the user.
This may be a bit tricky to visualize, so here's how this works:
You write a little API in your favorite server-side language. Let's say NodeJS. There are lots of resources for learning the basics of ExpressJS. You'd write an endpoint like /getTopGame that is attached to a cache like redis. If there's a cached entry in the redis cache, return that to the user. If not, go to the steam API and get the latest data. Store that with an expiration of, say, 5 seconds. Boom, you're done!
It seems a bit daunting at first but as you said being a college student, this is a great way to learn a lot about how backend development works.

Django, global variables and tokens

I'm using django to develop a website. On the server side, I need to transfer some data that must be processed on the second server (on a different machine). I then need a way to retrieve the processed data. I figured that the simplest would be to send back to the Django server a POST request, that would then be handled on a view dedicated for that job.
But I would like to add some minimum security to this process: When I transfer the data to the other machine, I want to join a randomly generated token to it. When I get the processed data back, I expect to also get back the same token, otherwise the request is ignored.
My problem is the following: How do I store the generated token on the Django server?
I could use a global variable, but I had the impression browsing here and there on the web, that global variables should not be used for safety reason (not that I understand why really).
I could store the token on disk/database, but it seems to be an unjustified waste of performance (even if in practice it would probably not change much).
Is there third solution, or a canonical way to do such a thing using Django?
You can store your token in django cache, it will be faster from database or disk storage in most of the cases.
Another approach is to use redis.
You can also calculate your token:
save some random token in settings of both servers
calculate token based on current timestamp rounded to 10 seconds, for example using:
token = hashlib.sha1(secret_token)
token.update(str(rounded_timestamp))
token = token.hexdigest()
if token generated on remote server when POSTing request match token generated on local server, when getting response, request is valid and can be processed.
The simple obvious solution would be to store the token in your database. Other possible solutions are Redis or something similar. Finally, you can have a look at distributed async tasks queues like Celery...

How to efficiently process large amount data response from REST API?

One of out client who will be supplying data to us has REST based API. This API will fetch data from client's big data columnar store and will dump data as response to requested query parameters.
We will be issuing queries as below
http://api.example.com/biodataid/xxxxx
Challenge is that though response is quite huge though. For given id it contains JSON or XML response with at least 800 - 900 attributes in response for single id. Client is refusing to change service for whatever reason I can't cite here. In addition , due to some constraints we will get only 4-5 hour window daily to download this data for about 25000 to 100000 ids.
I have read about synchronous vs asynchronous handling of response. What are options available to design data processing service for efficiently loading to relational database ? We use python for data processing and mysql as current data ( more recent data ) store and H-Base as backend big data-store (recent and historical data). Goal is retrieve this data and process and load it either MySQL database or to H-Base store as fast as possible.
If you have built high throughput processing services any pointers will be helpful. Are there any resources for creating such services with example implementation ?
PS - If this question sounds too high level please comment and I will provide additional details.
I appreciate your response.

A Realtime Dashboard For Logs

We have a number of Python services, many of which use Nginx as a reverse proxy. Right now, we examine requests in real time by tailing the logs found in /var/log/nginx/access.log. I want to make these logs publicly readable in aggregate on a webserver so people don't have to SSH into individual machines.
Our current infrastructure has fluentd (a tool similar to logstash I'm told) sending logs up to a centralized stats server, which has Elasticsearch and kibana installed, with the idea being that kibana would serve as the frontend for viewing our logs.
I know nothing about these services. If I wanted to view our logs in realtime, would this stack even be feasible? Can Elasticsearch provide realtime data with a mere second's delay? Does kibana have out-of-the-box functionality for automatically updating the page as new log data comes in (i.e., does it have a socket connection with elasticsearch? Or am I falling into the wrong toolset?
Kibana is just an interface on top of elastic search. It talks directly to elasticsearch, so the data on it is as realtime as the data you are feeding into elasticsearch. In other words, its only as good as your collectors (fluentd in your case).
It works by having you define time series which it uses to query data from elastic search, and then you can have it always search for keywords and then visualize that data.
If by "realtime" you mean that you want the graphs to move/animate - this is also possible (its called "streaming dashboards"); but that's not the real power of kibana - the real power is a very rich query engine, drill down into time series, do calculations (top x over period y).
If all you want is a nice visual/moving thing to place on the wall tv - this is possible with kibana, but keep in mind you'll have to store everything in elasticsearch so unless you plan on doing some other analysis, you'll have to adjust your configuration. For example, have a really short TTL for the messages so once they are visualized, they are no longer available; or filter fluentd to only send across those events that you want to plot. Otherwise you'll have a disk space problem.
If that is all that you want, it would be a easier to grab some javascript charting library and use that in your portal.
I have the "access.log (or other logs) - logstash (or other ES indexer) - Kibana" pipeline setup for a number of services and logs and it works well. In our case it has more than a second of delay but that's because of buffering in logs or the ES indexer, not because of Kibana/ES itself.
You can setup Kibana to show only the last X minutes of data and refresh every Y seconds, which gives a decent real-time feel - and looks good on TVs ;)
Keep in mind that Kibana can sometimes issue pretty bad queries which can crash your ES cluster (although this seems to have vastly improved in more recent ES and Kibana versions), so do not rely on this as a permanent data store for your logs, and do not share the ES cluster you use for Kibana with apps that have stronger HA requirements.
As Burhan Khalid pointed out, this setup also gives us the ability to drill down and study specific patterns in details, which is super useful ("What's this spike on this graph?" - zoom in, add a couple filters, look at a few example log lines, filter again - mystery solved). I think saving us from having to dig somewhere else to get more details when we see something weird is actually the best part of this solution.

Categories

Resources