Can I batch incrementing values in order to reduce database writes? - python

My web app uses Django on Heroku.
The app has two components: a banner serving API and a reporting API.
Banners served by the app contain an ajax call to the app's reporting API. This call is made once per impression. The purpose of this is to record how many impressions each banner gets each day.
Reporting API abridged source:
def report_v1(request):
'''
/report/v1/?cid=
- cid= the ID of the banner
'''
creative = get_creative(creative_id = request.GET.get('cid'))
if not creative:
return HttpResponseNotFound("404 creative with that ID not found")
day = timezone.now()
fact, created = CreativeFact.objects.get_or_create(
day = day,
app = app,
creative = creative,
defaults = {
'impression_count': 0,
}
)
fact.impression_count += 1
response = HttpResponse("200 OK")
response["Access-Control-Allow-Origin"] = "*"
return response
My problem: Is there any way to avoid writing to the database on every single impression? I'm currently tracking hundreds of thousands of impressions this way, and expect to track millions soon, and I think this is inefficient.

You could cache the counts and then write them to the model in increments of 100 or whatever the optimal value is:
At the top of report_v1:
if getattr(report_v1, 'cache', None) is None:
report_v1.cache = {}
Then, after you verify that the cid is valid (after the line with the 404 response):
if cid in report_v1.cache:
report_v1.cache[cid] += 1
else:
report_v1.cache[cid] = 1
Then, you'd want to increment the value on the model only at certain intervals:
if not report_v1.cache[cid] % 100:
#increment impression_count
#wipe cache
The downside to this approach is that the cache is in memory, so it's lost if the app crashes.

Related

Error 404 when trying to insert an ACL to a calendar with Python client - works if I retry

Using Google Suite for Education.
I have an app that wants to:
Create a new calendar.
Add an ACL to such calendar, so the student role would be "reader".
Everything is run through a service account.
The calendar is created just fine, but inserting the ACL throws a 404 error (redacted for privacy):
<HttpError 404 when requesting https://www.googleapis.com/calendar/v3/calendars/MY_DOMAIN_long_string%40group.calendar.google.com/acl?alt=json returned "Not Found">
The function that tries to insert the ACL:
def _create_calendar_acl(calendar_id, user, role='reader'):
credentials = service_account.Credentials.from_service_account_file(
CalendarAPI.module_path)
scoped_credentials = credentials.with_scopes(
['https://www.googleapis.com/auth/calendar'])
delegated_credentials = scoped_credentials.with_subject(
'an_admin_email')
calendar_api = googleapiclient.discovery.build('calendar',
'v3',
credentials=delegated_credentials)
body = {'role': role,
'scope': {'type': 'user',
'value': user}}
answer = calendar_api.acl().insert(calendarId=calendar_id,
body=body,
).execute()
return answer
The most funny thing is, if I retry the operation a couple times, it finally succeeds. Hence, that's what my code does:
def create_student_schedule_calendar(email):
MAX_RETRIES = 5
# Get student information
# Create calendar
answer = Calendar.create_calendar('a.calendar.owner#mydomain',
f'Student Name - schedule',
timezone='Europe/Madrid')
calendar_id = answer['id']
counter = 0
while counter < MAX_RETRIES:
try:
print('Try ' + str(counter + 1))
_create_calendar_acl(calendar_id=calendar_id, user=email) # This is where the 404 is thrown
break
except HttpError: # this is where the 404 is caught
counter += 1
print('Wait ' + str(counter ** 2))
time.sleep(counter ** 2)
continue
if counter == MAX_RETRIES:
raise Exception(f'Exceeded retries to create ACL for {calendar_id}')
Anyway, it takes four tries (between 14 and 30 seconds) to succeed - and sometimes it expires.
Would it be possible that the recently created calendar is not immediately available for the API using it?
Propagation is often an issue with cloud-based services. Large-scale online service are distributed along a network of machines which in themselves have some level of latency - there is a discrete, non-zero amount of time that information takes to propagate along a network and update everywhere.
All operations working after the first call which doesn't result in 404, is demonstrative of this process.
Mitigation:
I suggest if you're creating and editing in the same function call implementing some kind of wait/sleep for a moment to mitigate getting 404s. This can be done in python using the time library:
import time
# calendar creation code here
time.sleep(2)
# calendar edit code here

seems like web.py sessions are in app rather than on client

So the code below is more or less taken from http://webpy.org/cookbook/session
If I run the app it works as it should i.e. counter increments by one upon each refresh, however if I access the app in an incognito window or other web browser, the counter does not reset. To me it seems like the session doesn't initialize with count: 0 as it should. What is it that causes the new session to take the values of session in other client?
import web
web.config.debug = False
urls = (
"/", "count",
"/reset", "reset"
)
app = web.application(urls, locals())
session = web.session.Session(app, web.session.DiskStore('sessions'),
{'count': 0})
session_data = session._initializer
class count:
def GET(self):
session_data['count'] += 1
return str(session_data['count'])
class reset:
def GET(self):
session.kill()
return ""
if __name__ == "__main__":
app.run()
Sessions should be stored on the client but when I execute this code it seems like it is on the server, which would imply that only one user can use the app and I have to rerun the app to reset the counter.
I haven't been able to solve this for almost a week now. Pleeease help.
The example has sessions being created from the initial session variable. For example, session.count += 1 would add 1 to the current session's count. In your code you change session_data for each user. The way the documentation demonstrates creating a session variable with an initializer is:
session = web.session.Session(app, web.session.DiskStore('sessions'), initializer={'count': 0})
So, instead of doing session_data['count'] += 1, the documentation recommends doing session['count'] += 1 or session.count += 1. You would also need to update the return line in your Index.
I tested and confirmed that this works for me.

Google BigQuery python - error paginating table

I have a large table in BigQuery, which i have to go through, get all data and process it in my GAE app. Since my table is going to be about 4m rows, i decided i have to get data via pagination mechanism implemeted in code examples here > https://cloud.google.com/bigquery/querying-data
def async_query(query):
client = bigquery.Client()
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.use_legacy_sql = False
query_job.use_query_cache = False
query_job.begin()
wait_for_job(query_job)
query_results = query_job.results()
page_token = None
output_rows = []
while True:
rows, total_rows, page_token = query_results.fetch_data(max_results = 200, page_token = page_token)
output_rows = output_rows + rows
if not page_token:
break
def wait_for_job(job):
while True:
job.reload() # Refreshes the state via a GET request.
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
But when i execute it i receive an error:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
When max_results parameter > table size it works fine. When max_results < table size and pagination required - i get this error.
Am i missing something?
The error indicates your overall request handler processing takes too long. Very likely because of the multiple query_results.fetch_data iterations, due to pagination.
You may want to check:
Dealing with DeadlineExceededErrors
Deadline errors: 60 seconds or less in Google App Engine
You'll probably have to re-think your app a bit, maybe try to not get the whole result immediately and instead either:
get just a portion of the result
get the result on a separate request, later on, after obtaining it in the background, for example:
via a single task queue request (10 min instead or 60s deadline)
by assembling it from multiple chunks collected in separate task queue requests to really make it scalable (not sure if this actually works with bigquery, I only tried it with the datastore)

Python-ldap search: Size Limit Exceeded

I'm using the python-ldap library to connect to our LDAP server and run queries. The issue I'm running into is that despite setting a size limit on the search, I keep getting SIZELIMIT_EXCEEDED errors on any query that would return too many results. I know that the query itself is working because I will get a result if the query returns a small subset of users. Even if I set the size limit to something absurd, like 1, I'll still get a SIZELIMIT_EXCEEDED on those bigger queries. I've pasted a generic version of my query below. Any ideas as to what I'm doing wrong here?
result = self.ldap.search_ext_s(self.base, self.scope, '(personFirstMiddle=<value>*)', sizelimit=5)
When the LDAP client requests a size-limit, that is called a 'client-requested' size limit. A client-requested size limit cannot override the size-limit set by the server. The server may set a size-limit for the server as a whole, for a particular authorization identity, or for other reasons - whichever the case, the client may not override the server size limit. The search request may have to be issued in multiple parts using the simple paged results control or the virtual list view control.
Here's a Python3 implementation that I came up with after heavily editing what I found here and in the official documentation. At the time of writing this it works with the pip3 package python-ldap version 3.2.0.
def get_list_of_ldap_users():
hostname = "google.com"
username = "username_here"
password = "password_here"
base = "dc=google,dc=com"
print(f"Connecting to the LDAP server at '{hostname}'...")
connect = ldap.initialize(f"ldap://{hostname}")
connect.set_option(ldap.OPT_REFERRALS, 0)
connect.simple_bind_s(username, password)
connect=ldap_server
search_flt = "(personFirstMiddle=<value>*)" # get all users with a specific middle name
page_size = 500 # how many users to search for in each page, this depends on the server maximum setting (default is 1000)
searchreq_attrlist=["cn", "sn", "name", "userPrincipalName"] # change these to the attributes you care about
req_ctrl = SimplePagedResultsControl(criticality=True, size=page_size, cookie='')
msgid = connect.search_ext(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
total_results = []
pages = 0
while True: # loop over all of the pages using the same cookie, otherwise the search will fail
pages += 1
rtype, rdata, rmsgid, serverctrls = connect.result3(msgid)
for user in rdata:
total_results.append(user)
pctrls = [c for c in serverctrls if c.controlType == SimplePagedResultsControl.controlType]
if pctrls:
if pctrls[0].cookie: # Copy cookie from response control to request control
req_ctrl.cookie = pctrls[0].cookie
msgid = connect.search_ext(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
else:
break
else:
break
return total_results
This will return a list of all users but you can edit it as required to return what you want without hitting the SIZELIMIT_EXCEEDED issue :)

Let python sleep 60 secs after it has crawled every 20 pages

I am trying to collect the retweets data from a Chinese microblog Sina Weibo, you can see the following code. However, I am suffering from the problem of IP request out of limit.
To solve this problem, I have to set time.sleep() for the code. You can see I attempted to add a line of ' time.sleep(10) # to opress the ip request limit' in the code. Thus python will sleep 10 secs after crawling a page of retweets (one page contains 200 retweets).
However, it still not sufficient to deal with the IP problem.
Thus, I am planning to more systematically make python sleep 60 secs after it has crawled every 20 pages. Your ideas will be appreciated.
ids=[3388154704688495, 3388154704688494, 3388154704688492]
addressForSavingData= "C:/Python27/weibo/Weibo_repost/repostOwsSave1.csv"
file = open(addressForSavingData,'wb') # save to csv file
for id in ids:
if api.rate_limit_status().remaining_hits >= 205:
for object in api.counts(ids=id):
repost_count=object.__getattribute__('rt')
print id, repost_count
pages= repost_count/200 +2 # why should it be 2? cuz python starts from 0
for page in range(1, pages):
time.sleep(10) # to opress the ip request limit
for object in api.repost_timeline(id=id, count=200, page=page): # get the repost_timeline of a weibo
"""1.1 reposts"""
mid = object.__getattribute__("id")
text = object.__getattribute__("text").encode('gb18030') # add encode here
"""1.2 reposts.user"""
user = object.__getattribute__("user") # for object in user
user_id = user.id
"""2.1 retweeted_status"""
rts = object.__getattribute__("retweeted_status")
rts_mid = rts.id # the id of weibo
"""2.2 retweeted_status.user"""
rtsuser_id = rts.user[u'id']
try:
w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow(( mid,
user_id, rts_mid,
rtsuser_id, text)) # write it out
except: # Exception of UnicodeEncodeError
pass
elif api.rate_limit_status().remaining_hits < 205:
sleep_time=api.rate_limit_status().reset_time_in_seconds # time.time()
print sleep_time, api.rate_limit_status().reset_time
time.sleep(sleep_time+2)
file.close()
pass
Can you not just pace the script instead?
I suggest to make your script sleep in between each request instead of making a requests all at the same time. And say span over a minute.. This way you will also avoid any flooding bans and this is considered good behaviour.
Pacing your requests may also allow you to do things more quickly if the server does not time you out for sending too many requests.
If there is a limit to the IP sometimes their are no great and easy solutions. For example if you run apache http://opensource.adnovum.ch/mod_qos/ limits bandwidth and connections and specifically it limits;
The maximum number of concurrent requests
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
Limits the number of request events per second
Generic request line and header filter to deny unauthorized operations.
Request body data limitation and filtering
the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
You may want to start with these. You could send referrer URL's with your requests and make only single connections, not multiple connections.
You could also refer to this question
I figure out the solution:
first, give an integer, e.g 0
i = 0
second, in the for page loop, add the following code
for page in range(1, 300):
i += 1
if (i % 25 ==0):
print i, "find i which could be exactly divided by 25"

Categories

Resources