How to keep track of players' rankings?

How to keep track of players' rankings? - python

I have a Player class with a score attribute:
class Player(game_engine.Player):
def __init__(self, id):
super().__init__(id)
self.score = 0
This score increases/decreases as the player succeeds/fails to do objectives. Now I need to tell the player his rank out of the total amount of players with something like
print('Your rank is {0} out of {1}')
First I thought of having a list of all the players, and whenever anything happens to a player:
I check if his score increased or decreased
find him in the list
move him until his score is in the correct place
But this would be extremely slow. There can be hundreds of thousands of players, and a player can reset his own score to 0 which would mean that I'd have to move everyone after him in the stack. Even finding the player would be O(n).
What I'm looking for is a high performance solution. RAM usage isn't quite as important, although common sense should be used. How could I improve the system to be a lot faster?
Updated info: I'm storing a player's data into a MySQL database with SQLAlchemy everytime he leaves the gameserver, and I load it everytime he joins the server. These are handled through 'player_join' and 'player_leave' events:
#Event('player_join')
def load_player(id):
"""Load player into the global players dict."""
session = Session()
query = session.query(Player).filter_by(id=id)
players[id] = query.one_or_none() or Player(id=id)
#Event('player_leave')
def save_player(id):
"""Save player into the database."""
session = Session()
session.add(players[id])
session.commit()
Also, the player's score is updated upon 'player_kill' event:
#Event('player_kill')
def update_score(id, target_id):
"""Update players' scores upon a kill."""
players[id].score += 2
players[target_id].score -= 2

Redis sorted sets help with this exact situation (the documentation uses leader boards as the example usage) http://redis.io/topics/data-types-intro#redis-sorted-sets
The key commands you care about are ZADD (update player rank) and ZRANK (get rank for specific player). Both operations are O(log(N)) complexity.
Redis can be used as a cache of player ranking. When your application starts, populate redis from the SQL data. When updating player scores in mysql also update redis.
If you have multiple server processes/threads and they could trigger player score updates concurrently then you should also account for the mysql/redis update race condition, eg:
only update redis from a DB trigger; or
serialise player score updates; or
let data get temporarily out of sync and do another cache update after a delay; or
let data get temporarily out of sync and do a full cache rebuild at fixed intervals

The problem you have is that you want real-time updates against a database, which requires a db query each time. If you instead maintain a list of scores in memory, and update it at a more reasonable frequency (say once an hour, or even once a minute, if your players are really concerned with their rank), then the players will still experience real-time progress vs a score rank, and they can't really tell if there is a short lag in the updates.
With a sorted list of scores in memory, you can instantly get the player's rank (where by instantly, I mean O(lg n) lookup in memory) at the cost of the memory to cache, and of course the time to update the cache when you want to. Compared to a db query of 100k records every time someone wants to glance at their rank, this is a much better option.
Elaborating on the sorted list, you must query the db to get it, but you can keep using it for a while. Maybe you store the last_update, and re-query the db only if this list is "too old". So you update quickly by not trying to update all the time, but rather just enough to feel like real-time.
In order to find someone's rank nearly instantaneously, you use the bisect module, which supports binary search in a sorted list. The scores are sorted when you get them.
from bisect import bisect_left
# suppose scores are 1 through 10
scores = range(1, 11)
# get the insertion index for score 7
# subtract it from len(scores) because bisect expects ascending sort
# but you want a descending rank
print len(scores) - bisect_left(scores, 7)
This says that a 7 score is rank 4, which is correct.

That kind of information can be pulled using SQLAlchemy's sort_by function. If you perform a Query like:
leaderboard = session.query(Player).order_by(Player.score).all()
You will have the list of Players sorted by their score. Keep in mind that every time you do this you do an I/O with the database which can be rather slow instead of saving the data python variables.

Related

Filtering sequences of events in django

My django app stores the actions that players do during a game. One of the models is called Event, and it contains a list of all actions by players. It has the following 4 columns: game_id, player_name, action, turn. Turn is the number of the turn in which the action takes place.
Now I want to count how often players behave in certain patterns. For example, I want to know how often a player takes decision A in turn 2 and that same player takes decision B in turn 3. I'm breaking my head over how I can do this. Can someone help?
Note: Ideally the queries should be efficient, bearing in mind that some related queries are following. For example, after the above query I would want to know how often a player takes decision C in turn 4, after doing A and B in turns 2 and 3. (The goal is to predict the likelihood of each action, given the actions in the past.)

Python multiprocessing - Process a large number of rows simultaneously

SO community,
I am struggling with Python multicore programming. Whilst I cannot disclose too many details about the goal of my code, due to the company's information security policy, I'll try to be as specific as I can.
I am developing a code which will evaluate the weekly scores for several people, divided in 5 groups according to their overall scores. I have a fairly large dataframe, composed of dozens of thousands of rows, each one with the player's ID, the considered week (in a nine week period) and his/her score in the given week. On top of that, I have five arrays of indexes so I can loc/iloc the data for selecting people on one of five groups, depending on their scores.
I wrote a serialized version of this code, but it takes well over an hour to run for a small part (some 150k lines) of our data (I'm developing with a small sample, so I can test the code and, once it's A-OK, I can fire it up with the complete dataset). The structure of the serialized code in pythonesque pseudocode:
def weekscore(week, ID, players):
mask(week)
players.loc[players['player'==ID]
score = (calculate score)
return score
And then, for all players and all weeks:
sc_week = []
for play_ID in playerlist:
for week in np.arange(0,10): # To consider 9 weeks
sc_week.append(weekscore(week, play_ID, players)
i.e., for each player on the list, gets his/her score for the given week and append on a 9-element list, for further processing.
I know this isn't the best way for approaching this problem, but since I was trying to implement a multicore algorithm, I didn't bothered in optimizing this one. After all, this seemed the perfect opportunity to learn an important skill that will be useful later.
Anyway, I tried every single way I could find on SO to implement a multiprocessing algorithm for this problem. The idea is to delegate different player_IDs to different process, so I could greatly cut the total processing time. I have access to 32-core machines, so if I could make this run with 20 cores I would expect the code to run at least 10 times faster, minimum.
But no matter what I implemented, even though the code forks the process into several processes, it still just processes a single player_ID at a time. I monitored the execution through htop on Linux and through task manager on Windows and in both systems, setting a poolsize of 8, 8 subprocesses were created. Then, one of them would spike to 100% of CPU, output something to the screen, then another one would spike to 100%, output to the screen, another process spike to 100% and output and so on. I was expecting all processes to achieve 100%, but I couldn't managed to do this.
The closest I got was with this:
if __name__ == '__main__':
users, primeiro, dados, index_pred_80, intervalos, percents_ = prepara()
print('Multiprocessing step')
import multiprocessing as mp
from multiprocessing import freeze_support
from functools import partial
freeze_support()
pool = mp.Pool(10)
constparams = partial(perc1, week = week, ID = ID, players = players)
pool.map(constparams, users)
pool.close()
pool.join()
Calling the function perc1 which is (anonymized for privacy reasons)
def perc1(week, ID, players):
list1 = []
for week in np.arange(0,10):
mask(week)
players.loc[players['player'==ID]
print('Week = ', week, 'Player ID = ', player_ID)
score = (calculate score)
list1.append(score)
return (lista1)
So, after the long post, what is the best way for achieving this goal? What's the best way to delegate batches of IDs for different processes, so each process can churn through a part of the data?
EDIT 1: Fixed small typos.
EDIT 2: I am aware of the limitations of interactive interpreters for multicore Python, so I tested the code in command line both in Windows and Linux, but the behaviour was the same on both cases.

Python: How To Construct A Class With Many Parameters

I am in a bit of a jam in deciding how to structure my class. What I have is a baseball player class and necessary attributes of:
Player ID (a key from a DB)
Last name
First name
Team
Position
Opponent
about 10 or 11 stats (historical)
about 10 or 11 stats (projected)
pitcher matchup
weather
... and a few more
Some things to break this down a little:
1) put stats in dictionaries
2) make a team class that can hold general info common for all players on the team like weather and pitcher match up.
But, I still have 10 attributes after this.
Seeing this (Class with too many parameters: better design strategy?) has given me a couple ideas but don't know if they're ideal.
1) Use a dictionary - But then wouldn't be able to use methods to calculate stats (or would have to use separate functions)
2) Use args/kwargs - But from what I can gather, those seem to be for variable amounts of parameters, and all of my parameters will be required.
3) Breaking up into smaller classes - I have broken it up a bit already, but don't know if I can any further.
Is there a better way to build this rather than having a class with a bunch of parameters listed out?

If you think about this from the perspective of database design, it would probably be odd to have a BaseballPlayer object that has the following parameters:
Team
Position
Opponent
about 10 or 11 stats (historical)
about 10 or 11 stats (projected)
pitcher matchup
weather
Because there are certain things associated with a particular BaseballPlayer which remain relatively fixed, such as name, etc., but these other things are fluid and transitory.
If you were designing this as an application with various database tables, then, it's possible that each of the things listed here would represent a separate table, and the BaseballPlayer's relationship with these other tables amount to current and former Team, etc.
Thus, I would probably break up the problem into more classes, including a StatsClass and a Team class (which is probably what an Opponent really is...).
But it all depends what you would like to do. Usually when you are bending over backwards to cram data into a structure or doing the same to get it back out, the design could be reworked to make your job easier.

Django database planning - time series data

I would like some advice on how to best organize my django models/database tables to hold the data in my webapp
Im designing a site that will hold a users telemetry data from a racing sim game. So there will be a desktop companion app that will sample the game data every 0.1 seconds for a variety of information (car, track, speed, gas, brake, clutch, rpm, etc). For example, in a 2 minute race, each of those variables will hold 1200 data points (10 samples a second * 120 seconds).
The important thing here is that this list of data can be as many as 20 variables, and could potentially grow in the future. So 1200 * the number of variables you have is the amount of data for an individual race session. If a single user submits 100 sessions, and there are 100 users....the amount of data adds up very quickly.
The app will then ship all this data for a race session off to the database for the website. The data MUST be transferred between game and website via a CSV file. So structurally I am limited to what CSV can do. The website will then allow you to choose a race session/lap and plot this information on separate time series graphs (for each variable), and importantly allow you to plot your session against somebody elses to see where differences lie
My question here is how do you structure such a database to hold this much information?
The simplest structure I have in my mind is to have a separate table for each race track, then each row/entry will be a race session on that track. Fields in this table will be the variables above.
The problem I have is:
1) most of the variables in the list above are time series data and not individual values (e.g. var speed might look like: 70, 72, 74, 77, 72, 71, 65 where the values are samples spaced 0.1 seconds apart over the course of the entire lap). How do you store this type of information in a table/field?
2) The length of each var in the list above will always be the same length for any single race session (if your lap took 1min 35 then you all your vars will only capture data for that length of time), but given that I want to be able to compare different laps with each other, session times will be different for each lap. In other words, however I store the time series data for those variables, it must be variable in size
Any thoughts would be appreciated

One thing that may help you with HUGE tables is partitioning. Judging by the postgresql tag that you set for your question, take a look here: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
But for a start I would go with a one, simple table, supported by a reasonable set of indexes. From what I understand, each data entry in the table will be identified by race session id, player id and time indicator. Those columns should be covered with indexes according to your querying requirements.
As for your two questions:
1) You store those informations as simple integers. Remember to set a proper data types for those columns. For e.g. if you are 100% sure that some values will be very small, you can use smallint data type. More on integer data types here: http://www.postgresql.org/docs/9.3/static/datatype-numeric.html#DATATYPE-INT
2) That won't be a problem if you every var list will be different row in the table. You will be able to insert as many as you'd like.
So, to sum things up. I would start with a VERY simple single table schema. From django perspective this would look something like this:
class RaceTelemetryData(models.Model):
user = models.ForeignKey(..., index_db=True)
race = models.ForeignKey(YourRaceModel, index_db=True)
time = models.IntegerField()
gas = models.IntegerField()
speed = models.SmallIntegerField()
# and so on...
Additionaly, you should create an index (manually) for (user_id, race_id, time) columns, so looking up, data about one race session (and sorting it) would be quick.
In the future, if you'll find the performance of this single table too slow, you'll be able to experiment with additional indexes, or partitioning. PostgreSQL is quite flexible in modifying existing database structures, so you shouldn't have many problems with it.
If you decide to add a new variable to the collection, you will simply need to add a new column to the table.
EDIT:
In the end you end up with one table, that has at least these columns:
user_id - To specify which users data this row is about.
race_id - To specify which race data this row is about.
time - To identify the correct order in which to represent the data.
This way, when you want to get information on Joe's 5th race, you would look up rows that have user_id = 'Joe_ID' and race_id = 5, then sort all those rows by the time column.

Google App Engine request timing out on count() function

I am using count() function to calculate the number of results returned by the query. The problem is that count is taking too long , that the request times out. Is there any way that i can make the count to respond quickly or any alternative to count() ?
query = MyModel.query().filter(MyModel.name.IN(['john', 'sara', 'alex']))
search_count = query.count()
if i remove the count line and just return the results it takes just couple of seconds.

Unfortunately count doesn't scale. You can only count 1000 items without using a cursor. Secondly if you want to count do a keys only query (pulls less data from the datastore).
Really to keep a count relatively up to date for a large number of entities, you will need to use a task and run it every so often, (or trigger a task to be scheduled each time data is added/modified if it is infrequent) and store that value away some where.
Or think about why you really need a count ;-) and how accurate it is.

If you need count(), you should use the keys_only option as Tim Hoffman already suggested. That should save you enough time for counting small query results.
Be aware that count() actually runs through the complete query until the very last match in the index. This means, if your query matches millions of items in a huge index, you will see terrible request times and time-outs even with the keys_only option.
From a usability perspective it isn't likely that a user wants accurate numbers in large scales. Typically users will not even browse through dozens or even hundreds of pages.
Counter with threshold accuracy
Consider using a counter that only is accurate up to a low limit, e.g. "41 items found", and beyond that limit use a generic display, e.g. "1000 or more items found". This is how text searches in GMail shows number of matches.
Pre-calculated counter
Enter a generic term like "spaghetti" into Google search and you will see some incredibly high number, e.g. "5.3 million documents found". Then try to get to page number 1,000 or to match number 1,000,000. It won't work. And the number is inaccurate as well. For calculating number of matches ahead of time, you could write tasks / cron jobs (maybe with map-reduce) that will calculate the counters asynchronously. However, even in business use-cases the counter of an individual search query like in your example doesn't need to be accurate with large numbers because it is very probable that the counter is changing significantly while the user goes through the results.
Shard counters
If you however need an accurate counter, for example the number of all sales orders in the datastore, rather than individual queries, you could write a counter and increase/decrease it with every new sales order that is created or deleted in the datastore. Depending on how you model the entity groups such counter might hit current datastore limitations in large volume writes (~ 1 write op per second per entity group, in reality maybe 3 to 4). See the article Sharding counters which explains how to build a scalable counter.
Use Search API
You could use the full text search service in Google App Engine. Define an index (e.g. "Customer") with fields you want to search. Whenever a customer entity in datastore is updated, put an updated copy of it as document into the search index. In my experience, the Search API is scaling much better for complex searches in large indices. It also shows you a counter and provides your users with full text search capabilities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.