Django database planning - time series data

Django database planning - time series data - python

I would like some advice on how to best organize my django models/database tables to hold the data in my webapp
Im designing a site that will hold a users telemetry data from a racing sim game. So there will be a desktop companion app that will sample the game data every 0.1 seconds for a variety of information (car, track, speed, gas, brake, clutch, rpm, etc). For example, in a 2 minute race, each of those variables will hold 1200 data points (10 samples a second * 120 seconds).
The important thing here is that this list of data can be as many as 20 variables, and could potentially grow in the future. So 1200 * the number of variables you have is the amount of data for an individual race session. If a single user submits 100 sessions, and there are 100 users....the amount of data adds up very quickly.
The app will then ship all this data for a race session off to the database for the website. The data MUST be transferred between game and website via a CSV file. So structurally I am limited to what CSV can do. The website will then allow you to choose a race session/lap and plot this information on separate time series graphs (for each variable), and importantly allow you to plot your session against somebody elses to see where differences lie
My question here is how do you structure such a database to hold this much information?
The simplest structure I have in my mind is to have a separate table for each race track, then each row/entry will be a race session on that track. Fields in this table will be the variables above.
The problem I have is:
1) most of the variables in the list above are time series data and not individual values (e.g. var speed might look like: 70, 72, 74, 77, 72, 71, 65 where the values are samples spaced 0.1 seconds apart over the course of the entire lap). How do you store this type of information in a table/field?
2) The length of each var in the list above will always be the same length for any single race session (if your lap took 1min 35 then you all your vars will only capture data for that length of time), but given that I want to be able to compare different laps with each other, session times will be different for each lap. In other words, however I store the time series data for those variables, it must be variable in size
Any thoughts would be appreciated

One thing that may help you with HUGE tables is partitioning. Judging by the postgresql tag that you set for your question, take a look here: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
But for a start I would go with a one, simple table, supported by a reasonable set of indexes. From what I understand, each data entry in the table will be identified by race session id, player id and time indicator. Those columns should be covered with indexes according to your querying requirements.
As for your two questions:
1) You store those informations as simple integers. Remember to set a proper data types for those columns. For e.g. if you are 100% sure that some values will be very small, you can use smallint data type. More on integer data types here: http://www.postgresql.org/docs/9.3/static/datatype-numeric.html#DATATYPE-INT
2) That won't be a problem if you every var list will be different row in the table. You will be able to insert as many as you'd like.
So, to sum things up. I would start with a VERY simple single table schema. From django perspective this would look something like this:
class RaceTelemetryData(models.Model):
user = models.ForeignKey(..., index_db=True)
race = models.ForeignKey(YourRaceModel, index_db=True)
time = models.IntegerField()
gas = models.IntegerField()
speed = models.SmallIntegerField()
# and so on...
Additionaly, you should create an index (manually) for (user_id, race_id, time) columns, so looking up, data about one race session (and sorting it) would be quick.
In the future, if you'll find the performance of this single table too slow, you'll be able to experiment with additional indexes, or partitioning. PostgreSQL is quite flexible in modifying existing database structures, so you shouldn't have many problems with it.
If you decide to add a new variable to the collection, you will simply need to add a new column to the table.
EDIT:
In the end you end up with one table, that has at least these columns:
user_id - To specify which users data this row is about.
race_id - To specify which race data this row is about.
time - To identify the correct order in which to represent the data.
This way, when you want to get information on Joe's 5th race, you would look up rows that have user_id = 'Joe_ID' and race_id = 5, then sort all those rows by the time column.

Related

Approach to storing forecast time series data using python

So I want to scrape a weather forecast table once a day and store my results for future analysis. I want to store the data but im not sure how to.
Example of the data: Forecast Table
My four variables of interest are wind speed, wind gusts, wave height, and wave period.
This is my first python project involving time series data and I’m fairly new to databases so go easy on me and ELI5 please.
In the Python For Everyone course I recently took, I learned about relational databases and using SQLlite. The main idea here was basically to be efficient in storing data and never storing the same data twice. However none of the examples involved time series data. And so now I’m not sure what the best approach here is.
If I create a table for each variable and lastly one for the date I scraped the forecast. The date of scraping would then serve as the primary key. In this example the variables such as windspeed's first column would be date of scraping followed by the next columns being the forecasted values for the time stamps. Although this would make the storage more efficient as opposed to creating a new table every day, there are a few problems. The timestamps are not uniform (see image, forecast times are only from 3am to 9pm). Also depending on the time of day that the forecast is scraped the date and time values on the timestamps are always changing and thus the next timestamp is not always in 2 hours.
Seeing as each time I scrape the forecast, I get a new table, should I create a new database table each time in sqlite? This seems like a rather rudimentary solution, and I’m sure there are better ways to store the data.
How would you go about this?

Summarizing my comments:
You may want to append forecast data from a new scrapping to the existing data in the same database table.
From each new web-scrapping you will get approx. 40 new records with the same scrapping time stamp but different forecast time stamp.
e.g., this would be the columns of the table using ID as primary key with AUTOINCREMENT:
ID
Scrapping_time
Forecast_hours
Wind_speed
Wind_gusts
Wind_direction
Wave
Wave_period
wave_direction
Note:
if you use SQLite, you could leave out the ID column as SQLite would add such ROWID column by default if no other primary key had been specified
(https://www.sqlite.org/autoinc.html)

Speed - CSV vs MariaDB fetching stock data (python)

I like the idea of having my historical stock data stored in a database instead of CSV. Is there a speed penalty for fetching large data sets from MariaDB compared to CSV

Quite the opposite. Whenever you fetch data from a CSV, unless you have a stopping condition (for example, take the first entry with x = 3) you must parse every single line in the file. This is an expensive operation because not only do you have to read all of the lines (making it O(n)), but in general, you will be typecasting as well. In a database, you have already processed all of the lines, and if in this case there is an index on x or whatever attribute you are searching by, the database will be able to find the information in O(log(n)) time and will not look at the vast majority of entries.

Partition Data By Year/Month Column without Adding Columns to Result -pyspark/databricks

I have a dataframe in pyspark (and databricks) with the following schema structure:
orders schema:
submitted_at:timestamp
submitted_yyyy_mm using the format "yyyy-MM"
order_id:string
customer_id:string
sales_rep_id:string
shipping_address_attention:string
shipping_address_address:string
shipping_address_city:string
shipping_address_state:string
shipping_address_zip:integer
ingest_file_name:string
ingested_at:timestamp
I need to capture the data in my table in delta lake format, with a partition for every month of the order history reflected in the data of the submitted_yyyy_mm column. I am capturing the data correctly with the exception of two problems. One, my technique is adding two columns (and corresponding data) to the schema (could not figure out how to do the partitioning without adding columns). Two, the partitions correctly capture all the year/months with data, but are missing the year/months without data (requirement is those need to be included also). Specifically, all the months of 2017-2019 should have their own partition (so 36 months). However, my technique only created partitions for those months that actually had orders (which turned out to be 18 of the 36 months of the years 2017-2019).
Here is relevant are of my code:
# take the pristine order table and add these two extra columns you should not have in order to get partition structure
df_with_year_and_month = (df_orders
.withColumn("year", F.year(F.col("submitted_yyyy_mm").cast(T.TimestampType())))
.withColumn("month", F.month(F.col("submitted_yyyy_mm").cast(T.TimestampType()))))
# capture the data to the orders table using the year/month partitioning
df_with_year_and_month.write.partitionBy("year", "month").mode("overwrite").format("delta").saveAsTable(orders_table)
I would be grateful to anyone who might be able to help me tweak my code to fix the two issues I have the result. Thank you

There's no issue here. That's just how it works.
You want to partition on year and month. So you should have those values in you data, no way around it. You should also only partition on values where you want to filter on, since this 'causes partition pruning and results in faster queries. It would make no sense to partition on a field without related value.
Also it's totally normal that you don't create partitions where you don't have data for them. Once data is added, the corresponding partition is created if it doesn't exist yet. You don't need it any sooner than that.

smart way to structure my SQLite Database

I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.

It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.

In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.

redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure

So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.

Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.