Items movement daily collection database design system issue - python

I am doing a very simple database in mysql to track movement of items. The current paper form looks like this:
Date totalFromPreviousDay NewToday LeftToday RemainAtEndOfDay
1.1.2017 5 5 2 8 (5+5-2)
2.1.2017 8 3 0 11 ( 8+ 3 -0)
3.1.2017 11 0 5 6 (11+0-5)
And so forth. In my table, I want to make totalFromPreviousDay and RemainAtEndOfDay calculated fields which I show in my front end only. That is mainly cos we tend to erase on the paper due to errors. I want them to be reflected based on changes to the other two fields. As such, I did my table like this:
id
date
NewToday
LeftToday
Now the problem I am facing is, I want to select any date and be able to say "there were 5 items at the start of the day or from previous day, then 5 were added, 0 left and the day ended with 10 items"
So far, I can't really think of a way going about it. Theoretically, I want to try something like this: if the requested day is Feb. 1, 2017, start at 0 cos that's the day we started collecting data. If not, loop thru the records at 0 and doing the math until the requested date is found.
But that is obviously inefficient cos i have to start form first date until the last every time.
Is my approach ok or I should include the columns in the table? If the first, what would be the way to do it in python/mysql?

I think you have to step back a little bit and define the business needs first (it is worthwhile to talk somebody, who worked with stocks before) because these determine your table structure.
A system always tracks the current level of stocks and the movement. It is a business decision how often you save your historical stock level and this influences how you store the data.
You may save the current stock level along with all transactions. In this case you would store the stock level in the transactions table. You do not even have to sum up a transactions per day because the last transaction per day will have the daily closing stock level anyway.
You may choose to save the historic stock levels regularly (on a daily / weekly / monthly, etc. basis). In this case you will have a separate historic stock levels table with stock id, stock name (name may change over the time, so may be a good idea to save it), date and the level. If you would like to know the historic stock level for any point of time that falls between your saved points, then you need to take the latest saved stock level before the period you are looking for, and sum up all transactions to the saved period.

Related

Approach to storing forecast time series data using python

So I want to scrape a weather forecast table once a day and store my results for future analysis. I want to store the data but im not sure how to.
Example of the data: Forecast Table
My four variables of interest are wind speed, wind gusts, wave height, and wave period.
This is my first python project involving time series data and I’m fairly new to databases so go easy on me and ELI5 please.
In the Python For Everyone course I recently took, I learned about relational databases and using SQLlite. The main idea here was basically to be efficient in storing data and never storing the same data twice. However none of the examples involved time series data. And so now I’m not sure what the best approach here is.
If I create a table for each variable and lastly one for the date I scraped the forecast. The date of scraping would then serve as the primary key. In this example the variables such as windspeed's first column would be date of scraping followed by the next columns being the forecasted values for the time stamps. Although this would make the storage more efficient as opposed to creating a new table every day, there are a few problems. The timestamps are not uniform (see image, forecast times are only from 3am to 9pm). Also depending on the time of day that the forecast is scraped the date and time values on the timestamps are always changing and thus the next timestamp is not always in 2 hours.
Seeing as each time I scrape the forecast, I get a new table, should I create a new database table each time in sqlite? This seems like a rather rudimentary solution, and I’m sure there are better ways to store the data.
How would you go about this?
Summarizing my comments:
You may want to append forecast data from a new scrapping to the existing data in the same database table.
From each new web-scrapping you will get approx. 40 new records with the same scrapping time stamp but different forecast time stamp.
e.g., this would be the columns of the table using ID as primary key with AUTOINCREMENT:
ID
Scrapping_time
Forecast_hours
Wind_speed
Wind_gusts
Wind_direction
Wave
Wave_period
wave_direction
Note:
if you use SQLite, you could leave out the ID column as SQLite would add such ROWID column by default if no other primary key had been specified
(https://www.sqlite.org/autoinc.html)

Detecting recurring transactions from history

I'm hoping someone could lead me in the right direction. I have a list of users and money transactions for a service.
The data I have: userid, transaction date, transaction amount. Based on those three variables, is it possible that for a new transaction I could code something where it can decide/provide probability that the new transaction is likely recurring.
Example would be a user has a transaction for $100 at the end of the month, every month. If a new transaction for the same amount comes in the next month it's a high probability it's reoccuring, as opposed to a 25 dollar transaction towards the start of the month
or someone has had a transaction for around 1200 in December the last 10 years. If they do that again this year it's likely recurring.
Any help for guidance as to how to tackle this or machine learning models to use.
Thanks

Calculating average of multiple sets of data (performance issue)

I have a needs to do calculation like average of selected data grouped by time rage collections.
Example:
Table which is storing data has several main columns which are:
| time_stamp | external_id | value |
Now i want to calculate average for 20 (or more) groups of date ranges:
1) 2000-01-01 00-00-00 -> 2000-01-04 00-00-00
2) 2000-01-04 00-00-00 -> 2000-01-15 00-00-00
...
The important thing is that there are no gaps and intersections between groups so it means that first date and last date are covering full time range.
The other important thing is that in set of "date_from" to "date_to" there can be rows for outside of the collection (unneeded external_id's).
I have tried 2 approaches:
1) Execute query for each "time range" step with average function in SQL query (but i don't like that - it's consuming too much time for all queries, plus executing multiple queries sounds like not good approach)
2) I have selected all required rows (at one SQL request) and then i made loop over the results. The problem is that i have to check on each step to which "data group" current datetime belongs. This seams like a better approach (from SQL perspective) but right now i have not too good performance because of loop in the loop. I need to figure out how to avoid executing loop (checking to which group current timestamp belongs) in the main loop.
Any suggestions would be much helpful.
Actually both approaches are nice, and both could benefit on the index on the time_stamp column in your database, if you have it. I will try to provide advice on them:
Multiple queries are not such a bad idea, your data looks to be pretty static, and you can run 20 select avg(value) from data where time_stamp between date_from and date_to-like queries in 20 different connections to speed up the total operation. You'll eliminate need of transferring a lot of data to your client from DB as well. The downside would be that you need to include an additional where condition to exclude rows with unneeded external_id values. This complicates the query and can slow the processing down a little if there are a lot of these values.
Here you could sort the data on server by time_stamp index before sending and then just checking if your current item is from a new data range (because of sorting you will be sure later items will be from later dates). This would reduce the inner loop to an if statement. I am unsure this is the bottleneck here, though. Maybe you'd like to look into streaming the results instead of waiting them all to be fetched.

Python: aggregating data by row count

I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.
Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.
When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.

Django database planning - time series data

I would like some advice on how to best organize my django models/database tables to hold the data in my webapp
Im designing a site that will hold a users telemetry data from a racing sim game. So there will be a desktop companion app that will sample the game data every 0.1 seconds for a variety of information (car, track, speed, gas, brake, clutch, rpm, etc). For example, in a 2 minute race, each of those variables will hold 1200 data points (10 samples a second * 120 seconds).
The important thing here is that this list of data can be as many as 20 variables, and could potentially grow in the future. So 1200 * the number of variables you have is the amount of data for an individual race session. If a single user submits 100 sessions, and there are 100 users....the amount of data adds up very quickly.
The app will then ship all this data for a race session off to the database for the website. The data MUST be transferred between game and website via a CSV file. So structurally I am limited to what CSV can do. The website will then allow you to choose a race session/lap and plot this information on separate time series graphs (for each variable), and importantly allow you to plot your session against somebody elses to see where differences lie
My question here is how do you structure such a database to hold this much information?
The simplest structure I have in my mind is to have a separate table for each race track, then each row/entry will be a race session on that track. Fields in this table will be the variables above.
The problem I have is:
1) most of the variables in the list above are time series data and not individual values (e.g. var speed might look like: 70, 72, 74, 77, 72, 71, 65 where the values are samples spaced 0.1 seconds apart over the course of the entire lap). How do you store this type of information in a table/field?
2) The length of each var in the list above will always be the same length for any single race session (if your lap took 1min 35 then you all your vars will only capture data for that length of time), but given that I want to be able to compare different laps with each other, session times will be different for each lap. In other words, however I store the time series data for those variables, it must be variable in size
Any thoughts would be appreciated
One thing that may help you with HUGE tables is partitioning. Judging by the postgresql tag that you set for your question, take a look here: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
But for a start I would go with a one, simple table, supported by a reasonable set of indexes. From what I understand, each data entry in the table will be identified by race session id, player id and time indicator. Those columns should be covered with indexes according to your querying requirements.
As for your two questions:
1) You store those informations as simple integers. Remember to set a proper data types for those columns. For e.g. if you are 100% sure that some values will be very small, you can use smallint data type. More on integer data types here: http://www.postgresql.org/docs/9.3/static/datatype-numeric.html#DATATYPE-INT
2) That won't be a problem if you every var list will be different row in the table. You will be able to insert as many as you'd like.
So, to sum things up. I would start with a VERY simple single table schema. From django perspective this would look something like this:
class RaceTelemetryData(models.Model):
user = models.ForeignKey(..., index_db=True)
race = models.ForeignKey(YourRaceModel, index_db=True)
time = models.IntegerField()
gas = models.IntegerField()
speed = models.SmallIntegerField()
# and so on...
Additionaly, you should create an index (manually) for (user_id, race_id, time) columns, so looking up, data about one race session (and sorting it) would be quick.
In the future, if you'll find the performance of this single table too slow, you'll be able to experiment with additional indexes, or partitioning. PostgreSQL is quite flexible in modifying existing database structures, so you shouldn't have many problems with it.
If you decide to add a new variable to the collection, you will simply need to add a new column to the table.
EDIT:
In the end you end up with one table, that has at least these columns:
user_id - To specify which users data this row is about.
race_id - To specify which race data this row is about.
time - To identify the correct order in which to represent the data.
This way, when you want to get information on Joe's 5th race, you would look up rows that have user_id = 'Joe_ID' and race_id = 5, then sort all those rows by the time column.

Categories

Resources