I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.
It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.
In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?
Related
I have a Python program that will take any number of addresses in one database table (Dataset A) and query each one against another database table that contains promotional pricing for addresses across the country (Dataset B). The purpose of this program is to see which addresses in Dataset A are in Dataset B, and if they are, indicate that it found it and also the record ID in the table. I have a set series of wildcard queries to do the SQL searching in a certain order to pull results, which is part of the problem since I do wildcard searches with LIKE ‘%CRITERIA%’, which slows things down tremendously despite everything being indexed in MySQL.
Here's an example type of search that it does today…let’s say the input address in Dataset A is 123 Main Street, Brooklyn, NY 11201 but the address in Dataset B is 123 North Main Street, Brooklyn, NY 11201. Doing a straight query of Dataset A against Dataset B would not find a match since it’s not an exact match, but doing a wildcard search would yield a possible valid result with this style of query:
SELECT *
FROM Dataset_B
WHERE House_Number = 123 AND Street Like ‘%Main%’ AND City = ‘Brooklyn’ AND State = ‘NY’
I skipped the zip code since sometimes that interferes with results since they commonly have the incorrect zip code on them, but doing a zip code search later would be a secondary search if I didn’t find a hit using the above.
After my program receives the query results, it conducts an analysis to make sure that the result for each entry is not a false positive.
I’ve been using a Python multiprocessing SQL that I developed for this purpose and it’s worked pretty well overall, but I’m not sure how resource efficient it is compared to alternatives. It commonly takes up 100% of the hard drive datarate due to all this wildcard querying with up to 8 concurrent queries running at once to take advantage of all of my cores. The problem occurs when Dataset A is hundreds of thousands, or millions of entries where it can take up to 12 hours or more since the data in Dataset B could also be hundreds of thousands, or millions of records. I feel like SQL is the slowest way to do this though and am not sure if there’s a much more efficient solution using Python datastructures to do this including my wildcard criteria? Pandas would probably be overwhelmed by this volume and I’m not sure how much faster it would be compared to SQL. I was playing around with trying to use NumPy but wasn’t sure if that was the right direction. Can someone provide some guidance on the fastest way to tackle this kind of problem?
I am currently trying to develop macros/programs to help me edit a big database in Excel.
Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.
This was not easy, but I got to it somehow.
Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!
But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.
Problem is that Excel is only running at 25% CPU - I checked that online it is normal.
Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.
I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.
Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.
If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.?
Thanks!
I have no intention of condescending but anything I say would sound like condescending, so so be it.
The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.
I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.
In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.
So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.
What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.
If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.
I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.
Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).
Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).
Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.
Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.
Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.
A possible roadmap may look like:
Database (optional but recommended):
Learn about relational data
Learn database design
Learn SQL
Pandas (highly recommended):
Learn to read and write data (to excel / database)
Learn to merge, join, concatenate and update a DataFrame
I am writing a Django application that will have entries entered by users of the site. Now suppose that everything goes well, and I get the expected number of visitors (unlikely, but I'm planning for the future). This would result in hundreds of millions of entries in a single PostgreSQL database.
As iterating through such a large number of entries and checking their values is not a good idea, I am considering ways of grouping entries together.
Is grouping entries in to sets of (let's say) 100 a better idea for storing this many entries? Or is there a better way that I could optimize this?
Store one at a time until you absolutely cannot anymore, then design something else around your specific problem.
SQL is a declarative language, meaning "give me all records matching X" doesn't tell the db server how to do this. Consequently, you have a lot of ways to help the db server do this quickly even when you have hundreds of millions of records. Additionally RDBMSs are optimized for this problem over a lot of years of experience so to a certain point, you will not beat a system like PostgreSQL.
So as they say, premature optimization is the root of all evil.
So let's look at two ways PostgreSQL might go through a table to give you the results.
The first is a sequential scan, where it iterates over a series of pages, scans each page for the values and returns the records to you. This works better than any other method for very small tables. It is slow on large tables. Complexity is O(n) where n is the size of the table, for any number of records.
So a second approach might be an index scan. Here PostgreSQL traverses a series of pages in a b-tree index to find the records. Complexity is O(log(n)) to find each record.
Internally PostgreSQL stores the rows in batches with fixed sizes, as pages. It already solves this problem for you. If you try to do the same, then you have batches of records inside batches of records, which is usually a recipe for bad things.
Been studying Python on my own for some months. I'm about to ventur into the field of databases. I am currently aiming to create a very small scale application which would feature retrieving data based on a simple search (keyword) and reflect static data linked to it. Some numbers to put it in perspective:
About 150-200 "key" values
About 5-10 values to be displayed per "key" value
Editable (although the info tends to remain same most of time - maybe 1 or 2 amendments/month in total on all data stored)
An example would be the cards you see when you do a simple google search. For example you search for an actor (key value) and your query generates a "card" with values (age, length, wage, brothers,...).
As it would be the first attempt on creating something which works with such a (for me personally) larger amount of data, I am a bit puzzled with the options available to me. I have been reading up on different database models (relational,..). Ultimately I came to below 3 possibilities:
XML
database
hardcode
I intend on making the data amendable by 1 privileged user. This alone makes hardcoding it not really an option. Even though, as I am a novice, I could be interpreting this wrong here.
I'd be happy if you could point me in the right direction and if you'd go for a database, which one you'd recommend going for (MySQL,..).
Many thanks in advance. If I made any error with the post (as it is my initial post), do not hesitate to point this out.
I have a scenario where I have to obfuscate data(=scramble, for testing purposes, so it is not possible to see the real data, there is no need on unscramble/unobfuscate it) in database. There are several tables that are referencing the address_table. I can not obfuscate the address_table, so I figured that I simply change the references in those tables with random other address_table ID-s. The address_table contains 6M+ records. So I would create a temp table with all the address ID-s and then, when needed call some function to get a random one from there. So I could possibly generate a random value and take that row like:
Select * From (
Select Id, Rownum Rn From myTempTable )
WHERE RN = x;
where x is some random value generated by dbms_random. Now, although this is what I need, it does not perform anything near to what I expect.
Other thing I have tried is to call the sample() function, this (at least on small table) performs I bit better, but it is not good enough.
I know there are several threads on this matter like this or this on mySql, but they do not directly answer it in terms of performance.
Also, I am not limited in using pl/sql. I know a very little of pl/sql, how is it in terms of performance? I mean, it is just another process in DB server processing queue, perhaps i could get better performance doing the processing (i mean generating the update scripts, populating randoms etcetc) on client side using something like python, even considering network latency etc? Does anybody have any experience on this?
Use sample clause
select * from myTempTable SAMPLE(10);
This will return only 10% of rows.
If you just want to hide the real data why don't you take care of that in the select part of the query. Insteady of querying:
select column_name from table;
you could select
select scrambling_function(column_name) from table;
scrambling_function can be whatever you like.
There is not a good way to sample randomly using SQL that I am aware of. The sample function available in some SQL versions is not a sufficient random sample. The best way is to export the full sample set and use random software to determine the index of rows to be included in your final solution. Or if you have a simple number index (1,2,3...n) and know how many rows you need to select from you could upload a list of index's to include and query against that. Try random.org for random number generation, their API is located at http://www.random.org/clients/http/.