Database Design NBA - python

I am new to database design so sorry if this is obvious beginner question. I use python and sqlalchemy though I don't think that is relevant (the sample code below is psuedo code), though may be wrong. I have looked through some previous questions and didn't see this addressed. Anyway, on to the question. The goal here is to develop a database of NBA information which will have info on all games played and also box scores for every each player, for each game. There are a couple ways this DB can be designed.
Game(game_id, date, home_name, away_name, score)
Box_Score(game_id, player_name, date, points, rebounds)
In this situation if I want to get all the games the Los Angeles Lakers played I can just do
query(Game).filter(home_name=="lakers" or away_name=="lakers").all()
query(Box_Score).filter(player_name="kobe bryant")
Here is the second option for how to design this database:
Game(game_id, date, home_name=(foreignkey=Team.team_name), away_name, score)
Box_Score(game_id, player_name=foreignkey=Player.player_name), date, points, rebounds)
Team(team_name, home_games=relationship("Game"))
Player(player_name, box_scores=relationship("Box_Score"))
Then I can do
query(Team).filter(name=="lakers").first().games
query(Player).filter(name=="kobe bryant").first().box_scores
On the one hand it seems like the whole point of using a relational database is to set it up like in situation #2. On the other hand, I am not sure what extra functionality it gives me. So I guess my question is, which design do you recommend? Are there some benefits or disadvantages to either design that will become apparent down the line which I cannot see yet? And if you recommend the simpler design #1 which does not use table relationships, why is it that I am storing a decent amount of related information but don't need to use relational database? Thanks!!

The ideal data model for any database is highly subjective. If you are new to database design, you probably will not find the ideal schema until after you have created your application and tested it for an extended period of time. I would recommend reading up on some design basics, particularly Database Normalization, since you would probably benefit from a highly normalized schema, where data can be referenced in many different ways. Highly-normalized databases can suffer in the performance department if very large (which this does not seem like it would be), but you can always de-normalize data through the use of Materialized Views or other methods of caching.

Related

Advantage of Django ORM V/S Performing raw SQL queries [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
If you are motivate to the "pros" of an ORM and why would you use an ORM to management/client, what are those reasons would be?
Try and keep one reason per answer so that we can see which one gets voted up as the best reason.
The most important reason to use an ORM is so that you can have a rich, object oriented business model and still be able to store it and write effective queries quickly against a relational database. From my viewpoint, I don't see any real advantages that a good ORM gives you when compared with other generated DAL's other than the advanced types of queries you can write.
One type of query I am thinking of is a polymorphic query. A simple ORM query might select all shapes in your database. You get a collection of shapes back. But each instance is a square, circle or rectangle according to its discriminator.
Another type of query would be one that eagerly fetches an object and one or more related objects or collections in a single database call. e.g. Each shape object is returned with its vertex and side collections populated.
I'm sorry to disagree with so many others here, but I don't think that code generation is a good enough reason by itself to go with an ORM. You can write or find many good DAL templates for code generators that do not have the conceptual or performance overhead that ORM's do.
Or, if you think that you don't need to know how to write good SQL to use an ORM, again, I disagree. It might be true that from the perspective of writing single queries, relying on an ORM is easier. But, with ORM's it is far too easy to create poor performing routines when developers don't understand how their queries work with the ORM and the SQL they translate into.
Having a data layer that works against multiple databases can be a benefit. It's not one that I have had to rely on that often though.
In the end, I have to reiterate that in my experience, if you are not using the more advanced query features of your ORM, there are other options that solve the remaining problems with less learning and fewer CPU cycles.
Oh yeah, some developers do find working with ORM's to be fun so ORM's are also good from the keep-your-developers-happy perspective. =)
Speeding development. For example, eliminating repetitive code like mapping query result fields to object members and vice-versa.
Making data access more abstract and portable. ORM implementation classes know how to write vendor-specific SQL, so you don't have to.
Supporting OO encapsulation of business rules in your data access layer. You can write (and debug) business rules in your application language of preference, instead of clunky trigger and stored procedure languages.
Generating boilerplate code for basic CRUD operations. Some ORM frameworks can inspect database metadata directly, read metadata mapping files, or use declarative class properties.
You can move to different database software easily because you are developing to an abstraction.
Development happiness, IMO. ORM abstracts away a lot of the bare-metal stuff you have to do in SQL. It keeps your code base simple: fewer source files to manage and schema changes don't require hours of upkeep.
I'm currently using an ORM and it has sped up my development.
So that your object model and persistence model match.
To minimise duplication of simple SQL queries.
The reason I'm looking into it is to avoid the generated code from VS2005's DAL tools (schema mapping, TableAdapters).
The DAL/BLL i created over a year ago was working fine (for what I had built it for) until someone else started using it to take advantage of some of the generated functions (which I had no idea were there)
It looks like it will provide a much more intuitive and cleaner solution than the DAL/BLL solution from http://wwww.asp.net
I was thinking about created my own SQL Command C# DAL code generator, but the ORM looks like a more elegant solution
Abstract the sql away 95% of the time so not everyone on the team needs to know how to write super efficient database specific queries.
I think there are a lot of good points here (portability, ease of development/maintenance, focus on OO business modeling etc), but when trying to convince your client or management, it all boils down to how much money you will save by using an ORM.
Do some estimations for typical tasks (or even larger projects that might be coming up) and you'll (hopefully!) get a few arguments for switching that are hard to ignore.
Compilation and testing of queries.
As the tooling for ORM's improves, it is easier to determine the correctness of your queries faster through compile time errors and tests.
Compiling your queries helps helps developers find errors faster. Right? Right. This compilation is made possible because developers are now writing queries in code using their business objects or models instead of just strings of SQL or SQL like statements.
If using the correct data access patterns in .NET it is easy to unit test your query logic against in memory collections. This speeds the execution of your tests because you don't need to access the database, set up data in the database or even spin up a full blown data context.[EDIT]This isn't as true as I thought it was as unit testing in memory can present difficult challenges to overcome. But I still find these integration tests easier to write than in previous years.[/EDIT]
This is definitely more relevant today than a few years ago when the question was asked, but that may only be the case for Visual Studio and Entity Framework where my experience lies. Plugin your own environment if possible.
.net tiers using code smith templates
http://nettiers.com/default.aspx?AspxAutoDetectCookieSupport=1
Why code something that can be generated just as well.
convince them how much time / money you will save when changes come in and you don't have to rewrite your SQL since the ORM tool will do that for you
I think one cons is that ORM will need some updation in your POJO. mainly related to schema, relation and query. so scenario where you are not suppose to make changes in model objects, might be because it is shared among more that on project or b/w client and server. so in such cases you will need to split it in two levels, which will require additional efforts .
i am an android developer and as you know mobile apps are usually not huge in size, so this additional effort to segregate pure-model and orm-affected-model does not seems worth full.
i understand that question is generic one. but mobile apps are also come inside generic umbrella.

Storm ORM and auto generation table

I stating read Storm ORM docs, and try some examples with sqlite. I have one question, can Storm automaticaly create tables from models or no? I don't want to do this:
store.execute("CREATE TABLE person "
"(id INTEGER PRIMARY KEY, name VARCHAR)")
every time when I want to create new table, also this is not good when table alredy exists.
Storm ORM haven't feature for autocreating tables. I start use peewee ORM it looks very nice.
If you're still starting the project and haven't put too much work into it yet, let me kindly suggest, that you try an object oriented database directly, instead of emulating an object oriented database with some relational backend. ZODB is a very good match for that, but you should also have a look at MongoDB and colleaques. I had a try with Storm a while ago, and dropt it quite soon again, throwing a lot of code away, because of the horribly slow performance, especially with insert-or-update statements. You don't have to make the same mistake.
More on-topic: As far as I know, there is no such feature. I was also looking for it, and was somewhat disappointed, after setting up a detailed data model, that it couldn't generate the tables automatically. Beat me, if I missed it.

Correct way of implementing database-wide functionality

I'm creating a small website with Django, and I need to calculate statistics with data taken from several tables in the database.
For example (nothing to do with my actual models), for a given user, let's say I want all birthday parties he has attended, and people he spoke with in said parties. For this, I would need a wide query, accessing several tables.
Now, from the object-oriented perspective, it would be great if the User class implemented a method that returned that information. From a database model perspective, I don't like at all the idea of adding functionality to a "row instance" that needs to query other tables. I would like to keep all properties and methods in the Model classes relevant to that single row, so as to avoid scattering the business logic all over the place.
How should I go about implementing database-wide queries that, from an object-oriented standpoint, belong to a single object? Should I have an external kinda God-object that knows how to collect and organize this information? Or is there a better, more elegant solution?
I recommend extending Django's Model-Template-View approach with a controller. I usually have a controller.py within my apps which is the only interface to the data sources. So in your above case I'd have something like get_all_parties_and_people_for_user(user).
This is especially useful when your "data taken from several tables in the database" becomes "data taken from several tables in SEVERAL databases" or even "data taken from various sources, e.g. databases, cache backends, external apis, etc.".
User.get_attended_birthday_parties() or Event.get_attended_parties(user) work fine: it's an interface that makes sense when you use it. Creating an additional "all-purpose" object will not make your code cleaner or easier to maintain.

Keeping track of user habits and activities? - Django

I was working on a project a few months ago, and had the need to implement an award system. Similar to StackOverflow's badge system. Badges
I might have not implemented it in the best possible way, and I am curious what your say in it would be.
What would a good way to track user activities, needed for badge awarding be?
Stackoverflow's system needs to know of a lot of information, and I also get the impression that there would be a lot of data processing complicating things.
I would assume that SO calculates badges once or twice every 24, and that maybe logs are stored or a server dedicated to badge calculation.
Thoughts?
I don't think is as complicated as you think. I highly doubt that SO calculates badges with some kind of user activity log (although technically the entire database is a user activity log). When I look at the lists of badges, I don't see anything that can't be implemented by running a SQL select query.
Some of the queries could be pretty complicated, and there might be some sort of fancy caching mechanism, but I don't see any reason why you would have to calculate badges in batches.
In general badge/point systems can be based on two things.
Activity log of interesting events, this is effectively the paper register receipt of what has happend such that you can re-compute from the ground up if it's ever needed. Can be as simple as (user_id, timestamp, event_id, event_detail)
Most of the time you've pre-designed your scoring/point system so you know exactly which counters to keep on a user. Now it's as simple as having a big record that contains all of the details. (user_id, reply_points, login_points, last_login, thumbs_up_points, etc.,etc.)
Now you can slap some simple methods on that model object and have it manage/store the points as needed.

Django Table with Million of rows

I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?
Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.
ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.
Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.
You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.
I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.
Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.
A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.
You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

Categories

Resources