Python - Database for small scale static data

Python - Database for small scale static data - python

Been studying Python on my own for some months. I'm about to ventur into the field of databases. I am currently aiming to create a very small scale application which would feature retrieving data based on a simple search (keyword) and reflect static data linked to it. Some numbers to put it in perspective:
About 150-200 "key" values
About 5-10 values to be displayed per "key" value
Editable (although the info tends to remain same most of time - maybe 1 or 2 amendments/month in total on all data stored)
An example would be the cards you see when you do a simple google search. For example you search for an actor (key value) and your query generates a "card" with values (age, length, wage, brothers,...).
As it would be the first attempt on creating something which works with such a (for me personally) larger amount of data, I am a bit puzzled with the options available to me. I have been reading up on different database models (relational,..). Ultimately I came to below 3 possibilities:
XML
database
hardcode
I intend on making the data amendable by 1 privileged user. This alone makes hardcoding it not really an option. Even though, as I am a novice, I could be interpreting this wrong here.
I'd be happy if you could point me in the right direction and if you'd go for a database, which one you'd recommend going for (MySQL,..).
Many thanks in advance. If I made any error with the post (as it is my initial post), do not hesitate to point this out.

Related

smart way to structure my SQLite Database

I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.

It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.

In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?

Is it ok to generate a random string and using it as an ID on a relational database

I'm using MYSQL for storing data about my products, I've read on the internet using an auto-incremental number as Primary Key is a really bad idea because anyone could just iterate all through my service and stole my data. For practical proposes it is really handy, now I'm trying to find a better solution,
So far i've created a function in Python that returns a random string of length n from a fixed number of characters to choose from. Now I know it is possible the generated ID was already generated before and stored in DB and the probability that happens is really low at beginning but it will increment with the time, at a point where say half of the IDS are already taken so the probability to fail on creating a new record on DB will fail will be of 50%.
Is it ok to work with IDS this way? or is there any better solution?, how many characters should I use at least to avoid this problem?.
I know UUID exists but I don't think is very userFriendly

Implementation of multi-variable, multi-type RNN in Python

I have a dataset which has items with the following layout/schema:
{
words: "Hi! How are you? My name is Helennastica",
ratio: 0.32,
importantNum: 382,
wordArray: ["dog", "cat", "friend"],
isItCorrect: false,
type: 2
}
where I have a lot of different types of data, including:
Arrays (of one type only, e.g an array of strings or array of numbers, never both)
Booleans
Numbers with fixed min/max (i.e on a scale of 0 to 1)
Limitless integers (any integer -∞ to ∞)
Strings, with some dictionary, some new, words
The task is to create an RNN (well, generally, a system that can quickly retrain when given one extra bit of data instead of reprocessing it all - I think an RNN is the best choice; see below for reasoning) which can use all of these factors to categorise any dataset into one of 4 categories - labelled by the type key in the above example, a number 0-3.
I have a set of lots of the examples in the above format (with answer provided), and I have a database filled with uncategorised examples. My intention is to be able to run the ML model on that set, and sort all of them into categories. The reason I need to be able to retrain quickly is because of the feedback feature: if the AI gets something wrong, any user can report it, in which case that specific JSON will be added to the dataset. Obviously, having to retrain with 1000+ JSONs just to add one extra on would take ages - if I am not mistaken, an RNN can get around this.
I have found many possible use-cases for something like this, yet I have spent literal hours browsing through Github trying to find an implementation, or some Tensorflow module/addon to make this easier/copy, but to no avail.
I assume this would not be too difficult using Tensorflow, and I understand a bit of the maths and logic behind it (but not formally educated, so I probably have gaps!), but unfortunately I have essentially no experience with using Tensorflow/any other ML frameworks (beyond copy-pasting code for some other projects). If someone could point me in the right direction in the form of a Github repo/Python framework, or even write some demo code to help solve this problem, it would be greatly appreciated. And if you're just going to correct some of my technical knowledge/tell me where I've gone horrendously wrong, I'd appreciate that feedback to (just leave it as a comment).
Thanks in advance!

Convert CSV table to Redis data structures

I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.

You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).

Search Between Two Models in Django

I apologize in advance is this question is too broad, but I need some help conceptualizing.
The end result is that I want to enable radius-based searching. I am using Django. To do this, I have two classes: Users and Places. Inside the Users class is a function that defines the radius in which people want to search. Inside the Places class I have a function that defines the midpoint if someone enters a city and state and not a zip (i.e., if someone enters New York, NY a lot of zipcodes are associated with that so I needed to find the midpoint).
I have those two parts down. So now I have a radius where people want to search and I know (the estimate) of the places. Now, I am having a tremendous amount of difficulty combining the two, or even thinking about HOW to do this.
I attempted doing the searching against each other in the view, but I ran into a lot of trouble when I was looping through one model in the template but trying to display results based on an if statement of the other model.
It seemed like a custom template tags would be the solution for that problem, but I wanted to make sure I was conceptualizing the problem correctly in the first place. I.e.,
Do I want to do the displays based on an if statement in the template?
Or should I be creating another class based on the other two in my models file?
Or should I create a new column for one of the classes in the models file?
I suppose my ultimate question is, based on what it is I want to do (enable radius based searching), where/how should most of the work be done? Again, I apologize if the question is overly broad.

Perhaps you could put it in the view which renders the search page.
asuuming you have a view function like search you could:
get users radius request.user.get_radius
search for places based on that radius relevant_places = Places.get_all_places_in_radius
Render those places to a user

Based on what you are describing, I believe GeoDjango would be worth your time to look into: http://geodjango.org/
Especially if you want to enable radius based searching, most of the heavy lifting is already done by GeoDjango, you'll just have to invest some time learning how to use it (which is a small fraction of the time you would have had to spend "reinventing the wheel", so to speak)

I just decided to add the function to the view so that the information can be input directly into the model after a user enters it. Thanks for the help. I'll probably wind up looking into geodjango.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.