Should I use python classes for a MySQL database insert program? - python

I have created a database to store NGS sequencing results. It consists of 17 tables to store all of the information. The results are stored in spreadsheets which I parse data from and store in variables using python (2.7), and then use the python package mysqldb to insert data into the database. I mainly use functions to obtain the information i need in variables, then write a loop in which I call this function followed by a 'try:' statement to insert. Here is a simple example:
def sample_processer(file):
my_file = open(file, 'r+')
samples = []
for line in my_file:
...get info...
samples.append(line[0])
return(samples)
samples = sample_processor('path/to/file')
for sample in samples:
try:
sql = "samsql = "INSERT IGNORE INTO sample(sample_id, diagnosis, screening) VALUES ("
samsql = samsql + "'"+sample+"'," +sam_screen_dict.get(sample)+"')"
except e:
db.rollback()
print("Something went wrong inserting data into the sample table: %s" %(e))
*sam_screen_dict is a dictionary i made from another function.
This is a simple table that I upload into but many of them call of different dictionaries to make sure the correct results are uploaded. However I was wondering whether there would be a more robust way in which to do this using a class.
For example, my sample_id has an associated screening attribute in the sample table, so this is easy to do with one dictionary. I have more complex junction tables, such as the table in which the sample_id, experiment_id and found mutation are stored, alongside other data, would it be a good idea to create a class for this table, calling on a simple 'sample' class to inherit from? That way I would always know that the results being inserted will be for the correct sample/experiment etc.
Also, using classes could I write rules for each attribute so that if the source spreadsheet is for some reason incorrect, it will not insert into the database?
I.e: sample_id is in the format A123/16. Therefore using a class it will check that the first character is 'A' and that sample_id[-3] should always == '/'. I know I could write these into functions, but I feel like it would take up so much space and time writing so many 'if' statements, that if it is stored once in a class then this would be alot better.
Has anybody done anything similar using classes to pass through their variables to test that they are correct before it gets to the insert stage and an error is created?
I am new to python classes and understand the basics, still trying to get to grips with them so a point in the right direction would be great - as would any help on how to go about actually writing the code for a python class that would be used to make a more robust database insertion program.

17tables it means you may use about 17 classes.
Use a simple script. webpy.db
https://github.com/webpy/webpy/blob/master/web/db.py just modify few code.
Then you can use webpy api: http://webpy.org/docs/0.3/api#web.db to finish your job.
Hope it's useful for you

Related

Refactorable database queries

Say I have category="foo" and a NoSQL query={"category"=category}. Whenever I refactor my variable name of category, I need to manually change it inside the query if I want to adopt it.
In Python 3.8+ I'm able to get the variable name as a string via the variable itself.
Now I could use query={f"{category=}".split("=")[0]=category}. Now refactoring changes the query too. This applies to any database queries or statements (SQL etc.).
Would this be bad practice? Not just concerning Python but any language where this is possible.
Would this be bad practice?
Yes, the names of local variables do not need to correlate with the fields in data stores.
You should be able to retrieve a record and filter on its fields with any python variable, no matter its name or if its nested in a larger data structure.
In pseudocode:
connection = datastore.connect(...)
# passing a string directly
connection.fetch({"category": "fruit"})
# passing a string variable
category_to_fetch = "vegetable"
connection.fetch({"category": category_to_fetch})
# something more exotic like a previous list of records
r = [("fish",)]
connection.fetch({"category": r[0][0]})
# or even a premade filter dictionary
filter = {"category": "meat"}
connection.fetch(filter)

How to convert strings to arrays and join on another table in sql

At my work, I have two SQL tables, one is called jobs, with string attributes, job and codes. The latter is called skills with string attributes code and skill.
job code
--- ----
j1 s0001,s0003
j2 s0002,20003
j3 s0003,s0004
code skills
----- ------
s0001 python programming language
s0002 oracle java
s0003 structured query language sql
s0004 microsoft excel
What my boss wants me to do is: Take values from the attribute code in jobs, split the string into an array, join this array on the codes (from skills table) and return the query in the format of job skills like:
job skills
--- ------
j1 python programming language,structured query language sql
At this point, I'd just like to know if (A) this is possible and (B) if there is a preferred alternative to this approach. I've listed my python solution, using dictionaries, below to illustrate my the concept:
jobs = {'j1':'s0001,s0003',
'j2':'s0002,20003',
'j3':'s0003,s0004'}
skills = {'s0001':'python programming language',
's0002':'oracle java',
's0003':'structured query language sql',
's0004':'microsoft excel'}
job_skills = {k:[] for k in jobs.keys()}
for j,s in jobs.items():
for code,skill in skills.items():
for i in s.split(','):
if i == code:
job_skills[j].append(skill)
for k,v in job_skills.items():
job_skills[k] = ','.join(v)
And the output:
{'j1': 'python programming language,structured query language sql',
'j2': 'oracle java',
'j3': 'structured query language sql,microsoft excel'}
The real crux of this problem is that there aren't just 4 different skills in our data. Our company's data includes ~5000 skills. My boss would greatly like to avoid creating a table with 5000 attributes, 1 for each skill; he believes the above approach will result in simpler queries, with potentially better memory management.
I'm still pretty new to SQL, and technically only do SQLite3 anyway so the best I can probably do is a Python solution. I'll tell you how I would solve it, and hopefully someone can come along and fix it, because doing things purely in SQL is vastly faster than ever using Python.
I'm going to assume that this is SQLite, because you tagged Python. If it's not, there's probably ways to convert the database to the .db format in order to use that if you prefer this solution.
I'm assuming that conn is your connection to the database conn = sqlite3.connect(your_database_path) or a cursor for it. I don't use cursors, but it's almost certainly better practice to use them.
First, I would fetch the 'skills' table and convert it to a dict. I would do so with:
skills_array = conn.execute("""SELECT * FROM skills""")
skills_dict = dict()
#replace i with something else. I just did it so that I could use 'skill' as a variable
for i in skills_array:
#skills array is an iterator of tuples, which means the first position is the code number, and the second position is the skill itself
code = i[0]
skill = i[1]
skills_dict[code] = skill
There's probably better ways to do this. If it's important, I recommend researching them. But if it's a one time thing this will work just fine. All this is doing is making giving an easy way to look up skills given a code. You could do this dozens of ways. You didn't mention it being a particularly large database, so this should be fine.
Before the next part, something should be mentioned about SQLite. It has very limited table modifying mechanics-- something I coincidentally found out about today. The recommended method is just to create a new table instead of trying to finagle with an old one. But there are easy ways to modify them with SQLiteBrowser-- something I highly recommend you use. At the very least it's much easier to view info in it for me, and it's available on all the important OS's.
Second, we need to combine the job table and the skills dict. There are much better ways to go about it, but I chose the easy approach. Delimiting the job.skills column by commas and going from there. I'll also create a new table, and insert directly to there.
conn.execute("""CREATE TABLE combined (job TEXT PRIMARY KEY, skills text)""")
conn.commit()
job_array = conn.execute("""SELECT * FROM jobs""")
for i in job_array:
job = i[0]
skill = i[1]
for code in skill.split(","):
skill.replace(code, skills_dict[code])
conn.execute("""INSERT INTO combined VALUES (?, ?)""", (job, skill,))
conn.commit()
And to combine it all...
import sqlite3
conn = sqlite3.connect(your_database_path)
skills_array = conn.execute("""SELECT * FROM skills""")
skills_dict = dict()
#replace i with something else. I just did it so that I could use 'skill' as a variable
for i in skills_array:
#skills array is an iterator of tuples, which means the first position is the code number, and the second position is the skill itself
code = i[0]
skill = i[1]
skills_dict[code] = skill
conn.execute("""CREATE TABLE combined (job TEXT PRIMARY KEY, skills text)""")
conn.commit()
job_array = conn.execute("""SELECT * FROM jobs""")
for i in job_array:
job = i[0]
skill = i[1]
for code in skill.split(","):
skill.replace(code, skills_dict[code])
conn.execute("""INSERT INTO combined VALUES (?, ?)""", (job, skill,))
conn.commit()
To explain a little further if you/someone is confused on the job_array for loop:
Splitting skills allows you to see each individual code, meaning that all you have to do is replace every instance of the code being looked up with the corresponding skill.
And that's it. There's probably a mistake or two in the above code, so I would backup your database/tables before trying it, but this should work. One thing that you might find helpful are context managers, that would make it far more Pythonic. If you plan to use this consistently (for some strange reason), refactoring for speed and readability may also be prudent.
I would also like to believe that there's an SQLite only approach, since this is exactly what databases are made for.
Hope this helps. If it did, let me know. :>
P.S. If you're confused by something/want more explanation feel free to comment.

python SPARQL query RESULTS BINDINGS: IF statement on BINDINGs value?

Using Python with SPARQLWrapper, JSON, urlib2 & cgi. Had trouble passing a working SPARQL query with some NULL values to python so I populated the blanks with a literal and will try to filter at the output. I have this results section example:
for result in results["results"]["bindings"]:
project = result["project"]["value"].encode('utf-8')
filename = result["filename"]["value"].encode('utf-8')
url = result["url"]["value"].encode('utf-8')
...and I print the %s. Is there a way to filter a value, i.e., IF VALUE NE "string" then PRINT? Or is there another workaround? I'm at the tail-end of a small project, I know I need a better wrapper, I just need to get these results filtered before I can move on. T very much IA...
I'm one of the developers of the SPARQLWrapper library, and the question had been already answered at the mailing list.
Regarding optionals values on the original query, the result set will come with no values for those variables. The problems is that we'd need to parse the query to populate such missing entries, and we want to avoid such parsing; therefore you'd need to check it for avoiding runtime problems with KeyError.
Usually I use a code like:
for result in results["results"]["bindings"]:
party = result["party"]["value"] if ("party" in result) else None

How to a query a set of objects and return a set of object specific attribute in SQLachemy/Elixir?

Suppose that I have a table like:
class Ticker(Entity):
ticker = Field(String(7))
tsdata = OneToMany('TimeSeriesData')
staticdata = OneToMany('StaticData')
How would I query it so that it returns a set of Ticker.ticker?
I dig into the doc and seems like select() is the way to go. However I am not too familiar with the sqlalchemy syntax. Any help is appreciated.
ADDED: My ultimate goal is to have a set of current ticker such that, when new ticker is not in the set, it will be inserted into the database. I am just learning how to create a database and sql in general. Any thought is appreciated.
Thanks. :)
Not sure what you're after exactly but to get an array with all 'Ticker.ticker' values you would do this:
[instance.ticker for instance in Ticker.query.all()]
What you really want is probably the Elixir getting started tutorial - it's good so take a look!
UPDATE 1: Since you have a database, the best way to find out if a new potential ticker needs to be inserted or not is to query the database. This will be much faster than reading all tickers into memory and checking. To see if a value is there or not, try this:
Ticker.query.filter_by(ticker=new_ticker_value).first()
If the result is None you don't have it yet. So all together,
if Ticker.query.filter_by(ticker=new_ticker_value).first() is None:
Ticker(ticker=new_ticker_value)
session.commit()

What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.
Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.
[("hello", 3, 4), ("cat", 5, 3), ...]
I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.
My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:
#transaction.commit_on_success
def save_data_elements(input_list):
for (s, i1, i2) in input_list:
entry = DataElement(string=s, number1=i1, number2=i2)
entry.save()
This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.
For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.
Do you have any suggestion to optimize such a process?
Thanks
You can write the rows to a file in the format
"field1", "field2", .. and then use LOAD DATA to load them
data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()
Then execute this:
LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;
Reference
For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.
If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :
Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)
If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.
Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.
what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html
This is unrelated to the actual load of data into the DB, but...
If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.
Just something else to consider.
I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.
Hope you can work it out. I just started learning django, so I can just point you to resources.

Categories

Resources