I have to make multiple sql queries of entire tables, and concatenate them into one big data table.
I have a dictionary where the key is a team name, and the value serves as an acronym where the acronym is the prefix to mySQL data tables
engine = create_engine('mysql+mysqlconnector://%s:%s#%s/%s' % (mysql_user, mysql_password, mysql_host, mysql_dbname), echo=False, pool_recycle=1800)
mysql_conn = engine.connect()
team_dfs = []
nba_dict = {'New York Knicks': 'nyk',
'Boston Celtics': 'bos',
'Golden State Warriors': 'gsw',
'New York Knicks': 'nyk'}
for name, abbr in nba_dict.items()
query = f'''
SELECT *
from {abbr}_record
'''
df = pd.read_sql_query(query, mysql_conn)
df['team_name'] = name
team_dfs.append(df)
team_dfs = pd.concat(team_dfs)
Is there a better way to refactor this code and make it more efficient?
Your database layout, with a separate table for each team, is doomed to inefficiency whenever you need to retrieve data for more than one team at a time. You would be much much better off putting all that data in one table, giving the table a column mentioning the team associated with each row.
Why inefficient? More tables: more work. And, more queries: more work.
I suggest you push back, hard, on the designer of this database table structure. Its design is, bluntly, wrong.
If you must live with this structure, I suggest you create the following view. It will fake the single-table approach and give you your "gold layer". You get away with this because pro sports franchises don't come and go that often. You do this just once in your database.
CREATE OR REPLACE VIEW teams_record AS
SELECT 'nyk' team, * FROM nyk_record
UNION ALL
SELECT 'bos' team, * FROM bos_record
UNION ALL
SELECT 'gsw' team, * FROM gsw_record
UNION ALL .... the other teams
Then you can do SELECT * FROM teams_record ORDER BY team to get all your data.
If your nba_dict is fixed, you can use UNION to manually combine the result table of SQL.
abbr = list(nba_dict.values())
query = f'''
SELECT *
from {abbr[0]}_record UNION
SELECT *
from {abbr[1]}_record UNION
SELECT *
from {abbr[2]}_record UNION
SELECT *
from {abbr[3]}_record UNION
'''
df = pd.read_sql_query(query, mysql_conn)
df['team_name'] = list(nba_dict.keys())
Related
I have a GUI interacting with my database, and MySQL database has around 50 tables. I need to search each table for a value and return the field and key of the item in each table if it is found. I would like to search for partial matches. ex.( Search Value = "test", "Protest", "Test123" would be matches. Here is my attempt.
def searchdatabase(self, event):
print('Searching...')
self.connect_mysql() #Function to connect to database
d_tables = []
results_list = [] # I will store results here
s_string = "test" #Value I am searching
self.cursor.execute("USE db") # select the database
self.cursor.execute("SHOW TABLES")
for (table_name,) in self.cursor:
d_tables.append(table_name)
#Loop through tables list, get column name, and check if value is in the column
for table in d_tables:
#Get the columns
self.cursor.execute(f"SELECT * FROM `{table}` WHERE 1=0")
field_names = [i[0] for i in self.cursor.description]
#Find Value
for f_name in field_names:
print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE {f_name} LIKE {s_string}"))
print(table)
I get an error on print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE {f_name} LIKE {s_string}"))
Exception: (1054, "Unknown column 'test' in 'where clause'")
I use a similar insert query that works fine so I am not understanding what the issue is.
ex. insert_query = (f"INSERT INTO `{source_tbl}` ({query_columns}) VALUES ({query_placeholders})")
May be because of single quote you have missed while checking for some columns.
TRY :
print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE '{f_name}' LIKE '{s_string}'"))
Have a look -> here
Don’t insert user-provided data into SQL queries like this. It is begging for SQL injection attacks. Your database library will have a way of sending parameters to queries. Use that.
The whole design is fishy. Normally, there should be no need to look for a string across several columns of 50 different tables. Admittedly, sometimes you end up in these situations because of reasons outside your control.
I have 2 tables in MySQL.
One has transactions with important columns where each row has Debit account ID and Credit account ID. I have second table which contains Account name and special number associated to Account ID. I want somehow to try sql query which will take data from transactions table and assign account name and account number from second table.
I tried doing everything using two query , one would get transactions and second one would get account details and then I did iterate over dataframe and assigned everything one by one which doesn't seem to be good idea
query = "SELECT tr_id, tr_date, description, dr_acc, cr_acc, amount, currency, currency_rate, document, comment FROM transactions WHERE " \
"company_id = {} {} and deleted = 0 {} LIMIT {}, {}".format(
company_id, filter, sort, sn, en)
df = ncon.getDF(query)
df.insert(4, 'dr_name', '')
df.insert(6, 'cr_name', '')
data = tuple(list(set(df['dr_acc'].values.tolist() + df['cr_acc'].values.tolist())))
query = "SELECT account_number, acc_id, account_name FROM tb_accounts WHERE company_id = {} and deleted = 0 and acc_id in {}".format(
company_id, data)
df_accs = ncon.getDF(query)
for index, row in df_accs.iterrows():
acc = str(row['acc_id'])
ac = row['account_number']
nm = row['account_name']
indx = df.index[df['dr_acc'] == acc].tolist()
df.at[indx, 'dr_acc'] = ac
df.at[indx, 'dr_name'] = nm
indx = df.index[df['cr_acc'] == acc].tolist()
df.at[indx, 'cr_acc'] = ac
df.at[indx, 'cr_name'] = nm
What you're looking for, I think, is a SQL JOIN statement.
Taking a crack at writing a query that might work based on your code:
query = '''
SELECT transactions.tr_id,
transactions.tr_date,
transactions.description,
transactions.dr_acc,
transactions.cr_acc,
transactions.amount,
transactions.currency,
transactions.currency_rate,
transactions.document,
transactions.comment
FROM transactions INNER JOIN tb_accounts ON tb_accounts.acc_id = transactions.acc_id
WHERE
transactions.company_id = {} AND
tb_accounts.company_id = {} AND
transactions.deleted = 0 AND
tb_accounts.deleted = 0
ORDER BY transactions.tr_id
LIMIT 10;'''
The above query will, roughly, present query results with all the fields listed from the two tables for each pair of rows where the acc_id is the same.
NOTE, the query above will probably not have very good performance. SQL JOIN statements must be written with care, but I wrote it above in a way that's easy to understand, so as to illustrate the power of the JOIN.
You should as a matter of habit NEVER try to program something when you could use a join instead. As long as you take care to write a join properly so that it can be efficient, the MySQL engine will beat your python code for performance almost every time.
sort two dataframe and use merge for merging 2data frame
df1 = df1.sort_values(['dr_acc'], ascending=True)
df2 = df2.sort_values(['acc_id'], ascending=True)
merge2df = pd.merge(df1, df2, how='outer',
left_on=['dr_acc'], right_on=['acc_id'])
I assumed df1 is 1st query data set and df2 is 2nd query data set
sql query
'''SELECT tr_id, tr_date,
description,
dr_acc, cr_acc,
amount, currency,
currency_rate,
document,
account_number, acc_id, account_name
comment FROM transactions left join
tb_accounts on transactions.dr_acc=tb_accounts.account_number'''
I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function
So I found a great script over at QuantState that had a great walk-through on setting up my own securities database and loading in historical pricing information. However, I'm not trying to modify the script so that I can run it daily and have the latest stock quotes added.
I adjusted the initial data load to just download 1 week worth of historicals, but I've been having issues with writing the SQL statement to see if the row exists already before adding. Can anyone help me out with this. Here's what I have so far:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
# Create the time now
now = datetime.datetime.utcnow()
# Amend the data to include the vendor ID and symbol ID
daily_data = [(data_vendor_id, symbol_id, d[0], now, now,
d[1], d[2], d[3], d[4], d[5], d[6]) for d in daily_data]
# Create the insert strings
column_str = """data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price,
close_price, volume, adj_close_price"""
insert_str = ("%s, " * 11)[:-2]
final_str = "INSERT INTO daily_price (%s) VALUES (%s) WHERE NOT EXISTS (SELECT 1 FROM daily_price WHERE symbol_id = symbol_id AND price_date = insert_str[2])" % (column_str, insert_str)
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
Some notes regarding your code above:
It's generally easier to defer to now() in pure SQL than to try in Python whenever possible. It avoids lots of potential pitfalls with timezones, library differences, etc.
If you construct a list of columns, you can dynamically generate a string of %s's based on its size, and don't need to hardcode the length into a repeated string with is then sliced.
Since it appears that insert_daily_data_into_db is meant to be called from within a loop on a per-stock basis, I don't believe you want to use executemany here, which would require a list of tuples and is very different semantically.
You were comparing symbol_id to itself in the sub select, instead of a particular value (which would mean it's always true).
To prevent possible SQL Injection, you should always interpolate values in the WHERE clause, including sub selects.
Note: I'm assuming that you're using psycopg2 to access Postgres, and that the primary key for the table is a tuple of (symbol_id, price_date). If not, the code below would need to be tweaked at least a bit.
With those points in mind, try something like this (untested, since I don't have your data, db, etc. but it is syntactically valid Python):
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
column_list = ["data_vendor_id", "symbol_id", "price_date", "created_date",
"last_updated_date", "open_price", "high_price", "low_price",
"close_price", "volume", "adj_close_price"]
insert_list = ['%s'] * len(column_str)
values_tuple = (data_vendor_id, symbol_id, daily_data[0], 'now()', 'now()', daily_data[1],
daily_data[2], daily_data[3], daily_data[4], daily_data[5], daily_data[6])
final_str = """INSERT INTO daily_price ({0})
VALUES ({1})
WHERE NOT EXISTS (SELECT 1
FROM daily_price
WHERE symbol_id = %s
AND price_date = %s)""".format(', '.join(column_list), ', '.join(insert_list))
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.execute(final_str, values_tuple, values_tuple[1], values_tuple[2])
I have a small database which is legacy from an almost defunct project. This database has a "Patients" table with individual personal data and an unique "Id" field, and an "Exams" table with some fields for each exam, one of these fields being "Patient_Id".
What I want is, for a given patient (row from "Pacientes" table) the exams (rows) from "Exames" table whose "Patient_Id" matches that of the given patient.
I am very beginner with SQL, so if the question is very naive, I apologize.
My working code is the following, but I am using a for loop while I would much rather have a SQL query (which is the very point of using databases, I think...):
#!/usr/bin/env python
# coding: utf-8
import os, sqlite3
conn = sqlite3.connect('BDdata.db3')
conn.row_factory = sqlite3.Row
c = conn.cursor()
c.execute('SELECT * FROM Exames')
exams = c.fetchall()
c.execute('SELECT * FROM Pacientes')
for row in c:
Nome = row['nome']
ID = row['Id']
for exam in exams: # I would like to replace this loop
if exam['Id_Paciente'] == ID: # with a SQL query meaning
print exam['File']
============================
An answer to a similar question seems to be what I want, but I have no idea how to do this in Python with sqlite3 module, much less what in this expression is essential and what is incidental, or what is the syntax structure:
Selecting rows from a table by One a field from other table
SELECT i.prof_image
FROM profile_images i
WHERE cat_id = (select max(cat_id) from images_cat)
I think the following should do what you want:
...
c = conn.cursor()
c.execute('SELECT * FROM Pacientes')
for row in c.fetchall():
Nome = row['nome']
ID = row['Id']
c.execute('SELECT File FROM Exames WHERE Id_Paciente=?', [ID])
for exam in c:
print exam['File']