Update PostgreSQL database with daily stock prices in Python - python

So I found a great script over at QuantState that had a great walk-through on setting up my own securities database and loading in historical pricing information. However, I'm not trying to modify the script so that I can run it daily and have the latest stock quotes added.
I adjusted the initial data load to just download 1 week worth of historicals, but I've been having issues with writing the SQL statement to see if the row exists already before adding. Can anyone help me out with this. Here's what I have so far:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
# Create the time now
now = datetime.datetime.utcnow()
# Amend the data to include the vendor ID and symbol ID
daily_data = [(data_vendor_id, symbol_id, d[0], now, now,
d[1], d[2], d[3], d[4], d[5], d[6]) for d in daily_data]
# Create the insert strings
column_str = """data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price,
close_price, volume, adj_close_price"""
insert_str = ("%s, " * 11)[:-2]
final_str = "INSERT INTO daily_price (%s) VALUES (%s) WHERE NOT EXISTS (SELECT 1 FROM daily_price WHERE symbol_id = symbol_id AND price_date = insert_str[2])" % (column_str, insert_str)
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)

Some notes regarding your code above:
It's generally easier to defer to now() in pure SQL than to try in Python whenever possible. It avoids lots of potential pitfalls with timezones, library differences, etc.
If you construct a list of columns, you can dynamically generate a string of %s's based on its size, and don't need to hardcode the length into a repeated string with is then sliced.
Since it appears that insert_daily_data_into_db is meant to be called from within a loop on a per-stock basis, I don't believe you want to use executemany here, which would require a list of tuples and is very different semantically.
You were comparing symbol_id to itself in the sub select, instead of a particular value (which would mean it's always true).
To prevent possible SQL Injection, you should always interpolate values in the WHERE clause, including sub selects.
Note: I'm assuming that you're using psycopg2 to access Postgres, and that the primary key for the table is a tuple of (symbol_id, price_date). If not, the code below would need to be tweaked at least a bit.
With those points in mind, try something like this (untested, since I don't have your data, db, etc. but it is syntactically valid Python):
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
column_list = ["data_vendor_id", "symbol_id", "price_date", "created_date",
"last_updated_date", "open_price", "high_price", "low_price",
"close_price", "volume", "adj_close_price"]
insert_list = ['%s'] * len(column_str)
values_tuple = (data_vendor_id, symbol_id, daily_data[0], 'now()', 'now()', daily_data[1],
daily_data[2], daily_data[3], daily_data[4], daily_data[5], daily_data[6])
final_str = """INSERT INTO daily_price ({0})
VALUES ({1})
WHERE NOT EXISTS (SELECT 1
FROM daily_price
WHERE symbol_id = %s
AND price_date = %s)""".format(', '.join(column_list), ', '.join(insert_list))
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.execute(final_str, values_tuple, values_tuple[1], values_tuple[2])

Related

Illegal Variable Name/Number when Passing in Python List

I'm trying to run SQL statements through Python on a list.
By passing in a list, in this case date. Since i want to run multiple SELECT SQL queries and return them.
I've tested this by passing in integers, however when trying to pass in a date I am getting ORA-01036 error. Illegal variable name/number. I'm using an Oracle DB.
cursor = connection.cursor()
date = ["'01-DEC-21'", "'02-DEC-21'"]
sql = "select * from table1 where datestamp = :date"
for item in date:
cursor.execute(sql,id=item)
res=cursor.fetchall()
print(res)
Any suggestions to make this run?
You can't name a bind variable date, it's an illegal name. Also your named variable in cursor.execute should match the bind variable name. Try something like:
sql = "select * from table1 where datestamp = :date_input"
for item in date:
cursor.execute(sql,date_input=item)
res=cursor.fetchall()
print(res)
Some recommendation and warnings to your approach:
you should not depend on your default NLS date setting, while binding a String (e.g. "'01-DEC-21'") to a DATE column. (You probably need also remone one of the quotes).
You should ommit to fetch data in a loop if you can fetch them in one query (using an IN list)
use prepared statement
Example
date = ['01-DEC-21', '02-DEC-21']
This generates the query that uses bind variables for your input list
in_list = ','.join([f" TO_DATE(:d{ind},'DD-MON-RR','NLS_DATE_LANGUAGE = American')" for ind, d in enumerate(date)])
sql_query = "select * from table1 where datestamp in ( " + in_list + " )"
The sql_query generate is
select * from table1 where datestamp in
( TO_DATE(:d0,'DD-MON-RR','NLS_DATE_LANGUAGE = American'), TO_DATE(:d1,'DD-MON-RR','NLS_DATE_LANGUAGE = American') )
Note that the INlist contains one bind variable for each member of your input list.
Note also the usage of to_date with explicite mask and fixing the language to avoid problems with interpretation of the month abbreviation. (e.g. ORA-01843: not a valid month)
Now you can use the query to fetch the data in one pass
cur.prepare(sql_query)
cur.execute(None, date)
res = cur.fetchall()

Python MySQL search entire database for value

I have a GUI interacting with my database, and MySQL database has around 50 tables. I need to search each table for a value and return the field and key of the item in each table if it is found. I would like to search for partial matches. ex.( Search Value = "test", "Protest", "Test123" would be matches. Here is my attempt.
def searchdatabase(self, event):
print('Searching...')
self.connect_mysql() #Function to connect to database
d_tables = []
results_list = [] # I will store results here
s_string = "test" #Value I am searching
self.cursor.execute("USE db") # select the database
self.cursor.execute("SHOW TABLES")
for (table_name,) in self.cursor:
d_tables.append(table_name)
#Loop through tables list, get column name, and check if value is in the column
for table in d_tables:
#Get the columns
self.cursor.execute(f"SELECT * FROM `{table}` WHERE 1=0")
field_names = [i[0] for i in self.cursor.description]
#Find Value
for f_name in field_names:
print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE {f_name} LIKE {s_string}"))
print(table)
I get an error on print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE {f_name} LIKE {s_string}"))
Exception: (1054, "Unknown column 'test' in 'where clause'")
I use a similar insert query that works fine so I am not understanding what the issue is.
ex. insert_query = (f"INSERT INTO `{source_tbl}` ({query_columns}) VALUES ({query_placeholders})")
May be because of single quote you have missed while checking for some columns.
TRY :
print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE '{f_name}' LIKE '{s_string}'"))
Have a look -> here
Don’t insert user-provided data into SQL queries like this. It is begging for SQL injection attacks. Your database library will have a way of sending parameters to queries. Use that.
The whole design is fishy. Normally, there should be no need to look for a string across several columns of 50 different tables. Admittedly, sometimes you end up in these situations because of reasons outside your control.

Python and Pandas to Query API's and update DB

I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612

Python SQL cells interaction

I am using Python to try to gather the closing price for couple of different time intervals, save it in a database and then calculate the change in the closing price. This is my code:
def database_populate(symbol, interval):
base_url = "https://www.binance.com/api/v1"
url_klines = "/klines"
end_time = requests.get('{}/time'.format(base_url)).json()['serverTime']
start_time = end_time - 360000
kln = requests.get('{a}{b}?symbol={c}&interval={d}&startTime={e}&endTime={f}'.format(a = base_url, b = url_klines, c = symbol, d = interval, e = start_time, f = end_time)).json()
db = sqlite3.connect('database.db')
cursor = db.cursor()
cr_db = """
CREATE TABLE EOSBTC_symbol (
ID INTEGER PRIMARY KEY AUTOINCREMENT,
EPOCH_TIME INTEGER NOT NULL,
CLOSE_PRICE FLOAT,
CHANGE FLOAT )
"""
cursor.execute(cr_db)
for i in range(len(kln)):
lst = [kln[i][0], kln[i][4]]
cursor.execute("""INSERT INTO EOSBTC_symbol (EPOCH_TIME, CLOSE_PRICE) VALUES (?, ?)""", (lst[0], lst[1]))
db.commit()
db.close()
database_populate("EOSBTC", "1m")
This is populating the database with the closing price for a certain time period for the pair EOSBTC. I want to calculate the change in the closing price between two consecutive rows. Do I need to use the ID or the epoch time or there is another more elegant way? Just keep in mind that this DB will be continuously updated, so the ID and the EPOCH_TIME will change with time, and I want to calculate CHANGE field immediately after I populate these cells from the Binance API.
This is the database content at the moment:
For example for in row 6 the CHANGE will be equal to 0.00082563 - 0.00082587, for row 5 0.00082587 - 0.00082533 and so on.
If you need to calculate change in closing price in python and keep it only in python runtime, you should simply use a variable for previous row value.
If you want to store it in DB, you can have a small procedure that would do all the calculations and insert data including newly calculated difference.
If you want to retrieve value from DB every time, you might use something like TOP function, depending on RDBMS you are using.

Most python(3)esque way to repeatedly SELECT from MySQL database

I have a csv file of customer ids (CRM_id). I need to get their primary keys (an autoincrement int) from the customers table of the database. (I can't be assured of the integrity of the CRM_ids so I chose not to make that the primary key).
So:
customers = []
with open("CRM_ids.csv", 'r', newline='') as csvfile:
customerfile = csv.DictReader(csvfile, delimiter = ',', quotechar='"', skipinitialspace=True)
#only one "CRM_id" field per row
customers = [c for c in customerfile]
So far so good? I think this is the most pythonesque way of doing that (but happy to hear otherwise).
Now comes the ugly code. It works, but I hate appending to the list because that has to copy and reallocate memory for each loop, right? Is there a better way (pre-allocate + enumerate to keep track of the index comes to mind, but maybe there's an even quickler/better way by being clever with the SQL so as not to do several thousand separate queries...)?
cnx = mysql.connector.connect(user='me', password=sys.argv[1], host="localhost", database="mydb")
cursor = cnx.cursor()
select_customer = ("SELECT id FROM customers WHERE CRM_id = %(CRM_id)s LIMIT 1;")
c_ids = []
for row in customers:
cursor.execute(select_customer, row)
#note fetchone() returns a tuple, but the SELECTed set
#only has a single column so we need to get this column with the [0]
c_ids.extend(cursor.fetchall())
c_ids = [c[0] for c in c_ids]
Edit:
Purpose is to get the primary keys in a list so I can use these to allocate some other data from other CSV files in linked tables (the customer id primary key is a foreign key to these other tables, and the allocation algorithm changes, so it's best to have the flexibility to do the allocation in python rather than hard coding SQL queries). I know this sounds a little backwards, but the "client" only works with spreadsheets rather than an ERP/PLM, so I have to build the "relations" for this small app myself.
What about changing your query to get what you want?
crm_ids = ",".join(customers)
select_customer = "SELECT UNIQUE id FROM customers WHERE CRM_id IN (%s);" % crm_ids
MySQL should be fine with even a multi-megabyte query, according to the manual; if it gets to be a really long list, you can always break it up - two or three queries is guaranteed much faster than a few thousand.
how about storing your csv in a dict instead of a list:
customers = [c for c in customerfile]
becomes:
customers = {c['CRM_id']:c for c in customerfile}
then select the entire xref:
result = cursor.execute('select id, CRM_id from customers')
and add the new rowid as a new entry in the dict:
for row in result:
customers[row[1]]['newid']=row[0]

Categories

Resources