How to store and retrieve data in python - python

I have been looking for answers to my questions, but haven't found a definitive answer. I am new to python, mysql, and data science, so any advice is appreciated
What I want to be able to do is:
use python to pull daily close data from quandl for n securities
store the data in a database
retrieve, clean, and normalize the data
run regressions on different pairs
write the results to a csv file
The pseudocode below shows in a nutshell what I want to be able to do.
The questions I have are:
How do I store the quandl data in MySQL?
How do I retrieve that data from MySQL? Do I store it into lists and use statsmodels?
tickers = [AAPL, FB, GOOG, YHOO, XRAY, CSCO]
qCodes = [x + 'WIKI/' for x in tickers]
for i in range(0, len(qCodes)):
ADD TO MYSQLDB->Quandl.get(qCodes[i], collapse='daily', start_date=start, end_date=end)
for x in range(0, len(qCodes)-1):
for y in range(x+1, len(qCodes)):
//GET FROM MYSQLDB-> x, y
//clean(x,y)
//normalize(x,y)
//write to csv file->(regression(x,y))

There is a nice library called MySQLdb in Python, which helps you interact with the MySQL db's. So, for the following to execute successfully, you have to have your python shell and the MySQL shells fired up.
How do I store the quandl data in MySQL?
import MySQLdb
#Setting up connection
db = MySQLdb.connect("localhost", user_name, password, db_name)
cursor = db.cursor()
#Inserting records into the employee table
sql = """INSERT INTO EMPLOYEE(FIRST_NAME, LAST_NAME, AGE, SEX, INCOME) VALUES('Steven', "Karpinski", "50", "M", "43290")"""
try:
cursor.execute(sql)
db.commit()
except:
db.rollback()
db.close()
I did it for custom values. So, for quandl data, create the schema in a similar way and store them by executing a loop.
How do I retrieve that data from MySQL? Do I store it into lists and use statsmodels?
For data retrieval, you execute the following command, similar to the above command.
sql2 = """SELECT * FROM EMPLOYEE;
"""
try:
cursor.execute(sql2)
db.commit()
except:
db.rollback()
result = cursor.fetchall()
The result variable now contains the result of the query inside sql2 variable, and it is in form of tuples.
So, now you can convert those tuples into a data structure of your choice.

Quandl has a python package that makes interacting with the site trivial.
From Quandl's python page:
import Quandl
mydata = Quandl.get("WIKI/AAPL")
By default, Quandl's package returns a pandas dataframe. You can use Pandas to manipulate/clean/normalize your data as you see fit and use Pandas to upload the data directly to a sql database :
import sqlalchemy as sql
engine = sql.create_engine('mysql://name:blah#location/testdb')
mydata.to_sql('db_table_name', engine, if_exists='append')
To get the data back from your database, you can also use Pandas:
import pandas as pd
import sqlalchemy as sql
engine = sql.create_engine('mysql://name:blah#location/testdb')
query = sql.text('''select * from quandltable''')
mydata = pd.read_sql_query(engine, query)
After using statsmodels to run your analyses, you can use either pandas' df.to_csv() method or numpy's savetxt() function. (Sorry, i cannot post the links for those functions; I don't have enough reputation yet!)

Related

Retrieve data from python script and save to sqllite database

I have a python query which retrieves data through an API. The data returned is a dictionary. I want to save the data in sqlite3 database. There are two main columns('scan','tests'). I'm only interested in the data inside these two columns e.g. 'grade': 'D+', 'likelihood_indicator': 'MEDIUM'.
Any help is appreciated.
import pandas as pd
from httpobs.scanner.local import scan
import sqlite3
website_to_scan = 'digitalnz.org'
scan_site = scan(website_to_scan)
df = pd.DataFrame(scan_site)
print(scan_site)
print(df)`
Results of print(scan_site):
Results of print(df) attached:
This depends on how you have set up your table in sqlite but essentially you would write an INSERT INTO SQL clause and use the connection.execute() function in Python and pass your SQL string as an argument.
Its difficult to give a more precise answer for your question (i.e. code) because you haven't declared the connection variable. Lets imagine you already have your sqlite DB set up with the connection:
connection_variable.execute("""INSERT INTO table_name
(column_name1, column_name2) VALUES (value1, value2);""")

POSTGRESQL Queries using Python

I am trying to access tables from a database using python. There was some code on the website: https://rnacentral.org/help/public-database
import psycopg2.extras
def main():
conn_string = "host='hh-pgsql-public.ebi.ac.uk' dbname='pfmegrnargs' user='reader' password='NWDMCE5xdipIjRrp'"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)`
# retrieve a list of RNAcentral databases
query = "SELECT * FROM rnc_database"
cursor.execute(query)
for row in cursor:
print(row)`
When i run this code, i get back a list of databases:
I want to access tables from one of these databases but I don't know what the schema for those tables are or what the values in each list returned represents. I have been looking at 'postgresql to python' resources but all of them are about accessing tables when you know the name of the tables and the columns within.... Is there code for how I can access the table names from the database?
Thank You
Edit: sorry, i thought i linked the website before
The dataset you want to use has schema diagram here https://rnacentral.org/help/public-database
For general purpose I would use something like https://dbeaver.io/ tool it will show you all the schemas in the db and tables inside the schema and so forth. The DBeaver settings to connect to your db would look like this
If you want to keep using python script to explore the db this sql query
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema';
Should help you.

Inserting data to mysql ? One by one or the data as a whole which one is the better way to do it?

I am a beginner in programming and please excuse me if the question is stupid.
See the code below. It takes two values from a csv file named headphones_master_data.csv ( price, link) and writes the data into a MySQL table. When writing the data , the date is also being written to the table.
There are 900 rows in the file. When you see the writing part, the my_cursor.execute(sql, val) function is executed 900 times (the number of rows).
It got me thinking and I wanted to see if there are other ways to improve the data writing part. I came up with two ideas and they are as follows.
1 - Convert all the lists ( price, link) into a dictionary and write the dictionary. So the my_cursor.execute(sql, val) function is executed just once.
2 - Convert the lists into a data frame and write that into the database so the write happens just once.
Which method is the best one? Are there any drawbacks of writing the data only once. More importantly, Am I thinking about the optimization correctly?
''''
import pandas as pd
import pymysql
data = pd.read_csv("headphones-master_data.csv") #read csv file and save this into a variable named data
link_list = data['Product_url'].tolist() #taking athe url value from the data vaiable and turn into a list
price_list = data['Sale_price'].tolist()
crawled_date = time.strftime('%Y-%m-%d') #generate the date format compatiable with MySQL
connection = pymysql.connect(host='localhost',
user='root',
password='passme123##$',
db='hpsize') #connection obhect to pass the database details
my_cursor = connection.cursor() #curser object to communicate with database
for i in range(len(link_list)):
link = link_list[i]
price = price_list[i]
sql = "INSERT INTO comparison (link, price, crawled_date) VALUES (%s, %s, %s)" #sql query to add data to database with three variables
val = link , price , crawled_date #the variables to be addded to the SQL query
my_cursor.execute(sql, val) #execute the curser obhect to insert the data
connection.commit() #commit and make the insert permanent
my_cursor.execute("SELECT * from comparison") #load the table contents to verify the insert
result = my_cursor.fetchall()
for i in result:
print(i)
connection.close()
''''
The best way in my opinion is to pass the data into a DataFrame and then use the .to_sql method in order to save the data in your MySQL database.
This method take an argument (method='multi') which allows you to insert all the data in the DataFrame in one go and within a very short time.. This works if your database allows multi-writing.
Read more here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

How to update all the values of column in a db2 table using python [duplicate]

Is there any way to do an SQL update-where from a dataframe without iterating through each line? I have a postgresql database and to update a table in the db from a dataframe I would use psycopg2 and do something like:
con = psycopg2.connect(database='mydb', user='abc', password='xyz')
cur = con.cursor()
for index, row in df.iterrows():
sql = 'update table set column = %s where column = %s'
cur.execute(sql, (row['whatver'], row['something']))
con.commit()
But on the other hand if im either reading a table from sql or writing an entire dataframe to sql (with no update-where), then I would just use pandas and sqlalchemy. Something like:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('table', engine, if_exists='append')
It's great just having a 'one-liner' using to_sql. Isn't there something similar to do an update-where from pandas to postgresql? Or is the only way to do it by iterating through each row like i've done above. Isn't iterating through each row an inefficient way to do it?
Consider a temp table which would be exact replica of your final table, cleaned out with each run:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('temp_table', engine, if_exists='replace')
sql = """
UPDATE final_table AS f
SET col1 = t.col1
FROM temp_table AS t
WHERE f.id = t.id
"""
with engine.begin() as conn: # TRANSACTION
conn.execute(sql)
It looks like you are using some external data stored in df for the conditions on updating your database table. If it is possible why not just do a one-line sql update?
If you are working with a smallish database (where loading the whole data to the python dataframe object isn't going to kill you) then you can definitely conditionally update the dataframe after loading it using read_sql. Then you can use a keyword arg if_exists="replace" to replace the DB table with the new updated table.
df = pandas.read_sql("select * from your_table;", engine)
#update information (update your_table set column = "new value" where column = "old value")
#still may need to iterate for many old value/new value pairs
df[df['column'] == "old value", "column"] = "new value"
#send data back to sql
df.to_sql("your_table", engine, if_exists="replace")
Pandas is a powerful tool, where limited SQL support was just a small feature at first. As time goes by people are trying to use pandas as their only database interface software. I don't think pandas was ever meant to be an end-all for database interaction, but there are a lot of people working on new features all the time. See: https://github.com/pandas-dev/pandas/issues
I have so far not seen a case where the pandas sql connector can be used in any scalable way to update database data. It may have seemed like a good idea to build one, but really, for operational work it just does not scale.
What I would recommend is to dump your entire dataframe as CSV using
df.to_csv('filename.csv', encoding='utf-8')
Then loading the CSV into the database using COPY for PostgreSQL or LOAD DATA INFILE for MySQL.
If you do not make other changes to the table in question while the data is being manipulated by pandas, you can just load into the table.
If there are concurrency issues, you will have to load the data into a staging table that you then use to update your primary table from.
In the later case, your primary table needs to have a datetime which tells you when the latest modification to it was so you can determine if your pandas changes are the latest or if the database changes should remain.
I was wondering why donnt you update the df first based on your equation and then store the df to the database, you could use if_exists='replace', to store on the same table.
In case the column names have not changed I prefer removing all rows and then appending the data to the now empty table. Otherwise, dependent views will have to be regenerated as well:
from sqlalchemy import create_engine
from sqlalchemy import MetaData
engine = create_engine(f'postgresql://postgres:{pw}#localhost:5432/table')
# Get main table and delete all rows
# without deleting the table
meta = MetaData(engine)
meta.reflect(engine)
table = meta.tables['table']
del_st = table.delete()
conn = engine.connect()
res = conn.execute(del_st)
# Insert new data
df.to_sql('table', engine, if_exists='append', index=False)
I try the first answer and find it works not so well, then I change some parts to pass all situation by using pandas+sqlalchemy to update.
def update_to_sql(self, table_name, key_name)
a = []
self.table = table_name
self.primary_key = key_name
for col in df.columns:
if col == self.primary_key:
continue
a.append("f.{col}=t.{col}".format(col=col))
df.to_sql('temporary_table', self.sql_engine, if_exists='replace', index=False)
update_stmt_1 = "UPDATE {final_table} AS f".format(final_table=self.table)
update_stmt_2 = " INNER JOIN (SELECT * FROM temporary_table) AS t ON t.{primary_key}=f.{primary_key} ".format(primary_key=self.primary_key)
update_stmt_3 = "SET "
update_stmt_4 = ", ".join(a)
update_stmt_5 = update_stmt_1 + update_stmt_2 + update_stmt_3 + update_stmt_4 + ";"
print(update_stmt_5)
with self.sql_engine.begin() as cnx:
cnx.execute(update_stmt_5)
Here is an approach that I found to be somewhat clean. This utilizes sqlalchemy. It only updates one column at a time but can easily be generalized.
def dataframe_update(df, table, engine, primary_key, column):
md = MetaData(engine)
table = Table(table, md, autoload=True)
session = sessionmaker(bind=engine)()
for _, row in df.iterrows():
session.query(table).filter(table.columns[primary_key] == row[primary_key]).update({column: row[column]})
session.commit()

How to create a table in one DB from the result of a query from different DB in Python

I'm working with two separate databases (Oracle and PostgreSQL) where a substantial amount of reference data is in Oracle that I'll frequently need to reference while doing analysis on data stored in Postgres (I have no control over this part). I'd like to be able to directly transfer the results of a query from Oracle to a table in Postgres (and vice versa). The closest I've gotten is something like the code below using Pandas as a go between, but it's quite slow.
import pandas as pd
import psycopg2
import cx_Oracle
import sqlalchemy
# CONNECT TO POSTGRES
db__PG = psycopg2.connect(dbname="username", user="password")
eng_PG = sqlalchemy.create_engine('postgresql://username:password#localhost:5432/db_one')
# CONNECT TO ORACLE
db__OR = cx_Oracle.connect('username', 'password', 'SID_Name')
eng_OR = sqlalchemy.create_engine('oracle+cx_oracle://username:password#10.1.2.3:4422/db_two')
#DEFINE QUERIES
query_A = "SELECT * FROM tbl_A"
query_B = "SELECT * FROM tbl_B"
# CREATE PANDAS DATAFRAMES FROM QUERY RESULTS
df_A = pd.read_sql_query(query_A, db__OR)
df_B = pd.read_sql_query(query_B, db__PG)
# CREATE TABLES FROM PANDAS DATAFRAMES
df_A.to_sql(name='tbl_a', con=eng_PG, if_exists='replace', index=False)
df_B.to_sql(name='tbl_b', con=eng_OR, if_exists='replace', index=False)
I think there would have to be a more efficient, direct way to do this (like database links for moving data across different DB's in Oracle), but I'm fairly new to Python and have generally worked either directly in SQL or in SAS previously. I've been searching for ways to create a table directly from a python cursor result or a SQLAlchemy ResultProxy, but haven't had much luck.
Suggestions?

Categories

Resources