I have a python query which retrieves data through an API. The data returned is a dictionary. I want to save the data in sqlite3 database. There are two main columns('scan','tests'). I'm only interested in the data inside these two columns e.g. 'grade': 'D+', 'likelihood_indicator': 'MEDIUM'.
Any help is appreciated.
import pandas as pd
from httpobs.scanner.local import scan
import sqlite3
website_to_scan = 'digitalnz.org'
scan_site = scan(website_to_scan)
df = pd.DataFrame(scan_site)
print(scan_site)
print(df)`
Results of print(scan_site):
Results of print(df) attached:
This depends on how you have set up your table in sqlite but essentially you would write an INSERT INTO SQL clause and use the connection.execute() function in Python and pass your SQL string as an argument.
Its difficult to give a more precise answer for your question (i.e. code) because you haven't declared the connection variable. Lets imagine you already have your sqlite DB set up with the connection:
connection_variable.execute("""INSERT INTO table_name
(column_name1, column_name2) VALUES (value1, value2);""")
I am trying to access tables from a database using python. There was some code on the website: https://rnacentral.org/help/public-database
import psycopg2.extras
def main():
conn_string = "host='hh-pgsql-public.ebi.ac.uk' dbname='pfmegrnargs' user='reader' password='NWDMCE5xdipIjRrp'"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)`
# retrieve a list of RNAcentral databases
query = "SELECT * FROM rnc_database"
cursor.execute(query)
for row in cursor:
print(row)`
When i run this code, i get back a list of databases:
I want to access tables from one of these databases but I don't know what the schema for those tables are or what the values in each list returned represents. I have been looking at 'postgresql to python' resources but all of them are about accessing tables when you know the name of the tables and the columns within.... Is there code for how I can access the table names from the database?
Thank You
Edit: sorry, i thought i linked the website before
The dataset you want to use has schema diagram here https://rnacentral.org/help/public-database
For general purpose I would use something like https://dbeaver.io/ tool it will show you all the schemas in the db and tables inside the schema and so forth. The DBeaver settings to connect to your db would look like this
If you want to keep using python script to explore the db this sql query
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema';
Should help you.
I am a beginner in programming and please excuse me if the question is stupid.
See the code below. It takes two values from a csv file named headphones_master_data.csv ( price, link) and writes the data into a MySQL table. When writing the data , the date is also being written to the table.
There are 900 rows in the file. When you see the writing part, the my_cursor.execute(sql, val) function is executed 900 times (the number of rows).
It got me thinking and I wanted to see if there are other ways to improve the data writing part. I came up with two ideas and they are as follows.
1 - Convert all the lists ( price, link) into a dictionary and write the dictionary. So the my_cursor.execute(sql, val) function is executed just once.
2 - Convert the lists into a data frame and write that into the database so the write happens just once.
Which method is the best one? Are there any drawbacks of writing the data only once. More importantly, Am I thinking about the optimization correctly?
''''
import pandas as pd
import pymysql
data = pd.read_csv("headphones-master_data.csv") #read csv file and save this into a variable named data
link_list = data['Product_url'].tolist() #taking athe url value from the data vaiable and turn into a list
price_list = data['Sale_price'].tolist()
crawled_date = time.strftime('%Y-%m-%d') #generate the date format compatiable with MySQL
connection = pymysql.connect(host='localhost',
user='root',
password='passme123##$',
db='hpsize') #connection obhect to pass the database details
my_cursor = connection.cursor() #curser object to communicate with database
for i in range(len(link_list)):
link = link_list[i]
price = price_list[i]
sql = "INSERT INTO comparison (link, price, crawled_date) VALUES (%s, %s, %s)" #sql query to add data to database with three variables
val = link , price , crawled_date #the variables to be addded to the SQL query
my_cursor.execute(sql, val) #execute the curser obhect to insert the data
connection.commit() #commit and make the insert permanent
my_cursor.execute("SELECT * from comparison") #load the table contents to verify the insert
result = my_cursor.fetchall()
for i in result:
print(i)
connection.close()
''''
The best way in my opinion is to pass the data into a DataFrame and then use the .to_sql method in order to save the data in your MySQL database.
This method take an argument (method='multi') which allows you to insert all the data in the DataFrame in one go and within a very short time.. This works if your database allows multi-writing.
Read more here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
Is there any way to do an SQL update-where from a dataframe without iterating through each line? I have a postgresql database and to update a table in the db from a dataframe I would use psycopg2 and do something like:
con = psycopg2.connect(database='mydb', user='abc', password='xyz')
cur = con.cursor()
for index, row in df.iterrows():
sql = 'update table set column = %s where column = %s'
cur.execute(sql, (row['whatver'], row['something']))
con.commit()
But on the other hand if im either reading a table from sql or writing an entire dataframe to sql (with no update-where), then I would just use pandas and sqlalchemy. Something like:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('table', engine, if_exists='append')
It's great just having a 'one-liner' using to_sql. Isn't there something similar to do an update-where from pandas to postgresql? Or is the only way to do it by iterating through each row like i've done above. Isn't iterating through each row an inefficient way to do it?
Consider a temp table which would be exact replica of your final table, cleaned out with each run:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('temp_table', engine, if_exists='replace')
sql = """
UPDATE final_table AS f
SET col1 = t.col1
FROM temp_table AS t
WHERE f.id = t.id
"""
with engine.begin() as conn: # TRANSACTION
conn.execute(sql)
It looks like you are using some external data stored in df for the conditions on updating your database table. If it is possible why not just do a one-line sql update?
If you are working with a smallish database (where loading the whole data to the python dataframe object isn't going to kill you) then you can definitely conditionally update the dataframe after loading it using read_sql. Then you can use a keyword arg if_exists="replace" to replace the DB table with the new updated table.
df = pandas.read_sql("select * from your_table;", engine)
#update information (update your_table set column = "new value" where column = "old value")
#still may need to iterate for many old value/new value pairs
df[df['column'] == "old value", "column"] = "new value"
#send data back to sql
df.to_sql("your_table", engine, if_exists="replace")
Pandas is a powerful tool, where limited SQL support was just a small feature at first. As time goes by people are trying to use pandas as their only database interface software. I don't think pandas was ever meant to be an end-all for database interaction, but there are a lot of people working on new features all the time. See: https://github.com/pandas-dev/pandas/issues
I have so far not seen a case where the pandas sql connector can be used in any scalable way to update database data. It may have seemed like a good idea to build one, but really, for operational work it just does not scale.
What I would recommend is to dump your entire dataframe as CSV using
df.to_csv('filename.csv', encoding='utf-8')
Then loading the CSV into the database using COPY for PostgreSQL or LOAD DATA INFILE for MySQL.
If you do not make other changes to the table in question while the data is being manipulated by pandas, you can just load into the table.
If there are concurrency issues, you will have to load the data into a staging table that you then use to update your primary table from.
In the later case, your primary table needs to have a datetime which tells you when the latest modification to it was so you can determine if your pandas changes are the latest or if the database changes should remain.
I was wondering why donnt you update the df first based on your equation and then store the df to the database, you could use if_exists='replace', to store on the same table.
In case the column names have not changed I prefer removing all rows and then appending the data to the now empty table. Otherwise, dependent views will have to be regenerated as well:
from sqlalchemy import create_engine
from sqlalchemy import MetaData
engine = create_engine(f'postgresql://postgres:{pw}#localhost:5432/table')
# Get main table and delete all rows
# without deleting the table
meta = MetaData(engine)
meta.reflect(engine)
table = meta.tables['table']
del_st = table.delete()
conn = engine.connect()
res = conn.execute(del_st)
# Insert new data
df.to_sql('table', engine, if_exists='append', index=False)
I try the first answer and find it works not so well, then I change some parts to pass all situation by using pandas+sqlalchemy to update.
def update_to_sql(self, table_name, key_name)
a = []
self.table = table_name
self.primary_key = key_name
for col in df.columns:
if col == self.primary_key:
continue
a.append("f.{col}=t.{col}".format(col=col))
df.to_sql('temporary_table', self.sql_engine, if_exists='replace', index=False)
update_stmt_1 = "UPDATE {final_table} AS f".format(final_table=self.table)
update_stmt_2 = " INNER JOIN (SELECT * FROM temporary_table) AS t ON t.{primary_key}=f.{primary_key} ".format(primary_key=self.primary_key)
update_stmt_3 = "SET "
update_stmt_4 = ", ".join(a)
update_stmt_5 = update_stmt_1 + update_stmt_2 + update_stmt_3 + update_stmt_4 + ";"
print(update_stmt_5)
with self.sql_engine.begin() as cnx:
cnx.execute(update_stmt_5)
Here is an approach that I found to be somewhat clean. This utilizes sqlalchemy. It only updates one column at a time but can easily be generalized.
def dataframe_update(df, table, engine, primary_key, column):
md = MetaData(engine)
table = Table(table, md, autoload=True)
session = sessionmaker(bind=engine)()
for _, row in df.iterrows():
session.query(table).filter(table.columns[primary_key] == row[primary_key]).update({column: row[column]})
session.commit()
I'm working with two separate databases (Oracle and PostgreSQL) where a substantial amount of reference data is in Oracle that I'll frequently need to reference while doing analysis on data stored in Postgres (I have no control over this part). I'd like to be able to directly transfer the results of a query from Oracle to a table in Postgres (and vice versa). The closest I've gotten is something like the code below using Pandas as a go between, but it's quite slow.
import pandas as pd
import psycopg2
import cx_Oracle
import sqlalchemy
# CONNECT TO POSTGRES
db__PG = psycopg2.connect(dbname="username", user="password")
eng_PG = sqlalchemy.create_engine('postgresql://username:password#localhost:5432/db_one')
# CONNECT TO ORACLE
db__OR = cx_Oracle.connect('username', 'password', 'SID_Name')
eng_OR = sqlalchemy.create_engine('oracle+cx_oracle://username:password#10.1.2.3:4422/db_two')
#DEFINE QUERIES
query_A = "SELECT * FROM tbl_A"
query_B = "SELECT * FROM tbl_B"
# CREATE PANDAS DATAFRAMES FROM QUERY RESULTS
df_A = pd.read_sql_query(query_A, db__OR)
df_B = pd.read_sql_query(query_B, db__PG)
# CREATE TABLES FROM PANDAS DATAFRAMES
df_A.to_sql(name='tbl_a', con=eng_PG, if_exists='replace', index=False)
df_B.to_sql(name='tbl_b', con=eng_OR, if_exists='replace', index=False)
I think there would have to be a more efficient, direct way to do this (like database links for moving data across different DB's in Oracle), but I'm fairly new to Python and have generally worked either directly in SQL or in SAS previously. I've been searching for ways to create a table directly from a python cursor result or a SQLAlchemy ResultProxy, but haven't had much luck.
Suggestions?