Suppose I have a select roughly like this:
select instrument, price, date from my_prices;
How can I unpack the prices returned into a single dataframe with a series for each instrument and indexed on date?
To be clear: I'm looking for:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: ...
Data columns (total 2 columns):
inst_1 ...
inst_2 ...
dtypes: float64(1), object(1)
I'm NOT looking for:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: ...
Data columns (total 2 columns):
instrument ...
price ...
dtypes: float64(1), object(1)
...which is easy ;-)
You can pass a cursor object to the DataFrame constructor. For postgres:
import psycopg2
conn = psycopg2.connect("dbname='db' user='user' host='host' password='pass'")
cur = conn.cursor()
cur.execute("select instrument, price, date from my_prices")
df = DataFrame(cur.fetchall(), columns=['instrument', 'price', 'date'])
then set index like
df.set_index('date', drop=False)
or directly:
df.index = df['date']
Update: recent pandas have the following functions: read_sql_table and read_sql_query.
First create a db engine (a connection can also work here):
from sqlalchemy import create_engine
# see sqlalchemy docs for how to write this url for your database type:
engine = create_engine('mysql://scott:tiger#localhost/foo')
See sqlalchemy database urls.
pandas_read_sql_table
table_name = 'my_prices'
df = pd.read_sql_table(table_name, engine)
pandas_read_sql_query
df = pd.read_sql_query("SELECT instrument, price, date FROM my_prices;", engine)
The old answer had referenced read_frame which is has been deprecated (see the version history of this question for that answer).
It's often makes sense to read first, and then perform transformations to your requirements (as these are usually efficient and readable in pandas). In your example, you can pivot the result:
df.reset_index().pivot('date', 'instrument', 'price')
Note: You could miss out the reset_index you don't specify an index_col in the read_frame.
This connect with postgres and pandas with remote postgresql
# CONNECT TO POSTGRES USING PANDAS
import psycopg2 as pg
import pandas.io.sql as psql
this is used to establish the connection with postgres db
connection = pg.connect("host=192.168.0.1 dbname=db user=postgres")
this is used to read the table from postgres db
dataframe = psql.read_sql("SELECT * FROM DB.Table", connection)
import pandas as pd
import pandas.io.sql as sqlio
import psycopg2
conn = psycopg2.connect("host='{}' port={} dbname='{}' user={} password={}".format(host, port, dbname, username, pwd))
sql = "select count(*) from table;"
dat = sqlio.read_sql_query(sql, conn)
conn = None
import pandas as pd
conn = psycopg2.connect("host='{}' port={} dbname='{}' user={} password={}".format(host, port, dbname, username, pwd))
sql = "select count(*) from table;"
dat = pd.read_sql_query(sql, conn)
conn = None
import pandas as pd
import psycopg2
conn = psycopg2.connect(user="",
password="",
host="",
port="",
database="")
sql = "select count(*) from table;"
dat = pd.read_sql_query(sql, conn)
Related
I am working with python trying to connect with postgres, I created a table into my postgres database in the staging schema.
create table staging.data( Name varchar, Age bigint);
then I try to connect and insert my dataframe data into this table:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
conn_string = 'postgresql://myuser:password#host/database_name'
db = create_engine(conn_string)
conn = db.connect()
# our dataframe
data = {'Name': ['Tom', 'dick', 'harry'],
'Age': [22, 21, 24]}
# Create DataFrame
df = pd.DataFrame(data)
df.to_sql('staging.data', con=conn, if_exists='replace',
index=False)
conn = psycopg2.connect(conn_string
)
conn.autocommit = True
cursor = conn.cursor()
sql1 = '''select * from staging.data;'''
cursor.execute(sql1)
for i in cursor.fetchall():
print(i)
conn.commit()
conn.close()
But the Python ends with no error message, and there is no data into my table from postgres.
Any idea about this?
Regards
I think the issue is that you are trying to use a schema other than public. Try passing in the schema name via the schema argument of to_sql() like this:
df.to_sql('data', con=conn, if_exists='replace', schema='staging', index=False)
I have been trying to insert data from a dataframe in Python to a table already created in SQL Server. The data frame has 90K rows and wanted the best possible way to quickly insert data in the table. I only have read,write and delete permissions for the server and I cannot create any table on the server.
Below is the code which is inserting the data but it is very slow. Please advise.
import pandas as pd
import xlsxwriter
import pyodbc
df = pd.read_excel(r"Url path\abc.xlsx")
conn = pyodbc.connect('Driver={ODBC Driver 11 for SQL Server};'
'SERVER=Server Name;'
'Database=Database Name;'
'UID=User ID;'
'PWD=Password;'
'Trusted_Connection=no;')
cursor= conn.cursor()
#Deleting existing data in SQL Table:-
cursor.execute("DELETE FROM datbase.schema.TableName")
conn.commit()
#Inserting data in SQL Table:-
for index,row in df.iterrows():
cursor.execute("INSERT INTO Table Name([A],[B],[C],) values (?,?,?)", row['A'],row['B'],row['C'])
conn.commit()
cursor.close()
conn.close()
To insert data much faster, try using sqlalchemy and df.to_sql. This requires you to create an engine using sqlalchemy, and to make things faster use the option fast_executemany=True
connect_string = urllib.parse.quote_plus(f'DRIVER={{ODBC Driver 11 for SQL Server}};Server=<Server Name>,<port>;Database=<Database name>')
engine = sqlalchemy.create_engine(f'mssql+pyodbc:///?odbc_connect={connect_string}', fast_executemany=True)
with engine.connect() as connection:
df.to_sql(<table name>, connection, index=False)
Here is the script and hope this works for you.
import pandas as pd
import pyodbc as pc
connection_string = "Driver=SQL Server;Server=localhost;Database={0};Trusted_Connection=Yes;"
cnxn = pc.connect(connection_string.format("DataBaseNameHere"), autocommit=True)
cur=cnxn.cursor()
df= pd.read_csv("your_filepath_and_filename_here.csv").fillna('')
query = 'insert into TableName({0}) values ({1})'
query = query.format(','.join(df.columns), ','.join('?' * len(df1.columns)))
cur.fast_executemany = True
cur.executemany(query, df.values.tolist())
cnxn.close()
This should do what you want...very generic example...
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=Excel-PC\SQLEXPRESS;'
r'DATABASE=NORTHWND;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))
Try that and post back if you have additional questions/issues/concerns.
Replace df.iterrows() with df.apply() for one thing. Remove the loop for something much more efficient.
Try to populate a temp table with 1 or none indexes then insert it into your good table all at once.
Might speed things up due to not having to update the indexes after each insert??
I created a table inserting data fetched from an api and store in to a pandas dataframe using sqlalchemy.
I am gonna need to query the api, every 4 hours, to get new data.
Problem being that the api, will give me back not only the new data but as well the old ones, already imported in mysql
how can i import just the new data into the mysql table
i retrieved the data from the api, stored the data in to a pandas object, created the connection to the mysql db and created a fresh new table.
import requests
import json
from pandas.io.json import json_normalize
myToken = 'xxx'
myUrl = 'somewebsite'
head = {'Authorization': 'token {}'.format(myToken)}
response = requests.get(myUrl, headers=head)
data=response.json()
#print(data.dumps(data, indent=4, sort_keys=True))
results=json_normalize(data['results'])
results.rename(columns={'datastream.name': 'datastream_name',
'datastream.url':'datastream_url',
'datastream.datastream_type_id':'datastream_id',
'start':'error_date'}, inplace=True)
results_final=pd.DataFrame([results.datastream_name,
results.datastream_url,
results.error_date,
results.datastream_id,
results.message,
results.type_label]).transpose()
from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw#ip/schema')
con = engine.connect()
results_final.to_sql(name='error',con=con,if_exists='replace')
con.close()
End goal is to insert into the table, just the not existing data coming from the api
You could pull the results already in the database into a new dataframe and then compare the two dataframes. After that you would only insert the rows not in the table. Not knowing the format of your table or data I'm just using a generic SELECT statement here.
from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw#ip/schema')
con = engine.connect()
sql = "SELECT * FROM table_name"
old_results = pd.read_sql(sql, con)
df = pd.merge(old_results, results_final, how='outer', indicator=True)
new_results = df[df['_merge']=='right_only'][results_final.columns]
new_results.to_sql(name='error',con=con,if_exists='append')
con.close()
You also need to change if_exists to append because set to replace it drops all values in the table and replaces them with the values in the pandas dataframe.
I developed this function to handle both: news values and when columns from the source table and target table are not equal.
def load_data(df):
engine = create_engine('mysql+pymysql://root:pass#localhost/dw', echo_pool=True, pool_size=10, max_overflow=20)
with engine.connect() as conn, conn.begin():
try:
df_old = pd.read_sql('SELECT * FROM table', conn)
# Check if exists new rows to be inserted
if len(df) > len(df_saved) or df.disconnected_time.max() > df_saved.disconnected_time.max():
print("There are new rows to be inserted. ")
df_merged = pd.merge(df_old, df, how='outer', indicator=True)
df_final = df_merged[df_merged['_merge']=='right_only'][df.columns]
df_final.to_sql(name='table',con=conn,index=False, if_exists='append')
except Exception as err:
print (str(err))
else:
# This handling errors when the lengths of the columns are not equal to the target
if df_bulbr.shape[1] > df_old.shape[1]:
data = pd.read_sql('SELECT * FROM table', conn)
df2 = pd.concat([df,data])
df2.to_sql('table', conn, index=False, if_exists='replace')
outcome = conn.execute("select count(1) from table")
countRow = outcome.first()[0]
return print(f" Total of {countRow} rows load." )
I want to write a pandas dataframe to a postgres table. I make a connection to db as follows:
import psycopg2
import pandas as pd
import sqlalchemy
def connect(user, password, db, host='localhost', port=5432):
'''Returns a connection and a metadata object'''
url = 'postgresql://{}:{}#{}:{}/{}'
url = url.format(user, password, host, port, db)
# The return value of create_engine() is our connection object
con = sqlalchemy.create_engine(url, client_encoding='utf8')
# We then bind the connection to MetaData()
meta = sqlalchemy.MetaData(bind=con, reflect=True)
return con, meta
con, meta = connect('user_name', 'password', 'db_name', host='host_name')
When I read from a table that is already populated, it works fine:
df = pd.read_sql("SELECT * FROM db.table_name limit 10",con=con)
print df
I would like to be able to write df to a table. To test this, I have a temporary table called 'test' with two fields name and age.
# create a temp df
table = [['name', 'age'], ['nameA' , 20], ['nameB', 30]]
headers = table.pop(0)
df = pd.DataFrame(table, columns=headers)
# write to db
df.to_sql('db.test', con, if_exists = 'replace', index=False)
I then check if the temp table is populated:
df = pd.read_sql("SELECT * FROM db.test limit 10",con=con)
print df
I get an empty dataframe! I got no errors when I use df.to_sql but nothing is getting written to the database (?). What am I missing and how do I go about fixing this?
Versions:
Pandas: 0.19.2
Sqlachemy: 1.1.10
Postgres: 9.4.9
I have not figured out why df.to_sql did not write to the table. Writing to table using pd.io.sql.SQLDatabase worked for my test case:
meta = sqlalchemy.MetaData(con, schema='db_name')
meta.reflect()
pdsql = pd.io.sql.SQLDatabase(con, meta=meta)
pdsql.to_sql(df, 'test', if_exists='replace')
I would not consider this THE solution -- I'd be happy to accept better solution or an answer that brings a closure to why df.to_sql() does not behave as expected.
I have a database that contains multiple tables, and I am trying to import each table as a pandas dataframe. I can do this for a single table as follows:
import pandas as pd
import pandas.io.sql as psql
import pypyodbc
conn = pypyodbc.connect("DRIVER={SQL Server};\
SERVER=serveraddress;\
UID=uid;\
PWD=pwd;\
DATABASE=db")
df1 = psql.read_frame('SELECT * FROM dbo.table1', conn)
The number of tables in the database will change, and at any time I would like to be able to import each table into its own dataframe. How can I get all of these tables into pandas?
Depending on your SQL server, you can inspect the tables in a database.
For example:
tables_df = pd.read_sql('SELECT table_name FROM database_name', conn)
Now your table names are accessible as a pandas data frame, you just need to parse it out:
table_name_list = tables_df.table_name
select_template = 'SELECT * FROM {table_name}'
frames_dict = {}
for tname in table_name_list:
query = select_template.format(table_name = tname)
frames_dict[tname] = pd.read_sql(query, conn)
Your dictionary frames_dict contains all the dataframes with the table_name as the key