I recently swapped to Python from using SAS. I want to do some SQL queries in Python. I do them as follows (table1 and table2 are pandas dataframes):
import pandas as pd
import sqlite3
sql = sqlite3.connect(':memory:')
c = sql.cursor()
table1.to_sql('table1sql', sql, if_exists='replace', index=False)
table2.to_sql('table2sql', sql, if_exists='replace', index=False)
df_sql = c.execute('''
SELECT a.*, b.*
FROM table1sql as a
LEFT JOIN table2sql as b
ON a.id = b.id
''')
df = pd.DataFrame(df_sql.fetchall())
df.columns = list(map(lambda x: x[0], c.description)) # get column names from sql cursor
I work with very large datasets. Sometimes up to 60 mio observations. The query itself takes seconds. However, "fetching" the dataset, i.e. transforming the sql dataframe to a pandas dataframe, takes ages.
In SAS, the entire SQL query would take seconds. Is the way I am doing it inefficient? Is there any other way of doing what I am trying to do?
import pandas as pd
import sqlite3
# Connect to sqlite3 instance
con = sqlite3.connect(':memory:')
# Read sqlite query results into a pandas DataFrame
df = pd.read_sql_query('''
SELECT a.*, b.*
FROM table1sql as a
LEFT JOIN table2sql as b
ON a.id = b.id
''',
con
)
# Verify that result of SQL query is stored in the dataframe
print(df.head())
con.close()
Docs : https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html?highlight=read%20sql%20query#pandas.read_sql_query
EDIT
Wait, I just re-read your question, the sources are already pandas dataframes???
Why are you pushing them to SQLite, just to read them back out again? Just use pd.merge()?
df = pd.merge(
table1,
table2,
how="left",
on="id",
suffixes=("_x", "_y"),
copy=True
)
Docs : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge
Related
I have a postgresql database which I am accessing using sqlalchemy in python.
I am trying to create a database which are made of a bunch of csv files or data frames.
The csv files look like this, there are around 20,000 of them, some of them have
Data1.csv
Date Data1
01-Jan-2000 122.1
...
09-Oct-2020 991.2
Data2.csv
Date Data2
01-Feb-2010 101.1
...
09-Oct-2020 331.2
Data3.csv
Date Data3a Data3b Data3c.....
15-Dec-2015 1125.2 ....
...
09-Oct-2020 35512.2 ....
...
...
Data20000.csv
So I am do the following,
import pandas as pd
import sqlalchemy
import psycopg2
engine = sqlalchemy.create_engine('postgresql://user#127.0.0.1',isolation_level='AUTOCOMMIT')
engine.execute('CREATE DATABASE testdb')
df = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
df.to_sql('temp',con=engine,if_exists='replace')
I can see that this creates an empty database called testdb and a table called temp.
How do I merge the temp table into the testdb table, so that I can create a for loop and make a table like this
Date Data1 Data2 Data3a Data3b Data3c
....
....
if I was using pandas, I would do this,
testdb = pd.DataFrame()
df = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
testdb = pd.merge(testdb,df,how='outer',left_index=True, right_index=True)
I tried engine.execute('SELECT * from df OUTER JOIN testdb'),
but I get the following error
ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "OUTER"
LINE 1: SELECT * from df OUTER JOIN testdb
^
[SQL: SELECT * from df OUTER JOIN testdb]
(Background on this error at: http://sqlalche.me/e/13/f405)
What is the right way to merge my data here?
Update:
So I have 1389 files in this directory,
each one is around 15 years worth of daily data x 8 columns
I try to append but around 300 files in, it slows down like crazy.
What am I doing wrong here?
frame = pd.DataFrame()
length = len(os.listdir(filepath))
for filename in os.listdir(filepath):
file_path = os.path.join(filepath, filename)
print(length,end=" ")
df = pd.read_csv(file_path,index_col=0)
df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
frame = frame.append(df)
length-=1
You can merge them at the dataframe level using pandas library prior to writing the records to the database table.
df1 = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
df2 = pd.read_csv("/Users/user/Documents/data/Data2.csv",index_col=0)
df3 = pd.read_csv("/Users/user/Documents/data/Data3.csv",index_col=0)
final_df = pd.concat([df1, df2, df3], axis=1)
final_df.to_sql('temp',con=engine,if_exists='replace')
I am trying to retrieve information from a database using a Python tuple containing a set of ids (between 1000 and 10000 ids), but my query uses the IN statement and is subsequently very slow.
query = """ SELECT *
FROM table1
LEFT JOIN table2 ON table1.id = table2.id
LEFT JOIN ..
LEFT JOIN ...
WHERE table1.id IN {} """.format(my_tuple)
and then I query the database using PostgreSQL to charge the result in a Pandas dataframe:
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pd.read_csv(tmpfile, low_memory=False)
I know that IN is not very efficient with a high number of parameters, but I do not have any idea to optimise this part of the query. Any hint?
You could debug your query using explain statement. Probably you are trying to
sequently read big table while needing only a few rows. Is field table1.id indexed?
Or you could try to filter table1 first and then start joining
with t1 as (
select f1,f2, .... from table1 where id in {}
)
select *
from t1
left join ....
I am using pydobc to connect with my sql server. I have a table from which I need to delete a column.
I can read this table, the code I used to read this is as follows:
import pyodbc
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0}; Server=xyz; database=db; Trusted_Connection=yes;")
cursor = cnxn.cursor()
df = pd.read_sql("select * from [db].[username].[mytable]", cnxn)
df.shape
Above code works as expected. But when I try to drop a column from this table it says can not find the object.
Here is my code trial
query = 'ALTER TABLE [db].[username].[mytable] DROP COLUMN [TEMP CELCIUS]'
cursor.execute(query)
My question is how to drop this column. To add here this column has a white space in it's name.
Try:
query = 'ALTER TABLE [db].[username].[mytable] DROP COLUMN "TEMP CELCIUS"'
OR:
query = 'ALTER TABLE [db].[username].[mytable] DROP COLUMN `TEMP CELCIUS`'
New to python and pandas, Im facing the following issue:
I would like to pass multiple string into a sql query and struggle to insert the delimiter ',' :
Example data
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)
# Remove header (Not sure whether that is necessary)
df.columns = df.iloc[0]
pd.read_sql(
"""
SELECT
*
FROM emptable
WHERE empID IN ('{}',)
""".format(df.ix[:, 0]), # Which corresponds to 'Alex','Bob','Clarke'
con = connection)
I tried different combinations, however None of them have worked out.
Demo:
sql_ = """
SELECT *
FROM emptable
WHERE empID IN ({})
"""
sql = sql_.format(','.join([x for x in ['?'] * len(df)]))
print(sql)
new = pd.read_sql(query, conn, params=tuple(df['Name']))
Output:
In [166]: print(sql)
SELECT *
FROM emptable
WHERE empID IN (?,?,?)
NOTE: this approach will not work if your DF is large, because the generated SQL string would be too big.
In this case you can save/dump Names in a helper temporary table and use it in SQL:
df[['Name']].to_sql('tmp', conn, if_exists='replace')
sql = """
SELECT *
FROM emptable
WHERE empID IN (select Name from tmp)
"""
new = pd.read_sql(sql, conn)
Experts,
I am struggling to find an efficient way to work with pandas and sqlite.
I am building a tool that let's users
extract part of a sql database (sub_table) based on some filters
change part of sub_table
upload changed sub_table back to
overall sql table replacing old values
Users will only see excel data (so I need to write back and forth to excel which is not part of my example as out of scope).
Users can
replace existing rows (entries) with new data
delete existing rows
add new rows
Question: how can I most efficiently do this "replace/delete/add" using Pandas / sqlite3?
Here is my example code. If I use df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace") at the bottom than obviously the entire table is replaced...so there must be another way I cannot think of.
import pandas as pd
import sqlite3
import numpy as np
#### SETTING EXAMPLE UP
### Create DataFrame
data = dict({"City": ["London","Frankfurt","Berlin","Paris","Brondby"],
"Population":[8,2,4,9,0.5]})
df = pd.DataFrame(data,index = pd.Index(np.arange(5)))
### Create SQL DataBase
conn = sqlite3.connect("MyDB.db")
### Upload DataFrame as Table into SQL Database
df.to_sql("MyTable", con = conn, index = False, if_exists="replace")
### Read DataFrame from SQL DB
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
#### CREATE SUB_TABLE AND AMEND
#### EXTRACT sub_table FROM SQL TABLE
query = "SELECT * from MyTable WHERE Population > 2"
df_sub = pd.read_sql_query(query, con = conn)
df_sub
#### Amend Sub DF
df_sub[df_sub["City"] == "London"] = ["Brussel",4]
df_sub
#### Replace new data in SQL DB
df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
Thanks for your help!
Note: I did try to achieve via pure SQL queries but gave up. As I am not an expert on SQL I would want to go with pandas if a solution exists. If not a hint on how to achieve this via sql would be great!
I think there is no way around using SQL queries for this task.
With pandas it is only possible to read a query to a DataFrame and to write a DataFrame to a Database (replace or append).
If you want to update specific values/ rows or want to delete rows, you have to use SQL queries.
Commands you should look into are for example:
UPDATE, REPLACE, INSERT, DELETE
# Update the database, change City to 'Brussel' and Population to 4, for the first row
# (Attention! python indices start at 0, SQL indices at 1)
cur = conn.cursor()
cur.execute('UPDATE MyTable SET City=?, Population=? WHERE ROWID=?', ('Brussel', 4, 1))
conn.commit()
conn.close()
# Display the changes
conn = sqlite3.connect("MyDB.db")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con=conn)
For more examples on sql and pandas you can look at
https://www.dataquest.io/blog/python-pandas-databases/