Load oracle Dataframe in dask dataframe - python

I used to work with pandas and cx_Oracle until now. But I haver to switch to dask now due to RAM limitations.
import pandas as pd
from dask import dataframe as dd
import os
import cx_Oracle as cx
con = cx.connect('USER','userpw' , 'oracle_db',encoding='utf-8')
cursor = con.cursor()
query_V_Branchen = ('''SELECT * FROM DBOWNER.V_BRANCHEN vb''')
daskdf = dd.read_sql_table(query_V_Branchen,con ,index_col= 'RECID')
I tried to do it similar to how I used cx_oracle with pandas. But I receive an AttributeError named:
'cx_Oracle.Connection' object has no attribute '_instantiate_plugins'
Any ideas if its just a problem with the package?

Please read the dask doc on SQL:
you should provide a connection string, not an object
you should give a table name, not a query, or phrase your query using sqlalchemy's expression syntax.
e.g.,
df = dd.read_sql_table('DBOWNER.V_BRANCHEN',
'oracle+cx_oracle://USER:userpw#oracle_db', index_col= 'RECID')

Related

read.sql_query works, read sql_table doesn't

Trying to import a table from a SQLite into Pandas DF:
import pandas as pd
import sqlite3
cnxn = sqlite3.Connection("my_db.db")
c = cnxn.cursor()
Using this command works: pd.read_sql_query('select * from table1', con=cnxn). This doesn't : df = pd.read_sql_table('table1', con=cnxn).
Response :
ValueError: Table table1 not found
What could be the issue?
Using SQLite in Python the pd.read_sql_table() is not possible. Info found in Pandas doc.
Hence it's considered to be a DB-API when running the commands thru Python.
pd.read_sql_table() Documentation
Given a table name and a SQLAlchemy connectable, returns a DataFrame.
This function does not support DBAPI connections.

Pandas: load a table into a dataframe with read_sql - `con` parameter and table name

In trying to import an sql database into a python pandas dataframe, and I am getting a syntax error. I am newbie here, so probably the issue is very simple.
After downloading sqlite sample chinook.db from http://www.sqlitetutorial.net/sqlite-sample-database/
and reading pandas documentation, I tried to load it into a pandas dataframe with
import pandas as pd
import sqlite3
conn = sqlite3.connect('chinook.db')
df = pd.read_sql('albums', conn)
where 'albums' is a table of 'chinook.db' gathered with sqlite3 from command line.
The result is:
...
DatabaseError: Execution failed on sql 'albums': near "albums": syntax error
I tried variations of the above code to import in an ipython session the tables of the database for exploratory data analysis, with no success.
What am I doing wrong? Is there a documentation/tutorial for newbies with some examples around?
Thanks in advance for your help!
Found it!
An example of db connection with SQLAlchemy can be found here:
https://www.codementor.io/sagaragarwal94/building-a-basic-restful-api-in-python-58k02xsiq
import pandas as pd
from sqlalchemy import create_engine
db_connect = create_engine('sqlite:///chinook.db')
df = pd.read_sql('albums', con=db_connect)
print(df)
As suggested by #Anky_91, also pd.read_sql_table works, as read_sql wraps it.
The issue was the connection, that has to be made with SQLAlchemy and not with sqlite3.
Thanks

Zeppelin: how to read a DataFrame with sql

I have to use python with Zeppelin. I'm very new and I find only materials about pyspark into Zeppelin.
I want to import a dataframe with python and then access it through sql:
%python
import pandas as pd #To work with dataset
import numpy as np #Math library
#Importing the data
df_credit = pd.read_csv("../data.csv",index_col=0)
if I try with:
%python
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
df_credit.to_sql('mydatasql',con=engine)
and then access it, i.e. :
%sql select Age, count(1) from mydatasql where Age < 30 group by Age order by Age
I get the error: "Table or view not found"
I think the problem is that %sql cannot read variables created with %python, but I'm not sure of that.
Try %python.sql interpreter.
You have to install pandasql package.
Check this link for more info.

How to write python sql output into CSV using a dataframe

IMPORT MODULES
import pyodbc
import pandas as pd
import csv
CREATE CONNECTION TO MICROSOFT SQL SERVER
msconn = pyodbc.connect(driver='{SQL Server}',
server='SERVER',
database='DATABASE',
trusted_msconnection='yes')
cursor = msconn.cursor()
CREATE VARIABLES THAT HOLD SQL STATEMENTS
SCRIPT = "SELECT * FROM TABLE"
PRINT DATA
cursor.execute(SCRIPT)
cursor.commit
for row in cursor:
print (row)
WRITE ALL ROWS WITH COLUMN NAME TO CSV --- NEED HELP HERE
Pandas
Since pandas support direct import from an RDBMS with the name being called read_sql you don't need to write this manually.
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('mssql+pyodbc://user:pass#mydsn')
df = pd.read_sql(sql='SELECT * FROM ...', con=engine)
The right tool: odo
From odo docs
Loading CSV files into databases is a solved problem. It’s a problem
that has been solved well. Instead of rolling our own loader every
time we need to do this and wasting computational resources, we should
use the native loaders in the database of our choosing.
And it works the other way round also.
from odo import odo
odo('mssql+pyodbc://user:pass#mydsn::tablename','myfile.csv')
#e4c5's answer is great as it should be faster compared to for loop + cursor - i would extend it with saving result set to CSV:
...
pd.read_sql(sql='SELECT * FROM TABLE', con=msconn) \
.to_csv('/path/to/file.csv', index=False)
if you want to read all rows (not specifying WHERE clause):
pd.read_sql_table('TABLE', con=msconn).to_csv('/path/to/file.csv', index=False)

Pandas writing dataframe to other postgresql schema

I am trying to write a pandas DataFrame to a PostgreSQL database,
using a schema-qualified table.
I use the following code:
import pandas.io.sql as psql
from sqlalchemy import create_engine
engine = create_engine(r'postgresql://some:user#host/db')
c = engine.connect()
conn = c.connection
df = psql.read_sql("SELECT * FROM xxx", con=conn)
df.to_sql('a_schema.test', engine)
conn.close()
What happens is that pandas writes in schema "public", in a table named 'a_schema.test',
instead of writing in the "test" table in the "a_schema" schema.
How can I instruct pandas to use a schema different than public?
Thanks
Update: starting from pandas 0.15, writing to different schema's is supported. Then you will be able to use the schema keyword argument:
df.to_sql('test', engine, schema='a_schema')
Writing to different schema's is not yet supported at the moment with the read_sql and to_sql functions (but an enhancement request has already been filed: https://github.com/pydata/pandas/issues/7441).
However, you can get around for now using the object interface with PandasSQLAlchemy and providing a custom MetaData object:
meta = sqlalchemy.MetaData(engine, schema='a_schema')
meta.reflect()
pdsql = pd.io.sql.PandasSQLAlchemy(engine, meta=meta)
pdsql.to_sql(df, 'test')
Beware! This interface (PandasSQLAlchemy) is not yet really public and will still undergo changes in the next version of pandas, but this is how you can do it for pandas 0.14.
Update: PandasSQLAlchemy is renamed to SQLDatabase in pandas 0.15.
Solved, thanks to joris answer.
Code was also improved thanks to joris comment, by passing around sqlalchemy engine instead of connection objects.
import pandas as pd
from sqlalchemy import create_engine, MetaData
engine = create_engine(r'postgresql://some:user#host/db')
meta = sqlalchemy.MetaData(engine, schema='a_schema')
meta.reflect(engine, schema='a_schema')
pdsql = pd.io.sql.PandasSQLAlchemy(engine, meta=meta)
df = pd.read_sql("SELECT * FROM xxx", con=engine)
pdsql.to_sql(df, 'test')

Categories

Resources