Zeppelin: how to read a DataFrame with sql

Zeppelin: how to read a DataFrame with sql - python

I have to use python with Zeppelin. I'm very new and I find only materials about pyspark into Zeppelin.
I want to import a dataframe with python and then access it through sql:
%python
import pandas as pd #To work with dataset
import numpy as np #Math library
#Importing the data
df_credit = pd.read_csv("../data.csv",index_col=0)
if I try with:
%python
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
df_credit.to_sql('mydatasql',con=engine)
and then access it, i.e. :
%sql select Age, count(1) from mydatasql where Age < 30 group by Age order by Age
I get the error: "Table or view not found"
I think the problem is that %sql cannot read variables created with %python, but I'm not sure of that.

Try %python.sql interpreter.
You have to install pandasql package.
Check this link for more info.

Related

Python in Databricks

How to even start a basic query in databricks using python?
The data I need is in databricks and so far I have been using Juypterhub to pull the data and modify few things. But now I want to eliminate a step of pulling the data in Jupyterhub and directly move my python code in databricks then schedule the job.
I started like below
%python
import pandas as pd
df = pd.read_sql('select * from databasename.tablename')
and got below error
TypeError: read_sql() missing 1 required positional argument: 'con'
So I tried update
%python
import pandas as pd
import pyodbc
odbc_driver = pyodbc.drivers()[0]
conn = pyodbc.connect(odbc_driver)
df = pd.read_sql('select * databasename.tablename', con=conn)
and I got below error
ModuleNotFoundError: No module named 'pyodbc'
Can anyone please help? I can use sql to pull the data but I already have a lot of code in python that I dont know to convert in sql. So I just want my python code to work in databricks for now.

You should use directly spark's SQL facilities:
my_df = spark.sql('select * FROM databasename.tablename')

read.sql_query works, read sql_table doesn't

Trying to import a table from a SQLite into Pandas DF:
import pandas as pd
import sqlite3
cnxn = sqlite3.Connection("my_db.db")
c = cnxn.cursor()
Using this command works: pd.read_sql_query('select * from table1', con=cnxn). This doesn't : df = pd.read_sql_table('table1', con=cnxn).
Response :
ValueError: Table table1 not found
What could be the issue?

Using SQLite in Python the pd.read_sql_table() is not possible. Info found in Pandas doc.
Hence it's considered to be a DB-API when running the commands thru Python.
pd.read_sql_table() Documentation
Given a table name and a SQLAlchemy connectable, returns a DataFrame.
This function does not support DBAPI connections.

Load oracle Dataframe in dask dataframe

I used to work with pandas and cx_Oracle until now. But I haver to switch to dask now due to RAM limitations.
import pandas as pd
from dask import dataframe as dd
import os
import cx_Oracle as cx
con = cx.connect('USER','userpw' , 'oracle_db',encoding='utf-8')
cursor = con.cursor()
query_V_Branchen = ('''SELECT * FROM DBOWNER.V_BRANCHEN vb''')
daskdf = dd.read_sql_table(query_V_Branchen,con ,index_col= 'RECID')
I tried to do it similar to how I used cx_oracle with pandas. But I receive an AttributeError named:
'cx_Oracle.Connection' object has no attribute '_instantiate_plugins'
Any ideas if its just a problem with the package?

Please read the dask doc on SQL:
you should provide a connection string, not an object
you should give a table name, not a query, or phrase your query using sqlalchemy's expression syntax.
e.g.,
df = dd.read_sql_table('DBOWNER.V_BRANCHEN',
'oracle+cx_oracle://USER:userpw#oracle_db', index_col= 'RECID')

Pandas: load a table into a dataframe with read_sql - `con` parameter and table name

In trying to import an sql database into a python pandas dataframe, and I am getting a syntax error. I am newbie here, so probably the issue is very simple.
After downloading sqlite sample chinook.db from http://www.sqlitetutorial.net/sqlite-sample-database/
and reading pandas documentation, I tried to load it into a pandas dataframe with
import pandas as pd
import sqlite3
conn = sqlite3.connect('chinook.db')
df = pd.read_sql('albums', conn)
where 'albums' is a table of 'chinook.db' gathered with sqlite3 from command line.
The result is:
...
DatabaseError: Execution failed on sql 'albums': near "albums": syntax error
I tried variations of the above code to import in an ipython session the tables of the database for exploratory data analysis, with no success.
What am I doing wrong? Is there a documentation/tutorial for newbies with some examples around?
Thanks in advance for your help!

Found it!
An example of db connection with SQLAlchemy can be found here:
https://www.codementor.io/sagaragarwal94/building-a-basic-restful-api-in-python-58k02xsiq
import pandas as pd
from sqlalchemy import create_engine
db_connect = create_engine('sqlite:///chinook.db')
df = pd.read_sql('albums', con=db_connect)
print(df)
As suggested by #Anky_91, also pd.read_sql_table works, as read_sql wraps it.
The issue was the connection, that has to be made with SQLAlchemy and not with sqlite3.
Thanks

SQL statement for CSV files on IPython notebook

I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.
For instance I use the following code to select all "name" where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.
df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
df['name'][df['session_id']==1]
I just wonder after I have read the csv file, is it possible to somehow "switch/read" it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:
Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`
But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!

Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.
import pandas as pd
import sqlite3
#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)

Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.
First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).
Here's a full example you can run in a Jupyter notebook:
Installation
pip install jupysql duckdb duckdb-engine
Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine
Example
Load extension (%sql magic) and create in-memory database:
%load_ext SQL
%sql duckdb://
Download some sample CSV data:
from urllib.request import urlretrieve
urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")
Query:
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
JupySQL documentation available here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Zeppelin: how to read a DataFrame with sql - python

Try %python.sql interpreter. You have to install pandasql package. Check this link for more info.

Related

Python in Databricks

read.sql_query works, read sql_table doesn't

Load oracle Dataframe in dask dataframe

Pandas: load a table into a dataframe with read_sql - `con` parameter and table name

SQL statement for CSV files on IPython notebook

Categories

Resources