Python in Databricks - python

How to even start a basic query in databricks using python?
The data I need is in databricks and so far I have been using Juypterhub to pull the data and modify few things. But now I want to eliminate a step of pulling the data in Jupyterhub and directly move my python code in databricks then schedule the job.
I started like below
%python
import pandas as pd
df = pd.read_sql('select * from databasename.tablename')
and got below error
TypeError: read_sql() missing 1 required positional argument: 'con'
So I tried update
%python
import pandas as pd
import pyodbc
odbc_driver = pyodbc.drivers()[0]
conn = pyodbc.connect(odbc_driver)
df = pd.read_sql('select * databasename.tablename', con=conn)
and I got below error
ModuleNotFoundError: No module named 'pyodbc'
Can anyone please help? I can use sql to pull the data but I already have a lot of code in python that I dont know to convert in sql. So I just want my python code to work in databricks for now.

You should use directly spark's SQL facilities:
my_df = spark.sql('select * FROM databasename.tablename')

Related

Load oracle Dataframe in dask dataframe

I used to work with pandas and cx_Oracle until now. But I haver to switch to dask now due to RAM limitations.
import pandas as pd
from dask import dataframe as dd
import os
import cx_Oracle as cx
con = cx.connect('USER','userpw' , 'oracle_db',encoding='utf-8')
cursor = con.cursor()
query_V_Branchen = ('''SELECT * FROM DBOWNER.V_BRANCHEN vb''')
daskdf = dd.read_sql_table(query_V_Branchen,con ,index_col= 'RECID')
I tried to do it similar to how I used cx_oracle with pandas. But I receive an AttributeError named:
'cx_Oracle.Connection' object has no attribute '_instantiate_plugins'
Any ideas if its just a problem with the package?
Please read the dask doc on SQL:
you should provide a connection string, not an object
you should give a table name, not a query, or phrase your query using sqlalchemy's expression syntax.
e.g.,
df = dd.read_sql_table('DBOWNER.V_BRANCHEN',
'oracle+cx_oracle://USER:userpw#oracle_db', index_col= 'RECID')

Python replacement for SQL bcp.exe

The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.

Pandas: load a table into a dataframe with read_sql - `con` parameter and table name

In trying to import an sql database into a python pandas dataframe, and I am getting a syntax error. I am newbie here, so probably the issue is very simple.
After downloading sqlite sample chinook.db from http://www.sqlitetutorial.net/sqlite-sample-database/
and reading pandas documentation, I tried to load it into a pandas dataframe with
import pandas as pd
import sqlite3
conn = sqlite3.connect('chinook.db')
df = pd.read_sql('albums', conn)
where 'albums' is a table of 'chinook.db' gathered with sqlite3 from command line.
The result is:
...
DatabaseError: Execution failed on sql 'albums': near "albums": syntax error
I tried variations of the above code to import in an ipython session the tables of the database for exploratory data analysis, with no success.
What am I doing wrong? Is there a documentation/tutorial for newbies with some examples around?
Thanks in advance for your help!
Found it!
An example of db connection with SQLAlchemy can be found here:
https://www.codementor.io/sagaragarwal94/building-a-basic-restful-api-in-python-58k02xsiq
import pandas as pd
from sqlalchemy import create_engine
db_connect = create_engine('sqlite:///chinook.db')
df = pd.read_sql('albums', con=db_connect)
print(df)
As suggested by #Anky_91, also pd.read_sql_table works, as read_sql wraps it.
The issue was the connection, that has to be made with SQLAlchemy and not with sqlite3.
Thanks

Zeppelin: how to read a DataFrame with sql

I have to use python with Zeppelin. I'm very new and I find only materials about pyspark into Zeppelin.
I want to import a dataframe with python and then access it through sql:
%python
import pandas as pd #To work with dataset
import numpy as np #Math library
#Importing the data
df_credit = pd.read_csv("../data.csv",index_col=0)
if I try with:
%python
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
df_credit.to_sql('mydatasql',con=engine)
and then access it, i.e. :
%sql select Age, count(1) from mydatasql where Age < 30 group by Age order by Age
I get the error: "Table or view not found"
I think the problem is that %sql cannot read variables created with %python, but I'm not sure of that.
Try %python.sql interpreter.
You have to install pandasql package.
Check this link for more info.

Python - read parquet file without pandas

Currently I'm using the code below on Python 3.5, Windows to read in a parquet file.
import pandas as pd
parquetfilename = 'File1.parquet'
parquetFile = pd.read_parquet(parquetfilename, columns=['column1', 'column2'])
However, I'd like to do so without using pandas. How to best do this? I'm using both Python 2.7 and 3.6 on Windows.
You can use duckdb for this. It's an embedded RDBMS similar to SQLite but with OLAP in mind. There's a nice Python API and a SQL function to import Parquet files:
import duckdb
conn = duckdb.connect(":memory:") # or a file name to persist the DB
# Keep in mind this doesn't support partitioned datasets,
# so you can only read one partition at a time
conn.execute("CREATE TABLE mydata AS SELECT * FROM parquet_scan('/path/to/mydata.parquet')")
# Export a query as CSV
conn.execute("COPY (SELECT * FROM mydata WHERE col = 'val') TO 'col_val.csv' WITH (HEADER 1, DELIMITER ',')")

Categories

Resources