I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.
For instance I use the following code to select all "name" where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.
df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
df['name'][df['session_id']==1]
I just wonder after I have read the csv file, is it possible to somehow "switch/read" it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:
Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`
But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!
Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.
import pandas as pd
import sqlite3
#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)
Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.
First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).
Here's a full example you can run in a Jupyter notebook:
Installation
pip install jupysql duckdb duckdb-engine
Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine
Example
Load extension (%sql magic) and create in-memory database:
%load_ext SQL
%sql duckdb://
Download some sample CSV data:
from urllib.request import urlretrieve
urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")
Query:
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
JupySQL documentation available here
Related
I have an excel file with 2 sheets that I wish to manipulate using SQL in a Jupyter Notebook. I remember a long time ago going through some tutorial that explained you can use SQL within Python just by adding %sql before each line. But from google searches I can't figure out how to get this to work. It seems I need to create a database and then a connection, etc. But the database is just going to be the pandas dataframes I've already imported, I don't need to connect to any external database, right? (Sorry I know it must sound like such a stupid question but I've never created my own databases before).
Here's what I've tried:
import pandas as pd
import sqlite3
%load_ext sql
%sql sqlite://
path = (my file path)
orders = pd.read_excel(path,sheet_name = 1)
items = pd.read_excel(path,sheet_name = 2)
%sql select * from orders
Then I get an error:
* sqlite://
(sqlite3.OperationalError) no such table: orders
[SQL: select * from orders]
The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.
Currently I'm using the code below on Python 3.5, Windows to read in a parquet file.
import pandas as pd
parquetfilename = 'File1.parquet'
parquetFile = pd.read_parquet(parquetfilename, columns=['column1', 'column2'])
However, I'd like to do so without using pandas. How to best do this? I'm using both Python 2.7 and 3.6 on Windows.
You can use duckdb for this. It's an embedded RDBMS similar to SQLite but with OLAP in mind. There's a nice Python API and a SQL function to import Parquet files:
import duckdb
conn = duckdb.connect(":memory:") # or a file name to persist the DB
# Keep in mind this doesn't support partitioned datasets,
# so you can only read one partition at a time
conn.execute("CREATE TABLE mydata AS SELECT * FROM parquet_scan('/path/to/mydata.parquet')")
# Export a query as CSV
conn.execute("COPY (SELECT * FROM mydata WHERE col = 'val') TO 'col_val.csv' WITH (HEADER 1, DELIMITER ',')")
Problem Statement:
I have multiple csv files. I am cleaning them using python and inserting them to SQL server using bcp. Now I want to insert that into Greenplum instead of SQL Server. Please suggest a way to bulk insert into greenplum table directly from python data-frame to GreenPlum table.
Solution: (What i can think)
Way i can think is CSV-> Dataframe -> Cleainig -> Dataframe -> CSV -> then Use Gpload for Bulk load. And integrate it in Shell script for automation.
Do anyone has a good solution for it.
Issue in loading data directly from dataframe to gp table:
As gpload ask for the file path. Can i pass a varibale or dataframe to that? Is there any way to bulkload into greenplum ?I dont want to create a csv or txt file from dataframe and then load it to greenplum.
I would use psycopg2 and the io libraries to do this. io is built-in and you can install psycopg2 using pip (or conda).
Basically, you write your dataframe to a string buffer ("memory file") in the csv format. Then you use psycopg2's copy_from function to bulk load/copy it to your table.
This should get you started:
import io
import pandas
import psycopg2
# Write your dataframe to memory as csv
csv_io = io.StringIO()
dataframe.to_csv(csv_io, sep='\t', header=False, index=False)
csv_io.seek(0)
# Connect to the GreenPlum database.
greenplum = psycopg2.connect(host='host', database='database', user='user', password='password')
gp_cursor = greenplum.cursor()
# Copy the data from the buffer to the table.
gp_cursor.copy_from(csv_io, 'db.table')
greenplum.commit()
# Close the GreenPlum cursor and connection.
gp_cursor.close()
greenplum.close()
Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.
However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.
What I'd like to do:
from impala.dbapi import connect
from impala.util import as_pandas
# connect to my host and port
conn=connect(host='myhost', port=111)
# create query to save table as pandas df
create_query = """
SELECT * FROM {}
""".format(my_table_name)
# run query on impala
cur = conn.cursor()
cur.execute(create_query)
# store results as pandas data frame
pandas_df = as_pandas(cur)
cur.close()
Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.
# create query to save new_df back to impala
save_query = """
CREATE TABLE new_table AS
SELECT *
FROM pandas_df
"""
# run query on impala
cur = conn.cursor()
cur.execute(save_query)
cur.close()
The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!
You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.
The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.
You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).
I am trying to do same thing and I figured out a way to do this with an example provided with impyla:
df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)
This works fine and table in impala (backend mysql) works fine.
However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)