create a table in sqllite by using a dataframe - python

I'm new to sqllite3 and trying to understand how to create a table in sql environment by using my existing dataframe. I already have a database that I created as "pythonsqlite.db"
#import my csv to python
import pandas as pd
my_data = pd.read_csv("my_input_file.csv")
## connect to database
import sqlite3
conn = sqlite3.connect("pythonsqlite.db")
##push the dataframe to sql
my_data.to_sql("my_data", conn, if_exists="replace")
##create the table
conn.execute(
"""
create table my_table as
select * from my_data
""")
However, when I navigate to my SQLlite studio and check the tables under my database, I cannot see the table I've created. I'd really appreciate if someone tells me what I'm missing here.

I replaced just one part of the code, the 'read_csv' instead I create a small dataframe (see below), I think the issue will be either with the name of your script ( example: pandas.py)
import pandas as pd
# my_data = pd.read_csv("my_input_file.csv")
columns = ['a','b']
my_data = pd.DataFrame([[1, 2], [3, 4]], columns=columns)
## connect to database
import sqlite3
conn = sqlite3.connect("pythonsqlite.db")
##push the dataframe to sql
my_data.to_sql("my_data", conn, if_exists="replace")
##create the table
conn.execute(
"""
create table my_table as
select * from my_data
""")
I ran it and I don't see to have a problem

Related

Missing column names when importing data from database (python + postgre sql)

I am trying to import some data from the database (Postgre SQL) to work with them in Python. I tried with the code below, which seems quite similar to the ones I've found on the internet.
import psycopg2
import sqlalchemy as db
import pandas as pd
engine = db.create_engine('database specifications')
connection = engine.connect()
metadata = db.MetaData()
data = db.Table(tabela, metadata, schema=shema, autoload=True, autoload_with=engine)
query = db.select([data])
ResultProxy = connection.execute(query)
ResultSet = ResultProxy.fetchall()
df = pd.DataFrame(ResultSet)
However, it returns data without column names. What did I forget?
It turned out the only thing needed is adding
columns = data.columns.keys()
df.columns = columns
There is a great debate about that in this thread.

Insert Into Hive Using Pyhive invoke an error

I am using pyhive to interact with hive.
The SELECT statement going well using this code bellow.
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST")
cur = conn.cursor()
# Import pandas
import pandas as pd
# Store select query in dataframe
all_tables = pd.read_sql("SELECT * FROM table LIMIT 5", conn)
print all_tables
# Using curssor
cur = conn.cursor()
cur.execute('SELECT * FROM table LIMIT 5')
print cursor.fetchall()
Until here there is no problem. When I want to INSERT into hive.
Let's say I want to excute this query : INSERT INTO table2 SELECT Col1, Col2 FROM table1;
I tried :
cur.execute('INSERT INTO table2 SELECT Col1, Col2 FROM table1')
I recieve this error
pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage=u'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState=u'08S01', infoMessages=[u'*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', u'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:388', u'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:244', u'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:279', u'org.apache.hive.service.cli.operation.Operation:run:Operation.java:324', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:499', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:475', u'sun.reflect.GeneratedMethodAccessor81:invoke::-1', u'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', u'java.lang.reflect.Method:invoke:Method.java:498', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', u'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', u'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', u'java.security.AccessController:doPrivileged:AccessController.java:-2', u'javax.security.auth.Subject:doAs:Subject.java:422', u'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', u'com.sun.proxy.$Proxy33:executeStatement::-1', u'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:270', u'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', u'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', u'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', u'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', u'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', u'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', u'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', u'java.lang.Thread:run:Thread.java:748'], statusCode=3), operationHandle=None)
If I excute the same query in hive directly everything run well.
Any thoughts?
NB: All my tables are external
CREATE EXTERNAL TABLE IF NOT EXISTS table ( col1 String, col2 String) stored as orc LOCATION 's3://somewhere' tblproperties ("orc.compress"="SNAPPY");
The solution was to add the username in the connection line; conn = hive.Connection(host="HOST", username="USER")
From what I understand hive queries divided on many type of operations (jobs). While you are performing a simple query (ie. SELECT * FROM table) This reads data from the hive metastore no mapReduce job or tmp tables needed to perform the query. But as soon as you switch to more complicated queries (ie. using JOINs) you end up having the same error.
The file code looks like this:
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST", username="USER")
cur = conn.cursor()
query = "INSERT INTO table2 SELECT Col1, Col2 FROM table1"
cur.execute(query)
So maybe it needs permission or something.. I will search more about this behavior and update the answer.
I'm not sure how to insert a pandas df using pyhive, but if you have pyspark installed, one option is that you could convert to a spark df and use pyspark to do it.
from pyspark.sql import sqlContext
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.write.mode('append').saveAsTable('database_name.table_name')
You can do the following using spark.
from pyspark.sql import sqlContext
# convert the pandas data frame to spark data frame
spark_df = sqlContext.createDataFrame(pandas_df)
# register the spark data frame as temp table
spark_df.registerTempTable("my_temp_table")
# execute insert statement using spark sql
sqlContext,sql("insert into hive_table select * from my_temp_table")
This will insert data in your data frame to a hive table.
Hope this helps you

How do I insert my Python dictionary into my SQL Server database table?

I have a dictionary with 3 keys which correspond to field names in a SQL Server table. The values of these keys come from an excel file and I store this dictionary in a dataframe which I now need to insert into a SQL table. This can all be seen in the code below:
import pandas as pd
import pymssql
df=[]
fp = "file path"
data = pd.read_excel(fp,sheetname ="CRM View" )
row_date = data.loc[3, ]
row_sita = "ABZPD"
row_event = data.iloc[12, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
}, index=None)
df = df[4:]
df = df.fillna("")
print(df)
My question is how do I insert this dictionary into a SQL table now?
Also, as a side note, this code is part of a loop which needs to go through several excel files one by one, insert the data into dictionary then into SQL then delete the data in the dictionary and start again with the next excel file.
You could try something like this:
import MySQLdb
# connect
conn = MySQLdb.connect("127.0.0.1","username","passwore","table")
x = conn.cursor()
# write
x.execute('INSERT into table (row_date, sita, event) values ("%d", "%d", "%d")' % (row_date, sita, event))
# close
conn.commit()
conn.close()
You might have to change it a little based on your SQL restrictions, but should give you a good start anyway.
For the pandas dataframe, you can use the pandas built-in method to_sql to store in db. Following is the way to use it.
import sqlalchemy as sa
params = urllib.quote_plus("DRIVER={};SERVER={};DATABASE={};Trusted_Connection=True;".format("{SQL Server}",
"<db_server_url>",
"<db_name>"))
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = sa.create_engine(conn_str)
df.to_sql(<table_name>, engine,schema=<schema_name>, if_exists="append", index=False)
For this method you you will need to install sqlalchemy package.
pip install sqlalchemy
You will also need to setup the MSSql DSN on the machine.

SQLite Python Blaze - Attempting to create a table after dropping a table of same name returns old schema

I am trying to work out why the schema of a dropped table returns when I attempt to create a table using a different set of column names?
After dropping the table, I can confirm in an SQLite explorer that the table has disappeared. Once trying to load the new file via ODO it then returns an error "Column names of incoming data don't match column names of existing SQL table names in SQL table". Then I can see the same table is re-created in the database, using the previously dropped schema! I attempted a VACUUM statement after dropping the table but still same issue.
I can create the table fine using a different table name, however totally confused as to why I can't use the previously dropped table name I want to use?
import sqlite3
import pandas as pd
from odo import odo, discover, resource, dshape
conn = sqlite3.connect(dbfile)
c = conn.cursor()
c.execute("DROP TABLE <table1>")
c.execute("VACUUM")
importfile = pd.read_csv(csvfile)
odo(importfile,'sqlite:///<db_path>::<table1'>)
ValueError: Column names of incoming data don't match column names of existing SQL table Names in SQL table:
import sqlite3
import pandas as pd
from odo import odo, discover, resource, dshape
conn = sqlite3.connect('test.db')
cursor = conn.cursor();
table = """ CREATE TABLE IF NOT EXISTS TABLE1 (
id integer PRIMARY KEY,
name text NOT NULL
); """;
cursor.execute(table);
conn.commit(); # Save table into database.
cursor.execute(''' DROP TABLE TABLE1 ''');
conn.commit(); # Save that table has been dropped.
cursor.execute(table);
conn.commit(); # Save that table has been created.
conn.close();

Python to group SQL data

I used to read the data from CSV file, while I just imported all CSV data in SQL database, but I have difficulty in extracting data using Python from SQL.
My original code of read CSV is like this:
import pandas as pd
stock_data = pd.read_csv(filepath_or_buffer='stock_data_w.csv', parse_dates=[u'date'], encoding='gbk')
stock_data[u'change_weekly'] = stock_data.groupby(u'code')[u'change'].shift(-1)
Now I want to read data from SQL, here is my code, but it doesn't work and I am not sure how to sort it out:
import pandas as pd
import MySQLdb
db = MySQLdb.connect(host='localhost', user='root', passwd='232323', db='test', port=3306)
cur = db.cursor()
cur.execute("SELECT * FROM stock_data_w")
stock_data = pd.DataFrame(data=cur.fetchall(), columns=[i[0] for i in cur.description])
stock_data[u'change_weekly'] = stock_data.groupby(u'code')[u'change'].shift(-1)
the error is: "raise PandasError('DataFrame constructor not properly called!') pandas.core.common.PandasError: DataFrame constructor not properly called!"
Use below way to convert your cursor object to crate data frame.
stock_data = pd.DataFrame(data=cursor.fetchall(), index=None,
columns=cursor.keys())
print stock_data
In mysqldb, columns=[i[0] for i in cursor.description]
or
Make your connection with alchemy and use,
stock_data = pd.read_sql("SELECT * from stock_data_w",
con= cnx,parse_dates=['date'])
I'm not sure whether mysql.connector is supported in pandas read_sql(). You can give a try and let us know :)

Categories

Resources