Insert Into Hive Using Pyhive invoke an error - python

I am using pyhive to interact with hive.
The SELECT statement going well using this code bellow.
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST")
cur = conn.cursor()
# Import pandas
import pandas as pd
# Store select query in dataframe
all_tables = pd.read_sql("SELECT * FROM table LIMIT 5", conn)
print all_tables
# Using curssor
cur = conn.cursor()
cur.execute('SELECT * FROM table LIMIT 5')
print cursor.fetchall()
Until here there is no problem. When I want to INSERT into hive.
Let's say I want to excute this query : INSERT INTO table2 SELECT Col1, Col2 FROM table1;
I tried :
cur.execute('INSERT INTO table2 SELECT Col1, Col2 FROM table1')
I recieve this error
pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage=u'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState=u'08S01', infoMessages=[u'*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', u'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:388', u'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:244', u'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:279', u'org.apache.hive.service.cli.operation.Operation:run:Operation.java:324', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:499', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:475', u'sun.reflect.GeneratedMethodAccessor81:invoke::-1', u'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', u'java.lang.reflect.Method:invoke:Method.java:498', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', u'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', u'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', u'java.security.AccessController:doPrivileged:AccessController.java:-2', u'javax.security.auth.Subject:doAs:Subject.java:422', u'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', u'com.sun.proxy.$Proxy33:executeStatement::-1', u'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:270', u'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', u'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', u'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', u'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', u'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', u'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', u'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', u'java.lang.Thread:run:Thread.java:748'], statusCode=3), operationHandle=None)
If I excute the same query in hive directly everything run well.
Any thoughts?
NB: All my tables are external
CREATE EXTERNAL TABLE IF NOT EXISTS table ( col1 String, col2 String) stored as orc LOCATION 's3://somewhere' tblproperties ("orc.compress"="SNAPPY");

The solution was to add the username in the connection line; conn = hive.Connection(host="HOST", username="USER")
From what I understand hive queries divided on many type of operations (jobs). While you are performing a simple query (ie. SELECT * FROM table) This reads data from the hive metastore no mapReduce job or tmp tables needed to perform the query. But as soon as you switch to more complicated queries (ie. using JOINs) you end up having the same error.
The file code looks like this:
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST", username="USER")
cur = conn.cursor()
query = "INSERT INTO table2 SELECT Col1, Col2 FROM table1"
cur.execute(query)
So maybe it needs permission or something.. I will search more about this behavior and update the answer.

I'm not sure how to insert a pandas df using pyhive, but if you have pyspark installed, one option is that you could convert to a spark df and use pyspark to do it.
from pyspark.sql import sqlContext
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.write.mode('append').saveAsTable('database_name.table_name')

You can do the following using spark.
from pyspark.sql import sqlContext
# convert the pandas data frame to spark data frame
spark_df = sqlContext.createDataFrame(pandas_df)
# register the spark data frame as temp table
spark_df.registerTempTable("my_temp_table")
# execute insert statement using spark sql
sqlContext,sql("insert into hive_table select * from my_temp_table")
This will insert data in your data frame to a hive table.
Hope this helps you

Related

How to Implement Natural Join in Python without using Pandas

Should be done using single function
Shouldn't use Pandas or merge function or any other inbuilt database libraries
You can use native driver like psycopg2 for postgres https://www.psycopg.org/docs/usage.html
import psycopg2
# Connect to an existing database
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
# Query the database and obtain data as Python objects
cur.execute("""
SELECT * FROM test t
left join test1 t1 on (t.t1_id = t1.id);
""")
fetched_rows = cur.fetchall()
# Make the changes to the database persistent
conn.commit()
# Close communication with the database
cur.close()
conn.close()

How do I insert my Python dictionary into my SQL Server database table?

I have a dictionary with 3 keys which correspond to field names in a SQL Server table. The values of these keys come from an excel file and I store this dictionary in a dataframe which I now need to insert into a SQL table. This can all be seen in the code below:
import pandas as pd
import pymssql
df=[]
fp = "file path"
data = pd.read_excel(fp,sheetname ="CRM View" )
row_date = data.loc[3, ]
row_sita = "ABZPD"
row_event = data.iloc[12, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
}, index=None)
df = df[4:]
df = df.fillna("")
print(df)
My question is how do I insert this dictionary into a SQL table now?
Also, as a side note, this code is part of a loop which needs to go through several excel files one by one, insert the data into dictionary then into SQL then delete the data in the dictionary and start again with the next excel file.
You could try something like this:
import MySQLdb
# connect
conn = MySQLdb.connect("127.0.0.1","username","passwore","table")
x = conn.cursor()
# write
x.execute('INSERT into table (row_date, sita, event) values ("%d", "%d", "%d")' % (row_date, sita, event))
# close
conn.commit()
conn.close()
You might have to change it a little based on your SQL restrictions, but should give you a good start anyway.
For the pandas dataframe, you can use the pandas built-in method to_sql to store in db. Following is the way to use it.
import sqlalchemy as sa
params = urllib.quote_plus("DRIVER={};SERVER={};DATABASE={};Trusted_Connection=True;".format("{SQL Server}",
"<db_server_url>",
"<db_name>"))
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = sa.create_engine(conn_str)
df.to_sql(<table_name>, engine,schema=<schema_name>, if_exists="append", index=False)
For this method you you will need to install sqlalchemy package.
pip install sqlalchemy
You will also need to setup the MSSql DSN on the machine.

create a table in sqllite by using a dataframe

I'm new to sqllite3 and trying to understand how to create a table in sql environment by using my existing dataframe. I already have a database that I created as "pythonsqlite.db"
#import my csv to python
import pandas as pd
my_data = pd.read_csv("my_input_file.csv")
## connect to database
import sqlite3
conn = sqlite3.connect("pythonsqlite.db")
##push the dataframe to sql
my_data.to_sql("my_data", conn, if_exists="replace")
##create the table
conn.execute(
"""
create table my_table as
select * from my_data
""")
However, when I navigate to my SQLlite studio and check the tables under my database, I cannot see the table I've created. I'd really appreciate if someone tells me what I'm missing here.
I replaced just one part of the code, the 'read_csv' instead I create a small dataframe (see below), I think the issue will be either with the name of your script ( example: pandas.py)
import pandas as pd
# my_data = pd.read_csv("my_input_file.csv")
columns = ['a','b']
my_data = pd.DataFrame([[1, 2], [3, 4]], columns=columns)
## connect to database
import sqlite3
conn = sqlite3.connect("pythonsqlite.db")
##push the dataframe to sql
my_data.to_sql("my_data", conn, if_exists="replace")
##create the table
conn.execute(
"""
create table my_table as
select * from my_data
""")
I ran it and I don't see to have a problem

Querying json object in dataframe using Pyspark

I have a MySql table with following schema:
id-int
path-varchar
info-json {"name":"pat", "address":"NY, USA"....}
I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using
df = sqlContext.sql("select * from dbTable")
This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).
select id, info->"$.name" from dbTable where info->"$.name"='pat'
from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')
There is already a function named get_json_object
For your situation:
df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
""")
res.show()
Spark sql is almost like HIVE sql, you can see
https://cwiki.apache.org/confluence/display/Hive/Home

Can I export a Python Pandas dataframe to MS SQL?

I am using pymssql and the Pandas sql package to load data from SQL into a Pandas dataframe with frame_query.
I would like to send it back to the SQL database using write_frame, but I haven't been able to find much documentation on this. In particular, there is a parameter flavor='sqlite'. Does this mean that so far Pandas can only export to SQLite? My firm is using MS SQL Server 2008 so I need to export to that.
Unfortunately, yes. At the moment sqlite is the only "flavor" supported by write_frame. See https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L155
def write_frame(frame, name=None, con=None, flavor='sqlite'):
"""
Write records stored in a DataFrame to SQLite. The index will currently be
dropped
"""
if flavor == 'sqlite':
schema = get_sqlite_schema(frame, name)
else:
raise NotImplementedError
Writing a simple write_frame should be fairly easy, though. For example, something like this might work (untested!):
import pymssql
conn = pymssql.connect(host='SQL01', user='user', password='password', database='mydatabase')
cur = conn.cursor()
# frame is your dataframe
wildcards = ','.join(['?'] * len(frame.columns))
data = [tuple(x) for x in frame.values]
table_name = 'Table'
cur.executemany("INSERT INTO %s VALUES(%s)" % (table_name, wildcards), data)
conn.commit()
Just to save someone else who tried to use this some time. It turns out the line:
wildcards = ','.join(['?'] * len(frame.columns))
should be:
wildcards = ','.join(['%s'] * len(frame.columns))
Hope that helps

Categories

Resources