How to Implement Natural Join in Python without using Pandas - python

Should be done using single function
Shouldn't use Pandas or merge function or any other inbuilt database libraries

You can use native driver like psycopg2 for postgres https://www.psycopg.org/docs/usage.html
import psycopg2
# Connect to an existing database
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
# Query the database and obtain data as Python objects
cur.execute("""
SELECT * FROM test t
left join test1 t1 on (t.t1_id = t1.id);
""")
fetched_rows = cur.fetchall()
# Make the changes to the database persistent
conn.commit()
# Close communication with the database
cur.close()
conn.close()

Related

Insert Into Hive Using Pyhive invoke an error

I am using pyhive to interact with hive.
The SELECT statement going well using this code bellow.
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST")
cur = conn.cursor()
# Import pandas
import pandas as pd
# Store select query in dataframe
all_tables = pd.read_sql("SELECT * FROM table LIMIT 5", conn)
print all_tables
# Using curssor
cur = conn.cursor()
cur.execute('SELECT * FROM table LIMIT 5')
print cursor.fetchall()
Until here there is no problem. When I want to INSERT into hive.
Let's say I want to excute this query : INSERT INTO table2 SELECT Col1, Col2 FROM table1;
I tried :
cur.execute('INSERT INTO table2 SELECT Col1, Col2 FROM table1')
I recieve this error
pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage=u'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState=u'08S01', infoMessages=[u'*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', u'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:388', u'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:244', u'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:279', u'org.apache.hive.service.cli.operation.Operation:run:Operation.java:324', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:499', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:475', u'sun.reflect.GeneratedMethodAccessor81:invoke::-1', u'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', u'java.lang.reflect.Method:invoke:Method.java:498', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', u'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', u'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', u'java.security.AccessController:doPrivileged:AccessController.java:-2', u'javax.security.auth.Subject:doAs:Subject.java:422', u'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', u'com.sun.proxy.$Proxy33:executeStatement::-1', u'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:270', u'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', u'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', u'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', u'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', u'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', u'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', u'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', u'java.lang.Thread:run:Thread.java:748'], statusCode=3), operationHandle=None)
If I excute the same query in hive directly everything run well.
Any thoughts?
NB: All my tables are external
CREATE EXTERNAL TABLE IF NOT EXISTS table ( col1 String, col2 String) stored as orc LOCATION 's3://somewhere' tblproperties ("orc.compress"="SNAPPY");
The solution was to add the username in the connection line; conn = hive.Connection(host="HOST", username="USER")
From what I understand hive queries divided on many type of operations (jobs). While you are performing a simple query (ie. SELECT * FROM table) This reads data from the hive metastore no mapReduce job or tmp tables needed to perform the query. But as soon as you switch to more complicated queries (ie. using JOINs) you end up having the same error.
The file code looks like this:
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST", username="USER")
cur = conn.cursor()
query = "INSERT INTO table2 SELECT Col1, Col2 FROM table1"
cur.execute(query)
So maybe it needs permission or something.. I will search more about this behavior and update the answer.
I'm not sure how to insert a pandas df using pyhive, but if you have pyspark installed, one option is that you could convert to a spark df and use pyspark to do it.
from pyspark.sql import sqlContext
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.write.mode('append').saveAsTable('database_name.table_name')
You can do the following using spark.
from pyspark.sql import sqlContext
# convert the pandas data frame to spark data frame
spark_df = sqlContext.createDataFrame(pandas_df)
# register the spark data frame as temp table
spark_df.registerTempTable("my_temp_table")
# execute insert statement using spark sql
sqlContext,sql("insert into hive_table select * from my_temp_table")
This will insert data in your data frame to a hive table.
Hope this helps you

SQL connector using Python

I have a SQL database which I wish to run a query on through python. I have the following code:
sql='select * from mf where frequency=220258.0;'
cur.execute(sql)
Where I use the same select command in sqlite3 directly it works, but through Python no database entries are outputted.
What am I doing wrong?
Consider this SQLite database.
$ sqlite3 so.sqlite3 .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE t (f1 integer, f2 text);
INSERT INTO "t" VALUES(1,'foo');
INSERT INTO "t" VALUES(2,'bar');
INSERT INTO "t" VALUES(3,'baz');
COMMIT;
Python connects and queries this database like this.
import sqlite3
con = sqlite3.connect("so.sqlite3")
cur = con.cursor()
cur.execute("select * from t where f1 = 2")
print(cur.fetchone())
Output:
(2, 'bar')
You have to use one of cur.fetchone(), cur.fetchall(), or cur.fetchmany() to get rows from the cursor. Just doing cur.execute() does not return the rows.

joining across databases with sqlite3 / pysqlite

I have two separate database files each with tables with matching primary keys, but different data. I want to pull out rows from one table based on values in the other. In the CLI for sqlite3, I would do this like:
.open data.db
.open details.db
attach 'data.db' as data;
attach 'details.db' as details;
select details.A.colA from data.A join details.A using ('key') where data.A.colB = 0 and data.A.colC = 1;
How can I recreate such a cross-database join using pysqlite?
You can attach additional databases with ATTACH DATABASE:
conn = sqlite3.connect('data.db')
conn.execute('ATTACH DATABASE details.db AS details')
For query purposes, the first database is known as main:
cursor = conn.cursor()
cursor.execute('''
select details.A.colA
from main.A
join details.A using ('key')
where main.A.colB = 0 and main.A.colC = 1
''')

Python and PostgreSQL - GeomFromText

I'm a Python learner,
I'm trying to insert geometry records into PostgreSQL.
If I tried the query without the geometry column, it works fine and all data inserted successfully.
cur.execute("INSERT INTO taxi (userid,carNum) SELECT '"+str(msg['UserID'])+"',"+str(msg['CarNumber']))
Once I try to add the geometry records, nothing happens! execution ends without errors but nothing being inserted into DB.
cur.execute("INSERT INTO taxi (position,userid,carNum) SELECT GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")',4326),'"+str(msg['UserID'])+"',"+str(msg['CarNumber']))
Couldn't figure out what I'm missing here
You need to commit the data to the database.
Check the documentation of psycopg2 http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
Follow those steps
>>> import psycopg2
# Connect to an existing database
>>> conn = psycopg2.connect("dbname=test user=postgres")
# Open a cursor to perform database operations
>>> cur = conn.cursor()
# Execute a command: this creates a new table
>>> cur.execute("CREATE TABLE test (id serial PRIMARY KEY, num integer, data varchar);")
# Pass data to fill a query placeholders and let Psycopg perform
# the correct conversion (no more SQL injections!)
>>> cur.execute("INSERT INTO test (num, data) VALUES (%s, %s)",
... (100, "abc'def"))
# Query the database and obtain data as Python objects
>>> cur.execute("SELECT * FROM test;")
>>> cur.fetchone()
(1, 100, "abc'def")
# Make the changes to the database persistent
>>> conn.commit()
# Close communication with the database
>>> cur.close()
>>> conn.close()

Add new field to access table using python

I have an access table that I am trying to add fields programmatically using Python. It is not a personal geodatabase. Just a standard Access database with some tables in it.
I have been able to access the table and get the list of field names and data types.
How do I add a new field and assign the data type to this Access table using Python.
Thanks!
SRP
Using the pyodbc module:
import pyodbc
MDB = 'c:/path/to/my.mdb'
DRV = '{Microsoft Access Driver (*.mdb)}'
PWD = 'my_password'
conn = pyodbc.connect('DRIVER=%s;DBQ=%s;PWD=%s' % (DRV,MDB,PWD))
c = conn.cursor()
c.execute("ALTER TABLE my_table ADD COLUMN my_column INTEGER;")
conn.commit()
c.close()
conn.close()
Edit:
Using win32com.client...
import win32com.client
conn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=c:/path/to/my.mdb;'
conn.Open(DSN)
conn.Execute("ALTER TABLE my_table ADD COLUMN my_column INTEGER;")
conn.Close()

Categories

Resources