Inserting a Python Dataframe into Hive from an external server - python

I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server.
In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!

You could use the subprocess module.
The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.
Here's an example function for saving a local object, output, to user/<your_name>/<output_name> in hdfs.
import os
from subprocess import PIPE, Popen
def save_to_hdfs(output):
"""
Save a file in local scope to hdfs.
Note, this performs a forced put - any file with the same name will be
overwritten.
"""
hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
# example
df = pd.DataFrame(...)
output_file = 'yourdata.csv'
dataframe.to_csv(output_file)
save_to_hdfs(output_file)
# remove locally created file (so it doesn't pollute nodes)
os.remove(output_file)

In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format?
Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by #Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
Parquet: https://fastparquet.readthedocs.io/en/latest/
Avro:https://pypi.org/project/fastavro/

It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.
from sqlalchemy import create_engine
#Input Information
host = 'username#local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'
#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')

You can write back.
Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row).... so on;
thus you can insert.
bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]
con.cursor().execute('insert into table table_name values'+ bundle)
and you are done.

Related

Get data as csv from a very large MySQL dump file

I have a MySQL dump file as .sql format. Its size is around 100GB. There are just two tables in int. I have to extract data from this file using Python or Bash. The issue is the insert statement contains all data and that line is too lengthy. Hence, normal practice cause Memory issue as that line (i.e., all data) is load in loop also.
Is there any efficient way or tool to get data as CSV?
Just a little explanation. Following line contains actual data and it is of very large size.
INSERT INTO `tblEmployee` VALUES (1,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(2,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(3,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),....
The issue is that I cannot import it into MySQL due to resources issues.
I'm not sure if this is what you want, but pandas has a function to turn sql into a csv. Try this:
import pandas as pd
import sqlite3
connect = sqlite3.connect("connections.db")
cursor = connect.cursor()
# save sqlite table in a DataFrame
dataframe = pd.read_sql(f'SELECT * FROM table', connect)
# write DataFrame to CSV file
dataframe.to_csv("filename.csv", index = False)
connect.commit()
connect.close()
If you want to change the delimiter, you can do dataframe.to_csv("filename.csv", index = False, sep='3') and just change the '3' to your delimiter choice.

Python replacement for SQL bcp.exe

The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.

Copying csv data obtained using .read() to postgresql database using Python 3.5

I want to copy a csv data without uploading the actual csv to my folder. Currently, I can get the csv data via the following code:
f = request.files['data_file'].read()
a = f.decode('utf-8')
If I would print a, I can get the data from the csv. My problem is that how do I copy this data to my postgresql database? I tried using the copy command in postgresql but it needs a path to the file and I don't want to store the actual csv I just want it to be copied directly in my postgres database. Im using python 3.
Using one of the psycopg2.copy_* methods is the way to go. Which one will depend on -
Does the csv have a header?
Does the csv structure match the table exactly (number and type of columns)?
Note #1 - The form of the copy command that takes a path expects that path to exist on the database server. In Heroku, that will never be the case. Instead, you need the form of the command like this: copy table_name from stdin.... The copy_from method is a convenience method for that form.
1) Simplest case - comma-delimited file with no headers exactly matching the table structure:
stmt.copy_from(request.files['data_file'], 'your_table', sep=',')
(stmt is a cursor, preferably used inside a with conn.cursor() as stmt: clause)
2) No header, but csv only has a subset of the columns:
stmt.copy_from(request.files['data_file'], 'some_table', sep=',', cols=['col1', 'col2', 'col3'])
3) If you have a header, you will need copy_expert -
sql = """
copy some_table (col1, col2, col3)
from stdin with csv header delimiter ','
"""
stmt.copy_expert(sql, request.files['data_file'])
Note #2 - the data will be implicitly converted to the correct type. It will also need to satisfy data constraints. A failure of either operation for a single record kills the entire transaction. As a result, you may need to get fancy and load all the data into a simple temp table, clean it, then do a select into (
Note #3 - I kinda guessed that you can use request.files directly, but did not test it. If that fails, stream the data to a temp file and use that as the argument to the copy method.
See:
http://initd.org/psycopg/docs/cursor.html#cursor.copy_from

How i can insert data from dataframe(in python) to greenplum table?

Problem Statement:
I have multiple csv files. I am cleaning them using python and inserting them to SQL server using bcp. Now I want to insert that into Greenplum instead of SQL Server. Please suggest a way to bulk insert into greenplum table directly from python data-frame to GreenPlum table.
Solution: (What i can think)
Way i can think is CSV-> Dataframe -> Cleainig -> Dataframe -> CSV -> then Use Gpload for Bulk load. And integrate it in Shell script for automation.
Do anyone has a good solution for it.
Issue in loading data directly from dataframe to gp table:
As gpload ask for the file path. Can i pass a varibale or dataframe to that? Is there any way to bulkload into greenplum ?I dont want to create a csv or txt file from dataframe and then load it to greenplum.
I would use psycopg2 and the io libraries to do this. io is built-in and you can install psycopg2 using pip (or conda).
Basically, you write your dataframe to a string buffer ("memory file") in the csv format. Then you use psycopg2's copy_from function to bulk load/copy it to your table.
This should get you started:
import io
import pandas
import psycopg2
# Write your dataframe to memory as csv
csv_io = io.StringIO()
dataframe.to_csv(csv_io, sep='\t', header=False, index=False)
csv_io.seek(0)
# Connect to the GreenPlum database.
greenplum = psycopg2.connect(host='host', database='database', user='user', password='password')
gp_cursor = greenplum.cursor()
# Copy the data from the buffer to the table.
gp_cursor.copy_from(csv_io, 'db.table')
greenplum.commit()
# Close the GreenPlum cursor and connection.
gp_cursor.close()
greenplum.close()

Write pandas table to impala

Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.
However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.
What I'd like to do:
from impala.dbapi import connect
from impala.util import as_pandas
# connect to my host and port
conn=connect(host='myhost', port=111)
# create query to save table as pandas df
create_query = """
SELECT * FROM {}
""".format(my_table_name)
# run query on impala
cur = conn.cursor()
cur.execute(create_query)
# store results as pandas data frame
pandas_df = as_pandas(cur)
cur.close()
Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.
# create query to save new_df back to impala
save_query = """
CREATE TABLE new_table AS
SELECT *
FROM pandas_df
"""
# run query on impala
cur = conn.cursor()
cur.execute(save_query)
cur.close()
The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!
You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.
The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.
You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).
I am trying to do same thing and I figured out a way to do this with an example provided with impyla:
df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)
This works fine and table in impala (backend mysql) works fine.
However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)

Categories

Resources