load pipe separated csv to hive

load pipe separated csv to hive - python

I am trying to load pipe separated csv file in hive table using python without success. Please assist.
Full code:
from pyhive import hive
host_name = "192.168.220.135"
port = 10000
user = "cloudera"
password = "cloudera"
database = "default"
conn = hive.Connection(host=host_name, port=port, username=user, database=database)
print('Connected to DB: {}'.format(host_name))
cursor = conn.cursor()
Query = """LOAD DATA LOCAL inpath '/home/cloudera/Desktop/ccna_test/RERATING_EMMCCNA.csv' INTO TABLE python_testing fields terminated by '|' lines terminated by '\n' """
cursor.execute(Query)

From your question, I assume the csv format is like below and you want a query to load data into hive table.
value1|value2|value3
value4|value5|value6
value7|value8|value9
First there should be a hive table and could be created using below query.
create table python_testing
(
col1 string,
col2 string,
col3 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with SERDEPROPERTIES ( "separatorChar" = "|")
stored as textfile;
Note that separator character and input file format is explicitly given on table creation.
Also table is stored in TEXTFILE format. This is due to the format of input file.
If you want ORC table, then input file should be in ORC format(Hive 'load data' command just copies the files to hive data files and does not do any transformations on data). A possible workaround is to create a temporary table with STORED AS TEXTFILE, LOAD DATA into it, and then copy data from this table to the ORC table.
Use 'load' command to load the data.
load data local inpath '/home/hive/data.csv' into table python_testing;
/home/hive/data.csv should be your file path.
For more details visit blog post - http://forkedblog.com/load-data-to-hive-database-from-csv-file/

Related

How to ignore duplicate keys using the psycopg2 copy_from command copying .csv file into postgresql database

I'm using Python. I have a daily csv file that I need to copy daily into a postgresql table. Some of those .csv records may be same day over day so I want to ignore those, based on a primary key field. Using cursor.copy_from,Day 1 all is fine, new table created. Day 2, copy_from throws duplicate key error (as it should), but copy_from stops on 1st error. Is there a copy_from parameter that would ignore the duplicates and continue? If not, any other recommendations other than copy_from?
f = open(csv_file_name, 'r')
c.copy_from(f, 'mytable', sep=',')

This is how I'm doing it with psycopg3.
Assumes the file is in the same folder as the script and that it has a header row.
from pathlib import Path
from psycopg import sql
file = Path(__file__).parent / "the_data.csv"
target_table = "mytable"
conn = <your connection>
with conn.cursor() as cur:
# Create an empty table with the same columns as target_table.
cur.execute(f"CREATE TEMP TABLE tmp_table (LIKE {target_table})")
# The csv file imports as text.
# This approach tells postgres how to convert text to the proper column types.
column_types = sql.Identifier(target_table)
query = sql.SQL("COPY tmp_table FROM STDIN WITH(FORMAT csv, HEADER true)")
typed_query = query.format(column_types)
with cur.copy(typed_query) as copy:
with file.open() as csv_data:
copy.write(csv_data.read())
cur.execute(
f"INSERT INTO {target_table} SELECT * FROM tmp_table ON CONFLICT DO NOTHING"
)

Python script unloading the data from Redshift without any file type in S3 bucket. Need CSV file format

I have csv file containing schema and table name (Format shared below). My task is to unload data from Redshift to S3 bucket in CSV file type. For this task I have below python script and I have 2 IAM access. First IAM access to unload data from Redshift. 2nd IAM access to write the data to S3 bucket. The issue that I am facing is using below script, I am able to create folder in my S3 bucket, however instead of CSV file the file type is " -" in S3 bucket. I am not sure what is the possible reason ?
Any help is much appreciated. Thanks in advance for your time and effort!
Note: I have millions of rows to unload from Redshift to S3 bucket.
CSV File containing schema and table name
Schema;tables
mmy_schema;my_table
Python Script
import csv
import redshift_connector
import sys
CSV_FILE="Tables.csv"
CSV_DELIMITER=';'
S3_DEST_PATH="s3://..../"
DB_HOST="MY HOST"
DB_PORT=1234
DB_DB="MYDB"
DB_USER="MY_READ"
DB_PASSWORD="MY_PSWD"
IM_ROLE="arn:aws:iam::/redshift-role/unload data","arn:aws::iam::/write in bucket"
def get_tables(path):
tables=[]
with open (path, 'r') as file:
csv_reader = csv.reader (file,delimiter=CSV_DELIMITER)
header = next(csv_reader)
if header != None:
for row in csv_reader:
tables.append(row)
return tables
def unload(conn, tables, s3_path):
cur = conn.cursor()
for table in tables:
print(f">{table[0]}.{table[1]}")
try:
query= f'''unload('select * from {table[0]}.{table[1]}' to '{s3_path}/{table[1]}/'
iam_role '{IAM_ROLE}'
CSV
PARALLEL FALSE
CLEANPATH;'''
print(f"loading in progress")
cur.execute(query)
print(f"Done.")
except Esception as e:
print("Failed to load")
print(str(e))
sys.exit(1)
cur.close()
def main():
try:
conn = redshift_connector.connect(
host=DB_HOST,
port=DB_PORT,
database= DB_DB,
user= DB_USER,
password=DB_PASSWORD
)
tables = get_tables(CSV_FILE)
unload(conn,tables,S3_DEST_PATH)
conn.close()
except Exception as e:
print(e)
sys.exit(1)

Redshift doesn't add file type suffixes on UNLOAD. It should just end with the part number. Yes, "parallel off" unloads can make multiple files. If it is required that these file names end in ".csv" then your script will need to issue the S3 rename API calls (copy and delete). This process should also check that the number of files and perform appropriate actions if needed.

Insert PDF into Postgres database with python and retrieve it

I have to store a small PDF file in a Postgres database (already have a table ready with a bytea column for the data), then be able to delete the file, and use the data in the database to restore the PDF as it was.
For context, I'm working with FastApi in Python3 so I can get the file as bytes, or as a whole file. So the main steps are:
Getting the file as bytes or a file via FastAPI
Inserting it into the Postgres DB
Retrieve the data in the DB
Make a new PDF file with the data.
How can I do that in a clean way?
The uploading function from FastAPI :
def import_data(file: UploadFile= File(...)):
# Put the whole data into a variable as bytes
pdfFile = file.file.read()
database.insertPdfInDb(pdfFile)
# Saving the file we just got to check if it's intact (it is)
file_name = file.filename.replace(" ", "-")
with open(file_name,'wb+') as f:
f.write(pdfFile)
f.close()
return {"filename": file.filename}
The function inserting the data into the Postgres DB :
def insertPdfInDb(pdfFile):
conn = connectToDb()
curs = conn.cursor()
curs.execute("INSERT INTO PDFSTORAGE(pdf, description) values (%s, 'Some description...')", (psycopg2.Binary(pdfFile),))
conn.commit()
print("PDF insertion in the database attempted.")
disconnectFromDb(conn)
return 0
# Saving the file we just got to check if it's intact (it is)
file_name = file.filename.replace(" ", "-")
with open(file_name,'wb+') as f:
f.write(pdfFile)
f.close()
return {"filename": file.filename}
The exporting part is just started and entirely try-and-error code.

How to speed up a for loop that inserts rows from a CSV into a mysql table? [duplicate]

This question already has an answer here:
Inserting a list holding multiple values in MySQL using pymysql
(1 answer)
Closed 3 years ago.
What I'm doing:
I'm executing a query from a mysql table and exporting each day's worth of data into a folder
I then insert each csv row by row using a for loop into a separate mysql table
Once loaded into the table, I then move the csv into another separate folder
The problem is that it is taking a very long time to run and would like some help to find out areas where I can speed up the process or suggestions for alternative methods in Python.
Code:
import pymysql
import pymysql.cursors
import csv
import os
import shutil
import datetime
from db_credentials import db1_config, db2...
def date_range(start, end):
# Creates a list of dates from start to end
...
def export_csv(filename, data):
# Exports query result as a csv to the filename's pending folder
...
def extract_from_db(database, sql, start_date, end_date, filename):
# SQL query to extract data and export as csv
...
def open_csv(c):
# Read csv and return as a list of lists
...
def get_files(folder):
# Grab all csv files from a given folder's pending folder
...
# HERE IS WHERE IT GETS SLOW
def load_to_db(table, folder):
print('Uploading...\n')
files = get_files(folder)
# Connect to db2 database
connection = pymysql.connect(**db2_config, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Open each csv in the files list and ignore column headers
for file in files:
print('Processing ' + file.split("pending/",1)[1] + '...', end='')
csv_file = open_csv(file)
csv_headers = ', '.join(csv_file[0])
csv_data = csv_file[1:]
# Insert each row of each csv into db2 table
for row in csv_data:
placeholders = ', '.join(['%s'] * len(row))
sql = "INSERT INTO %s (%s) VALUES ( %s )" % (table, csv_headers, placeholders)
cursor.execute(sql, row)
# Move processed file to the processed folder
destination_folder = os.path.join('/Users','python', folder, 'processed')
shutil.move(file, destination_folder)
print('DONE')
# Connection is not autocommit by default.
# So you must commit to save your changes.
connection.commit()
finally:
connection.close()
if not files:
print('No csv data available to process')
else:
print('Finished')

How about trying mysql LOAD DATA
e.g. execute the following statements for the entire csv rather than individual inserts
LOAD DATA INFILE '<your filename>'
INTO TABLE <your table>
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

Copy data from csv file

I am trying to follow one copy_from example describe in stackoverflow but i modify little as i need to read data from csv file. Following this example i wrote a small program where the file is to be readed from file stored in disk and then copy data from that file to created table, My code is :
def importFile():
path = "C:\myfile.csv"
curs = conn.cursor()
curs.execute("Drop table if exists test_copy; ")
data = StringIO.StringIO()
data.write(path)
data.seek(0)
curs.copy_from(data, 'MyTable')
print("Data copied")
But i get error,
psycopg2.DataError: invalid input syntax for integer:
Does this mean there is mismatch between csv file and my table? OR is this syntax enough in order to copy csv file? or I need some more code ?? I am new to python, so any help will be appreciated..

Look at your .csv file with a text editor. You want to be sure that
the field-separator is a tab character
there are no quote-chars
there is no header row
If this is true, the following should work:
import psycopg2
def importFromCsv(conn, fname, table):
with open(fname) as inf:
conn.cursor().copy_from(inf, table)
def main():
conn = ?? # set up database connection
importFromCsv(conn, "c:/myfile.csv", "MyTable")
print("Data copied")
if __name__=="__main__":
main()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

load pipe separated csv to hive - python

Related

How to ignore duplicate keys using the psycopg2 copy_from command copying .csv file into postgresql database

Python script unloading the data from Redshift without any file type in S3 bucket. Need CSV file format

Insert PDF into Postgres database with python and retrieve it

How to speed up a for loop that inserts rows from a CSV into a mysql table? [duplicate]

Copy data from csv file

Categories

Resources