I am reading a file from the file system from loading data to PostgreSQL DB. I would like to use the code below to copy data to the database. However I have to fetch the CSV file from S3 instead of reading from the file system. I saw that there were utilities that allow data to be loaded directly from S3 to RDS - but it is not supported in my organization at the moment.How can I stream data from CSV file in S3 to PostgreSQL DB?
def load_data(conn, table_name, file_path):
copy_sql = """
COPY %s FROM stdin WITH CSV HEADER
DELIMITER as ','
"""
cur = conn.cursor()
f = open(file_path, 'r', encoding="utf-8")
cur.copy_expert(sql=copy_sql % table_name, file=f)
f.close()
cur.close()
Related
I have csv file containing schema and table name (Format shared below). My task is to unload data from Redshift to S3 bucket in CSV file type. For this task I have below python script and I have 2 IAM access. First IAM access to unload data from Redshift. 2nd IAM access to write the data to S3 bucket. The issue that I am facing is using below script, I am able to create folder in my S3 bucket, however instead of CSV file the file type is " -" in S3 bucket. I am not sure what is the possible reason ?
Any help is much appreciated. Thanks in advance for your time and effort!
Note: I have millions of rows to unload from Redshift to S3 bucket.
CSV File containing schema and table name
Schema;tables
mmy_schema;my_table
Python Script
import csv
import redshift_connector
import sys
CSV_FILE="Tables.csv"
CSV_DELIMITER=';'
S3_DEST_PATH="s3://..../"
DB_HOST="MY HOST"
DB_PORT=1234
DB_DB="MYDB"
DB_USER="MY_READ"
DB_PASSWORD="MY_PSWD"
IM_ROLE="arn:aws:iam::/redshift-role/unload data","arn:aws::iam::/write in bucket"
def get_tables(path):
tables=[]
with open (path, 'r') as file:
csv_reader = csv.reader (file,delimiter=CSV_DELIMITER)
header = next(csv_reader)
if header != None:
for row in csv_reader:
tables.append(row)
return tables
def unload(conn, tables, s3_path):
cur = conn.cursor()
for table in tables:
print(f">{table[0]}.{table[1]}")
try:
query= f'''unload('select * from {table[0]}.{table[1]}' to '{s3_path}/{table[1]}/'
iam_role '{IAM_ROLE}'
CSV
PARALLEL FALSE
CLEANPATH;'''
print(f"loading in progress")
cur.execute(query)
print(f"Done.")
except Esception as e:
print("Failed to load")
print(str(e))
sys.exit(1)
cur.close()
def main():
try:
conn = redshift_connector.connect(
host=DB_HOST,
port=DB_PORT,
database= DB_DB,
user= DB_USER,
password=DB_PASSWORD
)
tables = get_tables(CSV_FILE)
unload(conn,tables,S3_DEST_PATH)
conn.close()
except Exception as e:
print(e)
sys.exit(1)
Redshift doesn't add file type suffixes on UNLOAD. It should just end with the part number. Yes, "parallel off" unloads can make multiple files. If it is required that these file names end in ".csv" then your script will need to issue the S3 rename API calls (copy and delete). This process should also check that the number of files and perform appropriate actions if needed.
I am trying to load data (CSV Files) from S3 to MySQL RDS through Lambda. So that i have written code in Lambda so whenever csv file is upload in S3 bucket then the data will import to Database.
But if CSV files are having spaces then the data is not importing exactly in database. See below images
CODE:
import json
import boto3
import csv
import mysql.connector
from mysql.connector import Error
from mysql.connector import errorcode
s3_client = boto3.client('s3')
# Read CSV file content from S3 bucket
def lambda_handler(event, context):
# TODO implement
# print(event)
bucket = event['Records'][0]['s3']['bucket']['name']
csv_file = event['Records'][0]['s3']['object']['key']
csv_file_obj = s3_client.get_object(Bucket=bucket, Key=csv_file)
lines = csv_file_obj['Body'].read().decode('utf-8').split()
results = []
for row in csv.DictReader(lines, skipinitialspace=True, delimiter=',', quotechar='"', doublequote = True):
results.append(row.values())
print(results)
connection = mysql.connector.connect(host='xxxxxxxxxxxxxxx.ap-south-1.rds.amazonaws.com',database='xxxxxxxdb',user='xxxxxx',password='xxxxxx')
tables_dict = {
'sketching': 'INSERT INTO table1 (empid, empname, empaddress) VALUES (%s, %s, %s)'
}
if csv_file in tables_dict:
mysql_empsql_insert_query = tables_dict[csv_file]
cursor = connection.cursor()
cursor.executemany(mysql_empsql_insert_query,results)
connection.commit()
print(cursor.rowcount, f"Record inserted successfully from {csv_file} file")
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
CSV FILE
The Result in DataBase
So if there is a space in between name or any word, number the data is not uploading correctly and also if there is decimals like (9.2, 8.7) then also it is not uploading exactly.
How can I solve this problem?
I think the issue is with this line:
lines = csv_file_obj['Body'].read().decode('utf-8').split()
When no parameters are specified, split() will break strings at whitespaces.
You should probably use: split('\n')
Alternatively:
Download the file to disk (instead of using read())
Use the default behaviour of the CSV Reader (which knows how to break on lines)
I have to store a small PDF file in a Postgres database (already have a table ready with a bytea column for the data), then be able to delete the file, and use the data in the database to restore the PDF as it was.
For context, I'm working with FastApi in Python3 so I can get the file as bytes, or as a whole file. So the main steps are:
Getting the file as bytes or a file via FastAPI
Inserting it into the Postgres DB
Retrieve the data in the DB
Make a new PDF file with the data.
How can I do that in a clean way?
The uploading function from FastAPI :
def import_data(file: UploadFile= File(...)):
# Put the whole data into a variable as bytes
pdfFile = file.file.read()
database.insertPdfInDb(pdfFile)
# Saving the file we just got to check if it's intact (it is)
file_name = file.filename.replace(" ", "-")
with open(file_name,'wb+') as f:
f.write(pdfFile)
f.close()
return {"filename": file.filename}
The function inserting the data into the Postgres DB :
def insertPdfInDb(pdfFile):
conn = connectToDb()
curs = conn.cursor()
curs.execute("INSERT INTO PDFSTORAGE(pdf, description) values (%s, 'Some description...')", (psycopg2.Binary(pdfFile),))
conn.commit()
print("PDF insertion in the database attempted.")
disconnectFromDb(conn)
return 0
# Saving the file we just got to check if it's intact (it is)
file_name = file.filename.replace(" ", "-")
with open(file_name,'wb+') as f:
f.write(pdfFile)
f.close()
return {"filename": file.filename}
The exporting part is just started and entirely try-and-error code.
I am trying to load pipe separated csv file in hive table using python without success. Please assist.
Full code:
from pyhive import hive
host_name = "192.168.220.135"
port = 10000
user = "cloudera"
password = "cloudera"
database = "default"
conn = hive.Connection(host=host_name, port=port, username=user, database=database)
print('Connected to DB: {}'.format(host_name))
cursor = conn.cursor()
Query = """LOAD DATA LOCAL inpath '/home/cloudera/Desktop/ccna_test/RERATING_EMMCCNA.csv' INTO TABLE python_testing fields terminated by '|' lines terminated by '\n' """
cursor.execute(Query)
From your question, I assume the csv format is like below and you want a query to load data into hive table.
value1|value2|value3
value4|value5|value6
value7|value8|value9
First there should be a hive table and could be created using below query.
create table python_testing
(
col1 string,
col2 string,
col3 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with SERDEPROPERTIES ( "separatorChar" = "|")
stored as textfile;
Note that separator character and input file format is explicitly given on table creation.
Also table is stored in TEXTFILE format. This is due to the format of input file.
If you want ORC table, then input file should be in ORC format(Hive 'load data' command just copies the files to hive data files and does not do any transformations on data). A possible workaround is to create a temporary table with STORED AS TEXTFILE, LOAD DATA into it, and then copy data from this table to the ORC table.
Use 'load' command to load the data.
load data local inpath '/home/hive/data.csv' into table python_testing;
/home/hive/data.csv should be your file path.
For more details visit blog post - http://forkedblog.com/load-data-to-hive-database-from-csv-file/
I use python and vertica-python library to COPY data to Vertica DB
connection = vertica_python.connect(**conn_info)
vsql_cur = connection.cursor()
with open("/tmp/vertica-test-insert", "rb") as fs:
vsql_cur.copy( "COPY table FROM STDIN DELIMITER ',' ", fs, buffer_size=65536)
connection.commit()
It inserts data, but only 5 rows, although the file contains more. Could this be related to db settings or it's some client issue?
This code works for me:
For JSON
# for json file
with open("D:/SampleCSVFile_2kb/tweets.json", "rb") as fs:
my_file = fs.read().decode('utf-8')
cur.copy( "COPY STG.unstruc_data FROM STDIN parser fjsonparser()", my_file)
connection.commit()
For CSV
# for csv file
with open("D:/SampleCSVFile_2kb/SampleCSVFile_2kb.csv", "rb") as fs:
my_file = fs.read().decode('utf-8','ignore')
cur.copy( "COPY STG.unstruc_data FROM STDIN PARSER FDELIMITEDPARSER (delimiter=',', header='false') ", my_file) # buffer_size=65536
connection.commit()
Very likely that you have rows getting rejected. Assuming you are using 7.x, you can add:
[ REJECTED DATA {'path' [ ON nodename ] [, ...] | AS TABLE 'reject_table'} ]
You can also query this after the copy execution to see the summary of results:
SELECTGET_NUM_ACCEPTED_ROWS(),GET_NUM_REJECTED_ROWS();