Import csv file data to plsql table using python - python

I have a csv file which contains 60000 rows. I need to insert this data into postgres database table. Is there any way to do this to reduce time to insert data from file to database without looping? Please help me
Python Version : 2.6
Database : postgres
table: keys_data
File Structure
1,ED2,'FDFDFDFDF','NULL'
2,ED2,'SDFSDFDF','NULL

Postgres can read CSV directly into a table with the COPY command. This either requires you to be able to place files directly on the Postgres server, or data can be piped over a connection with COPY FROM STDIN.
The \copy command in Postgres' psql command-line client will read a file locally and insert using COPY FROM STDIN so that's probably the easiest (and still fastest) way to do this.
Note: this doesn't require any use of Python, it's native functionality in Postgres and not all or most other RDBs have the same functionality.

I've performed similar task, the only exception is that my solution is python 3.x based. I am sure you can find equivalent code of this solution. Code is pretty self explanatory.
from sqlalchemy import create_engine
def insert_in_postgre(table_name, df):
#create engine object
engine = create_engine('postgresql+psycopg2://user:password#hostname/database_name')
#push dataframe in given database engine
df.head(0).to_sql(table_name, engine, if_exists='replace',index=False )
conn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, table_name, null="")
conn.commit()
cur.close()

Related

psycopg2 cursor.copy_from() loading duplicates?

Using psycopg2 package in Python to load data from a CSV file to a table in Postgres. This CSV file gets regenerated every hour, and there may be duplicates in the file when compared to a file from another time. Some example code would look like this:
import psycopg2
conn = psycopg2.connect(user='postgres',password='password',database='postgres')
cur = conn.cursor()
file = open('test.csv','r')
cur.copy_from(file,'table',sep=',')
conn.commit()
cur.close()
conn.close()
I'm pretty certain this method does not account for duplicates, is there another method that would account for duplicates directly, or would it be better off to save the CSV to a temp table and use cur.execute() as a query to insert into the final table?
PostgreSQL v12 has a WHERE clause for COPY ... FROM, but you cannot use subqueries there.
So the only option is to load to a temporary table and then use INSERT ... ON CONFLICT to upsert the data.

How do I export a pandas DataFrame to Microsoft Access?

I have a Pandas DataFrame with around 200,000 indexes/rows and 30 columns.
I need to have this directly exported into an .mdb file, converting it into a csv and manually importing it will not work.
I understand there's tools like pyodbc that help a lot with importing/reading access, but there is little documentation on how to export.
I'd love any help anyone can give, and would strongly appreciate any examples.
First convert the dataframe into .csv file using the below command
name_of_your_dataframe.to_csv("filename.csv", sep='\t', encoding='utf-8')
Then load .csv to .mdb using pyodbc
MS Access can directly query CSV files and run a Make-Table Query(https://support.office.com/en-us/article/Create-a-make-table-query-96424f9e-82fd-411e-aca4-e21ad0a94f1b) to produce a resulting table. However, some cleaning is needed to remove the rubbish rows. Below opens two files one for reading and other for writing. Assuming rubbish is in first column of csv, the if logic writes any line that has some data in second column (adjust as needed):
import os
import csv
import pyodbc
# TEXT FILE CLEAN
with open('C:\Path\To\Raw.csv', 'r') as reader, open('C:\Path\To\Clean.csv', 'w') as writer:
read_csv = csv.reader(reader); write_csv = csv.writer(writer,lineterminator='\n')
for line in read_csv:
if len(line[1]) > 0:
write_csv.writerow(line)
# DATABASE CONNECTION
access_path = "C:\Path\To\Access\\DB.mdb"
con = pyodbc.connect("DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={};" \
.format(access_path))
# RUN QUERY
strSQL = "SELECT * INTO [TableName] FROM [text;HDR=Yes;FMT=Delimited(,);" + \
"Database=C:\Path\To\Folder].Clean.csv;"
cur = con.cursor()
cur.execute(strSQL)
con.commit()
con.close() # CLOSE CONNECTION
os.remove('C\Path\To\Clean.csv') # DELETE CLEAN TEMP
2020 Update
There is now a supported external SQLAlchemy dialect for Microsoft Access ...
https://github.com/gordthompson/sqlalchemy-access
... which enables you to use pandas' to_sql method directly via pyodbc and the Microsoft Access ODBC driver (on Windows).
I would recommend to export the pandas dataframe to csv as usual like this:
dataframe_name.to_csv("df_filename.csv", sep=',', encoding='utf-8')
Then you can convert it to .mdb file as this stackoverflow answer shows

How i can insert data from dataframe(in python) to greenplum table?

Problem Statement:
I have multiple csv files. I am cleaning them using python and inserting them to SQL server using bcp. Now I want to insert that into Greenplum instead of SQL Server. Please suggest a way to bulk insert into greenplum table directly from python data-frame to GreenPlum table.
Solution: (What i can think)
Way i can think is CSV-> Dataframe -> Cleainig -> Dataframe -> CSV -> then Use Gpload for Bulk load. And integrate it in Shell script for automation.
Do anyone has a good solution for it.
Issue in loading data directly from dataframe to gp table:
As gpload ask for the file path. Can i pass a varibale or dataframe to that? Is there any way to bulkload into greenplum ?I dont want to create a csv or txt file from dataframe and then load it to greenplum.
I would use psycopg2 and the io libraries to do this. io is built-in and you can install psycopg2 using pip (or conda).
Basically, you write your dataframe to a string buffer ("memory file") in the csv format. Then you use psycopg2's copy_from function to bulk load/copy it to your table.
This should get you started:
import io
import pandas
import psycopg2
# Write your dataframe to memory as csv
csv_io = io.StringIO()
dataframe.to_csv(csv_io, sep='\t', header=False, index=False)
csv_io.seek(0)
# Connect to the GreenPlum database.
greenplum = psycopg2.connect(host='host', database='database', user='user', password='password')
gp_cursor = greenplum.cursor()
# Copy the data from the buffer to the table.
gp_cursor.copy_from(csv_io, 'db.table')
greenplum.commit()
# Close the GreenPlum cursor and connection.
gp_cursor.close()
greenplum.close()

export very large sql file into csv with Python or R

I have a large sql file (20 GB) that I would like to convert into csv. I plan to load the file into Stata for analysis. I have enough ram to load the entire file (my computer has 32GB in RAM)
Problem is: the solutions I found online with Python so far (sqlite3) seem to require more RAM than my current system has to:
read the SQL
write the csv
Here is the code
import sqlite3
import pandas as pd
con=sqlite3.connect('mydata.sql')
query='select * from mydata'
data=pd.read_sql(query,con)
data.to_csv('export.csv')
con.close()
The sql file contains about 15 variables that can be timestamps, strings or numerical values. Nothing really fancy.
I think one possible solution could be to read the sql and write the csv file one line at a time. However, I have no idea how to do that (either in R or in Python)
Any help really appreciated!
You can read the SQL database in batches and write them to file instead of reading the whole database at once. Credit to How to add pandas data to an existing csv file? for how to add to an existing CSV file.
import sqlite3
import pandas as pd
# Open the file
f = open('output.csv', 'w')
# Create a connection and get a cursor
connection = sqlite3.connect('mydata.sql')
cursor = connection.cursor()
# Execute the query
cursor.execute('select * from mydata')
# Get data in batches
while True:
# Read the data
df = pd.DataFrame(cursor.fetchmany(1000))
# We are done if there are no data
if len(df) == 0:
break
# Let's write to the file
else:
df.to_csv(f, header=False)
# Clean up
f.close()
cursor.close()
connection.close()
Use the sqlite3 command line program like this from the Windows cmd line or UNIX shell:
sqlite3 -csv "mydata.sql" "select * from mydata;" > mydata.csv
If mydata.sql is not in the current directory use the path and on Windows use forward slashes rather than backslashes.
Alternately run sqlite3
sqlite3
and enter these commands at the sqlite prompt:
.open "mydata.sql"
.ouptut mydata.csv
.mode csv
select * from mydata;
.quit
(or put them in a file called run, say, and use sqlite3 < run .
Load the .sql file in mysql database and export it as CSV.
Commans to load mysql dump file in MySQL database.
Create a MySQL database
create database <database_name>
mysqldump -u root -p <database_name> < dumpfilename.sql
Command to export MySQL table as CSV
mysql -u root -p
use <database_name>
SELECT * INTO OUTFILE 'file.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
FROM <table_name>;

Write pandas table to impala

Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.
However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.
What I'd like to do:
from impala.dbapi import connect
from impala.util import as_pandas
# connect to my host and port
conn=connect(host='myhost', port=111)
# create query to save table as pandas df
create_query = """
SELECT * FROM {}
""".format(my_table_name)
# run query on impala
cur = conn.cursor()
cur.execute(create_query)
# store results as pandas data frame
pandas_df = as_pandas(cur)
cur.close()
Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.
# create query to save new_df back to impala
save_query = """
CREATE TABLE new_table AS
SELECT *
FROM pandas_df
"""
# run query on impala
cur = conn.cursor()
cur.execute(save_query)
cur.close()
The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!
You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.
The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.
You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).
I am trying to do same thing and I figured out a way to do this with an example provided with impyla:
df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)
This works fine and table in impala (backend mysql) works fine.
However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)

Categories

Resources