Using psycopg2 package in Python to load data from a CSV file to a table in Postgres. This CSV file gets regenerated every hour, and there may be duplicates in the file when compared to a file from another time. Some example code would look like this:
import psycopg2
conn = psycopg2.connect(user='postgres',password='password',database='postgres')
cur = conn.cursor()
file = open('test.csv','r')
cur.copy_from(file,'table',sep=',')
conn.commit()
cur.close()
conn.close()
I'm pretty certain this method does not account for duplicates, is there another method that would account for duplicates directly, or would it be better off to save the CSV to a temp table and use cur.execute() as a query to insert into the final table?
PostgreSQL v12 has a WHERE clause for COPY ... FROM, but you cannot use subqueries there.
So the only option is to load to a temporary table and then use INSERT ... ON CONFLICT to upsert the data.
Related
I want to delete multiple table rows that are present in the CSV file
I'm able to make a connection to the db.
conn = psycopg2.connect(host=dbendpoint, port=portinfo, database=databasename, user=dbusername, password=password)
cur = conn.cursor()
I'm able to read the CSV file and store the contents in the dataframe
dfOutput = spark.read.format("text").option("delimiter",",").option("header", "true").option("mode","FAILFAST").option("compression","None").csv(inputFile)
I think I have to convert my dataframe into tuple to use it into psycopg2.
Now I'm unable to proceed forward.
I'm new to psycopg2, so any help would be appreciated.
I have a MySQL dump file as .sql format. Its size is around 100GB. There are just two tables in int. I have to extract data from this file using Python or Bash. The issue is the insert statement contains all data and that line is too lengthy. Hence, normal practice cause Memory issue as that line (i.e., all data) is load in loop also.
Is there any efficient way or tool to get data as CSV?
Just a little explanation. Following line contains actual data and it is of very large size.
INSERT INTO `tblEmployee` VALUES (1,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(2,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(3,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),....
The issue is that I cannot import it into MySQL due to resources issues.
I'm not sure if this is what you want, but pandas has a function to turn sql into a csv. Try this:
import pandas as pd
import sqlite3
connect = sqlite3.connect("connections.db")
cursor = connect.cursor()
# save sqlite table in a DataFrame
dataframe = pd.read_sql(f'SELECT * FROM table', connect)
# write DataFrame to CSV file
dataframe.to_csv("filename.csv", index = False)
connect.commit()
connect.close()
If you want to change the delimiter, you can do dataframe.to_csv("filename.csv", index = False, sep='3') and just change the '3' to your delimiter choice.
Hope everyone is well and staying safe.
I'm uploading an Excel file to SQL using Python. I have three fields: CarNumber nvarchar(50), Status nvarchar(50), and DisassemblyStart date.
I'm having an issue importing the DisassemblyStart field dates. The connection and transaction using Python are succesful.
However I get zeroes all over, even though the Excel file is populated with dates. I've tried switching to nvarchar(50), date, and datetime to see if I can least get a string and nothing. I saved the Excel file as CSV and TXT and tried uploading it and still got zeroes. I added 0.001 to every date in Excel (as to add an artificial time) in case that would make it clic but nothing happened. Still zeroes. I sure there's a major oversight from being too much in the weeds. I need help.
The Excel file has the three field columns.
This is the Python code:
import pandas as pd
import pyodbc
# Import CSV
data = pd.read_csv (r'PATH_TO_CSV\\\XXX.csv')
df = pd.DataFrame(data,columns= ['CarNumber','Status','DisassemblyStart'])
df = df.fillna(value=0)
# Connect to SQL Server
conn = pyodbc.connect("Driver={SQL Server};Server=SERVERNAME,PORT ;Database=DATABASENAME;Uid=USER;Pwd=PW;")
cursor = conn.cursor()
# Create Table
cursor.execute ('DROP TABLE OPS.dbo.TABLE')
cursor.execute ('CREATE TABLE OPS.dbo.TABLE (CarNumber nvarchar(50),Status nvarchar(50), DisassemblyStart date)')
# Insert READ_CSV INTO TABLE
for row in df.itertuples():
cursor.execute('INSERT INTO OPS.dbo.TABLE (CarNumber,Status,DisassemblyStart) VALUES (?,?,Convert(datetime,?,23))',row.CarNumber,row.Status,row.DisassemblyStart)
conn.commit()
conn.close()
Help will be much appreciated.
Thank you and be safe,
David
I have a csv file which contains 60000 rows. I need to insert this data into postgres database table. Is there any way to do this to reduce time to insert data from file to database without looping? Please help me
Python Version : 2.6
Database : postgres
table: keys_data
File Structure
1,ED2,'FDFDFDFDF','NULL'
2,ED2,'SDFSDFDF','NULL
Postgres can read CSV directly into a table with the COPY command. This either requires you to be able to place files directly on the Postgres server, or data can be piped over a connection with COPY FROM STDIN.
The \copy command in Postgres' psql command-line client will read a file locally and insert using COPY FROM STDIN so that's probably the easiest (and still fastest) way to do this.
Note: this doesn't require any use of Python, it's native functionality in Postgres and not all or most other RDBs have the same functionality.
I've performed similar task, the only exception is that my solution is python 3.x based. I am sure you can find equivalent code of this solution. Code is pretty self explanatory.
from sqlalchemy import create_engine
def insert_in_postgre(table_name, df):
#create engine object
engine = create_engine('postgresql+psycopg2://user:password#hostname/database_name')
#push dataframe in given database engine
df.head(0).to_sql(table_name, engine, if_exists='replace',index=False )
conn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, table_name, null="")
conn.commit()
cur.close()
I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.
For instance I use the following code to select all "name" where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.
df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
df['name'][df['session_id']==1]
I just wonder after I have read the csv file, is it possible to somehow "switch/read" it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:
Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`
But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!
Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.
import pandas as pd
import sqlite3
#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)
Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.
First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).
Here's a full example you can run in a Jupyter notebook:
Installation
pip install jupysql duckdb duckdb-engine
Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine
Example
Load extension (%sql magic) and create in-memory database:
%load_ext SQL
%sql duckdb://
Download some sample CSV data:
from urllib.request import urlretrieve
urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")
Query:
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
JupySQL documentation available here