I have a mission to read a csv file line by line and insert them to database.
And the csv file contains about 1.7 million lines.
I use python with sqlalchemy orm(merge function) to do this.
But it spend over five hours.
Is it caused by python slow performance or sqlalchemy or sqlalchemy?
or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)
Hope you guy giving any suggestion, thanks!
Update: database - mysql
For such a mission you don't want to insert data line by line :) Basically, you have 2 ways:
Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query (How to do a batch insert in MySQL) instead.
Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)
Follow below three steps
Save the CSV file with the name of table what you want to save it
to.
Execute below python script to create a table dynamically
(Update CSV filename, db parameters)
Execute "mysqlimport
--ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"
PYTHON CODE:
import numpy as np
import pandas as pd
from mysql.connector import connect
csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')
query = "CREATE TABLE " + table_name[0] + "( \n"
for count in np.arange(df.columns.values.size):
query += df.columns.values[count]
if df.dtypes[count] == 'int64':
query += "\t\t int(11) NOT NULL"
elif df.dtypes[count] == 'object':
query += "\t\t varchar(64) NOT NULL"
elif df.dtypes[count] == 'float64':
query += "\t\t float(10,2) NOT NULL"
if count == 0:
query += " PRIMARY KEY"
if count < df.columns.values.size - 1:
query += ",\n"
query += " );"
#print(query)
database = connect(host='localhost', # your host
user='username', # username
passwd='password', # password
db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)
Related
This question already has answers here:
Python MySql Insert not working
(5 answers)
Load CSV data into MySQL in Python
(7 answers)
Closed 1 year ago.
I have a workflow where I need to take a 500k row csv and import it into a mysql table. I have a python script that seems to be working, but no data is being saved into the actual table when I select it. I'm dropping and re-creating the table headers then trying to bulk insert the csv file, but it doesn't look like the data is going in. No errors are reported in the python console when running.
The script takes about 2 minutes to run which makes me think that it's doing something, but I don't get anything other than the column headers when I select * from the table itself.
My script looks roughly like:
import pandas as pd
import mysql.connector
dataframe.to_csv('import-data.csv', header=False, index=False)
DB_NAME = 'SCHEMA1'
TABLES = {}
TABLES['TableName'] = (
"CREATE TABLE `TableName` ("
"`x_1` varchar(10) NOT NULL,"
"`x_2` varchar(20) NOT NULL,"
"PRIMARY KEY (`x_1`)"
") ENGINE = InnoDB")
load = """
LOAD DATA LOCAL INFILE 'import-data.csv'
INTO TABLE SCHEMA1.TableName
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
"""
conn = mysql.connector.connect(
host=writer_host,
port=port,
user=username,
password=password,
database=username,
ssl_ca=cert_path
)
cursor = conn.cursor(buffered=True)
cursor.execute("DROP TABLE IF EXISTS SCHEMA1.TableName")
cursor.execute(TABLES['TableName'])
cursor.execute(load)
cursor.close()
conn.close()
missing commit after executing your commands
cursor.commit()
I have a table with about 200 columns. I need to take a dump of the daily transaction data for ETL purposes. Its a MySQL DB. I tried that with Python both using pandas dataframe as well as basic write to CSV file method. I even tried to look for the same functionality using shell script. I saw one such for oracle Database using sqlplus. Following are my python codes with the two approaches:
Using Pandas:
import MySQLdb as mdb
import pandas as pd
host = ""
user = ''
pass_ = ''
db = ''
query = 'SELECT * FROM TABLE1'
conn = mdb.connect(host=host,
user=user, passwd=pass_,
db=db)
df = pd.read_sql(query, con=conn)
df.to_csv('resume_bank.csv', sep=',')
Using basic python file write:
import MySQLdb
import csv
import datetime
currentDate = datetime.datetime.now().date()
host = ""
user = ''
pass_ = ''
db = ''
table = ''
con = MySQLdb.connect(user=user, passwd=pass_, host=host, db=db, charset='utf8')
cursor = con.cursor()
query = "SELECT * FROM %s;" % table
cursor.execute(query)
with open('Data_on_%s.csv' % currentDate, 'w') as f:
writer = csv.writer(f)
for row in cursor.fetchall():
writer.writerow(row)
print('Done')
The table has about 300,000 records. It's taking too much time with both the python codes.
Also, there's an issue with encoding here. The DB resultset has some latin-1 characters for which I'm getting some errors like : UnicodeEncodeError: 'ascii' codec can't encode character '\x96' in position 1078: ordinal not in range(128).
I need to save the CSV in Unicode format. Can you please help me with the best approach to perform this task.
A Unix based or Python based solution will work for me. This script needs to be run daily to dump daily data.
You can achieve that just leveraging MySql. For example:
SELECT * FROM your_table WHERE...
INTO OUTFILE 'your_file.csv'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
FIELDS ESCAPED BY '\'
LINES TERMINATED BY '\n';
if you need to schedule your query put such a query into a file (e.g., csv_dump.sql) anche create a cron task like this one
00 00 * * * mysql -h your_host -u user -ppassword < /foo/bar/csv_dump.sql
For strings this will use the default character encoding which happens to be ASCII, and this fails when you have non-ASCII characters. You want unicode instead of str.
rows = cursor.fetchall()
f = open('Data_on_%s.csv' % currentDate, 'w')
myFile = csv.writer(f)
myFile.writerow([unicode(s).encode("utf-8") for s in rows])
fp.close()
You can use mysqldump for this task. (Source for command)
mysqldump -u username -p --tab -T/path/to/directory dbname table_name --fields-terminated-by=','
The arguments are as follows:
-u username for the username
-p to indicate that a password should be used
-ppassword to give the password via command line
--tab Produce tab-separated data files
For mor command line switches see https://dev.mysql.com/doc/refman/5.5/en/mysqldump.html
To run it on a regular basis, create a cron task like written in the other answers.
I have the following python code, it reads through a text file line by line and takes characters x to y of each line as the variable "Contract".
import os
import pyodbc
cnxn = pyodbc.connect(r'DRIVER={SQL Server};CENSORED;Trusted_Connection=yes;')
cursor = cnxn.cursor()
claimsfile = open('claims.txt','r')
for line in claimsfile:
#ldata = claimsfile.readline()
contract = line[18:26]
print(contract)
cursor.execute("USE calms SELECT XREF_PLAN_CODE FROM calms_schema.APP_QUOTE WHERE APPLICATION_ID = "+str(contract))
print(cursor.fetchall())
When including the line cursor.fetchall(), the following error is returned:
Programming Error: Previous SQL was not a query.
The query runs in SSMS and replace str(contract) with the actual value of the variable results will be returned as expected.
Based on the data, the query will return one value as a result formatted as NVARCHAR(4).
Most other examples have variables declared prior to the loop and the proposed solution is to set NO COUNT on, this does not apply to my problem so I am slightly lost.
P.S. I have also put the query in its own standalone file without the loop to iterate through the file in case this was causing the problem without success.
In your SQL query, you are actually making two commands: USE and SELECT and the cursor is not set up with multiple statements. Plus, with database connections, you should be selecting the database schema in the connection string (i.e., DATABASE argument), so TSQL's USE is not needed.
Consider the following adjustment with parameterization where APPLICATION_ID is assumed to be integer type. Add credentials as needed:
constr = 'DRIVER={SQL Server};SERVER=CENSORED;Trusted_Connection=yes;' \
'DATABASE=calms;UID=username;PWD=password'
cnxn = pyodbc.connect(constr)
cur = cnxn.cursor()
with open('claims.txt','r') as f:
for line in f:
contract = line[18:26]
print(contract)
# EXECUTE QUERY
cur.execute("SELECT XREF_PLAN_CODE FROM APP_QUOTE WHERE APPLICATION_ID = ?",
[int(contract)])
# FETCH ROWS ITERATIVELY
for row in cur.fetchall():
print(row)
cur.close()
cnxn.close()
I want to insert a record in mytable (in DB2 database) and get the id generated in that insert. I'm trying to do that with python 2.7. Here is what I did:
import sqlalchemy
from sqlalchemy import *
import ibm_db_sa
db2 = sqlalchemy.create_engine('ibm_db_sa://user:pswd#localhost:50001/mydatabase')
sql = "select REPORT_ID from FINAL TABLE(insert into MY_TABLE values(DEFAULT,CURRENT TIMESTAMP,EMPTY_BLOB(),10,'success'));"
result = db2.execute(sql)
for item in result:
id = item[0]
print id
When I execute the code above it gives me this output:
10 //or a increasing number
Now when I check in the database nothing has been inserted ! I tried to run the same SQL request on the command line and it worked just fine. Any clue why I can't insert it with python using sqlalchemy ?
Did you try a commit? #Lennart is right. It might solve your problem.
Your code does not commit the changes you have made and thus are rolled back.
If your Database is InnoDB, it is transactional and thus needs a commit.
according to this, you also have to connect to your engine. so in your instance it would look like:
db2 = sqlalchemy.create_engine('ibm_db_sa://user:pswd#localhost:50001/mydatabase')
conn = db2.connect()
trans = conn.begin()
try:
sql = "select REPORT_ID from FINAL TABLE(insert into MY_TABLE values(DEFAULT,CURRENT TIMESTAMP,EMPTY_BLOB(),10,'success'));"
result = conn.execute(sql)
for item in result:
id = item[0]
print id
trans.commit()
except:
trans.rollback()
raise
I do hope this helps.
I am using pymssql and the Pandas sql package to load data from SQL into a Pandas dataframe with frame_query.
I would like to send it back to the SQL database using write_frame, but I haven't been able to find much documentation on this. In particular, there is a parameter flavor='sqlite'. Does this mean that so far Pandas can only export to SQLite? My firm is using MS SQL Server 2008 so I need to export to that.
Unfortunately, yes. At the moment sqlite is the only "flavor" supported by write_frame. See https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L155
def write_frame(frame, name=None, con=None, flavor='sqlite'):
"""
Write records stored in a DataFrame to SQLite. The index will currently be
dropped
"""
if flavor == 'sqlite':
schema = get_sqlite_schema(frame, name)
else:
raise NotImplementedError
Writing a simple write_frame should be fairly easy, though. For example, something like this might work (untested!):
import pymssql
conn = pymssql.connect(host='SQL01', user='user', password='password', database='mydatabase')
cur = conn.cursor()
# frame is your dataframe
wildcards = ','.join(['?'] * len(frame.columns))
data = [tuple(x) for x in frame.values]
table_name = 'Table'
cur.executemany("INSERT INTO %s VALUES(%s)" % (table_name, wildcards), data)
conn.commit()
Just to save someone else who tried to use this some time. It turns out the line:
wildcards = ','.join(['?'] * len(frame.columns))
should be:
wildcards = ','.join(['%s'] * len(frame.columns))
Hope that helps