I have a 5GB CSV of IP addresses that I need to parse to a MySQL database.
Currently reading rows from the CSV and inserting into the MySQL. It works great however I would love to make it fast.
Could I parallel the reading and writing somehow? Or perhaps chuck the csv down and spawn from processes to read & write each split csv?
import csv
from csv import reader
from csv import writer
import mysql.connector
cnx = mysql.connector.connect(user='root', password='', host='127.0.0.1', database='ips')
cursor = cnx.cursor()
i = 1
with open('iplist.csv', 'r') as read_obj:
csv_reader = reader(read_obj)
for row in csv_reader:
query = """INSERT INTO ips (ip_start,ip_end,continent) VALUES ('%s','%s','%s')""" % (row[0],row[1],row[2])
print (query)
cursor.execute(query)
cursor.execute('COMMIT')
print(i)
i = i + 1
cnx.close()
Any help is appreciated.
Use cursor.executemany to increase speed:
# Tested with:
# docker run --rm -e MYSQL_ALLOW_EMPTY_PASSWORD=y -p 3306:3306 mysql
#
# CREATE DATABASE ips;
# USE ips;
# CREATE TABLE ips (id INT PRIMARY KEY NOT NULL AUTO_INCREMENT, ip_start VARCHAR(15), ip_end VARCHAR(15), continent VARCHAR(20));
import mysql.connector
import csv
import itertools
CHUNKSIZE = 1000 # Number of lines
cnx = mysql.connector.connect(user='root', password='', host='127.0.0.1', database='ips')
cursor = cnx.cursor()
with open('iplist.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
while True:
records = list(itertools.islice(reader, CHUNKSIZE))
if not records:
break
query = """INSERT INTO ips (ip_start, ip_end, continent) VALUES (%s, %s, %s)"""
cursor.executemany(query, records)
cursor.execute('COMMIT')
I created a pseudo-random CSV file where each row is of the style "111.222.333.444,555.666.777.888,A continent". The file contains 33 million rows. The following code was able to insert all rows into a MySQL database table in ~3 minutes:-
import mysql.connector
import time
import concurrent.futures
import csv
import itertools
CSVFILE='/Users/Andy/iplist.csv'
CHUNK=10_000
def doBulkInsert(rows):
with mysql.connector.connect(user='andy', password='monster', host='localhost', database='andy') as connection:
connection.cursor().executemany(f'INSERT INTO ips (ip_start, ip_end, continent) VALUES (%s, %s, %s)', rows)
connection.commit()
def main():
_s = time.perf_counter()
with open(CSVFILE) as csvfile:
csvdata = csv.reader(csvfile)
_s = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor() as executor:
while (data := list(itertools.islice(csvdata, CHUNK))):
executor.submit(doBulkInsert, data)
executor.shutdown(wait=True)
print(f'Duration = {time.perf_counter()-_s}')
if __name__ == '__main__':
main()
My recommendation would be chunk your list. Break it down into 5,000 (or similar) chunks, then iterate through those. This will reduce the amount of queries you are making. Query volume seems to be your biggest bottleneck.
https://medium.com/code-85/two-simple-algorithms-for-chunking-a-list-in-python-dc46bc9cc1a2
Related
I am trying to import the following table to my Postgres server using cursor.copy_from() in psycopg2 because the file is too large.
id
mail
name
1
john123#gmail.com
John Stokes
2
emily123#gmail.com
Emily Ray
Here is my code:
import psycopg2
import os
conn = psycopg2.connect(
dbname = name,
user = username,
password = pwd,
host = hst,
port = 5432
)
cur = conn.cursor()
path = os.getcwd() + '\users.csv'
file = open(path, 'r')
cur.copy_from(file, table_name, sep=',')
conn.commit()
conn.close()
This inserts the data to the table but there is double quotes in the third column like below.
id
mail
name
1
john123#gmail.com
"John Stokes"
2
emily123#gmail.com
"Emily Ray"
Later I found out that the problem lies in the open() itself. Because if I print the first line by doing file.readline(), I get:
1,john123#gmail.com,"John Stokes"
I don't want these double quotes in my table. I tried using cursor.execute() with COPY FROM query but it says that I am not a superuser even if I am.
Use copy_expert. Then you are not working as the server user but as the client user. Also you can use WITH CSV which will take care of the quoting. copy_from and copy_to work using the text format as described here COPY.
cat test.csv
1,john123#gmail.com,"John Stokes"
2,emily123#gmail.com,"Emily Ray"
create table test_csv (id integer, mail varchar, name varchar);
import psycopg2
con = psycopg2.connect(dbname="test", host='localhost', user='postgres', port=5432)
cur = con.cursor()
with open('test.csv') as f:
cur.copy_expert('COPY test_csv FROM stdin WITH CSV', f)
con.commit()
select * from test_csv ;
id | mail | name
----+--------------------+-------------
1 | john123#gmail.com | John Stokes
2 | emily123#gmail.com | Emily Ray
FYI, in psycopg3(psycopg) this behavior has changed substantially. See here psycopg3 COPY for how to handle in that case.
UPDATE
Using psycopg3 the answer for Python 3.8+ where the walrus operator is available would be:
import psycopg
with open('test.csv') as f:
with cur.copy("COPY test_csv FROM STDIN WITH CSV") as copy:
while data := f.read(1000):
copy.write(data)
con.commit()
Or using Python 3.7-, something like:
# Function copied from here https://www.iditect.com/guide/python/python_howto_read_big_file_in_chunks.html
def read_in_chunks(file, chunk_size=1024*10): # Default chunk size: 10k.
while True:
chunk = file.read(chunk_size)
if chunk:
yield chunk
else: # The chunk was empty, which means we're at the end of the file
return
with open('test.csv') as f:
with cur.copy("COPY test_csv FROM STDIN WITH CSV") as copy:
for chunk in read_in_chunks(f):
copy.write(chunk)
con.commit()
I imported a txt file on my python script and then converted it to dataframe. Then I created a function that uses cx_oracle to insert my data to Oracle database faster. It works pretty well and it only took 15min to import 1mil+ data - but it doesn't copy the values as is. This is a chunk of that code:
sqlquery = 'INSERT INTO {} VALUES({})'.format(tablename, inserttext)
df_list = df.values.tolist()
cur = con.cursor()
cur.execute(sql_query1)
logger.info("Completed: %s", sql_query1)
for b in df_list :
for index, value in enumerate(b):
if isinstance(value, float) and math.isnan(value):
b[index] = None
elif isinstance(value, type(pd.NaT)):
b[index] = None
Here is a sample data of what I expected:
DATE
STORE
COST
PARTIAL
16-JUN-21 08.00.00.000000000 PM
00006
+00000.0082
false
But instead this is being imported
DATE
STORE
COST
PARTIAL
16-JUN-21
6
0.0082
F
I need it to be eaxcatly same with zeros, symbols etc. I've already tried converting the dataframe as string by doing df = df.astype(str) but it doesn't work.
Hopefully you can help!
Without going into whether the schema design and architecture is really what you should be using, then with this schema:
create table t (d varchar2(31), s varchar2(6), c varchar(12), p varchar(5));
and this data in t.csv:
16-JUN-21 08.00.00.000000000 PM,00006,+00000.0082,false
and this code:
import cx_Oracle
import os
import sys
import csv
if sys.platform.startswith("darwin"):
cx_Oracle.init_oracle_client(lib_dir=os.environ.get("HOME")+"/Downloads/instantclient_19_8")
username = os.environ.get("PYTHON_USERNAME")
password = os.environ.get("PYTHON_PASSWORD")
connect_string = os.environ.get("PYTHON_CONNECTSTRING")
connection = cx_Oracle.connect(username, password, connect_string)
with connection.cursor() as cursor:
# Predefine the memory areas to match the table definition
cursor.setinputsizes(31,6,12,5)
# Adjust the batch size to meet your memory and performance requirements
batch_size = 10000
with open('t.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
sql = "insert into t (d, s, c, p) values (:1, :2, :3, :4)"
data = []
for line in csv_reader:
data.append(line)
if len(data) % batch_size == 0:
cursor.executemany(sql, data)
data = []
if data:
cursor.executemany(sql, data)
connection.commit()
with connection.cursor() as cursor:
sql = """select * from t"""
for r in cursor.execute(sql):
print(r)
the output is:
('16-JUN-21 08.00.00.000000000 PM', '00006', '+00000.0082', 'false')
For general reference see the cx_Oracle documentation Batch Statement Execution and Bulk Loading.
I have a many large 1-D hdf5 datasets with next properties:
init size = (5201,),
maxshape = (6000000,),
dtype ='float32'
chunks = (10000,)
compression = "gzip"
Path example: file["Group"]["1"]["Group1"]["2"]["Dataset"]
I want to move them into the PostgreSQL, I dealed with structure of database and inserting data, but each filling takes ~650 seconds of 72,4mb hdf5 file, can someone give me a tips/advice how I can improve the performance?
What I have now:
def fill_database(self, dog):
if isinstance(dog, h5py.Dataset):
name = dog.name.split('/')
table_name = '{}_{}'.format(name[3], name[5])
data = dog.value.astype(int).tolist()
self.cur.execute('CREATE TABLE IF NOT EXISTS {} (cur_id INT PRIMARY KEY , data INT[]);'.format(table_name))
self.cur.execute('INSERT INTO {} VALUES (%s, %s)'.format(table_name), (name[2], data))
if isinstance(dog, h5py.Group):
for k, v in dict(dog).items():
self.fill_database(v)
What I tried:
import psycopg2
import h5py
from itertools import islice
with h5py.File('full_db.hdf5') as hdf5file:
with psycopg2.connect(database='hdf5', user='postgres', password='pass', port=5432) as conn:
cur = conn.cursor()
cur.execute('drop table if EXISTS mytable;')
cur.execute('create table mytable (data INT[]);')
chunksize = 10000
t = iter(hdf5file["Group"]["1"]["Group1"]["2"]["Dataset"][:].astype(int))
rows = islice(t, chunksize)
while rows:
statement = "INSERT INTO mytable(data) VALUES {}".format(rows) # I stuck here
cur.execute(statement)
rows = islice(t, chunksize)
conn.commit()
Also I tried to do something with LIMIT in PostgreSQL and many other ways but I was not successful.
I think some of the problem may be because of arrays in the database, I use them for later more convenient output.
After almost two weeks I think I can answer my own question.
In search of an answer, I came across on the Internet at this page https://github.com/psycopg/psycopg2/issues/179
Also after reading of documentation, I understood that a copying from a file works even quicker and I tried to use the module of StringIO. And thats what I get:
import h5py
import psycopg2
import time
from io import StringIO
conn = psycopg2.connect(database='hdf5', user='postgres', password=' ')
cur = conn.cursor()
file = h5py.File('db.hdf5', 'r')
data_set = file['path/to/large/data_set'].value.astype(int).tolist()
cur.execute('DROP TABLE IF EXISTS table_test;')
cur.execute('CREATE TABLE table_test (data INTEGER[]);')
# ORIGINAL
start = time.time()
cur.execute('INSERT INTO table_test VALUES (%s);', (data_set,))
print('Original: {} sec'.format(round(time.time() - start, 2)))
# STRING FORMAT
start = time.time()
data_str = ','.join(map(str, data_set)).replace('[', '{').replace(']', '}')
cur.execute('INSERT INTO table_test VALUES (ARRAY[{}]);'.format(data_str))
print('String format: {} sec'.format(round(time.time() - start, 2)))
# STRING IO COPY
start = time.time()
data_str = ','.join(map(str, data_set)).replace('[', '{').replace(']', '}')
data_io = StringIO('{{{}}}'.format(data_str))
cur.copy_from(data_io, 'table_test')
print('String IO: {} sec'.format(round(time.time() - start, 2)))
conn.commit()
Which gives me next result with dataset with shape (1200201,):
Original: 1.27 sec
String format: 0.58 sec
String IO: 0.3 sec
I have a csv file like this:
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#gmail.com, 01-05-2014
I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains.
Below is the code in which is giving me error as TypeError: not enough arguments for format string and it's happening when I try to insert into domains table.
#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb
from collections import defaultdict, Counter
domain_counts = defaultdict(Counter)
# ======================== Defined Functions ======================
def get_file_path(filename):
currentdirpath = os.getcwd()
# get current working directory path
filepath = os.path.join(currentdirpath, filename)
return filepath
# ===========================================================
def read_CSV(filepath):
with open('emails.csv') as f:
reader = csv.reader(f)
for row in reader:
domain_counts[row[0].split('#')[1].strip()][row[1]] += 1
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="root", # your username
passwd="abcdef1234", # your password
db="test") # name of the data base
cur = db.cursor()
q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%d-%m-%Y'))"""
for domain, data in domain_counts.iteritems():
for email_date, email_count in data.iteritems():
cur.execute(q, (domain, email_count, email_date))
db.commit()
# ======================= main program =======================================
path = get_file_path('emails.csv')
read_CSV(path) # read the input file
What is wrong I am doing?
As of now my data type for date_of_entry column is date in MySQL.
You need the "%d-%m-%Y" in your sql statement in exact this way. But python (or the execute command) tries first to use it for string formatting and throws this error.
I think you have to escape it and you should try following:
q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""
Try this:
q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, 'd-m-Y'))"""
So we have changed from STR_TO_DATE(%s, '%d-%m-%Y')) to STR_TO_DATE(%s, 'd-m-Y'))
It is detecting the %s as a format string and failing on that. You need to surround it with quotes I guess
INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE('%s', '%d-%m-%Y'))
I have created a python script that reads a csv file and then stores the data into variables using a dictionary.
then I insert the variables to mysql database.
Everytime is working 100% .. except when I try to insert a date.
i get the error:
Out of range value for column 'date'
I printed the variable date which is: 2015-02-28
it is exactly what I need! but I still get the error message..
Also, it inserts the value 0000-00-00 instead of 2015-02-28 to my table :s
I think the problems is that 2015-02-28 might be a string.. how can i convert it to date?
This is my python script:
#4 python script to insert all data to mysql
#!/usr/bin/python
from StringIO import StringIO
import numpy as np
import csv
import MySQLdb
import os
from datetime import datetime,date,timedelta
dict= {}
infile= open('csv_err1.log','r')
lines= infile.readlines()
for i in lines:
eventKey, count, totalDuration, average = [a.strip() for a in i.split(',')]
dict.setdefault(eventKey, []).append((int(count), int(totalDuration), float(average)))
date = date.today() - timedelta(1)
app_launch_time =dict["app_launch_time"][0][0]
bup_login_error =dict["bup_login_error"][0][0]
crash =dict["crash"][0][0]
parental_controls_error =dict["parental_controls_error"][0][0]
playback_error =dict["playback_error"][0][0]
qp_library_failed_reauthentication =dict["qp_library_failed_reauthentication"][0][0]
qp_library_failed_to_start =dict["qp_library_failed_to_start"][0][0]
search_error =dict["search_error"][0][0]
video_load_time =dict["video_load_time"][0][0]
tbr_error =dict["tbr_error"][0][0]
live_channels_metadata_request_failed =dict["live_channels_metadata_request_failed"][0][0]
vod_catalog_metadata_request_failed =dict["vod_catalog_metadata_request_failed"][0][0]
app_launch_time_avg =dict["app_launch_time"][0][2]
video_load_time_avg =dict["video_load_time"][0][2]
print date
# Open database connection
db = MySQLdb.connect(host="localhost",user="root",passwd="bravoecholimalima",db="capacityreports_mobiletv")
cursor = db.cursor()
# Prepare SQL query to INSERT a record into the database.
sql = ("""INSERT INTO errorscounted (date,app_launch_time,bup_login_error,crash, parental_controls_error,playback_error,qp_library_failed_reauthentication,qp_library_failed_to_start,search_error,video_load_time,tbr_error,live_channels_metadata_request_failed,vod_catalog_metadata_request_failed,app_launch_time_avg,video_load_time_avg) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""" %(date,app_launch_time, bup_login_error, crash, parental_controls_error,playback_error,qp_library_failed_reauthentication,qp_library_failed_to_start,search_error,video_load_time,tbr_error,live_channels_metadata_request_failed,vod_catalog_metadata_request_failed,app_launch_time_avg,video_load_time_avg))
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
for row in cursor.fetchall():
print row[0]
except:
# Rollback in case there is any error
db.rollback()
# disconnect from server
cursor.close()
db.close()
any help or pointers would be greatly appreciated :)
edit:
this is my csv_err1.log file:
app_launch_time,12247,118616277,9685.33
video_load_time,12966,123702815,9540.55
eventKey,2,0,0
playback_error,3773,0,0
qp_library_failed_reauthentication,230,0,0
search_error,183,0,0
epg_metadata_request_failed,5,0,0
live_channels_metadata_request_failed,13,0,0
vod_catalog_metadata_request_failed,1,0,0
bup_login_error,20,0,0
qp_library_failed_to_start,295,0,0
0,9,0,0
tbr_error,389,0,0
crash,218,0,0
parental_controls_error,123,0,0
I finally found the solution to my problem! :)
I couldn't insert a variable date in python into mysql table.. for a strange reason they weren't compatible..
so I created a sql code that would get yesterday's date and insert it without having to use a python date variable :)
I simply used:
DATE_ADD(CURDATE(), INTERVAL -1 day)
Here's my code 100% bullet proof:
#!/usr/bin/python
from StringIO import StringIO
import numpy as np
import csv
import MySQLdb
import os
dict= {}
infile= open('csv_err1.log','r')
lines= infile.readlines()
for i in lines:
eventKey, count, totalDuration, average = [a.strip() for a in i.split(',')]
dict.setdefault(eventKey, []).append((int(count), int(totalDuration), float(average)))
app_launch_time =dict["app_launch_time"][0][0]
bup_login_error =dict["bup_login_error"][0][0]
crash =dict["crash"][0][0]
parental_controls_error =dict["parental_controls_error"][0][0]
playback_error =dict["playback_error"][0][0]
qp_library_failed_reauthentication =dict["qp_library_failed_reauthentication"][0][0]
qp_library_failed_to_start =dict["qp_library_failed_to_start"][0][0]
search_error =dict["search_error"][0][0]
video_load_time =dict["video_load_time"][0][0]
tbr_error =dict["tbr_error"][0][0]
live_channels_metadata_request_failed =dict["live_channels_metadata_request_failed"][0][0]
vod_catalog_metadata_request_failed =dict["vod_catalog_metadata_request_failed"][0][0]
app_launch_time_avg =dict["app_launch_time"][0][2]
video_load_time_avg =dict["video_load_time"][0][2]
print ("app_launch_time", app_launch_time)
print ("bup_login_error",bup_login_error)
print ("crash ", crash )
print ("parental_controls_error", parental_controls_error)
print ("playback_error", playback_error)
print ("qp_library_failed_reauthentication", qp_library_failed_reauthentication)
print ("qp_library_failed_to_start", qp_library_failed_to_start)
print ("search_error", search_error)
print ("Video_load", video_load_time)
print ("tbr_error", tbr_error)
print ("live_channels_metadata_request_failed", live_channels_metadata_request_failed)
print ("vod_catalog_metadata_request_failed", vod_catalog_metadata_request_failed)
print ("app_launch_time_avg", app_launch_time_avg)
print ("Video_load", video_load_time_avg)
# Open database connection
db = MySQLdb.connect(host="localhost",user="root",passwd="bravoecholimalima",db="capacityreports_mobiletv")
cursor = db.cursor()
# Prepare SQL query to INSERT a record into the database.
sql = ("""INSERT INTO errorscounted (date,app_launch_time,bup_login_error,crash, parental_controls_error,playback_error,qp_library_failed_reauthentication,qp_library_failed_to_start,search_error,video_load_time,tbr_error,live_channels_metadata_request_failed,vod_catalog_metadata_request_failed,app_launch_time_avg,video_load_time_avg) VALUES(DATE_ADD(CURDATE(), INTERVAL -1 day),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""" %(app_launch_time,bup_login_error,crash,parental_controls_error,playback_error,qp_library_failed_reauthentication,qp_library_failed_to_start,search_error,video_load_time,tbr_error,live_channels_metadata_request_failed,vod_catalog_metadata_request_failed,app_launch_time_avg,video_load_time_avg))
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
for row in cursor.fetchall():
print row[0]
except:
# Rollback in case there is any error
db.rollback()
# disconnect from server
cursor.close()
db.close()