Help with MySQL LOAD DATA INFILE - python

I want to load a CSV file that looks like this:
Acct. No.,1-15 Days,16-30 Days,31-60 Days,61-90 Days,91-120 Days,Beyond 120 Days
2314134101,898.89,8372.16,5584.23,7744.41,9846.54,2896.25
2414134128,5457.61,7488.26,9594.02,6234.78,273.7,2356.13
2513918869,2059.59,7578.59,9395.51,7159.15,5827.48,3041.62
1687950783,4846.85,8364.22,9892.55,7213.45,8815.33,7603.4
2764856043,5250.11,9946.49,8042.03,6058.64,9194.78,8296.2
2865446086,596.22,7670.04,8564.08,3263.85,9662.46,7027.22
,4725.99,1336.24,9356.03,1572.81,4942.11,6088.94
,8248.47,956.81,8713.06,2589.14,5316.68,1543.67
,538.22,1473.91,3292.09,6843.89,2687.07,9808.05
,9885.85,2730.72,6876,8024.47,1196.87,1655.29
But if you notice, some of the fields are incomplete. I'm thinking MySQL will just skip the row where the first column is missing. When I run the command:
LOAD DATA LOCAL INFILE 'test-long.csv' REPLACE INTO TABLE accounts
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(cf_535, cf_580, cf_568, cf_569, cf_571, cf_572);
And the MySQL output is:
Query OK, 41898 rows affected, 20948 warnings (0.78 sec)
Records: 20949 Deleted: 20949 Skipped: 0 Warnings: 20948
The number of lines is only 20,949 but MySQL reports it as 41,898 rows affected. Why so? Also, nothing really changed in the table. I also couldn't see what the warnings generated is all about. I wanted to use the LOAD DATA INFILE because it takes python half a second to update each row which translates to 2.77 hours for a file with 20,000+ records.
UPDATE: Modified the code to set auto-commit to 'False' and added a db.commit() statement:
# Tell MySQLdb to turn off auto-commit
db.autocommit(False)
# Set count to 1
count = 1
while count < len(contents):
if contents[count][0] != '':
cursor.execute("""
UPDATE accounts SET cf_580 = %s, cf_568 = %s, cf_569 = %s, cf_571 = %s, cf_572 = %s
WHERE cf_535 = %s""" % (contents[count][1], contents[count][2], contents[count][3], contents[count][4], contents[count][5], contents[count][0]))
count += 1
try:
db.commit()
except:
db.rollback()

You have basically 3 issues here. In reverse order
Are you doing your Python inserts in individual statements? You probably want to surround them all with a begin transaction/commit. 20,000 commits could easily take hours.
Your import statement defines 6 fields, but the CSV has 7 fields. That would explain the double row count: every line of input results in 2 rows in the database, the 2nd one with fields 2-6 null.
Incomplete rows will be inserted with null or default values for the missing columns. This may not be what you want with those malformed rows.
If your python program can't perform fast enough even with a single transaction, you should at least have the python program edit/clean the data file before importing. If Acct. No. is the primary key, as seems reasonable, inserting rows with blank will either cause the whole import to fail, or if auto number is on, cause bogus data to be imported.

If you use REPLACE keyword in LOAD DATA, then number after "Deleted: " shows how many rows were actually replaced

Related

mySQL Load Data - Row 1 doesn't contain data for all columns

I've looked at many similar questions on this topic.. But none appear to apply.
Here are the details:
I have a table with 8 columns.
create table test (
node_name varchar(200),
parent varchar(200),
actv int(11),
fid int(11),
cb varchar(100),
co datetime,
ub varchar(100),
uo datetime
);
There is a trigger on the table:
CREATE TRIGGER before_insert_test
BEFORE INSERT ON test
FOR EACH ROW SET NEW.co = now(), NEW.uo = now(), NEW.cb = user(), NEW.ub = user()
I have a csv file to load into this table. Its got just 2 columns in it.
First few rows:
node_name,parent
West,
East,
BBB: someone,West
Quebec,East
Ontario,East
Manitoba,West
British Columbia,West
Atlantic,East
Alberta,West
I have this all set up in a mySQL 5.6 environment. Using python and SQLAlchemy, i run the load of the file without issue.. It LOADS ALL RECORDS with empty strings for the second field in the first 2 records. All as expected.
I have a mysql 8 environment, and run the exact same routine. All the same statements, etc. It fails with the 'Row 1 doesn't contain data for all columns' error.
The connection is made using this:
engine = create_engine(
connection_string,
pool_size=6, max_overflow=10, encoding='latin1', isolation_level='AUTOCOMMIT',
connect_args={"local_infile": 1}
)
db_connection = engine.connect()
The Command I place in the sql variable is:
LOAD DATA INFILE 'test.csv'
INTO TABLE test
FIELDS TERMINATED BY ',' ENCLOSED BY '\"' IGNORE 1 LINES SET fid = 526, actv = 1;
And execute it with:
db_connection.execute(sql)
So.. I basically load the first two columns from the file.. I set the next 2 columns in the load statement, and the final 4 others are handled by the trigger.
I repeat - this is working fine in the mysql 5 environment, but not the mysql 8.
I checked mysql character set variables in both db environments, and they are equivalent (just in case the default character set change between 5.6 and 8 had an impact).
I will say that the mySQL 5 db is running on ubuntu 18.04.5 while mySQL 8 is running on ubuntu 20.02.2 - could there be something there??
I have tried all sorts of fiddling with the LOAD DATA statement.. I tried filling in data for the first two records in the file in case that was it.. I tried using different line terminators in the LOAD statement.. I'm at a loss for the next thing to look into..
Thanks for any pointers..
MySQL will assume that each row in your CSV maps to a column in the table, unless you tell it otherwise.
Give the query a column list:
LOAD DATA INFILE 'test.csv'
INTO TABLE test
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
IGNORE 1 LINES
(node_name, parent)
SET fid = 526, actv = 1;
In addition to Tangentially Perpendicular's answer, there are other options:
add the IGNORE keyword as per:
https://dev.mysql.com/doc/refman/8.0/en/sql-mode.html#ignore-effect-on-execution
it should come just before the ' INTO' in the LOAD DATA statement as per https://dev.mysql.com/doc/refman/8.0/en/load-data.html.
or, altering the sql_mode to be less strict will work also.
Due to the strict sql_mode, LOAD DATA isn't smart enough to realize that TRIGGERS are handling a couple columns.. Would be nice if they enhanced it to be that smart.. but alas.

Wrapping INSERT by BEGIN TRANSACTION and COMMIT

I have some code from my class. It's about making one database from another. There is a INSERT method. It takes really long time. I read the FAQ, and i know that I need to put BEGIN TRANSACTION and COMMIT around the multiple INSERT, but i swear, I tried every place of c.execute(''BEGIN TRANSACTION") and c.execute("COMMIT") - always same ca 5 kb/s. Please show me where is the proper place for those instruction, or tell me what else could be a problem.
For the record - I'm working with 5400 RPM hard drive.
Here is original code:
import sqlite3
conn = sqlite3.connect('/path/to/database.db')
c = conn.cursor()
with open('sqlite-sakila-schema.sql', 'r', encoding='utf-8') as create_file:
create_query = create_file.read()
with open('sqlite-sakila-insert-data.sql', 'r', encoding='utf-8') as insert_file:
insert_query = insert_file.read()
c.executescript(create_query)
c.executescript(insert_query)
conn.commit()
conn.close()
edited:
first file:
https://raw.githubusercontent.com/jOOQ/jOOQ/master/jOOQ-examples/Sakila/sqlite-sakila-db/sqlite-sakila-schema.sql
second one:
https://raw.githubusercontent.com/jOOQ/jOOQ/master/jOOQ-examples/Sakila/sqlite-sakila-db/sqlite-sakila-insert-data.sql
It is all about INSERTs. There is couple tables, whole sql file started with deleting from those tables, and then 231K lines of INSERTS code like below.
Insert into language
(language_id,name,last_update)
Values
('1','English','2006-02-15 05:02:19.000')
;
Insert into language
(language_id,name,last_update)
Values
('2','Italian','2006-02-15 05:02:19.000')
;
Insert into language
(language_id,name,last_update)
Values
('3','Japanese','2006-02-15 05:02:19.000')
;
Insert into language
(language_id,name,last_update)
Values
('4','Mandarin','2006-02-15 05:02:19.000')
;
Try combining your INSERT queries into a single query:
INSERT into language (language_id,name,last_update) VALUES
('1','English','2006-02-15 05:02:19.000'),
('2','Italian','2006-02-15 05:02:19.000'),
('3','Japanese','2006-02-15 05:02:19.000'),
('4','Mandarin','2006-02-15 05:02:19.000'),
...
;
SQLite has a limit on the size of a single query, which is the value of SQLITE_MAX_SQL_LENGTH and defaults to 1,000,000 bytes. So you'll need to either increase this limit or split this query up into groups that fit into the limit. Doing them in groups of something like 1,000 rows will probably make a noticeable difference.

SQLite3 Columns Are Not Unique

I'm inserting data from some csv files into my SQLite3 database with a python script I wrote. When I run the script, it inserts the first row into the database, but gives this error when trying to inset the second:
sqlite3.IntegrityError: columns column_name1, column_name2 are not unique.
It is true the values in column_name1 and column_name2 are same in the first two rows of the csv file. But, this seems a bit strange to me, because reading about this error indicated that it signifies a uniqueness constraint on one or more of the database's columns. I checked the database details using SQLite Expert Personal, and it does not show any uniqueness constraints on the current table. Also, none of the fields that I am entering specify the primary key. It seems that the database automatically assigns those. Any thoughts on what could be causing this error? Thanks.
import sqlite3
import csv
if __name__ == '__main__' :
conn = sqlite3.connect('ts_database.sqlite3')
c = conn.cursor()
fileName = "file_name.csv"
f = open(fileName)
csv_f = csv.reader(f)
for row in csv_f:
command = "INSERT INTO table_name(column_name1, column_name2, column_name3)"
command += " VALUES (%s, '%s', %s);" % (row[0],row[1],row[2])
print command
c.execute(command)
conn.commit()
f.close()
If SQLite is reporting an IntegrityError error it's very likely that there really is a PRIMARY KEY or UNIQUE KEY on those two columns and that you are mistaken when you state there is not. Ensure that you're really looking at the same instance of the database.
Also, do not write your SQL statement using string interpolation. It's dangerous and also difficult to get correct (as you probably know considering you have single quotes on one of the fields). Using parameterized statements in SQLite is very, very simple.
The error may be due to duplicate column names in the INSERT INTO statement. I am guessing it is a typo and you meant column_name3 in the INSERT INTO statement.

Syntax error when parsing .dta files and attempting to move it to a postgresql server

I am attempting to parse .dta files and enter each row into a separate table. The .dta files are composed of a lot of different variables, and I want to insert each variable into a separate "variable table". I am using the new .dta reader from pandas, which is named statareader. I do not have a lot of experience with python, and was hoping for a little help with my syntax. Also I am using python 2.7.5
a = 2
t = 1
while t >= 1:
for date, row in dr.iterrows():
cur.execute("INSERT INTO (table#'+str(t)') (data) VALUES(%s)" % (row[a]))
t+=1
a+=1
if t == 10:
break
At the cur.execute line, I get the error:
pg8000.errors.ProgrammingError: ('ERROR', '42601', 'syntax error at or near "("')
Any ideas about what I am doing wrong?
You are generating invalid SQL code. An INSERT statement does not accept () parenthesis around the table name. To quote a table name (which makes it case sensitive, so be careful) put double quotes around it:
cur.execute('INSERT INTO "table#{}" (data) VALUES (%s)'.format(t), (row[a],))
The above example also uses proper SQL parameters for the row data; you generally want to let the database prepare a generic statement and reuse the prepared statement for each insert. By using SQL parameters you not only ensure that row[a] is properly escaped, but also let the database prepare the generic statement. I used the default paramstyle format for pg8000.
You probably want to rethink your while loop condition; why not test if t < 10 instead?
a = 2
t = 1
while t < 10:
for date, row in dr.iterrows():
cur.execute('INSERT INTO "table#{}" (data) VALUES (%s)'.format(t), (row[a],))
a += 1
t += 1
or use a python for loop with range() instead:
for t in range(1, 10):
a = t + 1
for date, row in dr.iterrows():
cur.execute('INSERT INTO "table#{}" (data) VALUES (%s)'.format(t), (row[a],))

How to determine if field exists in same table in another field in any other row

I am having troubles finding out if I can even do this. Basically, I have a csv file that looks like the following:
1111,804442232,1
1112,312908721,1
1113,A*2434,1
1114,A*512343128760987,1
1115,3512748,1
1116,1111,1
1117,1234,1
This is imported into a sqlite database in memory for manipulation. I will be importing multiple files into this database after some manipulation. Sqlite is allowing me to keep constraints on the tables and receive errors where needed without creating additional functions just to check each constraint while using arrays in python. I want to do a few things but the first of which is to prepend field2 where all field2 strings match an entry in field1.
For example, in the above data field2 in entry 6 matches entry 1. In this case I would like to prepend field2 in entry 6 with '555'
If this is not possible I do believe I could make do using a regex and just do this on every row with 4 digits in field2... though... I have yet to successfully get REGEX working using python/sqlite as it always throws me an error.
I am working within Python using Sqlite3 to connect/manipulate my sqlite database.
EDIT: I am looking for a method to manipulate the resultant tables which reside in a sqlite database rather than manipulating just the csv data. The data above is just a simple representation of what is contained in the files I am working with. Would it be better to work with arrays containing the data from the csv files? These files have 10,000+ entries and about 20-30 columns.
If you must do it in SQLite, how about this:
First, get the column names of the table by running the following and parsing the result
def get_columns(table_name, cursor):
cursor.execute('pragma table_info(%s)' % table_name)
return [row[1] for row in cursor]
conn = sqlite3.connect('test.db')
columns = get_columns('test_table',conn.cursor())
For each of those columns, run the following update, that does your prepending
def prepend(column, reference, prefix, cursor):
query = '''
UPDATE %s
SET %s = 'prefix' || %s
WHERE %s IN (SELECT %s FROM %s)
''' % (table, column, column, column, reference, table)
cursor.execute(query)
reference = 'field1'
[prepend('test_table', column, reference, '555', conn.cursor())
for column in columns
if column != reference]
Note that this is expensive: O(n^2) for each column you want to do it for.
As per your edit and Nathan's answer, it might be better to simply work with python's builtin datastructures. You can always insert it into SQLite after.
10,000 entries is not really much so it might not matter in the end. It all depends on your reason for requiring it to be done in SQLite (which we don't have much visibility of).
There is no need to use regex expressions to do this, just throw the contents from the first column into a set and then iterate through the rows and update the second field.
first_col_values = set(row[0] for row in rows)
for row in rows:
if row[1] in first_col_values:
row[1] = '555' + row[1]
So... I found the answer to my own question after a ridiculous amount of my own searching and trial and error. My unfamiliarity with SQL had me stumped as I was trying all kinds of crazy things. In the end... this was the simple type of solution I was looking for:
prefix="555"
cur.execute("UPDATE table SET field2 = %s || field2 WHERE field2 IN (SELECT field1 FROM table)"% (prefix))
I kept the small amount of python in there but what I was looking for was the SQL statement. Not sure why nobody else came up with something that simple =/. Unsatisfied with the answers so far, I had been searching far and wide for this simple line >_<.

Categories

Resources