I am looking for a syntax definition, example, sample code, wiki, etc. for
executing a LOAD DATA LOCAL INFILE command from python.
I believe I can use mysqlimport as well if that is available, so any feedback (and code snippet) on which is the better route, is welcome. A Google search is not turning up much in the way of current info
The goal in either case is the same: Automate loading hundreds of files with a known naming convention & date structure, into a single MySQL table.
David
Well, using python's MySQLdb, I use this:
connection = MySQLdb.Connect(host='**', user='**', passwd='**', db='**')
cursor = connection.cursor()
query = "LOAD DATA INFILE '/path/to/my/file' INTO TABLE sometable FIELDS TERMINATED BY ';' ENCLOSED BY '\"' ESCAPED BY '\\\\'"
cursor.execute( query )
connection.commit()
replacing the host/user/passwd/db as appropriate for your needs. This is based on the MySQL docs here, The exact LOAD DATA INFILE statement would depend on your specific requirements etc (note the FIELDS TERMINATED BY, ENCLOSED BY, and ESCAPED BY statements will be specific to the type of file you are trying to read in).
You can also get the results for the import by adding the following lines after your query:
results = connection.info()
Related
I'm getting data loss when doing a csv import using the Python MySQLdb module. The crazy thing is that I can load the exact same csv using other MySQL clients and it works fine.
It works perfectly fine when running the exact same command with the exact same csv from sequel pro mysql client
It works perfectly fine when running the exact same command with the exact same csv from the mysql command line
It doesn't work (some rows truncated) when loading through python script using mysqldb module.
It's truncating about 10 rows off of my 7019 row csv.
The command I'm calling:
LOAD DATA LOCAL INFILE '/path/to/load.txt' REPLACE INTO TABLE tble_name FIELDS TERMINATED BY ","
When the above command is ran using the native mysql client on linux or sequel pro mysql client on mac it works fine and I get 7019 rows imported.
When the above command is ran using Python's MySQLdb module such as:
dest_cursor.execute( '''LOAD DATA LOCAL INFILE '/path/to/load.txt' REPLACE INTO TABLE tble_name FIELDS TERMINATED BY ","''' )
dest_db.commit()
Most all rows are imported but I get thrown out a slew of
Warning: (1265L, "Data truncated for column '<various_column_names' at row <various_rows>")
When the warnings pop up, it states at row <row_num> but I'm not seeing that correlate to the row in the csv (I think it's the row it's trying to create on the target table, not the row in the csv) so I can't use that to help troubleshoot.
And sure enough, when it's done, my target table is missing some rows.
Unfortunately with over 7,000 rows in the csv it's hard to tell exactly which line it's choking on for further analysis. When the warnings pop up, it states at row <row_num> but I'm not seeing that correlate to the row in the csv (I think it's the row it's trying to create on the target table, not the row in the csv) so I can't use that to help troubleshoot.
There are many rows that are null and/or empty spaces but they are importing fine.
The fact that I can import the entire csv using other MySQL clients makes me feel that the MySQLdb module is not configured right or something.
This is Python 2.7
Any help is appreciated. Any ideas on how to get better visibility into which line it's choking up on would be helpful.
To Further help I would ask you the following.
Error Checking
After your import using any of your three ways, are there any results from running this after each run? SELECT ##GLOBAL.SQL_WARNINGS; (if so this should show you the errors, as it might be silently failing.)
What is your SQL_MODE? SELECT ##GLOBAL.SQL_MODE;
Check the file and make sure you have an even number of "'s for one.
Check the data for extra " or ,'s or anything that may get caught in translation of bash/python/mysql?
Data Request
Can you provide the data for the 1st row that was missing?
Can you provide the exact script you are using?
Versions
You said your using python 2.7
What version of mysql client? SELECT ##GLOBAL.VERSION;
What version of MySQLdb?
Internationalization
Are you dealing with internationalization (汉语 Hànyǔ or русский etc. languages)?
What is the database/schema collation?
Query:
SELECT DISTINCT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE (
SCHEMA_NAME <> 'sys' AND
SCHEMA_NAME <> 'mysql' AND
SCHEMA_NAME <> 'information_schema' AND
SCHEMA_NAME <> '.mysqlworkbench' AND
SCHEMA_NAME <> 'performance_schema'
);
What is the Table collation?
Query:
SELECT DISTINCT ENGINE, TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES
WHERE (
TABLE_SCHEMA <> 'sys' AND
TABLE_SCHEMA <> 'mysql' AND
TABLE_SCHEMA <> 'information_schema' AND
TABLE_SCHEMA <> '.mysqlworkbench' AND
TABLE_SCHEMA <> 'performance_schema'
);
What is the column collation?
Query:
SELECT DISTINCT CHARACTER_SET_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS
WHERE (
TABLE_SCHEMA <> 'sys' AND
TABLE_SCHEMA <> 'mysql' AND
TABLE_SCHEMA <> 'information_schema' AND
TABLE_SCHEMA <> '.mysqlworkbench' AND
TABLE_SCHEMA <> 'performance_schema'
);
Lastly
Check the Database
For connection collation/character_set
SHOW VARIABLES
WHERE VARIABLE_NAME LIKE 'CHARACTER\_SET\_%' OR
VARIABLE_NAME LIKE 'COLLATION%';
If the first two ways work without error then I'm leaning toward:
Other Plausible Concerns
I am not ruling out problems with any of the following:
possible python connection configuriation issues around
python to db connection collation
default connection timeout
default character set error
python/bash runtime interpolation of symbols causing a random hidden gem
db collation not set to handle foreign languages
exceeding the MAX(field values)
hidden or unicode characters
emoji processing
issues with the data as i mentioned above with Double-Quotes, Commas, and I forgot to mention about NewLines for Windows or Linux (Carriage return or NewLine)
All in all there is a lot to look at and require more information to further assist.
Please update your question when you have more information and I will do the same for my answer to help you resolve your error.
Hope this helps and all goes well!
Update:
Your Error
Warning: (1265L, "Data truncated for column
Leads me to believe it is the Double-Quote around your "field terminations" Check to make sure your data does NOT have commas inside of the errored out fields. This will cause your data to shift when running command-line. As the gui is "Smart-ENOUGH" per say to deal with this. but the command-line is literal!
This is an embarrassing one but maybe I can help someone in the future making horrible mistakes like I have.
I spent a lot of time analyzing fields, checking for special characters, etc and it turned out I was simply causing the problem myself.
I had spaces in the csv, and NOT using a forced ENCLOSED BY in the load statement. This means I was adding a space character to some fields thus causing an overflow. So the data looked like value1, value2, value3 when it should have been value1,value2,value3. Removing those spaces, putting quotes around the fields and enforcing ENCLOSED BY in my statement fixed this.
I assume that the clients that were working were sanitizing the data behind the scenes or something. I really don't know for sure why it was working elsewhere using the same csv but that got me through the first set of hurdles.
Then after getting through that, the last line in the csv was choking and it was stating Row doesn't contain data for all columns - turns out I didn't close() the file after creating it before attempting to load it. So there was some sort of lock on the file. Once I added the close() statement and fixed the spacing issue, all the data is loading now.
Sorry for anyone that spent any measure of time looking into this issue for me.
I have wrote a query which has some string replacements. I am trying to update a url in a table but the url has % signs in which causes a tuple index out of range exception.
If I print the query and run in manually it works fine but through peewee causes an issue. How can I get round this? I'm guessing this is because the percentage signs?
query = """
update table
set url = '%s'
where id = 1
""" % 'www.example.com?colour=Black%26white'
db.execute_sql(query)
The code you are currently sharing is incredibly unsafe, probably for the same reason as is causing your bug. Please do not use it in production, or you will be hacked.
Generally: you practically never want to use normal string operations like %, +, or .format() to construct a SQL query. Rather, you should to use your SQL API/ORM's specific built-in methods for providing dynamic values for a query. In your case of SQLite in peewee, that looks like this:
query = """
update table
set url = ?
where id = 1
"""
values = ('www.example.com?colour=Black%26white',)
db.execute_sql(query, values)
The database engine will automatically take care of any special characters in your data, so you don't need to worry about them. If you ever find yourself encountering issues with special characters in your data, it is a very strong warning sign that some kind of security issue exists.
This is mentioned in the Security and SQL Injection section of peewee's docs.
Wtf are you doing? Peewee supports updates.
Table.update(url=new_url).where(Table.id == some_id).execute()
I'm trying to extract email addresses from text in the column alltext and update the column email with the list of emails found in alltext. The datatype for email is a string array (i.e. text[]).
1) I'm getting the following error and can't seem to find a way around it:
psycopg2.ProgrammingError: syntax error at or near "["
LINE 1: UPDATE comments SET email=['person#email.com', 'other#email.com']
2) Is there a more efficient way to be doing this in the first place? I've experimented some with the PostgreSQL regex documentation but a lot of people seem to think it's not great for this purpose.
def getEmails():
'''Get emails from alltext.
'''
DB = psycopg2.connect("dbname=commentDB")
c = DB.cursor()
c.execute("SELECT id, alltext FROM comments WHERE id < 100")
for row in c:
match = re.findall(r'[\w\.-]+#[\w\.-]+', str(row[1]))
data = {'id':int(row[0]), 'email':match}
c.execute("UPDATE comments SET email=%(email)s WHERE id=%(id)s" % data)
DB.commit()
DB.close()
execute should be passed a list for unnamed arguments, or dict -- as in this case -- for named arguments, as a second argument to ensure that it is psycopg2 (via libpq) that is doing all the proper escaping. You are using native Python string interpolation, which is subject to SQL Injection, and leading to this error, since it isn't libpq doing the interpolation.
Also, as an aside, your regex won't capture various types of email addresses. One type that immediately comes to mind is the form foo+bar#loopback.edu. The + is technically allowed, and can be used, for example, for filtering email. See this link for more details as to issues that crop up with using regexes for validating/parsing email addresses.
In short, the above link recommends using this regex:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
With the caveat that it is valid for only what the author claims is valid email address. Still, it's probably a good jumping-off point, and can be adjusted if you have specific cases that differ.
Edit in response to comment from OP:
The execute line from above would become:
c.execute("UPDATE comments SET email=%(email)s WHERE id=%(id)s", data)
Note that data is now a second argument to execute as opposed to being part of an interpolation operation. This means that Psycopg2 will handle the interpolation and not only avoid the SQL Injection issue, but also properly interpret how the dict should be interpolated into the query string.
Edit in response to follow-up comment from OP:
Yes, the subsequent no results to fetch error is likely because you are using the same cursor. Since you are iterating over the current cursor, trying to use it again in the for loop to do an update interferes with the iteration.
I would declare a new cursor inside the for loop and use that.
I have a table of three columnsid,word,essay.I want to do a query using (?). The sql sentence is sql1 = "select id,? from training_data". My code is below:
def dbConnect(db_name,sql,flag):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
if (flag == "danci"):
itm = 'word'
elif flag == "wenzhang":
itm = 'essay'
n = cursor.execute(sql,(itm,))
res1 = cursor.fetchall()
return res1
However, when I print dbConnect("data.db",sql1,"danci")
The result I obtained is [(1,'word'),(2,'word'),(3,'word')...].What I really want to get is [(1,'the content of word column'),(2,'the content of word column')...]. What should I do ? Please give me some ideas.
You can't use placeholders for identifiers -- only for literal values.
I don't know what to suggest in this case, as your function takes a database nasme, an SQL string, and a flag to say how to modify that string. I think it would be better to pass just the first two, and write something like
sql = {
"danci": "SELECT id, word FROM training_data",
"wenzhang": "SELECT id, essay FROM training_data",
}
and then call it with one of
dbConnect("data.db", sql['danci'])
or
dbConnect("data.db", sql['wenzhang'])
But a lot depends on why you are asking dbConnect to decide on the columns to fetch based on a string passed in from outside; it's an unusual design.
Update - SQL Injection
The problems with SQL injection and tainted data is well documented, but here is a summary.
The principle is that, in theory, a programmer can write safe and secure programs as long as all the sources of data are under his control. As soon as they use any information from outside the program without checking its integrity, security is under threat.
Such information ranges from the obvious -- the parameters passed on the command line -- to the obscure -- if the PATH environment variable is modifiable then someone could induce a program to execute a completely different file from the intended one.
Perl provides direct help to avoid such situations with Taint Checking, but SQL Injection is the open door that is relevant here.
Suppose you take the value for a database column from an unverfied external source, and that value appears in your program as $val. Then, if you write
my $sql = "INSERT INTO logs (date) VALUES ('$val')";
$dbh->do($sql);
then it looks like it's going to be okay. For instance, if $val is set to 2014-10-27 then $sql becomes
INSERT INTO logs (date) VALUES ('2014-10-27')
and everything's fine. But now suppose that our data is being provided by someone less than scrupulous or downright malicious, and your $val, having originated elsewhere, contains this
2014-10-27'); DROP TABLE logs; SELECT COUNT(*) FROM security WHERE name != '
Now it doesn't look so good. $sql is set to this (with added newlines)
INSERT INTO logs (date) VALUES ('2014-10-27');
DROP TABLE logs;
SELECT COUNT(*) FROM security WHERE name != '')
which adds an entry to the logs table as before, end then goes ahead and drops the entire logs table and counts the number of records in the security table. That isn't what we had in mind at all, and something we must guard against.
The immediate solution is to use placeholders ? in a prepared statement, and later passing the actual values in a call to execute. This not only speeds things up, because the SQL statement can be prepared (compiled) just once, but protects the database from malicious data by quoting every supplied value appropriately for the data type, and escaping any embedded quotes so that it is impossible to close one statement and another open another.
This whole concept was humourised in Randall Munroe's excellent XKCD comic
I'm developing a webapp in Django, and for it's database I need to import a CSV file into a particular MySQL database.
I searched around a bit, and found many pages which listed how to do this, but I'm a bit confused.
Most pages say to do this:
LOAD DATA INFILE '<file>' INTO TABLE <tablenname>
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
But I'm confused how Django would interpret this, since we haven't mentioned any column names here.
I'm new to Django and even newer to databasing, so I don't really know how this would work out.
It looks like you are in the database admin (i.e. PostgreSQL/MySQL). Others above has given a good explanation for that.
But if you want to import data into Django itself -- Python has its own csv implementation, like so: import csv.
But if you're new to Django, then I recommend installing something like the Django CSV Importer: http://django-csv-importer.readthedocs.org/en/latest/index.html. (You install the add-ons into your Python library.)
The author, unfortunately, has a typo in the docs, though. You have to do from csvImporter.model import CsvDbModel, not from csv_importer.model import CsvDbModel.
In your models.py file, create something like:
class MyCSVModel(CsvDbModel):
pass
class Meta:
dbModel = Model_You_Want_To_Reference
delimiter = ","
has_header = True
Then, go into your Python shell and do the following command:
my_csv = MyCsvModel.import_data(data = open("my_csv_file_name.csv"))
This isn't Django code, and Django does not care what you call the columns in your CSV file. This is SQL you run directly against your database via the DB shell. You should look at the MySQL documentation for more details, but it will just take the columns in order as they are defined in the table.
If you want more control, you could write some Python code using the csv module to load and parse the file, then add it to the database via the Django ORM. But this will be much much slower than the SQL way.
It will likely just add the data to the columns in order, since they are omitted from your SQL statement.
If you want, you can add the fields to the end of the SQL:
LOAD DATA INFILE '<file>' INTO TABLE <tablenname>
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(#Field1, #Field2, #Field3) /* Fields in CSV */
SET Col1 = #Field1, Col2 = #Field2, Col3 = #Field3; /* Columns in DB */
More in-depth analysis of the LOAD DATA command at MySQL.com
The command is interpreted by MySQL, not Django. As stated in the manual:
By default, when no column list is provided at the end of the LOAD DATA INFILE statement, input lines are expected to contain a field for each table column. If you want to load only some of a table's columns, specify a column list:
LOAD DATA INFILE 'persondata.txt' INTO TABLE persondata (col1,col2,...);
You must also specify a column list if the order of the fields in the input file differs from the order of the columns in the table. Otherwise, MySQL cannot tell how to match input fields with table columns.