How to bulk insert data to mysql with python

How to bulk insert data to mysql with python - python

Currently i'm using Alchemy as a ORM, and I look for a way to speed up my insert operation, I have bundle of XML files to import
for name in names:
p=Product()
p.name="xxx"
session.commit()
i use above code to insert my data paser from batch xml file to mysql,it's very slow
also i tried to
for name in names:
p=Product()
p.name="xxx"
session.commit()
but it seems didn't change anything

You could bypass the ORM for the insertion operation and use the SQL Expression generator instead.
Something like:
conn.execute(Product.insert(), [dict(name=name) for name in names])
That should create a single statement to do your inserting.
That example was taken from lower down the same page.
(I'd be interested to know what speedup you got from that)

Related

SQL query builder with seed-query Parser

Is there an SQL query builder (in Python) which allows me to "parse" and initial SQL query, add certain operators and then get the resulting SQL text?
My use case is the following:
Start with a query like: "SELECT * from my_table"
I want to be able to do something like query_object = Query.parse("SELECT * from my_table to get a query object I can manipulate and then write something like query_object.where('column < 10').limit(10) or similar (columns and operators could also be part of the library, may also have to consider existing WHERE clauses)
And finally getting the resulting query string str(query_object) with the final modified query.
Is this something that can be achieved with any of the ORMs? I don't need all the database connection to specific DB-engines or object mappings (although having it is not a limitation).
I've seen pypika, which allows to create an SQL query from code, but it doesn't allow one to parse an existing query and continue from there.
I've also seen sqlparse which allows me to parse and SQL query into tokens. But because it does not create a tree, it is non-trivial to add additional elements to am existing statement. (it is close to what I am looking for, if only it created an actual tree)

How to read a csv using sql

I would like to know how to read a csv file using sql. I would like to use group by and join other csv files together. How would i go about this in python.
example:
select * from csvfile.csv where name LIKE 'name%'

SQL code is executed by a database engine. Python does not directly understand or execute SQL statements.
While some SQL database store their data in csv-like files, almost all of them use more complicated file structures. Therefore, you're required to import each csv file into a separate table in the SQL database engine. You can then use Python to connect to the SQL engine and send it SQL statements (such as SELECT). The engine will perform the SQL, extract the results from its data files, and return them to your Python program.
The most common lightweight engine is SQLite.

littletable is a Python module I wrote for working with lists of objects as if they were database tables, but using a relational-like API, not actual SQL select statements. Tables in littletable can easily read and write from CSV files. One of the features I especially like is that every query from a littletable Table returns a new Table, so you don't have to learn different interfaces for Table vs. RecordSet, for instance. Tables are iterable like lists, but they can also be selected, indexed, joined, and pivoted - see the opening page of the docs.
# print a particular customer name
# (unique indexes will return a single item; non-unique
# indexes will return a Table of all matching items)
print(customers.by.id["0030"].name)
print(len(customers.by.zipcode["12345"]))
# print all items sold by the pound
for item in catalog.where(unitofmeas="LB"):
print(item.sku, item.descr)
# print all items that cost more than 10
for item in catalog.where(lambda o : o.unitprice>10):
print(item.sku, item.descr, item.unitprice)
# join tables to create queryable wishlists collection
wishlists = customers.join_on("id") + wishitems.join_on("custid") + catalog.join_on("sku")
# print all wishlist items with price > 10
bigticketitems = wishlists().where(lambda ob : ob.unitprice > 10)
for item in bigticketitems:
print(item)
Columns of Tables are inferred from the attributes of the objects added to the table. namedtuples are good also, as well as a types.SimpleNamespaces. You can insert dicts into a Table, and they will be converted to SimpleNamespaces.
littletable takes a little getting used to, but it sounds like you are already thinking along a similar line.

You can easily query an SQL Database using PHP script. PHP runs serverside, so all your code will have to be on a webserver (the one with the database). You could make a function to connect to the database like this:
$con= mysql_connect($hostname, $username, $password)
or die("An error has occured");
Then use the $con to accomplish other tasks such as looping through data and creating a table, or even adding rows and columns to an existing table.
EDIT: I noticed you said .CSV file. You can upload a CSV file into a SQL database and create a table out of it. If you are using a control panel service such as phpMyAdmin, you can simply import a CSV file into your database like this:
If you are looking for a free web host to test your SQL and PHP files on, check out x10 hosting.

Psycopg2, Postgresql, Python: Fastest way to bulk-insert

I'm looking for the most efficient way to bulk-insert some millions of tuples into a database. I'm using Python, PostgreSQL and psycopg2.
I have created a long list of tulpes that should be inserted to the database, sometimes with modifiers like geometric Simplify.
The naive way to do it would be string-formatting a list of INSERT statements, but there are three other methods I've read about:
Using pyformat binding style for parametric insertion
Using executemany on the list of tuples, and
Using writing the results to a file and using COPY.
It seems that the first way is the most efficient, but I would appreciate your insights and code snippets telling me how to do it right.

Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

There is a new psycopg2 manual containing examples for all the options.
The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.

in my experience executemany is not any faster than running many inserts yourself,
the fastest way is to format a single INSERT with many values yourself, maybe in the future executemany will improve but for now it is quite slow
i subclass a list and overload the append method ,so when a the list reaches a certain size i format the INSERT to run it

You could use a new upsert library:
$ pip install upsert
(you may have to pip install decorator first)
conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
upsert.row(selector, setter)
Where selector is a dict object like {'name': 'Chris Smith'} and setter is a dict like { 'age': 28, 'state': 'WI' }
It's almost as fast as writing custom INSERT[/UPDATE] code and running it directly with psycopg2... and it won't blow up if the row already exists.

The newest way of inserting many items is using the execute_values helper (https://www.psycopg.org/docs/extras.html#fast-execution-helpers).
from psycopg2.extras import execute_values
insert_sql = "INSERT INTO table (id, name, created) VALUES %s"
# this is optional
value_template="(%s, %s, to_timestamp(%s))"
cur = conn.cursor()
items = []
items.append((1, "name", 123123))
# append more...
execute_values(cur, insert_sql, items, value_template)
conn.commit()

Anyone using SQLalchemy could try 1.2 version which added support of bulk insert to use psycopg2.extras.execute_batch() instead of executemany when you initialize your engine with use_batch_mode=True like:
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
use_batch_mode=True)
http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109
Then someone would have to use SQLalchmey won't bother to try different combinations of sqla and psycopg2 and direct SQL together.

After some testing, unnest often seems to be an extremely fast option, as I learned from #Clodoaldo Neto's answer to a similar question.
data = [(1, 100), (2, 200), ...] # list of tuples
cur.execute("""CREATE TABLE table1 AS
SELECT u.id, u.var1
FROM unnest(%s) u(id INT, var1 INT)""", (data,))
However, it can be tricky with extremely large data.

The first and the second would be used together, not separately. The third would be the most efficient server-wise though, since the server would do all the hard work.

A very related question: Bulk insert with SQLAlchemy ORM
All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.
In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:
The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().
In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).
The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

Large Sqlite database search

How is it possible to implement an efficient large Sqlite db search (more than 90000 entries)?
I'm using Python and SQLObject ORM:
import re
...
def search1():
cr = re.compile(ur'foo')
for item in Item.select():
if cr.search(item.name) or cr.search(item.skim):
print item.name
This function runs in more than 30 seconds. How should I make it run faster?
UPD: The test:
for item in Item.select():
pass
... takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time.
A Sqlite3 shell query:
select '' from item where name like '%foo%';
runs in about a second. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.

The best way would be to rework your logic to do the selection in the database instead of in your python program.
Instead of doing Item.select(), you should rework it to do Item.select("""name LIKE ....
If you do this, and make sure you have the name and skim columns indexed, it will return very quickly. 90000 entries is not a large database.

30 seconds to fetch 90,000 rows might not be all that bad.
Have you benchmarked the time required to do the following?
for item in Item.select():
pass
Just to see if the time is DB time, network time or application time?
If your SQLite DB is physically very large, you could be looking at -- simply -- a lot of physical I/O to read all that database stuff in.

If you really need to use a regular expression, there's not really anything you can do to speed that up tremendously.
The best thing would be to write an sqlite function that performs the comparison for you in the db engine, instead of Python.
You could also switch to a db server like postgresql that has support for SIMILAR.
http://www.postgresql.org/docs/8.3/static/functions-matching.html

I would definitely take a suggestion of Reed to pass the filter to the SQL (forget the index part though).
I do not think that selecting only specified fields or all fields make a difference (unless you do have a lot of large fields). I would bet that SQLObject creates/instanciates 80K objects and puts them into a Session/UnitOfWork for tracking. This could definitely take some time.
Also if you do not need objects in your session, there must be a way to select just what the fields you need using custom-query creation so that no Item objects are created, but only tuples.

Initially doing regex via Python was considered for y_serial, but that
was dropped in favor of SQLite's GLOB (which is far faster).
GLOB is similar to LIKE except that it's syntax is more
conventional: * instead of %, ? instead of _ .
See the Endnotes at http://yserial.sourceforge.net/ for more details.

Given your example and expanding on Reed's answer your code could look a bit like the following:
import re
import sqlalchemy.sql.expression as expr
...
def search1():
searchStr = ur'foo'
whereClause = expr.or_(itemsTable.c.nameColumn.contains(searchStr), itemsTable.c.skimColumn.contains(searchStr))
for item in Items.select().where(whereClause):
print item.name
which translates to
SELECT * FROM items WHERE name LIKE '%foo%' or skim LIKE '%foo%'
This will have the database do all the filtering work for you instead of fetching all 90000 records and doing possibly two regex operations on each record.
You can find some info here on the .contains() method here.
As well as the SQLAlchemy SQL Expression Language Tutorial here.
Of course the example above assumes variable names for your itemsTable and the column it has (nameColumn and skimColumn).

What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.
Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.
[("hello", 3, 4), ("cat", 5, 3), ...]
I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.
My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:
#transaction.commit_on_success
def save_data_elements(input_list):
for (s, i1, i2) in input_list:
entry = DataElement(string=s, number1=i1, number2=i2)
entry.save()
This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.
For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.
Do you have any suggestion to optimize such a process?
Thanks

You can write the rows to a file in the format
"field1", "field2", .. and then use LOAD DATA to load them
data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()
Then execute this:
LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;
Reference

For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.

If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :
Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)

If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.

Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.

what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html

This is unrelated to the actual load of data into the DB, but...
If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.
Just something else to consider.

I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.
Hope you can work it out. I just started learning django, so I can just point you to resources.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to bulk insert data to mysql with python - python

Related

SQL query builder with seed-query Parser

How to read a csv using sql

Psycopg2, Postgresql, Python: Fastest way to bulk-insert

Large Sqlite database search

What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

Categories

Resources