I'll make it easier on you.
I need to perform a multi-insert operation using parameters from a text file.
However, I need to report each input line in a log or an err file depending on the insert status.
I was to able to understand if the insert was ok or nor when performing it once at a time (for example, using cur.rowcount or simply a try..except statement).
Is there a way to perform N insert (corresponding to N input line) and to understand which fail?
Here my code:
QUERY="insert into table (field1, field2, field3) values (%s, %s, %s)"
Let
a b c
d e f
g h i
be 3 rows from input file. So
args=[('a','b','c'), ('d','e','f'),('g','h','i')]
cur.executemany(QUERY,args)
Now, let's suppose only the first 2 rows were successfully added. So I have to track such a situation as follows:
log file
a b c
d e f
err file
g h i
Any idea?
Thanks!
try this:
QUERY="insert into table (field1, field2, field3) values ({}, {}, {})"
with open('input.txt', 'r') as inputfile:
readfile = inputfile.read()
inputlist = readfile.splitlines()
listafinal = []
for x in inputlist:
intermediate = x.split(' ')
cur.execute(QUERY.format(intermediate[0], intermediate[1], intermediate[2]))
# if error:
# log into the error file
# else:
# log into the success file
Do not forget to undo the comments and ajust the error as you like
How common do you expect failures to be, and what kind of failures? What I have done in such similar cases is insert 10,000 rows at a time, and if the chunk fails then go back and do that chunk 1 row at a time to get the full error message and specific row. Of course, that depends on failures being rare. What I would be more likely to do today is just turn off synchronous_commit and process them one row at a time always.
Related
I'm trying to update all rows of 1 column of my database with a big tuple.
c.execute("SELECT framenum FROM learnAlg")
db_framenum = c.fetchall()
print(db_framenum)
db_framenum_new = []
# How much v6 framenum differentiates from v4
change_fn = 0
for f in db_framenum:
t = f[0]
if t in change_numbers:
change_fn += 1
t = t + change_fn
db_framenum_new.append((t,))
print("")
print(db_framenum_new)
c.executemany("UPDATE learnAlg SET framenum=?", (db_framenum_new))
First I take the existing values of the column 'framenum', which look like:
[(0,), (1,), (2,) , ..., (104,)]
Then I transform the tuple to a list so I can change some values in the for f in db_framenum: loop, which result in a similar tuple:
[(0,), (1,), (2,) , ..., (108,)]
Problem
So far so good, but then I try to update the column 'framenum' with these new framenumbers:
c.executemany("UPDATE learnAlg SET framenum=?", (db_framenum_new))
I expect the rows in the column 'framenum' to have the new values, but instead they all have the value: 108 (which is the last value of the tuple 'db_framenum_new'). Why are they not being updated in order (from 1 till 108)?
Expect:
framenum: 1, 2, .., 108
Got:
framenum: 108, 108, ..., 108
Note: The list of tuples has not become longer, only certain values have been changed to. Everything above 46 has +1, everything above 54 additional +1 (+2 total)...
Note2: The column is created with: 'framenum INTEGER'. Another column has the PRIMARY KEY if this matters, made with: 'framekanji TEXT PRIMARY KEY', which has (for now) all value 'NULL'.
Edit
Solved my problem, but I'm still interested in proper use of c.executemany(). I don't know why this only updates the first rowid:
c.execute("SELECT rowid, framenum FROM learnAlg")
db_framenum = c.fetchall()
print(db_framenum)
db_framenum_new = []
# How much v6 framenum differentiates from v4
change_fn = 0
for e, f in enumerate(db_framenum):
e += 1
t = f[1]
if t in change_numbers:
change_fn += 1
t = t + change_fn
db_framenum_new.append((e,t))
print(db_framenum_new)
c.executemany("UPDATE learnAlg SET framenum=? WHERE rowid=?",
(db_framenum_new[1], db_framenum_new[0]))
Yes, you are telling the database to update all rows with the same framenum. That's because the UPDATE statement did not select any specific row. You need to tell the database to change one row at a time, by including a primary key for each value.
Since you are only altering specific framenumbers, you could ask the database to only provide those specific rows instead of going through all of them. You probably also need to specify an order in which to change the numbers; perhaps you need to do so in incrementing framenumber order?
c.execute("""
SELECT rowid, framenum FROM learnAlg
WHERE framenum in ({})
ORDER BY framenum
""".format(', '.join(['?'] * len(change_numbers))),
change_numbers)
update_cursor = conn.cursor()
for change, (rowid, f) in enumerate(c, 1):
update_cursor.execute("""
UPDATE learnAlg SET framenum=? WHERE rowid=?""",
(f + change, rowid))
I altered the structure somewhat there; the query limits the results to frame numbers in the change_numbers sequence only, through a WHERE IN clause. I loop over the cursor directly (no need to fetch all results at once) and use separate UPDATEs to set the new frame number. Instead of a manual counter I used enumerate() to keep count for me.
If you needed to group the updates by change_numbers, then just tell the database to do those updates:
change = len(change_numbers)
for framenumber in reversed(change_numbers):
update_cursor.execute("""
UPDATE learnAlg SET framenum=framenum + ? WHERE framenum=?
""", (change, framenumber))
change -= 1
This starts at the highest framenumber to avoid updating framenumbers you already updated before. This does assume your change_numbers are sorted in incremental order.
Your executemany update should just pass in the whole list, not just the first two items; you do need to alter how you append the values:
for e, f in enumerate(db_framenum):
# ...
db_framenum_new.append((t, e)) # framenum first, then rowid
c.executemany("UPDATE learnAlg SET framenum=? WHERE rowid=?",
db_framenum_new)
Note that the executemany() call takes place outside the for loop!
Thanks #Martijn Pieters, using rowid is what I needed. This is the code that made it work for me:
c.execute("SELECT rowid, framenum FROM learnAlg")
db_framenum = c.fetchall()
print(db_framenum)
# How much v6 framenum differentiates from v4
change_fn = 0
for e, f in enumerate(db_framenum):
e += 1
db_framenum_new = f[1]
if db_framenum_new in change_numbers:
change_fn += 1
db_framenum_new = db_framenum_new + change_fn
c.execute("UPDATE learnAlg SET framenum=? WHERE rowid=?",
(db_framenum_new, e))
However I still don't know how to properly use c.executemany(). See edit for updated question.
The (regrettably lengthy) MWE at the end of this question is cut down from a real application. It is supposed to work like this: There are two tables. One includes both already-processed and not-yet-processed data, the other has the results of processing the data. On startup, we create a temporary table that lists all of the data that has not yet been processed. We then open a read cursor on that table and scan it from beginning to end; for each datum, we do some crunching (omitted in the MWE) and then insert the results into the processed-data table, using a separate cursor.
This works correctly in autocommit mode. However, if the write operation is wrapped in a transaction -- and in the real application, it has to be, because the write actually touches several tables (all but one of which have been omitted from the MWE) -- then the COMMIT operation has the side-effect of resetting the read cursor on the temp table, causing rows that have already been processed to be reprocessed, which not only prevents forward progress, it causes the program to crash with an IntegrityError upon trying to insert a duplicate row into data_out. If you run the MWE you should see this output:
0
1
2
3
4
5
6
7
8
9
10
0
---
127 rows remaining
Traceback (most recent call last):
File "sqlite-test.py", line 85, in <module>
test_main()
File "sqlite-test.py", line 83, in test_main
test_run(db)
File "sqlite-test.py", line 71, in test_run
(row[0], b"output"))
sqlite3.IntegrityError: UNIQUE constraint failed: data_out.value
What can I do to prevent the read cursor from being reset by a COMMIT touching unrelated tables?
Notes: All of the INTEGERs in the schema are ID numbers; in the real application there are several more ancillary tables that hold more information for each ID, and the write transaction touches two or three of them in addition to data_out, depending on the result of the computation. In the real application, the temporary "data_todo" table is potentially very large -- millions of rows; I started down this road precisely because a Python list was too big to fit in memory. The MWE's shebang is for python3 but it will behave exactly the same under python2 (provided the interpreter is new enough to understand b"..." strings). Setting PRAGMA locking_mode = EXCLUSIVE; and/or PRAGMA journal_mode = WAL; has no effect on the phenomenon. I am using SQLite 3.8.2.
#! /usr/bin/python3
import contextlib
import sqlite3
import sys
import tempfile
import textwrap
def init_db(db):
db.executescript(textwrap.dedent("""\
CREATE TABLE data_in (
origin INTEGER,
origin_id INTEGER,
value INTEGER,
UNIQUE(origin, origin_id)
);
CREATE TABLE data_out (
value INTEGER PRIMARY KEY,
processed BLOB
);
"""))
db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
[ (1, x, x) for x in range(100) ])
db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
[ (2, x, 200 - x*2) for x in range(100) ])
db.executemany("INSERT INTO data_out VALUES(?, ?);",
[ (x, b"already done") for x in range(50, 130, 5) ])
db.execute(textwrap.dedent("""\
CREATE TEMPORARY TABLE data_todo AS
SELECT DISTINCT value FROM data_in
WHERE value NOT IN (SELECT value FROM data_out)
ORDER BY value;
"""))
def test_run(db):
init_db(db)
read_cur = db.cursor()
write_cur = db.cursor()
read_cur.arraysize = 10
read_cur.execute("SELECT * FROM data_todo;")
try:
while True:
block = read_cur.fetchmany()
if not block: break
for row in block:
# (in real life, data actually crunched here)
sys.stdout.write("{}\n".format(row[0]))
write_cur.execute("BEGIN TRANSACTION;")
# (in real life, several more inserts here)
write_cur.execute("INSERT INTO data_out VALUES(?, ?);",
(row[0], b"output"))
db.commit()
finally:
read_cur.execute("SELECT COUNT(DISTINCT value) FROM data_in "
"WHERE value NOT IN (SELECT value FROM data_out)")
result = read_cur.fetchone()
sys.stderr.write("---\n{} rows remaining\n".format(result[0]))
def test_main():
with tempfile.NamedTemporaryFile(suffix=".db") as tmp:
with contextlib.closing(sqlite3.connect(tmp.name)) as db:
test_run(db)
test_main()
Use a second, separate connection for the temporary table, it'll be unaffected by commits on the other connection.
Here is input.txt file
Jan_Feb 0.11
Jan_Mar -1.11
Jan_Apr 0.2
Feb_Jan 0.11
Feb_Mar -3.0
Mar_Jan -1.11
Mar_Feb -3.0
Mar_Apr 3.5
from this file, I am trying to create a dictionary from the input text file. 1) The keys are two values which is split with "_" from 1st column string of input file. 2) Moreover, if the name of column and row are same (such as Jan and Jan), write 0.0 as follows. 3) Lastly, if the keys are not found in the dictionary, write "NA". Output.txt
Jan Feb Mar Apr
Jan 0.0 0.11 -1.11 0.2
Feb 0.11 0.0 -3.0 NA
Mar -1.11 -3.0 0.0 3.5
Apr 0.2 NA 3.5 0.0
I would really appreciate if someone can help me figure out. Actually, there are about 100,000,000 rows * 2 columns in real input.txt. The name of Thank you so much in advance.
Others might disagree with this but one solution would be simply to read all 100 million lines into a relational database table (appropriately split-ing out what you need, of course) using a module that interfaces with MySQL or SQLite:
Your_Table:
ID
Gene_Column
Gene_Row
Value
Once they're in there, you can query against the table in something that resembles English:
Get all of the column headings:
select distinct Gene_Column from Your_Table order by Gene_Column asc
Get all of the values for a particular row, and which columns they're in:
select Gene_Column, Value from Your_Table where Gene_Row = "Some_Name"
Get the value for a particular cell:
select Value from Your_Table where Gene_Row = "Some_Name" and Gene_Column = "Another_Name"
That, and you really don't want to shuffle around 100 million records any more than you have to. Reading all of them into memory may be problematic as well. Doing it this way, you can construct your matrix one row at a time, and output the row to your file.
It might not be the fastest, but it will probably be pretty clear and straightforward code.
What you first need to do is get the data in an understandable format. So, first, you need to create a row. I would get the data like so:
with open('test.txt') as f:
data = [(l.split()[0].split('_'), l.split()[1]) for l in f]
# Example:
# [(['Jan', 'Feb'], '0.11'), (['Jan', 'Mar'], '-1.11'), (['Jan', 'Apr'], '0.2'), (['Feb', 'Jan'], '0.11'), (['Feb', 'Mar'], '-3.0'), (['Mar', 'Jan'], '-1.11'), (['Mar', 'Feb'], '-3.0'), (['Mar', 'Apr'], '3.5')]
headers = set([var[0][0] for var in data] + [var[0][1] for var in data])
# Example:
# set(['Jan', 'Apr', 'Mar', 'Feb'])
What you then need to do is create a mapping from your headers to your values, which are stored in data. Ideally, you would need to create a table. Take a look at this answer to help you figure out how to do that (we can't write your code for you).
Secondly, in order to print things out properly, you will need to use the format method. Ideally, it will help you deal with strings, and printing them in a specific fashion.
After that you can simply write like so with open('output.txt', 'w').
matrix = dict()
with open('inpu.txt') as f:
content = f.read()
tmps = content.split('\n')
for tmp in tmps:
s = tmp.split(' ')
latter = s[0].split('_')
try:
if latter[0] in matrix:
matrix[latter[0]][latter[1]] = s[1]
else:
matrix[latter[0]] = dict()
matrix[latter[0]][latter[1]] = s[1]
except:
pass
print matrix
And now in matrix you have table what u want.
If you want a dictionary as a result, something like that:
dico = {}
keyset=set()
with open('input.txt','r') as file:
line = file.readline()
keys = line.split('\t')[0]
value = line.split('\t')[1]
key1 = keys.split('_')[0]
keyset.add(key1)
key2 = keys.split('_')[1]
keyset.add(key2)
if key1 not in dico:
dico[key1] = {}
dico[key1][key2] = value
for key in keyset:
dico[key][key] = 0.0
for secondkey in keyset:
if secondkey not in dico[key].keys():
dico[key][secondkey]="NA"
1) Determine all of the possible headers for resulting column/row. In your example, that is A-D. How you do this can vary. You can parse the file 2x (not ideal, but it might be necessary), or perhaps you have someplace you can refer to for the distinct columns.
2) Establish headers. In the example above, you would have headers=["A","B","C","D"]. You can build this up during #1 if you have to parse the first column. Use len(indexes) to determine N
3) Parse the data, this time consider both columns. You will get the two keys using .split("_") on the first column, then you get the index for your data by doing simple arithmetic:
x,y = [headers.index(a) for a in row[0].split("_")]
data[x+y*len(headers)] = row[1]
This should be relatively fast, except for the parsing of the file twice. If it can fit into memory, you could load your file into memory and then scan over it twice, or use command line tricks to establish those header entries.
-- I should say that you will need to determine N before you begin storing the actual data. (i.e. data=[0]*N). Also, you'll need to use x+y*len(headers) during the save as well. If you are using numpy, you can use reshape to get an actual row/col layout which will be a little easier to manipulate and print (i.e. data[x,y]=row[1])
If you do a lot of large data manipulation, especially if you might be performing calculations, you really should look into learning numpy (www.numpy.org).
Given the size of your input, I would split this in several passes on your file:
First pass to identify the headers (loop on all lines, read the first group, find the headers)
Read the file again to find the values to put in the matrix. There are several options.
The simplest (but slow) option is to read the whole file for each line of your matrix, identifying only the lines you need in the file. This way, you only have one line of the matrix in memory at a time
From your example file, it seems your input file is properly sorted. If it is not, it can be a good idea to sort it. This way, you know that the line in the input file you are reading is the next cell in your matrix (except of course for the 0 diagonal which you only need to add.)
I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian
You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.
add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator
I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])
[Edit 2: More information and debugging in answer below...]
I'm writing a python script to export MS Access databases into a series of text files to allow for more meaningful version control (I know - why Access? Why aren't I using existing solutions? Let's just say the restrictions aren't of a technical nature).
I've successfully exported the full contents and structure of the database using ADO and ADOX via the comtypes library, but I'm getting a problem re-importing the data.
I'm exporting the contents of each table into a text file with a list on each line, like so:
[-9, u'No reply']
[1, u'My home is as clean and comfortable as I want']
[2, u'My home could be more clean or comfortable than it is']
[3, u'My home is not at all clean or comfortable']
And the following function to import the said file:
import os
import sys
import datetime
import comtypes.client as client
from ADOconsts import *
from access_consts import *
class Db:
def create_table_contents(self, verbosity = 0):
conn = client.CreateObject("ADODB.Connection")
rs = client.CreateObject("ADODB.Recordset")
conn.ConnectionString = self.new_con_string
conn.Open()
for fname in os.listdir(self.file_path):
if fname.startswith("Table_"):
tname = fname[6:-4]
if verbosity > 0:
print "Filling table %s." % tname
conn.Execute("DELETE * FROM [%s];" % tname)
rs.Open("SELECT * FROM [%s];" % tname, conn,
adOpenDynamic, adLockOptimistic)
f = open(self.file_path + os.path.sep + fname, "r")
data = f.readline()
print repr(data)
while data != '':
data = eval(data.strip())
print data[0]
print rs.Fields.Count
rs.AddNew()
for i in range(rs.Fields.Count):
if verbosity > 1:
print "Into field %s (type %s) insert value %s." % (
rs.Fields[i].Name, str(rs.Fields[i].Type),
data[i])
rs.Fields[i].Value = data[i]
data = f.readline()
print repr(data)
rs.Update()
rs.Close()
conn.Close()
Everything works fine except that numerical values (double and int) are being inserted as zeros. Any ideas on whether the problem is with my code, eval, comtypes, or ADO?
Edit: I've fixed the problem with inserting numbers - casting them as strings(!) seems to solve the problem for both double and integer fields.
However, I now have a different issue that had previously been obscured by the above: the first field in every row is being set to 0 regardless of data type... Any ideas?
And found an answer.
rs = client.CreateObject("ADODB.Recordset")
Needs to be:
rs = client.CreateObject("ADODB.Recordset", dynamic=True)
Now I just need to look into why. Just hope this question saves someone else a few hours...
Is data[i] being treated as a string? What happens if you specifically cast it as a int/double when you set rs.Fields[i].Value?
Also, what happens when you print out the contents of rs.Fields[i].Value after it is set?
Not a complete answer yet, but it appears to be a problem during the update. I've added some further debugging code in the insertion process which generates the following (example of a single row being updated):
Inserted into field ID (type 3) insert value 1, field value now 1.
Inserted into field TextField (type 202) insert value u'Blah', field value now Blah.
Inserted into field Numbers (type 5) insert value 55.0, field value now 55.0.
After update: [0, u'Blah', 55.0]
The last value in each "Inserted..." line is the result of calling rs.Fields[i].Value before calling rs.Update(). The "After..." line shows the results of calling rs.Fields[i].Value after calling rs.Update().
What's even more annoying is that it's not reliably failing. Rerunning the exact same code on the same records a few minutes later generated:
Inserted into field ID (type 3) insert value 1, field value now 1.
Inserted into field TextField (type 202) insert value u'Blah', field value now Blah.
Inserted into field Numbers (type 5) insert value 55.0, field value now 55.0.
After update: [1, u'Blah', 2.0]
As you can see, results are reliable until you commit them, then... not.