What is the proper way of moving a number of records from one collection to another. I have come across several other SO posts such as this which deal with achieve the same goal but none have a python implementation.
#Taking a number of records from one database and returning the cursor
cursor_excess_new = db.test_collection_new.find().sort([("_id", 1)]).limit(excess_num)
# db.test.insert_many(doc for doc in cursor_excess_new).inserted_ids
# Iterating cursor and Trying to write to another database
# for doc in cursor_excess_new:
# db.test_collection_old.insert_one(doc)
result = db.test_collection_old.bulk_write([
for doc in cursor_excess_new:
InsertMany(doc for each doc in cur)
pprint(doc)
])
If I use insert_many, I get the following error: pymongo.errors.OperationFailure: Writes to config servers must have batch size of 1, found 10
bulk_write is giving me a syntax error at the start of for loop.
What is the best practice and correct way of transferring records from one collection to another in pymongo so that it is atomic?
Collection.bulk_write accepts as argument an iterable of query operations.
pymongo has pymongo.operations.InsertOne operation not InsertMany.
For your situation, you can build a list of InsertOne operations for each document in the source collection. Then do a bulk_write on the destination using the built-up list of operations.
from pymongo import InsertOne
...
cursor_excess_new = (
db.test_collection_new
.find()
.sort([("_id", 1)])
.limit(excess_num)
)
queries = [InsertOne(doc) for doc in cursor_excess_new]
db.test_collection_old.bulk_write(queries)
You don't need "for loop".
myList=list(collection1.find({}))
collection2.insert_many(myList)
collection1.delete_many({})
If you need to filter it you can use the following code:
myList=list(collection1.find({'status':10}))
collection2.insert_many(myList)
collection1.delete_many({'status':10})
But be careful, because it has no warranty to move successfully so you need to control transactions. If you're going to use the following code you should consider that your MongoDb shouldn't be Standalone and you need to active Replication and have another instance.
with myClient.start_session() as mySession:
with mySession.start_transaction():
...yourcode...
Finally, the above code has the warranty to move (insert and delete) successfully but the transaction isn't in your hand and you can't get the result of this transaction so you can use the following code to control both moving and transaction:
with myClient.start_session() as mySession:
mySession.start_transaction()
try:
...yourcode...
mySession.commit_transaction()
print("Done")
except Exception as e:
mySession.abort_transaction()
print("Failed",e)
Related
I have a collection student and I want this collection as list in Python, but unfortunately I got the following error CursorNextError: [HTTP 404][ERR 1600] cursor not found. Is there an option to read a 'huge' collection without an error?
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
print(db.collections())
students = db.collection('students')
#students.all()
students = db.collection('handlingUnits').all()
list(students)
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
students = list(db.collection('students'))
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
as suggested in my comment, if raising the ttl is not an option (what I wouldn't do either) I would get the data in chunks instead of all at once. In most cases you don't need the whole collection anyway, so maybe think of limiting that first. Do you really need all documents and all their fields?
That beeing said I have no experience with arango, but this is what I would do:
entries = db.collection('students').count() # get total amount of documents in collection
limit=100 # blocksize you want to request
yourlist = [] # final output
for x in range(int(entries/limit) + 1):
block = db.collection('students').all(skip=x*limit, limit=100)
yourlist.extend(block) # assuming block is of type list. Not sure what arango returns
something like this. (Based on the documentation here: https://python-driver-for-arangodb.readthedocs.io/_/downloads/en/dev/pdf/)
Limit your request to a reasonable amount and then skip this amount with your next request. You have to check if this "range()" thing works like that you might have to think of a better way of defining the number of iterations you need.
This also assumes arango sorts the all() function per default.
So what is the idea?
determin the number of entries in the collection.
based on that determin how many requests you need (f.e. size=1000 -> 10 blocks each containing 100 entries)
make x requests where you skip the blocks you already have. First iteration entries 1-100; second iteration 101-200, third iteration 201-300 etc.
By default, AQL queries generate the complete result, which is then held in memory, and provided batch by batch. So the cursor is simply fetching the next batch of the already calculated result. In most of the cases this is fine, but if your query produces a huge result set, then this can take a long time and will require a lot of memory.
As an alternative you can create a streaming cursor. See https://www.arangodb.com/docs/stable/http/aql-query-cursor-accessing-cursors.html and check the stream option.
Streaming cursors calculate the next batch on demand and are therefore better suited to iterate a large collection.
I have implemented a python script in order to divide millions of documents (generated by a .NET web application and which were all content into a single directory) into sub folders with this scheme: year/month/batch, as all the tasks these documents come from were originally divided into batches.
My python scripts performs queries to SQL Server 2014 which contains all data it needs for each document, in particular the month and year it was created in. Then it uses shutil module to move the pdf. So, I firstly perform a first query to get a list of batches, for a given month and year:
queryBatches = '''SELECT DISTINCT IDBATCH
FROM [DBNAME].[dbo].[WORKS]
WHERE YEAR(DATETIMEWORK)={} AND MONTH(DATETIMEWORK)={}'''.format(year, month)
Then I perform:
for batch in batches:
query = '''SELECT IDWORK, IDBATCH, NAMEDOCUMENT
FROM [DBNAME].[dbo].[WORKS]
WHERE NAMEDOCUMENTI IS NOT NULL and
NAMEDOCUMENT not like '/%/%/%/%.pdf' and
YEAR(DATETIMEWORK)={} and
MONTH(DATETIMEWORK)={} and
IDBATCH={}'''.format(year,month,batch[0])
whose records are collected into a cursor, according to PYMSSQL use documentation. So I go on with:
IDWorksUpdate = []
row = cursor.fetchone()
while row:
if moveDocument(...):
IDWorksUpdate.append(row[0])
row = cursor.fetchone()
Finally, when the cycle has ended, in IDWorksUpdate I have all the PKs of WORKS whose documents succeeded to be correctly moved into a subfolder. So, I close the cursor and the connection and I instantiate new ones.
In the end I perform:
subquery = '('+', '.join(str(x) for x in IDWorksUpdate)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(year,month,idbatch,subquery)
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Error on updating documents names in database of works {}/{} of batch {}'.format(year,month,idbatch))
finally:
newCursor.close()
del newCursor
newConn.close()
This morning I see that only for a couple of batches that update query failed executing at the database, even if the documents were correctly moved into subdirectories.
That batched had more than 55000 documents to be moved, so maybe the IDWorksUpdate overflowed and it helped to bad create that final update query? I thought that 55000 was not such a big list of integers. The problem is that in PYMSSQL we cannot have more than one connection/cursor at a time to the same database so I cannot update the records as the respective files are moved. So I thought to create a list of PKs of works whose documents were correctly moved and finally update them with a new connection/cursor. What could have happened? Am I doing it wrong?
UPDATE
I've just written a simple script to reproduce the query which is going to be executed to update the records, and this is the error I get from SQL Server:
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
This is the query:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE IDWORK IN (list of 55157 PKs)
The fact is that table is very big (with about 14 millions of records). But I need that list of PKs because only the tasks whose document have been correctly processed and moved can be updated. I cannot simply run:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE YEAR(DATETIMEWORK)=2016 and
MONTH(DATETIMEWORK)=12 and IDBATCH=1484
This because as our server was attacked by a crypto locker, I must process and move only the documents that still exist, waiting for the other to be released.
Should I split those string into sub lists? How?
UPDATE 2
It seems the following could be a solution: I split the list of PKs into chunks of 10000 (a fully experimental number) and then I execute as many queries as many chunks, each of them with a chunk as subquery.
def updateDB(listID, y, m, b, log):
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
if len(listID) <= 10000:
subquery = '('+', '.join(str(x) for x in listID)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORKIN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('...')
log.write('\n\n')
finally:
newCursor.close()
del newCursor
newConn.close()
else:
chunksPK = [listID[i:i + 10000] for i in xrange(0, len(listID), 10000)]
for sublistPK in chunksPK:
subquery = '('+', '.join(str(x) for x in sublistPK)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Could not execute partial {}'.format(query))
log.write('\n\n')
newCursor.close()
del newCursor
newConn.close()
Could this be a good/secure solution?
As stated in the MSDN document
IN (Transact-SQL)
Explicitly including an extremely large number of values (many thousands of values separated by commas) within the parentheses, in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table, and use a SELECT subquery within an IN clause.
(The error message you cited was error 8623.)
Putting the IN list values into a temporary table and then using
... WHERE IDWORK IN (SELECT keyValue FROM #inListTable)
strikes me as being more straightforward than the "chunking" method you described.
I have created a database to store NGS sequencing results. It consists of 17 tables to store all of the information. The results are stored in spreadsheets which I parse data from and store in variables using python (2.7), and then use the python package mysqldb to insert data into the database. I mainly use functions to obtain the information i need in variables, then write a loop in which I call this function followed by a 'try:' statement to insert. Here is a simple example:
def sample_processer(file):
my_file = open(file, 'r+')
samples = []
for line in my_file:
...get info...
samples.append(line[0])
return(samples)
samples = sample_processor('path/to/file')
for sample in samples:
try:
sql = "samsql = "INSERT IGNORE INTO sample(sample_id, diagnosis, screening) VALUES ("
samsql = samsql + "'"+sample+"'," +sam_screen_dict.get(sample)+"')"
except e:
db.rollback()
print("Something went wrong inserting data into the sample table: %s" %(e))
*sam_screen_dict is a dictionary i made from another function.
This is a simple table that I upload into but many of them call of different dictionaries to make sure the correct results are uploaded. However I was wondering whether there would be a more robust way in which to do this using a class.
For example, my sample_id has an associated screening attribute in the sample table, so this is easy to do with one dictionary. I have more complex junction tables, such as the table in which the sample_id, experiment_id and found mutation are stored, alongside other data, would it be a good idea to create a class for this table, calling on a simple 'sample' class to inherit from? That way I would always know that the results being inserted will be for the correct sample/experiment etc.
Also, using classes could I write rules for each attribute so that if the source spreadsheet is for some reason incorrect, it will not insert into the database?
I.e: sample_id is in the format A123/16. Therefore using a class it will check that the first character is 'A' and that sample_id[-3] should always == '/'. I know I could write these into functions, but I feel like it would take up so much space and time writing so many 'if' statements, that if it is stored once in a class then this would be alot better.
Has anybody done anything similar using classes to pass through their variables to test that they are correct before it gets to the insert stage and an error is created?
I am new to python classes and understand the basics, still trying to get to grips with them so a point in the right direction would be great - as would any help on how to go about actually writing the code for a python class that would be used to make a more robust database insertion program.
17tables it means you may use about 17 classes.
Use a simple script. webpy.db
https://github.com/webpy/webpy/blob/master/web/db.py just modify few code.
Then you can use webpy api: http://webpy.org/docs/0.3/api#web.db to finish your job.
Hope it's useful for you
I'm new to both MongoDB and pyMongo,
and am having some performance issues
regarding cursors.
TL,DNR: Anything operation I try to perform
using a cursor takes about a second.
Long version
I have a small database, which I bulkloaded. Each entry has 3 fields:
dom: domain name (unique)
date: date, YYYYMMDD
flag: string
I've loaded about 1.9 million entries, without incident, and quite quickly.
I created a hash index on the dom field.
Now, I want to grab certain records by the domain field, and update them, using a Python program.
That's where the problem lies.
I'm using the latest MongoDB, and the latest pyMongo.
stripped down program...
import pymongo
from pymongo import MongoClient
db = client.myindexname
posts = db.posts
print list(db.profiles.index_information()) # shows hash index is present
for k in newdomainlist.keys(): #iterate list of domains to check
ret = posts.find({"dom": k}) #this runs fine, and quickly
#'ret' is a cursor
print ret #this runs quickly
#Here's the problem
print ret.count() #this takes about a second. why?
If I just 'print ret', the speed is fine. However, if I try to
reference anything in the cursor, the speed drops to the floor - I
can do about 1 operation per second.
In this case, I'm just trying to see if ret.count() returns '0' (we don't
have this domain), or '1' (we have it already).
I've tried adding a batch_size(10000) to the find, without it helping.
I DO have the Python C extensions loaded.
What the heck am I doing wrong?
thanks
It turned out that I'd created my hashed index on the wrong field, 'collection', rather than 'posts'. Chalk it up to mongodb inexperience. We can close this one now, or delete it entirely.
I have the following code:
def executeQuery(conn, query):
cur = conn.cursor()
cur.execute(query)
return cur
def trackTagsGenerator(chunkSize, baseCondition):
""" Returns a dict of trackId:tag limited to chunkSize. """
sql = """
SELECT track_id, tag
FROM tags
WHERE {baseCondition}
""".format(baseCondition=baseCondition)
limit = chunkSize
offset = 0
while True:
trackTags = {}
# fetch the track ids with the coresponding tag
limitPhrase = " LIMIT %d OFFSET %d" % (limit, offset)
query = sql + limitPhrase
offset += limit
cur = executeQuery(smacConn, query)
rows = cur.fetchall()
if not rows:
break
for row in rows:
trackTags[row['track_id']] = row['tag']
yield trackTags
I want to use it like this:
for trackTags in list(trackTagsGenerator(DATA_CHUNK_SIZE, baseCondition)):
print trackTags
break
This code produces the following error without even fetching one chunk of track tags:
Exception _mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now") in <bound method SSDictCursor.__del__ of <MySQLdb.cursors.SSDictCursor object at 0x10b067b90>> ignored
I suspect it's because I have the query execute logic in the body of loop in the generator function.
Is someone able to tell me how to fetch chunks of data using mysqldb in such way?
I'm pretty sure this is because it can run into situations where you've got two queries
running simultaniously because of the yield. Depending on how you call the function (threads, async, etc..) I'm pretty sure your cursor might get clobbered too?
As well, you're opening yourself up to (sorry, but I can't sugar coat this part) horrific SQL injection holes by inserting baseConditional using essentially a printf. Take a look at the DB-API’s parameter substitution docs for help.
Yield isn't going to save you time or energy here at all, the full sql command will always need to run before you'll get a single result. (Hence you're using LIMIT and OFFSET to make it more friendly, kudos)
i.e. someone updates the table while you're yielding out some data, in this particular case - not the end of the world. In many others, it gets ugly.
If you're just goofing around and you want this to work 'right-now-dammit', it'd probably work to modify executeQuery as such:
def executeQuery(conn, query):
cur = conn.cursor()
cur.execute(query)
cur = executeQuery(smacConn, query)
rows = cur.fetchall()
cur.close()
return rows
One thing that also kinda jumps out at me - you define trackTags = {}, but then you update tagTrackIds, and yield trackTags.. Which will always be empty dict.
My suggestion would be to not bother yourself with the headache of hand writing SQL if you're just trying to get a hobby project working. Take a look at Elixir which is built on top of SQLAlchemy.
Using an ORM (object-relational-mapper) can be a much more friendly introduction to databases. Defining what your objects look like in Python, and having it automatically generate your schema for you - and being able to add/modify/delete things in a Pythonic manner is really nifty.
If you really need to be async, check out ultramysql python module.
You use a SSDictCursor, something that maps to mysql_use_result() on MySQL-API-side. This requires that you read out the complete result before you can issue a new command.
As this happens before you receive the first chunk of data after all: are yu sure that this doesn't happen in the context of the query before this part of code is executed? The results of that last query might be still in the line, and executing the next one (i. e., the fist one in this context) might break things...