I am using Multiprocessing.Pool in my current program because I wanted to increase the fetching and dumping of data from on-premises data center to another db in a different server. The current rate is too slow for MB worth of data, this seems to work best in my current requirement:
def fetch_data()
select data from on_prem_db (id, name...data)
#using Pool and starmap,
#runs dump_data function in 5 parallel threads
dump_data()
pass
def dump_data()
insert entry in table_f1
insert entry in table_g1
Now I am running into the issue where sometimes, multiple threads fetching already processed granules which leads to unique key violation.
eg: first thread fetch [10,20,40,50,70]
second thread fetch[30,40,60,70,80]
rows with id 40 and 70 and duplicated. I am supposed to see 10 entry in my db but I see only 8 entries, and 2 of them raises unique key violation.
How can I make sure that different threads fetch different rows from my source db which is on-prem db so that my program don't try to insert already inserted rows?
eg of my select query:
fetch_data_list_of_ids = [list of ids of processed data]
data_list = list(itertools.islice(on_prem_db.get_data(table_name),5))
Is there a way I can make a list and append the row ids of already processed data in fetch_data () ?
And every time data_list runs a new query to fetch the data, next thing i would do is check if the newly fetched data has ids in fetch_data_list_of_ids list ?
Or is there any other way I can do it to make sure duplicate entries are not being processed??
Related
I have table called details
I am trying to get the live count from table
Below is the code
I was having already 7 items in the table and I inserted 8 items. Now My output has to show 15.
Still my out showing 7 which is old count. How to get the live updated count
From UI also when i check its showing 7 but when i checked live count by Start Scan, I got 15 entries
Is there time is there like some hours which will update the live count?
dynamo_resource = boto3.resource('dynamodb')
dynamodb_table_name = dynamo_resource.Table('details')
item_count_table = dynamodb_table_name.item_count
print('table_name count for field is', item_count_table)
using client
dynamoDBClient = boto3.client('dynamodb')
table = dynamoDBClient.describe_table(TableName='details')
print(table['Table']['ItemCount'])
In your example you are calling DynamoDB DescribeTable which only updates its data approximately every 6 hours:
Item Count:
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableDescription.html#DDB-Type-TableDescription-ItemCount
In order to get the "Live Count" you have two possible options:
Option 1: Scan the entire table.
dynamoDBClient = boto3.client('dynamodb')
item_count = dynamoDBClient.scan(TableName='details', Select='COUNT')
print(item_count)
Be aware this will call for a full table Scan and may not be the best option for large tables.
Option 2: Update a counter for each item you write
You would need to use TransactWriteItems and Update a live counter for every Put which you do to the table. This can be useful, but be aware that you will limit your throughput to 1000 WCU per second as you are focusing all your writes on a single item.
You can then return the live item count with a simple GetItem.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TransactWriteItems.html
According to the DynamoDB documentation,
ItemCount - The number of items in the global secondary index. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
You might need to keep track of the item counts yourself if you need to return an updated count shortly after editing the table.
I have two object called RECEIPT and PARTICULAR.
Attributes of RECEIPT object are:
receiptNo
particularList
Attributes of PARTICULAR object are:
particularId
particularName
Also I have their respective tables. receiptNo is the primary key of RECEIPT table and it is the foreign key in PARTICULAR table. So for a receipt there are multiple particulars.
I want to fetch data to populate RECEIPT object. To achieve this I can first run select query to RECEIPT table and by iterating the result using a for loop I can run another query to fetch the PARTICULAR table. Here I am calling the DB twice.
To avoid calling DB twice I tried joins also as:
SELECT * FROM RECEIPT r,PARTICULAR p WHERE r.RECEIPT_NO = p.RECEIPT_NO
However as it returns repetitive data for the RECEIPT, i.e. for each PARTICULAR row corresponding RECEIPT data are also fectching. This RECEIPT data are repetitive as multiple particularId shares same receiptNo. Due to thisI am unable to load the data properly to the RECEIPT object (Or may be I dont know how to load such resulset to the respective objects)
My actual requirement is to load RECEIPT object by forming PARTICULAR list for each receipt.
Is using the for loop and calling DB twice the only way to achieve it?
Suggest me an efficient way to achieve this
I think querying the data from the database with the JOIN approach is the most efficient way to do it.
If you make sure to ORDER BY "RECEIPT_NO" you just have to loop through the list once in python, only creating a new Receipt object every time you reach a new "RECEIPT_NO".
So the SQL becomes:
SELECT * FROM RECEIPT r,PARTICULAR p WHERE r.RECEIPT_NO = p.RECEIPT_NO ORDER BY RECEIPT_NO
And the python code could look like
data = query("SELECT * FROM RECEIPT r,PARTICULAR p WHERE r.RECEIPT_NO = p.RECEIPT_NO ORDER BY RECEIPT_NO")
last_receipt_nr = ""
for row in data:
if last_receipt_nr == row.RECEIPT_NO:
# Replace with code initializing a new Receipt object
last_receipt_nr = row.RECEIPT_NO
#Replace with code initializing a new Particular object
I have implemented a python script in order to divide millions of documents (generated by a .NET web application and which were all content into a single directory) into sub folders with this scheme: year/month/batch, as all the tasks these documents come from were originally divided into batches.
My python scripts performs queries to SQL Server 2014 which contains all data it needs for each document, in particular the month and year it was created in. Then it uses shutil module to move the pdf. So, I firstly perform a first query to get a list of batches, for a given month and year:
queryBatches = '''SELECT DISTINCT IDBATCH
FROM [DBNAME].[dbo].[WORKS]
WHERE YEAR(DATETIMEWORK)={} AND MONTH(DATETIMEWORK)={}'''.format(year, month)
Then I perform:
for batch in batches:
query = '''SELECT IDWORK, IDBATCH, NAMEDOCUMENT
FROM [DBNAME].[dbo].[WORKS]
WHERE NAMEDOCUMENTI IS NOT NULL and
NAMEDOCUMENT not like '/%/%/%/%.pdf' and
YEAR(DATETIMEWORK)={} and
MONTH(DATETIMEWORK)={} and
IDBATCH={}'''.format(year,month,batch[0])
whose records are collected into a cursor, according to PYMSSQL use documentation. So I go on with:
IDWorksUpdate = []
row = cursor.fetchone()
while row:
if moveDocument(...):
IDWorksUpdate.append(row[0])
row = cursor.fetchone()
Finally, when the cycle has ended, in IDWorksUpdate I have all the PKs of WORKS whose documents succeeded to be correctly moved into a subfolder. So, I close the cursor and the connection and I instantiate new ones.
In the end I perform:
subquery = '('+', '.join(str(x) for x in IDWorksUpdate)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(year,month,idbatch,subquery)
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Error on updating documents names in database of works {}/{} of batch {}'.format(year,month,idbatch))
finally:
newCursor.close()
del newCursor
newConn.close()
This morning I see that only for a couple of batches that update query failed executing at the database, even if the documents were correctly moved into subdirectories.
That batched had more than 55000 documents to be moved, so maybe the IDWorksUpdate overflowed and it helped to bad create that final update query? I thought that 55000 was not such a big list of integers. The problem is that in PYMSSQL we cannot have more than one connection/cursor at a time to the same database so I cannot update the records as the respective files are moved. So I thought to create a list of PKs of works whose documents were correctly moved and finally update them with a new connection/cursor. What could have happened? Am I doing it wrong?
UPDATE
I've just written a simple script to reproduce the query which is going to be executed to update the records, and this is the error I get from SQL Server:
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
This is the query:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE IDWORK IN (list of 55157 PKs)
The fact is that table is very big (with about 14 millions of records). But I need that list of PKs because only the tasks whose document have been correctly processed and moved can be updated. I cannot simply run:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE YEAR(DATETIMEWORK)=2016 and
MONTH(DATETIMEWORK)=12 and IDBATCH=1484
This because as our server was attacked by a crypto locker, I must process and move only the documents that still exist, waiting for the other to be released.
Should I split those string into sub lists? How?
UPDATE 2
It seems the following could be a solution: I split the list of PKs into chunks of 10000 (a fully experimental number) and then I execute as many queries as many chunks, each of them with a chunk as subquery.
def updateDB(listID, y, m, b, log):
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
if len(listID) <= 10000:
subquery = '('+', '.join(str(x) for x in listID)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORKIN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('...')
log.write('\n\n')
finally:
newCursor.close()
del newCursor
newConn.close()
else:
chunksPK = [listID[i:i + 10000] for i in xrange(0, len(listID), 10000)]
for sublistPK in chunksPK:
subquery = '('+', '.join(str(x) for x in sublistPK)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Could not execute partial {}'.format(query))
log.write('\n\n')
newCursor.close()
del newCursor
newConn.close()
Could this be a good/secure solution?
As stated in the MSDN document
IN (Transact-SQL)
Explicitly including an extremely large number of values (many thousands of values separated by commas) within the parentheses, in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table, and use a SELECT subquery within an IN clause.
(The error message you cited was error 8623.)
Putting the IN list values into a temporary table and then using
... WHERE IDWORK IN (SELECT keyValue FROM #inListTable)
strikes me as being more straightforward than the "chunking" method you described.
I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.
My current code to pull 10,000 email records is
conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email'])
But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.
Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?
Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.
Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.
Sqlite database with two tables, each over 28 million rows long. Here's the schema:
CREATE TABLE MASTER (ID INTEGER PRIMARY KEY AUTOINCREMENT,PATH TEXT,FILE TEXT,FULLPATH TEXT,MODIFIED_TIME FLOAT);
CREATE TABLE INCREMENTAL (INC_ID INTEGER PRIMARY KEY AUTOINCREMENT,INC_PATH TEXT,INC_FILE TEXT,INC_FULLPATH TEXT,INC_MODIFIED_TIME FLOAT);
Here's an example row from MASTER:
ID PATH FILE FULLPATH MODIFIED_TIME
---------- --------------- ---------- ----------------------- -------------
1 e:\ae/BONDS/0/0 100.bin e:\ae/BONDS/0/0/100.bin 1213903192.5
The tables have mostly identical data, with some differences between MODIFIED_TIME in MASTER and INC_MODIFIED_TIME in INCREMENTAL.
If I execute the following query in sqlite, I get the results I expect:
select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;
That query will pause for a minute or so, return a number of rows, pause again, return some more, etc., and finish without issue. Takes about 2 minutes to fully return everything.
However, if I execute the same query in Python:
changed_files = conn.execute("select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;")
It will never return - I can leave it running for 24 hours and still have nothing. The python32.exe process doesn't start consuming a large amount of cpu or memory - it stays pretty static. And the process itself doesn't actually seem to go unresponsive - however, I can't Ctrl-C to break, and have to kill the process to actually stop the script.
I do not have these issues with a small test database - everything runs fine in Python.
I realize this is a large amount of data, but if sqlite is handling the actual queries, python shouldn't be choking on it, should it? I can do other large queries from python against this database. For instance, this works:
new_files = conn.execute("SELECT DISTINCT INC_FULLPATH, INC_PATH, INC_FILE from INCREMENTAL where INC_FULLPATH not in (SELECT DISTINCT FULLPATH from MASTER);")
Any ideas? Are the pauses in between sqlite returning data causing a problem for Python? Or is something never occurring at the end to signal the end of the query results (and if so, why does it work with small databases)?
Thanks. This is my first stackoverflow post and I hope I followed the appropriate etiquette.
Python tends to have older versions of the SQLite library, especially Python 2.x, where it is not updated.
However, your actual problem is that the query is slow.
Use the usual mechanisms to optimize it, such as creating a two-column index on INC_FULLPATH and INC_MODIFIED_TIME.