concurrent sqlite writes in python - python

I've got a python application (Gtk) which uses threads to fetch information from certain sites and writes them to the datebase.
I've got a thread that checks for new updates at site1, if there are updates I receive a json object (json1).
I will then iterate through json1 and insert the new information to the datebase, within json1 there is a result I need to use to fetch more information at site2. I will recive a json object(json2) at site2 as well.
So the situation is something like this
def get_more_info(name):
json2 = get(www.site2.com?=name....)
etc
for information in json1:
db.insert(information)
get_more_info(information.name)
From this situation I see that there are a couple of ways of doing this.
get_more_info to return json object so that
for information in json1:
db.insert(information)
json2 = get_more_info(information.name)
for info in json2:
db.insert(info)
db.commit()
get_more_info to do the inserting
for information in json1:
db.insert(information)
get_more_info(information.name)
db.commit()
Both of these ways seem a bit slow since the main for loop will have to wait for get_more_info to complete before carrying on and both json1 and json2 could be large, there is also the possiblity that site2 is unavailiable at that moment, causing the whole transaction to fail. The application can still function without json2, that data can be fetched at a later time if needed.
So I was thinking of passing information.name to a queue so that the main loop can continue and kick off a thread that will monitor that queue and excute get_more_info. Is this the right approach to take?
I know that sqlite does not perform concurrent writes, If I recall correctly if get_more_info tries to write while the main for loop is busy, sqlite will output OperationalError: database is locked.
Now what happends to get_more_info at that point, does it get put into sometype of write queue or does it wait for the main loop to complete and what happens to the main for loop when get_more_info is busying writing?
Will there be a need to go to another database engine?

Since you are using threads always, you can use an other thread to write to the database. In order to feed it with the data you should use a globally accessible Queue.Queue() (queue.Queue() in Python3) instance. Using the instances get() method with block=true will make the thread wait for data to write.

Related

Trying to create a same empty file using AWS s3.put_object by more than one process in parallel using python

Suppose if two(it can be any number) processes are trying to access same block of code in parallel, to avoid that parallel access, I am trying to create an empty file in s3 bucket, so that if that file exists, then other process which is trying to access has to wait before the first process ends up using the block of code. After its usage the first process will delete the empty file which means that the second process can now be able to use that block of code by creating the empty file and holds the lock with it.
import boto3
s3 = boto3.client('s3')
def create_obj(bucket, file):
s3.put_object(Bucket=bucket, Key=file)
return "file created"
job1 = create_obj(bucket="s3bucketname", file='xyz/empty_file.txt')
job2 = create_obj(bucket="s3bucketname", file='xyz/empty_file.txt')
Here suppose job1 and job2 are trying to access the same create_obj in parallel to create empty_file.txt, which means they are hitting the line s3.put_object at the same time. Then one of the jobs has to wait. Here n number of jobs can access the create_obj function in parallel. We need to make sure that those jobs execute properly as explained above.
Please help me with this.
My understanding is that you're trying to implement a distributed locking mechanism on the basis of S3.
Since the updates to the S3 consistency model at re:invent 2020 this could be possible, but you could also use a service like DynamoDB for it, which makes building these easier.
I recommend you check out this blog post on the AWS blog, which describes the process. Since link-only answers are discouraged here, I'll try to summarize the idea, but you should really read the full article.
You do a conditional PutItem call on a lock-item in the table. The condition is, that an item with that key doesn't exist or has expired. The new item contains:
The name of the lock (Partition Key)
How long that lock is supposed to be valid
Timestamp when the lock was created
Some identifier that identifies your system (the locking entity)
If that put works, you know that you have acquired the lock, if it fails you know it's already locked and can retry later
You then perform your work
In the end you remove the lock item
There's an implementation of this for Python as well python-dynamodb-lock on PyPi

Non-blocking Python database call

I need to make many concurrent database calls while allowing the program to continue to run. When the call returns, it sets a value.
If the queries were known right away, we can use a ThreadPoolExecutor for example. What if we don't have all queries ready ahead of time but we are running them as we go? For example, we are traversing a linked list and at each node we want to make a database query and set a value based on the response.
The task here is to not wait until the database result is returned before proceeding to the next node.
Is it possible? One idea would be to create a Thread object. Maybe we can use asyncio to our advantage. The advantage of traversing and requesting as we go over traversing, collecting all the nodes and running them all at once is that the database won't be overwhelmed as much however the difference might be minimal.
Thanks!
If you're using SQLite then you can use https://pypi.org/project/sqlite3worker/
If not, you can use the Queue library.
You can queue the items from your thread calls.
And a condition to execute the queue item sequentially.
You can check the implementation of sqlite3worker and implement similarly for your own database.
P.S.: Databases like SQL Server allow you to make consequent calls by default, you needn't worry about being threadsafe.

Spider/Scraper in Python, insert data from multiprocessing

So I'm writing a small spider/scraper in Python that fetches and analyses different URLs using multiple processes.
My question is how should I insert the data gathered in the previous action?
Call a thread from each process? Add them to a global object and insert into the database afterwards? Other options?
Thank you.
one way is to dump the results from each thread to a .csv file in append mode. You can protect your file using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False)
# As soon as you are here, reference is closed
My present bumble opinion is to use Pool, for a small spider Pool is enough.
Here is a example:
from multiprocessing.pool import Pool
pool = Pool(20)
pool.map(main, urls) # Encapsulate the original functions into the main function.And input urls.
pool.close()
pool.join()
This the source code
Ps.this is my first answer, I would be glad if was helpful.

MPI locking for sqlite (python)

I am using mpi4py for a project I want to parallelize. Below is very basic pseudo code for my program:
Load list of data from sqlite database
Based on COMM.Rank and Comm.Size, select chunk of data to process
Process data...
use MPI.Gather to pass all of the results back to root
if root:
iterate through results and save to sqlite database
I would like to eliminate the call to MPI.Gather by simply having each process write its own results to the database. So I want my pseudo code to look like this:
Load list of data
Select chunk of data
Process data
Save results
This would drastically improve my program's performance. However, I am not entirely sure how to accomplish this. I have tried to find methods through google, but the only thing I could find is MPI-IO. Is it possible to use MPI-IO to write to a database? Specifically using python, sqlite, and mpi4py. If not, are there any alternatives for writing concurrently to a sqlite database?
EDIT:
As #CL pointed out in a comment, sqlite3 does not support concurrent writes to the database. So let me ask my question a little differently: Is there a way to lock writes to the database so that other processes wait till the lock is removed before writing? I know sqlite3 has its own locking modes, but these modes seem to cause insertions to fail rather than block. I know I've seen something like this in Python threading, but I haven't been able to find anything online about doing this with MPI.
I would suggest you pass your results back to the root process, and let the root process write them to the SQLite database. The pseudocode would look something like this:
load list of data
if rank == 0:
for _ in len(data):
result = receive from any worker
save result
else:
select chunk of data
process data
send result(s) to rank 0
The advantage over gathering is that rank 0 can save the results as soon as they are ready. There is an mpi4py example that shows how to spread tasks out over multiple workers when there are lots of tasks and the processing time varies widely.

Python: file-based thread-safe queue

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.
You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

Categories

Resources