Spider/Scraper in Python, insert data from multiprocessing - python

So I'm writing a small spider/scraper in Python that fetches and analyses different URLs using multiple processes.
My question is how should I insert the data gathered in the previous action?
Call a thread from each process? Add them to a global object and insert into the database afterwards? Other options?
Thank you.

one way is to dump the results from each thread to a .csv file in append mode. You can protect your file using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False)
# As soon as you are here, reference is closed

My present bumble opinion is to use Pool, for a small spider Pool is enough.
Here is a example:
from multiprocessing.pool import Pool
pool = Pool(20)
pool.map(main, urls) # Encapsulate the original functions into the main function.And input urls.
pool.close()
pool.join()
This the source code
Ps.this is my first answer, I would be glad if was helpful.

Related

Python: Efficient way to move a huge list of data from one list to another

I'm using multiprocessing, and in a separate process (let's call it thread for clarity), I'm using PySerial to pull data from a device. In my program, I have a list that's shared with the main thread. I create the list using:
import multiprocessing as mp
#... stuff
self.mpManager = mp.Manager()
self.shared_return_list = self.mpManager.list()
This list, is filled inside the process, and then this is transferred to the local thread using:
if len(self.acqBuffer) < 50000:
try:
self.shared_result_lock.acquire()
self.acqBuffer.extend(self.shared_return_list)
del self.shared_return_list[:]
finally:
self.shared_result_lock.release()
where acqBuffer is the local list that takes the data for analysis and storage.
The problem is that if the device has lots of data queued, the transfer process will become really, really slow as lots of data is there, and the GUI freezes. Possible solutions are either to transfer the data in chunks and actively keep reviving the GUI, or find a smart way to transfer the data, which is what I'm asking about.
In C++, I would use some derivative of std::deque or std::list (which is not necessarily contiguous in memory) with a move constructor and use std::move to just push the pointer to the data, instead of re-copying the whole data to the main thread. Is such a thing possible in my case in Python? Are there smarter ways to do this?
Thank you.

Python: Writing multiprocess shared list in background

I believe I am about to ask a definite newbie question, but here goes:
I written a python script that does snmp queries. The snmp query function uses a global list as its output.
def get_snmp(community_string, mac_ip):
global output_list
snmp get here
output_list.append(output_string)
The get_snmp querier's are launched using the following code:
pool.starmap_async(get_snmp, zip(itertools.repeat(COMMUNITY_STRING), input_list))
pool.close()
pool.join()
if output_file_name != None:
csv_writer(output_list, output_file_name)
This setup works fine, all of the get_snmp process write their output out to a shared list output_list, and then the csv_write function is called and that list is dumped to disk.
The main issue with this program is on a large run the memory usage can become quite high as the list is being built. I would like to write the results to the text file in the background to keep memory usage down, and I'm not sure how to do it. I went with the global list to eliminate file locking issues.
I think that your main problem with increasing memory usage is that you don't remove contents from that list when writing them to file.
Maybe you should do del output_list[:] after writing it to file.
Have each of the workers write their output to a Queue, then have another worker (or the main thread) read from the Queue and write to a file. That way you don't have to store everything in memory.
Don't write directly to the file from the workers; otherwise you can have issues with multiple processes trying to write to the same file at the same time, which will just give you a headache until you fix it anyway.

concurrent sqlite writes in python

I've got a python application (Gtk) which uses threads to fetch information from certain sites and writes them to the datebase.
I've got a thread that checks for new updates at site1, if there are updates I receive a json object (json1).
I will then iterate through json1 and insert the new information to the datebase, within json1 there is a result I need to use to fetch more information at site2. I will recive a json object(json2) at site2 as well.
So the situation is something like this
def get_more_info(name):
json2 = get(www.site2.com?=name....)
etc
for information in json1:
db.insert(information)
get_more_info(information.name)
From this situation I see that there are a couple of ways of doing this.
get_more_info to return json object so that
for information in json1:
db.insert(information)
json2 = get_more_info(information.name)
for info in json2:
db.insert(info)
db.commit()
get_more_info to do the inserting
for information in json1:
db.insert(information)
get_more_info(information.name)
db.commit()
Both of these ways seem a bit slow since the main for loop will have to wait for get_more_info to complete before carrying on and both json1 and json2 could be large, there is also the possiblity that site2 is unavailiable at that moment, causing the whole transaction to fail. The application can still function without json2, that data can be fetched at a later time if needed.
So I was thinking of passing information.name to a queue so that the main loop can continue and kick off a thread that will monitor that queue and excute get_more_info. Is this the right approach to take?
I know that sqlite does not perform concurrent writes, If I recall correctly if get_more_info tries to write while the main for loop is busy, sqlite will output OperationalError: database is locked.
Now what happends to get_more_info at that point, does it get put into sometype of write queue or does it wait for the main loop to complete and what happens to the main for loop when get_more_info is busying writing?
Will there be a need to go to another database engine?
Since you are using threads always, you can use an other thread to write to the database. In order to feed it with the data you should use a globally accessible Queue.Queue() (queue.Queue() in Python3) instance. Using the instances get() method with block=true will make the thread wait for data to write.

How to use simple sqlalchemy calls while using thread/multiprocessing

Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.

Python: Lock a file

I have a Python app running on Linux. It is called every minute from cron. It checks a directory for files and if it finds one it processes it - this can take several minutes. I don't want the next cron job to pick up the file currently being processed so I lock it using the code below which calls portalocker. The problem is it doesn't seem to work. The next cron job manages to get a file handle returned for the file all ready being processed.
def open_and_lock(full_filename):
file_handle = open(full_filename, 'r')
try:
portalocker.lock(file_handle, portalocker.LOCK_EX
| portalocker.LOCK_NB)
return file_handle
except IOError:
sys.exit(-1)
Any ideas what I can do to lock the file so no other process can get it?
UPDATE
Thanks to #Winston Ewert I checked through the code and found the file handle was being closed way before the processing had finished. It seems to be working now except the second process blocks on portalocker.lock rather than throwing an exception.
After fumbling with many schemes, this works in my case. I have a script that may be executed multiple times simultaneously. I need these instances to wait their turn to read/write to some files. The lockfile does not need to be deleted, so you avoid blocking all access if one script fails before deleting it.
import fcntl
def acquireLock():
''' acquire exclusive lock file access '''
locked_file_descriptor = open('lockfile.LOCK', 'w+')
fcntl.lockf(locked_file_descriptor, fcntl.LOCK_EX)
return locked_file_descriptor
def releaseLock(locked_file_descriptor):
''' release exclusive lock file access '''
locked_file_descriptor.close()
lock_fd = acquireLock()
# ... do stuff with exclusive access to your file(s)
releaseLock(lock_fd)
You're using the LOCK_NB flag which means that the call is non-blocking and will just return immediately on failure. That is presumably happening in the second process. The reason why it is still able to read the file is that portalocker ultimately uses flock(2) locks, and, as mentioned in the flock(2) man page:
flock(2) places advisory locks only;
given suitable permissions on a file,
a process is free to ignore the use of
flock(2) and perform I/O on the file.
To fix it you could use the fcntl.flock function directly (portalocker is just a thin wrapper around it on Linux) and check the returned value to see if the lock succeeded.
Don't use cron for this. Linux has inotify, which can notify applications when a filesystem event occurs. There is a Python binding for inotify called pyinotify.
Thus, you don't need to lock the file -- you just need to react to IN_CLOSE_WRITE events (i.e. when a file opened for writing was closed). (You also won't need to spawn a new process every minute.)
An alternative to using pyinotify is incron which allows you to write an incrontab (very much in the same style as a crontab), to interact with the inotify system.
what about manually creating an old-fashioned .lock-file next to the file you want to lock?
just check if it’s there; if not, create it, if it is, exit prematurely. after finishing, delete it.
I think fcntl.lockf is what you are looking for.

Categories

Resources