I am writing a crawler in Python, in order to make Ctrl+C not to cause my crawler to start over in next run, I need to save the processing deque in a text file (one item per line) and update it every iteration, the update operation needs to be super fast. In order not to reinvent the wheel, I am asking if there is an established module to do this?
As an alternative, you could set up an exit function, and pickle the deque on exit.
Exit function
Pickle
You should be able to use pickle to serialize your lists.
I am not sure if I understood the question right, I am just curious, so here are few questions and suggestions:
Are you planning to catch the Ctrl+C interrupt and do the deque?
What happens if the crawler crashes for some arbitrary reason like an unhandled exception or crash? You loose the queue status and start over again?
from the documentation:
Note
The exit function is not called when
the program is killed by a signal,
when a Python fatal internal error is
detected, or when os._exit() is
called.
What happens when you happen to visit the same URI again, are you maintaining a visited list or something?
I think you should be maintaining some kind of visit and session information / status for each URI you crawl.
You can use the visit information to decide to crawl a URI or not when you visit the same URI next time.
The other info - session information - for the last session with that URI will help in picking up only the incremental stuff and if the page is not change no need to pick it up saving some db I/O costs, duplicates, etc.
That way you won't have to worry about the ctrl+C or a crash. If the crawler goes down for any reason, lets say after crawling 60K posts when 40K more were left, the next time crawler fills in the queue, though the queue may be huge but the crawler can check if the it has already visited the URI or not and what was the state of the page when it was crawled - optimization - does the page requires a new pick up coz it has changed or not.
I hope that is of some help.
Some things that come to my mind:
leave the file handle open (don't close the file everytime you wrote something)
or write the file every n items and catch a close signal to write the current non-written items
Related
I have to wait till file copy/upload finishes completely using python (preferred approach), bash/shell also fine(I will call from python)
I have shared nfs directory /data/files_in/, if somebody copies/uploads a file to /data/files_in/ directory, I should notify to other application, only after complete file copy/upload is done
My current code to check file is completed copied or not
while True:
current_size = Path(file_path).stat().st_size
time.sleep(5)
result_size = Path(file_path).stat().st_size
if result_size == current_size:
break
# Notify your application
It is working only with small size files, for large files like 100G files it is not working properly.
I have increased a timer, but still sometimes it is failing and timer based approach seems not good idea to rely on.
Is there any other way, I can implement code to fix this issue?
OS: Linux, Cent os
Python Version: 3.9
I can't comment so I will ask here. Shouldn't the resulting size be larger (or at least different) from the current one in order for a file to be done uploading and therefore stop the loop?
I assume you cannot establish any kind of direct communications with the other process, i.e. the one which is copying/uploading the file.
One common approach in these cases is to have the other process to write/erase a "semaphore" file. It may be that it creates the semaphore just before beginning copying and erases it just after finishing, so the semaphore means "don't do anything, I'm still running", or the other way round, it creates the semaphore just after finishing and erases it just before starting next time, so the semaphore means "your data are ready to use".
That said, I'm amazed your approach doesn't work if you allow enough time, and 5 secs should be more than enough on any networks
I've got a simple pyramid app up and running, most of the views are a fairly thin wrapper around an sqlite database, with forms thrown in to edit/add some information.
A couple of times a month a new chunk of data will need to be added to this system (by csv import). The data is saved in an SQL table (the whole process right till commit takes about 4 seconds).
Every time a new chunk of data is uploaded, this triggers a recalculation of other tables in the database. The recalculation process takes a fairly long time (about 21-50 seconds for a month's worth of data).
Currently I just let the browser/client sit there waiting for the process to finish, but I do foresee the calculation process taking more and more time as the system gets more usage. From a UI perspective, this obviously looks like a hung process.
What can I do to indicate to the user that:-
That the long wait is normal/expected?
How MUCH longer they should have to wait (progress bar etc.)?
Note: I'm not asking about long-polling or websockets here, as this isn't really an interactive application and based on my basic knowledge websockets/async are overkill for my purposes.
I guess a follow-on question at this point, am I doing the wrong thing running processes in my view functions? Hardly seem to see that being done in examples/tutorials around the web. Am I supposed to be using celery or similar in this situation?
You're right, doing long calculations in a view function is generally frowned upon - I mean, if it's a typical website with random visitors who are able to hung a webserver thread for a minute then it's a recipe for a DoS vulnerability. But in some situations (internal website, few users, only admin has access to the "upload csv" form) you may get away with it. In fact, I used to have maintenance scripts which ran for hours :)
The trick here is to avoid browser timeouts - at the moment your client sends the data to the server and just sits there waiting for any reply, without any idea whether their request is being processed or not. Generally, at about 60 seconds the browser (or proxy, or frontend webserver) may become impatient and close the connection. Your server process will then get an error trying writing anything to the already closed connection and crash/raise an error.
To prevent this from happening the server needs to write something to the connection periodically, so the client sees that the server is alive and won't close the connection.
"Normal" Pyramid templates are buffered - i.e. the output is not sent to the client until the whole template to generated. Because of that you need to directly use response.app_iter / response.body_file and output some data there periodically.
As an example, you can duplicate the Todo List Application in One File example from Pyramid Cookbook and replace the new_view function with the following code (which itself has been borrowed from this question):
#view_config(route_name='new', request_method='GET', renderer='new.mako')
def new_view(request):
return {}
#view_config(route_name='new', request_method='POST')
def iter_test(request):
import time
if request.POST.get('name'):
request.db.execute(
'insert into tasks (name, closed) values (?, ?)',
[request.POST['name'], 0])
request.db.commit()
def test_iter():
i = 0
while True:
i += 1
if i == 5:
yield str('<p>Done! Click here to see the results</p>')
raise StopIteration
yield str('<p>working %s...</p>' % i)
print time.time()
time.sleep(1)
return Response(app_iter=test_iter())
(of cource, this solution is not too fancy UI-wise, but you said you didn't want to mess with websockets and celery)
So is the long running process triggered by browser action? I.e., the user is uploading the CSV that gets processed and then the view is doing the processing right there? For short-ish running browser processes I've used a loading indicator via jQuery or javascript, basically popping a modal animated spinner or something while a process runs, then when it completes hiding the spinner.
But if you're getting into longer and longer processes I think you should really look at some sort of background processing that will offload it from the UI. It doesn't have to be a message based worker, but even something like the end user uploads the file and a "to be processed" entry gets set in a database. Then you could have a pyramid script scheduled periodically in the background polling the status table and running anything it finds. You can move your file processing that is in the view to a separate method, and that can be called from the command line script. Then when the processing is finished it can update the status table indicating it is finished and that feedback could be presented back to the user somewhere, and not blocking their UI the whole time.
I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).
We have a network client based on asyncore with the user's network connection is embodied in a Dispatcher. The goal is for a user working from an interactive terminal to be able to enter network request commands which would go out to a server and eventually come back with an answer. The client is written to be asynchronous so that the user can start several requests on different servers all at once, collecting the results as they become available.
How can we allow the user to type in commands while we're going around a select loop? If we hit the select() call registered only as readable, then we'll sit there until we read data or timeout. During this (possibly very long) time user input will be ignored. If we instead always register as writable, we get a hot loop.
One bad solution is as follows. We run the select loop in its own thread and have the user inject input into a thread safe write Queue by invoking a method we define on our Dispatcher. Something like
def myConnection.enqueue(self, someData):
self.lock.acquire()
self.queue.put(someData)
self.lock.release()
We then only register as writable if the Queue is not emtpy
def writable(self):
return not self.queue.is_empty()
We would then specify a timeout for select() that's short on human scales but long for a computer. This way if we're in the select call registered only for reading when the user puts in new data, the loop will eventually run around again and find out that there's new data to write. This is a bad solution though because we might want to use this code for servers' client connections as well, in which case we don't want the dead time you get waiting for select() to time out. Again, I realize this is a bad solution.
It seems like the correct solution would be to bring the user input in through a file descriptor so that we can detect new input while sitting in the select call registered only as readable. Is there a way to do this?
NOTE: This is an attempt to simplify the question posted here
stdin is selectable. Put stdin into your dispatcher.
Also, I recommend Twisted for future development of any event-driven software. It is much more featureful than asyncore, has better documentation, a bigger community, performs better, etc.
I am using a Python WebServer (CherryPy), but I guess the question is more open and is fairly general. At the moment, I have some Ajax call trough JQuery load on a button click, that triggers some computation, ending in files generation.
At the moment, as soon as the processing starts in a background thread, my load returns on the page the links to the future files generated on the server. There are several files to be generated, and the whole process can take minutes. How would one manage to display links to files only when they get available, progressively, file by file ? ... At the moment, the links are dead until there are files behind, and I have no way of telling the user when the links get alive.
UPDATE : Thanks JB Nizet. Now could anyone advise about Python Thread safe data structures writing ? Don't know much about the subject, and don't know where to get started.
Poll the server to get the latest generated files (or the complete list of generated files) every n seconds, and stop the polling once the list is complete, or once the first ajax query (the one which starts the generation process) has completed.
The thread which generates the file should make the list of generated files available in a shared, thread-safe, data-structure.