How can I speed this up? (urllib2, requests) - python

Problem: I am trying to validate a captcha can be anything from 0000-9999, using the normal requests module it takes around 45 minutes to go through all of them (0000-9999). How can I multithread this or speed it up? Would be really helpful if I can get the HTTP Status Code from the site to see if i successfully got the code correct or if it is incorrect (200 = correct, 400 = incorrect) If I could get two examples (GET and POST) of this that would be fantastic!
I have been searching for quite some time, most of the modules I look at are outdated (I have been using grequests recently)
example url = https://www.google.com/
example params = captcha=0001
example post data = {"captcha":0001}
Thank you!

You really shouldn't be trying to bypass a captcha programmatically!
You could use several threads to make simultaneous requests but at that point the service you're attacking will most likely ban your IP. At the very least, they've probably got throttling on the service; There's a reason it's supposed to take 45 minutes.
Threading in Python is usually achieved by creating a thread object with a run() method containing your long running code. In your case, you might want to create a thread object which takes a number range to poll. Once instantiated, you'd call the .start() method to have that thread begin working. If any thread should get a success message it would return a message to the main thread, halt itself, and the main thread could then tell all the other threads in the thread pool to stop.

Related

gunicorn async worker class

I have a very basic question. It seems like I don't get it, or I simply need a confirmation.
Let's say I set up a flask app and let it run using gunicorn.
I use --wokers=2 and --threads=2, so I can serve 4 requests on parallel.
Now let's say a client does 4 parallel requests, which will do a requests.get in the flask app, which needs 5 seconds to get a response (in theory). I fifth client call will need to wait for one of the 4 others to be finshed, before it's even started in the backend (and will take another 5 seconds for execution).
Now my question: When I switch to --worker-class gevent, will it help getting more parallel requests without adaping the code? If I understand it correctly, I need to properly use async library calls to get advantage of gevent, and to get a maximum parallel request execution of 1000 for example, right? Am I right by saying: If the code continues to simply do requests.get (or a sleep or whatever) without using async client libs, it the fifth request will still be blocked?
Thank you!
(I've never worked with asnycio and coroutines, so I'm sorry)

How to return multiple Responses from a single app route: Flask (Python)

I have hosted a Flask app on Heroku, written in Python. I have a function which is something like this:
#app.route("/execute")
def execute():
doSomething()
return Response()
Now, the problem is that doSomething() takes more than 30 seconds to execute, bypassing the 30-second-timeout duration of Heroku, and it kills the app.
I could make another thread and execute doSomething() inside it, but the Response object needs to return a file that will be made available only after doSomething() has finished execution.
I also tried working with generators and yield, but couldn't get them to work either. Something like:
#app.route("/execute")
def execute():
def generate():
yield ''
doSomething()
yield file
return Response(generate())
but the app requires me to refresh the page in order to get the second yielded object.
What I basically need to do is return an empty Response object initially, start the execution of doSomething(), and then return another Response object. How do I accomplish this?
Usually with http one request means one response, that's it.
For your issue you might want to look into:
Streaming Response, which are used for large response with many parts.
Sockets to allow multiple "responses" for a single "request".
Making multiple queries with your client, if you have control over your client code this is most likely the easiest solution
I'd recommend reading this, it gets a bit technical but it helped me understand a lot of things.
What you are trying to make is an asynchronous job. For that I recommend you use Celery (here you have a good example: https://blog.miguelgrinberg.com/post/using-celery-with-flask/page/7) or some another tool for asynchronous jobs. In the front-end you can do a simple pooling to wait for response, I recommend you to use SocketIO (https://socket.io/). It's a simple and efficient solution.
It's basically an asynchronous job. You can use Celery or Asyncio for these operations. You can never ask any user to wait for more than 3 seconds - 10 seconds for any operation.
1) Make an AJAX Request
2) Initialize a socket that listens to your operation.
3) As soon as you finish the operation, the socket sends the message back, you can show the user later on through a popup.
This is the best approach you can do
If you could share, what computation are you making then you can get more alternative approaches

long running running job in flask

I have created a module that does some heavy computations, and returns some data to be stored in a nosqldatabase. The computation process is started via a post request in my flask application. The flask function will execute the cumputation code and after the code and then the returned results will be stored in db. I was thinking of celery. But I am wondering and haven't found any clear info on that if it would be possible to use python trheading E.g
from mysci_module import heavy_compute
#route('/initiate_task/', methods=['POST',])
def run_computation():
import thread
thread.start_new_thread(heavy_compute, post_data)
return reponse
Its very abstract I know. The only problem I see in this method is that my function will have to know and be responsible in storing data in the database, so It is not very independant on the database used. Correct? Why is Celery a better (is it really?) than the method above?
Since CPython is restricted from true concurrency using threads by the GIL, all computations will infact happen serially. Instead you could use the python multiprocessing module and create a pool of processes to complete your heavy computation task.
There are a few microframeworks such as twisted klein apart from celery that can also help achieve that concurrency and independence that you're looking for. They aren't necessarily better, but are available for those who don't want to get their hands messy with various issues that are likely to come up when one gets into synchronizing flask and the actual business logic, especially when response is based on that activity.
I would suggest the following method to start a thread for the long procedure first. Then leave Flask to communicate with the procedure time by time upon your requirements:
from mysci_module import heavy_compute
import thread
thread.start_new_thread(heavy_compute, post_data)
#route('/initiate_task/', methods=['POST',])
def check_computation():
response = heave_compute.status
return response
The best part of this method is to make sure you have a callable thread in the background all the time while it's possible to get the necessary result even passing some parameters to the task.

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.

Python Tornado - making POST return immediately while async function keeps working

so I have a handler below:
class PublishHandler(BaseHandler):
def post(self):
message = self.get_argument("message")
some_function(message)
self.write("success")
The problem that I'm facing is that some_function() takes some time to execute and I would like the post request to return straight away when called and for some_function() to be executed in another thread/process if possible.
I'm using berkeley db as the database and what I'm trying to do is relatively simple.
I have a database of users each with a filter. If the filter matches the message, the server will send the message to the user. Currently I'm testing with thousands of users and hence upon each publication of a message via a post request it's iterating through thousands of users to find a match. This is my naive implementation of doing things and hence my question. How do I do this better?
You might be able to accomplish this by using your IOLoop's add_callback method like so:
loop.add_callback(lambda: some_function(message))
Tornado will execute the callback in the next IOLoop pass, which may (I'd have to dig into Tornado's guts to know for sure, or alternatively test it) allow the request to complete before that code gets executed.
The drawback is that that long-running code you've written will still take time to execute, and this may end up blocking another request. That's not ideal if you have a lot of these requests coming in at once.
The more foolproof solution is to run it in a separate thread or process. The best way with Python is to use a process, due to the GIL (I'd highly recommend reading up on that if you're not familiar with it). However, on a single-processor machine the threaded implementation will work just as fine, and may be simpler to implement.
If you're going the threaded route, you can build a nice "async executor" module with a mutex, a thread, and a queue. Check out the multiprocessing module if you want to go the route of using a separate process.
I've tried this, and I believe the request does not complete before the callbacks are called.
I think a dirty hack would be to call two levels of add_callback, e.g.:
def get(self):
...
def _defered():
ioloop.add_callback(<whatever you want>)
ioloop.add_callback(_defered)
...
But these are hacks at best. I'm looking for a better solution right now, probably will end up with some message queue or simple thread solution.

Categories

Resources