Scraping a lot of pages with multiple machines (with different IPs) - python

I have to scrape information from several web pages and use BeautifulSoup + requests + threading. I create many workers, each one grabs a URL from the queue, downloads it, scrapes data from HTML and puts the result to the result list. This is my code, I thought it was too long to just paste it here.
But I ran into following problem - this site probalby limits the quantity of requests from one IP per minute, so scraping becomes not as fast as it could be. But a have a server that has a different IP, so I thought I could make use of it.
I thought of creating a script for the server that would listen to some port (with sockets) and accept URLs, process them, and then send the result back to my main machine.
But I'm not sure if there is no ready-made solution, the problem seems common to me. If there is, what should I use?

Most of the web servers make use use of rate limiting to save resources and keep themselves from DoS attacks; its a common security measure.
Now looking into your problem these are the things you could do.
Put some sleep in between different different requests (it will
bring down the request per second count; and server may not treat
your code as robot)
If you are using an internet connection on your home computer and it is not using any static IP address then you may try rebooting your router every time your request gets denied using simple telnet interface to the router.
If you are using cloud server/VPS you can buy multiple IP address and keep switching your requests through different network interfaces it can also help you lower down the request per second.
You will need to check through the real cause of denial from the server you are pulling webpages from; it is very general topic to write any definitive answer; here are certain things you can do to find out what is causing your requests to be denied and choose one of the aforementioned method to fix the problem.
Decrease the requests per second count and see how web server is performing.
Set the request headers of HTTP to simulate a web-browser and see if its blocking or not.
Bandwidth of your internet connection/ Network connection limit of your machine could also be problem; use netstat to monitor number of active connection before and after your requests are being blocked.

Related

Is possible to have multiple global IPs running on the same server?

I'm currently working on a python code that manages multiple user accounts and accesses the provider's API, instantiates a websocket and makes action requests, but it has a limit of requests per account and per IP, the quota of accounts is never reached, but the quota of request by IP is being reached easily and I'm being blocked from making a request from time to time, the question is if I can have more than one IP per server so I don't have this quota problem.
Already tried to instanse different VPS, but some profiders have the same IP for cheap VPS. (the client manager is very ligth in processing, if i have to instance a lot of VPS just for diffent IPs, i ended up using 2-5% of each VPS)
I´m new to network in general, sorry if it´s already been responded somewhere else (didn´t find anywhere) or is to much of a newbie question..

How many requests can i make to google.com from same IP without getting blocked or something like that?

I write a program in python and want to check the internetconnection in a loop. i do this with requests modul in python and all works fine but my question is, how many requests are allowed per day, or hour. At the moment i check the connection every 2 seconds so every 2 seconds google gets a request from my ip. Thats make more than 40,000 requests a day, if my software runs 24 hours.
Is this a problem? I cant use proxies because i will not have access or control about the computer or settings of them when my software finally runs by the customer.
There are rate limits on all google public and internal apis.
However, documentation does not clearly spell out whats the exact rate limit on google.com
If you want to check connectivity, it might be better to use DNS servers such as 1.1.1.1
If you want programmatic access to google.com, you should rather use search API at https://developers.google.com/custom-search/v1/overview

Python Multiprocessing Rate Limit

Suppose you want to call some API (open without any KEY needed) which has some rate limit per second (10 requests max per second). Suppose now, I am calling a function (with multiprocessing.Pool) which request data from API and process it. Is there some way to switch ip in order to not get blocked? Maybe a list of ip/ proxies. Can somebody tell me a way to get this done?
Thanks!
You unfortunately can't retrieve data from an api and do multiprocessing with different ip's because the api will already have your Ip assigned to its request variable. Some Web pages also have http 402 errors which means it is possible to get timed out for sending too many requests.
There certainly are a lot of hacks you can use to get around rate limiting, but you should take a moment and ask yourself 'should I?' You'd likely be violating the terms of service of the service, and in some jurisdictions you could be opening yourself to legal action if you are caught. Additionally, many services implement smart bot detection which can identify bot behavior from request patterns and block multiple IPs. In extreme cases, I've seen teams block the IP range for an entire country to punish botting.

Block incoming traffic to webserver python

I am working on a basic IPS written in Python, which should protect a webserver. This is done for a school project, so it's mostly just a "proof of concept" kind of thing. The thought is that the IPS shall block any IP addresses that sends requests that does not fall within the model for normal behaviour for a number of minutes. My initial thought was to use scapy to do this, but I've come to realize that while it is possible to read the incoming data with scapy it will probably not be possible to block the traffic from reaching the web server. One idea is to use iptables to block the traffic, but that solution seems a bit clumpsy. We have also had a look at mitmproxy, but it seems this has to run on a separate computer, so that won't be an option. My question is if there is an easier way to do this than adding and removing iptables rules every 15 minutes?

Bottle server not responding while calculating

I have a bottle server running on port 8080, using the "gevent" server. I use this server to support some simple "server sent events".
My question is probably related to not knowing exactly how my set up is working. I hope someone can take the time to elaborate on this.
All routes and serving of files from the server is working great, but I have an issue when accessing a specific route "/get_data". This gathers data from the web as well as from some internal data sources. The gathering takes about 30 minutes. While this process is running, I am not able to access any routes on the server, i.e. "/" or "/login". Once the process is finished, everything works again and the database is updated with the gathered information.
I tried replacing the gathering algorithms by a simple time.sleep(60), and while the timer was active, I was still able to access other routes just fine.
This leads to my two questions:
Why am I not able to access the server while this process is running. Is it the port that is blocked (from reading web-information), or maybe it has something to do with threading?
What would be the best way to run a demanding / long process on my server? Preferably I would like to access this from my web app, but I have thought about just putting this in a seperate python file and run this localy on the server, in a seperate instance of python. This process is run at most once per day, maybe as seldom as once per week.
This happen because WSGI handle request/response synchronously.
You can use gunicorn to run your application, it will handle multi requests and response, or you can use other methods described in bottle website:
Primer to Asynchronous Applications

Categories

Resources