How to deal with dotdefender web scraping blocks python

How to deal with dotdefender web scraping blocks python - python

I'm scraping websites for a research project, and for the first time I've encountered'
dotdefender blocked your request."
I'm not doing anything malicious; just scraping basic information. Is it possible to let them know this and/or overcome the block?
Here's the site.

Some sites will block scraping even if it is not malicious. You can try to run the scraping through a proxy but depending on how quickly you are scraping and the quality of the proxy you may still eventually get blocked. If you are doing a low amount of data collection the proxy should work, but if you are doing a larger quantity you may want to consider a premium service, rather an IP rotation service(think premium proxy).
Also you could try TOR but you may still run into speed issues.
For Proxies there are plenty of free and paid options but the quality is hard to measure.

Related

Scraping Amazon with Scrapy Framework on Large Scale

We are currently trying to scrape article ratings on the Amazon.co.uk website using the Scrapy Framework. We use LTE dongles that we use as proxies. We change the IP and the user agent as soon as the bans/captchas come.
On a small level it works without problems and we rarely get banned/captchas and the IP refresh shows its effect.
When we start to scrape larger amounts of Asins we get bans/captchas again IMMEDIATELY after rotating the IP.
What could be the reasons for this?

How to hide my IP doing web scraping in python?

I am doing web scraping with python in some pages and I have been blocked from some of them. When I have tried to check it also through the TOR Browser I have seen that I cannot access to the pages neither, so I think that these pages have been able to track all my IP or I dont have well configurated TOR (and I think not cause I have checked my IP address with Chrome and TOR and are different), so, any one knows why?
Also, I am trying to do a function or method in my python code to change mi IP automatically. What I have seen is that the best is to do it through the TOR browser (using it as the search engine to get data from pages) but I am not able to make it work. Do you have any recommendation to create this function?
Thank you!

I would expect anti scrape protection to also block visits from known Tor exit nodes. I dont think they know it is you. Some websites hire/implement state of the art scrape protection services.
You could setup your own proxies at friends and family and use a very conservative crawl rate or maybe search for commercial residential proxy offerings.

Web scraping and proxy type

Framework: Scrapy.
I am currently using a web-scraper but I am getting disconnected from the server.
The scraper will (eventually) scrape between 100k and 150k pages with each page containing 11 fields that contain data that will be scraped.
My idea is that the scraper will be used once per month.
Estimated size of database upon completion is between 200mb and 300mb (not accounting for bandwidth).
I do not know if I need a paid proxy for this or if I can use free proxies.
Any advice (or proxy provider for my needs) will be greatly received.

there are several techniques to avoid being disconnected to the server you are scraping
this are some of the common techniques
you can add user agents here is a library and in this page there are tutorials on how to use user agents
you can go to your settings.py and uncomment DOWNLOAD_DELAY = x

Without a proxy you're very likely to have your IP address blocked and then even with proxies you may run into a CAPTCHA that prevents you from scraping pages.
For scraping 100K - 150K pages per month, as you indicated, I would highly recommend not using free proxies. The problem with free proxies is that they're incredibly unreliable - you never know who's using them, what they're being used for, when they'll no longer work, etc... Which could lead to any/all of your proxies being shut down or blocked and therefore your scraper will no longer work at any given moment.
Paid proxies are certainly the way to go although they can get quite expensive and some of the proxy providers have been known to use shady techniques to obtain IP addresses.
However https://htmlapi.io (my service) can bridge this gap for you and it's free to get started with (you don't even need to create an account). HtmlAPI returns the raw HTML contents of the page you're scraping. It handles rotating proxy servers for you automatically, defeating CAPTCHA's, rendering JavaScript, and more...
All you have to do is call the API and extract the data you need from the webpage.
Try it from your command line:
curl "https://htmlapi.io/example.com"

404: Is there any way to avoid being blocked by website while scraping using scrapy

I was trying to use Scrapy to scrape some website about 70k items. but every time after it scraped about 200 items, theis error will pop up for the rest:
scrapy] DEBUG: Ignoring response <404 http://www.somewebsite.com/1234>: HTTP status code is not handled or not allowed
I believe it is because my spider got blocked by the website, and I tried using random user agent suggested here but it doesn't solve the problem at all. Is there any good suggestions?

If you're being blocked your spider is probably hitting the site too often or too fast.
In addition to a random user agent you can try setting the CONCURRENT_REQUESTS and DOWNLOAD_DELAY options in settings.py. The default is fairly aggressive and will hammer a site.
The other options you have are using proxies or use AWS with nano instances, they get a new IP each reboot.
Remember that scraping is at best a gray area, you absolutely need to respect the site owners. The best way is obviously to seek permission from the owner but failing that you need to make sure your scraping efforts don't stand out from the usual browsing patterns or you'll get blocked in no time.
Some sites use fairly sophisticated techniques to identify scrapers including cookies and javascript as well as just request patterns and time on site etc. There are also a number of cloud based anti-scraping solutions such as distil or shieldsquare which if you're up against you'll need to put in a lot of effort to make your spider look human!

Can you force someone to answer your questions or give you information? Neither can you force a web server. At best you can try to impersonate a client that the web server will answer to. To do that you need to figure out the criteria the server uses to decide whether or not to answer the request, then you can (try to) form a request that will meet the criteria.

Web Scraper: Limit to Requests Per Minute/Hour on Single Domain?

I'm working with a librarian to re-structure his organization's digital photography archive.
I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.
I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?
I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.
Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.
They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.

There's no built-in rate limit for HTTP. Most common web servers are not configured out of the box to rate limit. If rate limiting is in place, it will almost certainly have been put there by the administrators of the website and you'd have to ask them what they've configured.
Some search engines respect a non-standard extension to robots.txt that suggests a rate limit, so check for Crawl-delay in robots.txt.
HTTP does have a concurrent connection limit of two connections, but browsers have already started ignoring that and efforts are underway to revise that part of the standard as it is quite outdated.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.