I am doing web scraping with python in some pages and I have been blocked from some of them. When I have tried to check it also through the TOR Browser I have seen that I cannot access to the pages neither, so I think that these pages have been able to track all my IP or I dont have well configurated TOR (and I think not cause I have checked my IP address with Chrome and TOR and are different), so, any one knows why?
Also, I am trying to do a function or method in my python code to change mi IP automatically. What I have seen is that the best is to do it through the TOR browser (using it as the search engine to get data from pages) but I am not able to make it work. Do you have any recommendation to create this function?
Thank you!
I would expect anti scrape protection to also block visits from known Tor exit nodes. I dont think they know it is you. Some websites hire/implement state of the art scrape protection services.
You could setup your own proxies at friends and family and use a very conservative crawl rate or maybe search for commercial residential proxy offerings.
Related
I am a beginner programmer. How would I approach this problem? I want to provide Python with certain webpages and certain actions to take on said webpages. The problem is, the webpages are region restricted, so I have to use a VPN constantly. Would there be any way to have Python automatically connect to a vpn service (Mullvad, NordVPN etc) to a specific country while running the code? Thanks.
Excluding VPNs you could use proxies. But if you need to use a VPN I suggest looking at the Google results for your specific provider, like this one for Nord.
Framework: Scrapy.
I am currently using a web-scraper but I am getting disconnected from the server.
The scraper will (eventually) scrape between 100k and 150k pages with each page containing 11 fields that contain data that will be scraped.
My idea is that the scraper will be used once per month.
Estimated size of database upon completion is between 200mb and 300mb (not accounting for bandwidth).
I do not know if I need a paid proxy for this or if I can use free proxies.
Any advice (or proxy provider for my needs) will be greatly received.
there are several techniques to avoid being disconnected to the server you are scraping
this are some of the common techniques
you can add user agents here is a library and in this page there are tutorials on how to use user agents
you can go to your settings.py and uncomment DOWNLOAD_DELAY = x
Without a proxy you're very likely to have your IP address blocked and then even with proxies you may run into a CAPTCHA that prevents you from scraping pages.
For scraping 100K - 150K pages per month, as you indicated, I would highly recommend not using free proxies. The problem with free proxies is that they're incredibly unreliable - you never know who's using them, what they're being used for, when they'll no longer work, etc... Which could lead to any/all of your proxies being shut down or blocked and therefore your scraper will no longer work at any given moment.
Paid proxies are certainly the way to go although they can get quite expensive and some of the proxy providers have been known to use shady techniques to obtain IP addresses.
However https://htmlapi.io (my service) can bridge this gap for you and it's free to get started with (you don't even need to create an account). HtmlAPI returns the raw HTML contents of the page you're scraping. It handles rotating proxy servers for you automatically, defeating CAPTCHA's, rendering JavaScript, and more...
All you have to do is call the API and extract the data you need from the webpage.
Try it from your command line:
curl "https://htmlapi.io/example.com"
I have an unlimited internet connection in my house and a limited internet connection in the school.
I want to make a web browser (or something like that) that navigate from my house, get the data (including the streamings), and resends it to my browser in the school.
In Python, using WebKit, a web browser can be created easily and navigate youtube and other pages, I want to recreate that navigation in the other web browser (the one connected in my school).
School browser ⟶ send request to program or another Web browser ⟶ get page data (including streaming) ⟶ tunneling ⟶ sent to school browser.
It’s something like to do a remote web browser.
It sounds like you are trying to make a home private proxy server.
There are plenty of guides on how to do this but here's one I found by quickly looking around:
https://null-byte.wonderhowto.com/how-to/sploit-make-proxy-server-python-0161232/
Depending on your school's restriction method, a proxy server may not be enough to bypass their restrictions. You may also be able to overcome this by completely encrypting communications between your home network and your school system. To do this you would need to set up a home virtual private network (VPN). There are also many guides that you can use to achieve this.
I'm scraping websites for a research project, and for the first time I've encountered'
dotdefender blocked your request."
I'm not doing anything malicious; just scraping basic information. Is it possible to let them know this and/or overcome the block?
Here's the site.
Some sites will block scraping even if it is not malicious. You can try to run the scraping through a proxy but depending on how quickly you are scraping and the quality of the proxy you may still eventually get blocked. If you are doing a low amount of data collection the proxy should work, but if you are doing a larger quantity you may want to consider a premium service, rather an IP rotation service(think premium proxy).
Also you could try TOR but you may still run into speed issues.
For Proxies there are plenty of free and paid options but the quality is hard to measure.
I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders .
Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this
Now i am confused about the options i should chosse
Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
Use TOR
Use VPN Service like http://www.hotspotshield.com/
Any Option better than above three
Here are the options I'm currently using (depending on my needs):
proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)
a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies
The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.
Crawlera is built specifically for web crawling projects. For example, it implements smart algorithms to avoid getting banned and it is used to crawl very large and high profile websites.
Disclaimer: I work for the mother company Scrapinghub, who also are core developers of Scrapy.
If you don't want to use a paid service please consider just using a scrapy library that will automate rotating proxies for you: https://github.com/TeamHG-Memex/scrapy-rotating-proxies
You can have a look for a full tutorial on how to automate it here: https://tinyendian.com/articles/how-to-scrape-the-web-and-not-get-caught
Keep in mind, that when connecting through a proxy always imposes a performance penalty, but 10K web pages that you mentioned is still well within your reach.