I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders .
Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this
Now i am confused about the options i should chosse
Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
Use TOR
Use VPN Service like http://www.hotspotshield.com/
Any Option better than above three
Here are the options I'm currently using (depending on my needs):
proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)
a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies
The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.
Crawlera is built specifically for web crawling projects. For example, it implements smart algorithms to avoid getting banned and it is used to crawl very large and high profile websites.
Disclaimer: I work for the mother company Scrapinghub, who also are core developers of Scrapy.
If you don't want to use a paid service please consider just using a scrapy library that will automate rotating proxies for you: https://github.com/TeamHG-Memex/scrapy-rotating-proxies
You can have a look for a full tutorial on how to automate it here: https://tinyendian.com/articles/how-to-scrape-the-web-and-not-get-caught
Keep in mind, that when connecting through a proxy always imposes a performance penalty, but 10K web pages that you mentioned is still well within your reach.
Related
I am doing web scraping with python in some pages and I have been blocked from some of them. When I have tried to check it also through the TOR Browser I have seen that I cannot access to the pages neither, so I think that these pages have been able to track all my IP or I dont have well configurated TOR (and I think not cause I have checked my IP address with Chrome and TOR and are different), so, any one knows why?
Also, I am trying to do a function or method in my python code to change mi IP automatically. What I have seen is that the best is to do it through the TOR browser (using it as the search engine to get data from pages) but I am not able to make it work. Do you have any recommendation to create this function?
Thank you!
I would expect anti scrape protection to also block visits from known Tor exit nodes. I dont think they know it is you. Some websites hire/implement state of the art scrape protection services.
You could setup your own proxies at friends and family and use a very conservative crawl rate or maybe search for commercial residential proxy offerings.
Framework: Scrapy.
I am currently using a web-scraper but I am getting disconnected from the server.
The scraper will (eventually) scrape between 100k and 150k pages with each page containing 11 fields that contain data that will be scraped.
My idea is that the scraper will be used once per month.
Estimated size of database upon completion is between 200mb and 300mb (not accounting for bandwidth).
I do not know if I need a paid proxy for this or if I can use free proxies.
Any advice (or proxy provider for my needs) will be greatly received.
there are several techniques to avoid being disconnected to the server you are scraping
this are some of the common techniques
you can add user agents here is a library and in this page there are tutorials on how to use user agents
you can go to your settings.py and uncomment DOWNLOAD_DELAY = x
Without a proxy you're very likely to have your IP address blocked and then even with proxies you may run into a CAPTCHA that prevents you from scraping pages.
For scraping 100K - 150K pages per month, as you indicated, I would highly recommend not using free proxies. The problem with free proxies is that they're incredibly unreliable - you never know who's using them, what they're being used for, when they'll no longer work, etc... Which could lead to any/all of your proxies being shut down or blocked and therefore your scraper will no longer work at any given moment.
Paid proxies are certainly the way to go although they can get quite expensive and some of the proxy providers have been known to use shady techniques to obtain IP addresses.
However https://htmlapi.io (my service) can bridge this gap for you and it's free to get started with (you don't even need to create an account). HtmlAPI returns the raw HTML contents of the page you're scraping. It handles rotating proxy servers for you automatically, defeating CAPTCHA's, rendering JavaScript, and more...
All you have to do is call the API and extract the data you need from the webpage.
Try it from your command line:
curl "https://htmlapi.io/example.com"
So i am using Scrapy to crawl some websites and i want to increase my privacy on the internet and also avoid getting banned so i read that i could achieve that by using premium proxy lists like http://www.ninjasproxy.com/ or http://hidemyass.com/ or VPN or Tor.
From what i understood a paid VPN would be a good option like the one http://hidemyass.com/ offers, but i can't seem to find any code that actually shows Scrapy integrating with a VPN like hidemyass.
I only saw an example like https://github.com/aivarsk/scrapy-proxies that shows how to use proxy lists.
How do i make Scrapy work with a VPN? If i can't are proxy lists good enough to maintain anonymity?
A VPN is something working system wide, not something that proxy selected traffic. All your internet traffic, browser, torrent, chat etc etc will be routed through the VPN, so just connect to the VPN and run the script.
I am trying to make a "proxy" in Python that allows the user to route all of their web traffic through a host machine (this is mainly for me and a couple people I know to confuse/avoid hackers and/or spies, who would only see web pages and so on coming in through one IP). I have run into several difficulties. The first is that I would like to be able to use the final, compiled product with Firefox, which can be set to route all of its traffic through an installed proxy program. I don't know what kind of configuration my proxy needs to have to do this. Second, the way the proxy works is by using urllib.requests.urlretrieve (yes, going to die soon, but I like it) to download a webpage onto the host computer (it's inefficient and slow, but it's only going to be used for a max of 7-10 clients) and then sending the file to the client. However, this results in things like missing pictures or broken submission forms. What should I be using to get the webpages right (I want things like SSL and video streaming to work as well as pictures and whatnot).
(wince) this sounds like "security through obscurity", and a lot of unnecessary work.
Just set up an ssh tunnel and proxy your web traffic through that.
See http://www.linuxjournal.com/content/use-ssh-create-http-proxy
I'm looking for a bit of web development advice. I'm fairly new to the area but I'm sure there are some gurus out there willing to part with some wisdom.
Objective: I'm interested in controlling a Python application on my computer from my personal web hosted site. I know, this question has been asked several times before but in each case the requirements were a bit different from my own. To reduce the length of this post I'll summarize my objective in a few bullet points:
Personal site is hosted by a web hosting company
Site uses HTML, PHP, MySQL, Python and JavaScript, the majority of everything is coded by me from the ground up
An application that is coded in Python will run on a PC within my home and will communicate with an Arduino board
The app will receive commands from the internet to control actuation via the Arduino, and will transmit sensor data back to the site (such as temperature)
Looking for the communication to be bi-directional, fast and secure
Securing the connection between site and Python app would be most ideal
I'm not looking to connect to the Python application directly, the web server must serve as the 'middle man'
So far I've considered HTTP Post and HTML forms, using sockets (Python app would run as a web server), an IRC bot and reading/writing to a text file stored on the web server.
I was also hoping to have a way to communicate with the Python app without needing to refresh the webpage, perhaps using AJAX or JavaScipt? Maybe with Flash?
Is there something I'm not considering? I feel like I'm missing something. Thanks in advance for the advice!
Just thinking out loud for how I would start out with this. First, regarding the website itself, you can just use what's easiest to you, or to the environment you're in. For example, a basic PHP page will do just fine, but if you can get a site running in Python as well, I'd prefer using the same language all over.
That said, I'm not sure why you would need to use a hosted website? Given that you're already forced to have a externally accessible PC at home for the communication, why not run a webserver on that directly (Apache, Nginx, or even something like CherryPy should do)? That webserver can then communicate with the python process that is running to control your Arduino (by using e.g. Python's xmlrpclib). If you would run things via the hosting company, you would still need some process that can handle external requests securely... something a webserver is quite good at. Just running it yourself gives you all the freedom you want, and simplifies things by lessening the number of components in your solution.
The updates on your site I'd keep quite basic: commands you want to run can be handled in the request handlers of the webserver by just calling the relevant (xmlrpclib) calls. Dynamically updating the page is best done by some AJAX calls I reckon. Based on your story, these updates are easily put in a JSON object, suitable for periodically updating only the relevant segments of your page.