How to get info/data from blocked web sites with BeautifulSoup?

How to get info/data from blocked web sites with BeautifulSoup? - python

I want to write a script with python 3.7. But first I have to scrape it.
I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.
If I use a VPN service I can enter these "banned" sites with Chrome browser.
I tried setting a proxy in pycharm, but I failed. I just got errors all the time.
What's the simplest and free way to solve this problem?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'}) # that web site is blocked in my country
webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site.
page_soup = soup(webpage, "html.parser")

There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.
A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet.
When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.
You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do
import requests
proxies = { 'http': "http://xxx.xx.xx.xxx:yy",
'https': "https://xxx.xx.xx.xxx:yy"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
and expect to get a response.
The proxy should be configured to take your request and send you a response.
so, where can you get a proxy?
a. You could buy proxies from many providers.
b. Use a list of free proxies from the internet.
You don't need to buy proxies unless you are doing some massive scale scraping.
For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.
import requests
#replace the ip and port below with the ip and port you got from any of the free sites
proxies = { 'http': "http://182.52.51.155:39236",
'https': "https://182.52.51.155:39236"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)
You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.
Note:
Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.

Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.
Here is a small example:
import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port",
'https': "https://host:port"}
text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?
If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:
# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port' }
If you don't have proxies at hand, you can fetch some proxies.

Related

Web scraping on Pythonanywhere

In my project I scraping data from Amazon. I deploy this on Pythonanywhere(I'm using paid account). But there is a problem that the code (I'm using BeautifulSoup4) doesn't get the html of the site when I try it on Pythonanywhere. It gets the Something Went Wrong site of Amazon. But on my local it works perfectly. I think its about User Agents. On my local I use my own User Agent. When deploying which User Agent should I use? And how can I fix this?
Here is my code:
URL = link ##some amazon link
headers = {"User-Agent": " ##my user agent"}
page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, 'html.parser')
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
Is there any way I can do it on Pythonanywhere?

Your code works perfectly on my home machine, so the issue could be:
PythonAnywhere machine's IP being blocked by Amazon (as others have mentioned)
Another issue with the machine's access to the internet (Try scraping another site to test this)
To solve the former, you'd probably want to try out a proxy connection to change the IP you access Amazon with (I suggest you check PythonAnywhere's and Amazon's Terms of Service to be aware of any risks). The usage would look something like this:
import requests
proxies = {
"http": "http://IP:Port", # HTTP
"https": "https://IP:Port", # HTTPS
'http': 'socks5://user:pass#IP:Port' # SOCKS5
}
URL = "https://api4.my-ip.io/ip" # Plaintext IPv4 to test
page = requests.get(URL, proxies=proxies)
print(page.text)
Finding proxies to use takes a couple Google searches, but the difficult part is swapping them out occasionally since they don't last forever.

Try Selenium Webdriver instead of BeautifulSoup4. I had this issue myself when deploying a web scraper to pythonanywhere.com
pythonanywhere.com require a Hacker plan (as a minimum) to run web scraping applications. I was told this by their support team: https://www.pythonanywhere.com/pricing/
I also used the following user-agent and chrome options:
from fake_useragent import UserAgent
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
ua = UserAgent()
userAgent = ua.random
chrome_options.add_argument(f'user-agent={userAgent}')
As per: https://www.pythonanywhere.com/forums/topic/21948/

Prevent from being banned from google scraping with BeautifulSoup

I want to make google news scraper with Python and BeautifulSoup but I have read that there is a chance that I can be banned.
I have also read that I can prevent this with using some rotating proxies and rotating IP addresses.
Only thing I managed to do Is to make rotating User-Agent.
Can you show me how to add rotating proxy and rotating IP address?
I know that it should be added in request.get() part but I do not know how.
This is my code:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=0
for page in range(1,5):
page = page*10
url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
print(url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
headline_text = soup.find_all('h3', class_= "r dO0Ag")
snippet_text = soup.find_all('div', class_='st')
news_date = soup.find_all('div', class_='slp')
print(len(news_date))

You can do searches with the proper API from Google:
https://developers.google.com/custom-search/v1/overview

You can use https://gimmmeproxy.com for rotating proxies and it's python wrapper: https://github.com/DeyaaMuhammad/GimmeProxyApi.
proxy = GimmeProxyAPI(protocol="https")
proxies = {
'http': proxy,
'https': proxy
}
requests.get('https://example.org', proxies=proxies)

If you want to learn web scraping, best choose some other website, like reddit or some magazine online. Google news (and other google services) are well protected against scraping and they change the names of classes regularly enough to prevent you from doing it the easy way.

If your question is 'What to do to get not banned?', then the answer is 'Don't violate the TOS' which means no scraping at all and using the proper search API.
There is some amount of "free" google search uses, based on the ip address you are using. So if you only scraping a handful of searches, this should be no problem.
If your question is 'How to use a proxy with requests module?', then you should start looking here.
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
But this is only the python side, you need to setup a web-proxy (or even better a pool of proxies) yourself and then use an algorithm to choose a different proxy every N requests for example.

One more simple trick is like Using Google colab in the Brave Tor browser and then see the results that you will get different ip addresses.
So, once you'll get the data that you want then you can use that data in you jupyter notebook or VS Code or elsewhere.
See, the results in the screenshots:
Using free proxies will get an error because there are too many requests on the free proxies so, you have to pick every time different one whose proxy is getting lower traffic so that's a terrible task to chose one out of hundreds.
Getting correct results with Brave Tor VPN:

Unable to get the same requests results from Twitter Web from my local PC and AWS EC2 instance

I want to be able to Scrape Twitter's Trending Topics.
Of course, the natural way to do that, is to use the Twitter API. However, most of the Trends do not come with a Tweet_count, which is key for me.
So I decided to scrape the Twitter website, and it has been a mess.
First, I just went after https://twitter.com/i/trends and it worked fine and still does, on my local computer. Then I tried to set up the script on my AWS EC2 instance, yet I got no results.
This is a simplified version of the code:
import requests
from bs4 import BeautifulSoup
url = 'http://twitter.com/i/trends'
r = requests.get(url)
html = r.json()['module_html']
soup = BeautifulSoup(html, 'html.parser')
trends_list = soup.find_all('span', {'class':'u-linkComplex-target trend-name'})
tweet_volume_list = soup.findAll('div', {'class':'js-nav trend-item-stats js-ellipsis'})
and like I said, it worked fine. However, if I run this same code on my Linux server in AWS, the result of r.content is '{}'.
So then I tried going with mobile.twitter.com/i/trends and got a similar problem. I did find with the DevTools on a Private Session that twitter goes to an https://api.twitter.com/2/guide.json endpoint internally, and that is the actual resource that returns the data I'm looking for (trends and tweet volume). However, no matter what I did, with requests, I was unable to access it with python. I tried using the same headers and the same params as the browser, but to no avail.
So then I move to selenium, and just like before, I did get data locally, but not the actual TT data on the server. So at this point I'm pretty lost. I don't know enough web dev to understand exactly if this is a cookie problem or what, nor how to fix it.
TL;DR: I want to scrape Twitter's Trending Topics with python but it's not working.

The main reason it doesn't work because Twitter blocks AWS EC2 instance IPs. It's not a server problem but a block improvised by Twitter itself. I searched a lot and found the same problems with various libraries used for twitter scraping.
I would recommend using proxies in this case or can change the provider to Linode or digital ocean. I also checked Heroku and turns out its IPs are also blocked after some requests.
Usage of proxies is well-explained in this link of docs from request-docs.
As from your code, the solution should be
import requests
from bs4 import BeautifulSoup
proxies = [
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
"http": "your proxy server"
]
url = 'http://twitter.com/i/trends'
r = requests.get(url, proxies=proxies)
html = r.json()['module_html']
soup = BeautifulSoup(html, 'html.parser')
trends_list = soup.find_all('span', {'class':'u-linkComplex-target trend-name'})
tweet_volume_list = soup.findAll('div', {'class':'js-nav trend-item-stats js-ellipsis'})
You should also try some of the free proxy servers and python also has libraries like free-proxy that might be of help.
Even then I would recommend if the data volume is large then you should consider using multiple proxies and rotate them frequently and also try using async requests library like aiohttp.

Python3 Requests not using passed-in proxy

Requests isn't using the proxies I pass to it. The site at the url I'm using shows which ip the request came from--and it's always my ip not the proxy ip. I'm getting my proxy ips from sslproxies.org which are supposed to be anonymous.
url = 'http://www.lagado.com/proxy-test'
proxies = {'http': 'x.x.x.x:xxxx'}
headers = {'User-Agent': 'Mozilla...etc'}
res = requests.get(url, proxies=proxies, headers=headers)
Are there certain headers that need to be used or something else that needs to be configured so that my ip is hidden from the server?

The docs state
that proxy URLs must include the scheme.
Where scheme is scheme://hostname. So you should add 'http://' or 'socks5://' to your proxy URL, depending on the protocol you're using.

Using urllib2 via proxy

I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?

Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get info/data from blocked web sites with BeautifulSoup? - python

Related

Web scraping on Pythonanywhere

Prevent from being banned from google scraping with BeautifulSoup

Unable to get the same requests results from Twitter Web from my local PC and AWS EC2 instance

Python3 Requests not using passed-in proxy

Using urllib2 via proxy

Categories

Resources