Web scraping on Pythonanywhere - python

In my project I scraping data from Amazon. I deploy this on Pythonanywhere(I'm using paid account). But there is a problem that the code (I'm using BeautifulSoup4) doesn't get the html of the site when I try it on Pythonanywhere. It gets the Something Went Wrong site of Amazon. But on my local it works perfectly. I think its about User Agents. On my local I use my own User Agent. When deploying which User Agent should I use? And how can I fix this?
Here is my code:
URL = link ##some amazon link
headers = {"User-Agent": " ##my user agent"}
page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, 'html.parser')
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
Is there any way I can do it on Pythonanywhere?

Your code works perfectly on my home machine, so the issue could be:
PythonAnywhere machine's IP being blocked by Amazon (as others have mentioned)
Another issue with the machine's access to the internet (Try scraping another site to test this)
To solve the former, you'd probably want to try out a proxy connection to change the IP you access Amazon with (I suggest you check PythonAnywhere's and Amazon's Terms of Service to be aware of any risks). The usage would look something like this:
import requests
proxies = {
"http": "http://IP:Port", # HTTP
"https": "https://IP:Port", # HTTPS
'http': 'socks5://user:pass#IP:Port' # SOCKS5
}
URL = "https://api4.my-ip.io/ip" # Plaintext IPv4 to test
page = requests.get(URL, proxies=proxies)
print(page.text)
Finding proxies to use takes a couple Google searches, but the difficult part is swapping them out occasionally since they don't last forever.

Try Selenium Webdriver instead of BeautifulSoup4. I had this issue myself when deploying a web scraper to pythonanywhere.com
pythonanywhere.com require a Hacker plan (as a minimum) to run web scraping applications. I was told this by their support team: https://www.pythonanywhere.com/pricing/
I also used the following user-agent and chrome options:
from fake_useragent import UserAgent
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
ua = UserAgent()
userAgent = ua.random
chrome_options.add_argument(f'user-agent={userAgent}')
As per: https://www.pythonanywhere.com/forums/topic/21948/

Related

Why does using Amazon API gateway give the wrong HTML page when using requests.get(URL)

I'm currently building a web scraper and have run into the issue of being IP blocked. To get around this issue I'm trying to use the requests_ip_rotator which use AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping. Following this answer I've implemented it into my code which is below:
import requests
from bs4 import BeautifulSoup
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
url = "https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1"
page1 = requests.get(url)
soup1 = BeautifulSoup(page1.content, "html.parser")
gateway = ApiGateway("https://secure.runescape.com/",access_key_id="****",access_key_secret="****")
gateway.start()
session = requests.Session()
session.mount("https://secure.runescape.com/", gateway)
page2 = session.get(url)
gateway.shutdown()
soup2 = BeautifulSoup(page2.content, "html.parser")
print("\n"+page1.url)
print(page2.url)
print(soup1.head.title==soup2.head.title)
input()
output:
Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://secure.runescape.com/ - IP Rotate API' (10 new).
Deleting gateways for site 'https://secure.runescape.com'.
Deleted 10 endpoints with for site 'https://secure.runescape.com'.
https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1
https://6kesqk9t6d.execute-api.eu-central-1.amazonaws.com/ProxyStage/m=hiscore_oldschool_ironman/a=13/overall
False
So both times I use the .get(url) method I am using the same url but receiving different pages. Request.get(url) is giving me the page I want but when I use the amazon gateway with session.get(url) it is not giving me the same page as before but a different page from the same site. I'm stumped for what the issue could be so any help would be greatly appreciated!
When making get requests to the "https://secure.runescape.com" domain using the AWS gateway I noticed that if the URL path is: "a=13/group-ironman/?groupSize=5&page=x" for any x then I get a 302 response (redirect response) which redirects me to the URL path "/a=13/overall".
This leads me to believe that the runescape server is redirecting AWS IP's for some URL's but fortunately its not redirecting my own IP.
So my workaround is to use requests.get() without the AWS gateway for URL's that are being redirected and for other URL's of the same site the AWS gateway is not being redirected so I am still using it to avoid being IP blocked.

Prevent from being banned from google scraping with BeautifulSoup

I want to make google news scraper with Python and BeautifulSoup but I have read that there is a chance that I can be banned.
I have also read that I can prevent this with using some rotating proxies and rotating IP addresses.
Only thing I managed to do Is to make rotating User-Agent.
Can you show me how to add rotating proxy and rotating IP address?
I know that it should be added in request.get() part but I do not know how.
This is my code:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=0
for page in range(1,5):
page = page*10
url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
print(url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
headline_text = soup.find_all('h3', class_= "r dO0Ag")
snippet_text = soup.find_all('div', class_='st')
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
You can do searches with the proper API from Google:
https://developers.google.com/custom-search/v1/overview
You can use https://gimmmeproxy.com for rotating proxies and it's python wrapper: https://github.com/DeyaaMuhammad/GimmeProxyApi.
proxy = GimmeProxyAPI(protocol="https")
proxies = {
'http': proxy,
'https': proxy
}
requests.get('https://example.org', proxies=proxies)
If you want to learn web scraping, best choose some other website, like reddit or some magazine online. Google news (and other google services) are well protected against scraping and they change the names of classes regularly enough to prevent you from doing it the easy way.
If your question is 'What to do to get not banned?', then the answer is 'Don't violate the TOS' which means no scraping at all and using the proper search API.
There is some amount of "free" google search uses, based on the ip address you are using. So if you only scraping a handful of searches, this should be no problem.
If your question is 'How to use a proxy with requests module?', then you should start looking here.
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
But this is only the python side, you need to setup a web-proxy (or even better a pool of proxies) yourself and then use an algorithm to choose a different proxy every N requests for example.
One more simple trick is like Using Google colab in the Brave Tor browser and then see the results that you will get different ip addresses.
So, once you'll get the data that you want then you can use that data in you jupyter notebook or VS Code or elsewhere.
See, the results in the screenshots:
Using free proxies will get an error because there are too many requests on the free proxies so, you have to pick every time different one whose proxy is getting lower traffic so that's a terrible task to chose one out of hundreds.
Getting correct results with Brave Tor VPN:

Unable to get the same requests results from Twitter Web from my local PC and AWS EC2 instance

I want to be able to Scrape Twitter's Trending Topics.
Of course, the natural way to do that, is to use the Twitter API. However, most of the Trends do not come with a Tweet_count, which is key for me.
So I decided to scrape the Twitter website, and it has been a mess.
First, I just went after https://twitter.com/i/trends and it worked fine and still does, on my local computer. Then I tried to set up the script on my AWS EC2 instance, yet I got no results.
This is a simplified version of the code:
import requests
from bs4 import BeautifulSoup
url = 'http://twitter.com/i/trends'
r = requests.get(url)
html = r.json()['module_html']
soup = BeautifulSoup(html, 'html.parser')
trends_list = soup.find_all('span', {'class':'u-linkComplex-target trend-name'})
tweet_volume_list = soup.findAll('div', {'class':'js-nav trend-item-stats js-ellipsis'})
and like I said, it worked fine. However, if I run this same code on my Linux server in AWS, the result of r.content is '{}'.
So then I tried going with mobile.twitter.com/i/trends and got a similar problem. I did find with the DevTools on a Private Session that twitter goes to an https://api.twitter.com/2/guide.json endpoint internally, and that is the actual resource that returns the data I'm looking for (trends and tweet volume). However, no matter what I did, with requests, I was unable to access it with python. I tried using the same headers and the same params as the browser, but to no avail.
So then I move to selenium, and just like before, I did get data locally, but not the actual TT data on the server. So at this point I'm pretty lost. I don't know enough web dev to understand exactly if this is a cookie problem or what, nor how to fix it.
TL;DR: I want to scrape Twitter's Trending Topics with python but it's not working.
The main reason it doesn't work because Twitter blocks AWS EC2 instance IPs. It's not a server problem but a block improvised by Twitter itself. I searched a lot and found the same problems with various libraries used for twitter scraping.
I would recommend using proxies in this case or can change the provider to Linode or digital ocean. I also checked Heroku and turns out its IPs are also blocked after some requests.
Usage of proxies is well-explained in this link of docs from request-docs.
As from your code, the solution should be
import requests
from bs4 import BeautifulSoup
proxies = [
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
"http": "your proxy server"
]
url = 'http://twitter.com/i/trends'
r = requests.get(url, proxies=proxies)
html = r.json()['module_html']
soup = BeautifulSoup(html, 'html.parser')
trends_list = soup.find_all('span', {'class':'u-linkComplex-target trend-name'})
tweet_volume_list = soup.findAll('div', {'class':'js-nav trend-item-stats js-ellipsis'})
You should also try some of the free proxy servers and python also has libraries like free-proxy that might be of help.
Even then I would recommend if the data volume is large then you should consider using multiple proxies and rotate them frequently and also try using async requests library like aiohttp.

How to get info/data from blocked web sites with BeautifulSoup?

I want to write a script with python 3.7. But first I have to scrape it.
I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.
If I use a VPN service I can enter these "banned" sites with Chrome browser.
I tried setting a proxy in pycharm, but I failed. I just got errors all the time.
What's the simplest and free way to solve this problem?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'}) # that web site is blocked in my country
webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site.
page_soup = soup(webpage, "html.parser")
There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.
A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet.
When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.
You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do
import requests
proxies = { 'http': "http://xxx.xx.xx.xxx:yy",
'https': "https://xxx.xx.xx.xxx:yy"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
and expect to get a response.
The proxy should be configured to take your request and send you a response.
so, where can you get a proxy?
a. You could buy proxies from many providers.
b. Use a list of free proxies from the internet.
You don't need to buy proxies unless you are doing some massive scale scraping.
For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.
import requests
#replace the ip and port below with the ip and port you got from any of the free sites
proxies = { 'http': "http://182.52.51.155:39236",
'https': "https://182.52.51.155:39236"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)
You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.
Note:
Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.
Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.
Here is a small example:
import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port",
'https': "https://host:port"}
text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?
If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:
# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port' }
If you don't have proxies at hand, you can fetch some proxies.

Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416

I'm doing web scraping using selenium webdriver in Python with Proxy.
I want to browse more than 10k pages of single site using this scraping.
Issue is using this proxy I'm able to send request for single time only. when I'm sending another request on same link or another link of this site, I'm getting 416 error (kind of block IP using firewall) for 1-2 hours.
Note: I'm able to do scraping all normal sites with this code, but this site has kind of security which is prevent me for scraping.
Here is code.
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference(
"network.proxy.http", "74.73.148.42")
profile.set_preference("network.proxy.http_port", 3128)
profile.update_preferences()
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('http://www.example.com/')
time.sleep(5)
element = browser.find_elements_by_css_selector(
'.well-sm:not(.mbn) .row .col-md-4 ul .fs-small a')
for ele in element:
print ele.get_attribute('href')
browser.quit()
Any solution ??
Selenium wasn't helpful for me, so I solved the problem by using beautifulsoup, the website has used security to block proxy whenever received request, so I am continuously changing proxyurl and User-Agent whenever server blocking requested proxy.
I'm pasting my code here
from bs4 import BeautifulSoup
import requests
import urllib2
url = 'http://terriblewebsite.com/'
proxy = urllib2.ProxyHandler({'http': '130.0.89.75:8080'})
# Create an URL opener utilizing proxy
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15')
result = urllib2.urlopen(request)
data = result.read()
soup = BeautifulSoup(data, 'html.parser')
ptag = soup.find('p', {'class', 'text-primary'}).text
print ptag
Note :
change proxy and User-Agent and use latest updated proxy only
few server are accepting only specific country proxy, In my case I used Proxies from United States
this process might be a slow, still u can scrap the data
Going through the 416 error issues in the following links, it seems that some cached information(cookies maybe) is creating the issues. You are able to send request for the first time and subsequent send requests fail.
https://webmasters.stackexchange.com/questions/17300/what-are-the-causes-of-a-416-error
416 Requested Range Not Satisfiable
Try choosing not to save cookies by setting a preference or deleting the cookies after every send request.
profile.set_preference("network.cookie.cookieBehavior", 2);

Categories

Resources