Access denied while scraping - python

I want to create a script to go on to https://www.size.co.uk/featured/footwear/ and scrape the content but somehow when i run the script, i got access denied. Here is the code:
from urllib import urlopen
from bs4 import BeautifulSoup as BS
url = urlopen('https://www.size.co.uk/')
print BS(url, 'lxml')
The output is
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.size.co.uk/" on this server.
<p>
Reference #18.6202655f.1498945327.11002828
</p></body>
</html>
When i try it with other websites, the code works fine and also when i use Selenium, nothing happens but i still want to know how to bypass this error without using Selenium. But when i use Selenium on different website like http://www.footpatrol.co.uk/shop i got the same Access Denied error, here is the code for footpatrol:
from selenium import webdriver
driver = webdriver.PhantomJS('C:\Users\V\Desktop\PY\web_scrape\phantomjs.exe')
driver.get('http://www.footpatrol.com')
pageSource = driver.page_source
soup = BS(pageSource, 'lxml')
print soup
Output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.footpatrol.co.uk/" on this
server.<p>
Reference #18.6202655f.1498945644.110590db
</p></body></html>

import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.size.co.uk/'
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
page = requests.get(url, headers=agent)
print (BS(page.content, 'lxml'))

try this :
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101
Firefox/50.0'}
source=requests.get(url, headers=headers).text
print(source)

Related

server sends 403 status code when using requests library in python, but works with browser

I'm trying to automate a login using python's requests module, but whenever I use the POST or GET request the server sends 403 status code; the weird part is that I can access that same URL with any browser but it just won't work with curl and requests.
here is the code:
import requests
import lxml
from bs4 import BeautifulSoup
import os
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w")
FILE.write(ready)
FILE.close()
I'd appreciate any help or idea!
Its probably the /robots.txt, thats blocking you.
try overriding the user-agent with a custom one.
import requests
import lxml
from bs4 import BeautifulSoup
import os
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url, headers=headers).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w", encoding="utf-8")
FILE.write(ready)
FILE.close()
you also didnt specify the file encoding when opening a file.

Web crawling error using Python at 'whoscored.com'

import requests
from bs4 import BeautifulSoup
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
page = requests.get("https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League", headers=user_agent)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I'm trying to do webcrawling on 'whoscored.com' but I can't get all the HTML Tell me the solution.
Request unsuccessful. Incapsula incident ID: 946001050011236585-61439481461474967
this is result.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
There is a couple of issues here. The root cause is that the website you are trying to scrape knows you're not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection

Unable to read html page from beautiful soup

The below code got stuck after printing hi in output. Can you please check what is wrong with this? And if the site is secure and I need some special authentication?
from bs4 import BeautifulSoup
import requests
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
Unable to read html page from beautiful soup
Why you got this problem is website consider that you are robots, they won't send anything to you. And they even hang up the connection let you wait forever.
You just imitate browser's request, then server will consider you are not an robot.
Add headers is the simplest way to deal with this problem. But something you should not pass User-Agent only(like this time). Remember copy your browser's request and remove the useless element(s) through testing. If you are lazy use browser's headers straightly, but you must not copy all of them when you want to upload files
from bs4 import BeautifulSoup
import requests
rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")
Was having the same issue as you. Just sat there.
I tried by adding user-agent, and it pulled it realtively quickly. Don't know why that is though.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
EDIT: So odd. Now it's not working for me again. It first didn't work. Then it did. Now it doesn't. But there is another potential option with the use of Selenium.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')
r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)
browser.close()

BeautifulSoup doesn't seem to parse anything

I've been trying to learn BeautifulSoup by making myself a proxy scraper and I've encountered a problem. BeautifulSoup seems unable to find anything and when printing what it parses, It shows me this :
<html>
<head>
</head>
<body>
<bound 0x7f977c9121d0="" <http.client.httpresponse="" at="" httpresponse.read="" method="" object="" of="">
>
</bound>
</body>
</html>
I have tried changing the website I parsed and the parser itself (lxml, html.parser, html5lib) but nothing seems to change, no matter what I do I get the exact same result. Here's my code, can anyone explain what's wrong ?
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req).read
soup = BeautifulSoup(str(content), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
You are calling urllib.request.urlopen(req).read, correct syntax is: urllib.request.urlopen(req).read() also you are not closing the connection, fixed that for you.
A better way to open connections is using the with urllib.request.urlopen(url) as req: syntax as this closes the connection for you.
from bs4 import BeautifulSoup
import urllib
import html5lib
class Websites:
def __init__(self):
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def free_proxy_list(self):
print("Connecting to free-proxy-list.net ...")
url = "https://free-proxy-list.net"
req = urllib.request.Request(url, None, self.header)
content = urllib.request.urlopen(req)
html = content.read()
soup = BeautifulSoup(str(html), "html5lib")
print("Connected. Loading the page ...")
print("Print page")
print("")
print(soup.prettify())
content.close() # Important to close the connection
For more info see: https://docs.python.org/3.0/library/urllib.request.html#examples

Why can't I scrape Amazon by BeautifulSoup?

Here is my python code:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
it works for google.com and many other websites, but it doesn't work for amazon.com.
I can open amazon.com in my browser, but the resulting "soup" is still none.
Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
So I doubt whether Amazon and App Annie block scraping.
Add a header, then it will work.
from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
You can try this:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
In python arbitrary text is called a string and it must be enclosed in quotes(" ").
I just ran into this and found that setting any user-agent will work. You don't need to lie about your user agent.
response = HTTParty.get #url, headers: {'User-Agent' => 'Httparty'}
Add a header
import urllib2
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

Categories

Resources