Python BS4 not allowed to access the web page

Python BS4 not allowed to access the web page - python

First i used html_doc=requests.get(x) to read the page but when i printed the soup, i got 403 Forbidden error.
In order to bypass this, i added a User Agent and used this code: html_doc=requests.get(x, headers=header)
However, this time, i got a 400 Bad Request error when i tried to print the soup.
Could some one guide me and help find a solution to this problem?
Edit - Code:
from bs4 import BeautifulSoup, NavigableString
from urllib import request
import requests
import lxml
from lxml import etree
from lxml import html
x='https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html'
header = {'User Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'}
html_doc=requests.get(x, headers=header) #With header
html_doc=requests.get(x) #Without Header
soup = BeautifulSoup(html_doc.text, 'lxml')
print(soup)
URL: x=https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html
Thanks for reading!
EDIT2: Solved by using this code:
import requests
session = requests.Session()
response = session.get('https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html', headers={'User-Agent': 'Mozilla/5.0'})
print(response.text)
PS: im just learning coding and this is not for any work related purposes. Just a personal project relating to the stock market.

You’ll need to use User-Agent: not User Agent:. HTTP headers should not use spaces in their keys.

Related

How to Read JS generated Page in Python

Please Note: This problem can be solved easily by using selenium library but I don't want to use selenium since the Host doesn't have a browser installed and not willing to.
Important: I know that render() will download chromium at first time and I'm ok with that.
Q: How can I get the page source when it's generated by JS code? For example this HP printer:
220.116.57.59
Someone posted online and suggested using:
from requests_html import HTMLSession
r = session.get('https://220.116.57.59', timeout=3, verify=False)
session = HTMLSession()
base_url = r.url
r.html.render()
But printing r.text doesn't print full page source and indicates that JS is disabled:
<div id="pgm-no-js-text">
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
</div>
Original Answer: https://stackoverflow.com/a/50612469/19278887 (last part)

Grab the config endpoints and then parse the XML for the data you want.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
with requests.Session() as s:
soup = (
BeautifulSoup(
s.get(
"http://220.116.57.59/IoMgmt/Adapters",
headers=headers,
).text,
features="xml",
).find_all("io:HardwareConfig")
)
print("\n".join(c.find("MacAddress").getText() for c in soup if c.find("MacAddress") is not None))
Output:
E4E749735068
E4E74973506B
E6E74973D06B

Requests returns a status code of 429 for URL https://www.instagram.com/google

I'm trying to code an Instagram-webscraper in Python to return values like a person's followers, the number of posts etc.
Let's just take Google's Instagram-account for this example.
Here is my code:
import requests
from bs4 import BeautifulSoup
link = requests.get("https://www.instagram.com/google")
soup = BeautifulSoup(link.text, "html.parser")
print(soup)
print(link.status_code)
Pretty straightforward.
However, if I run the code, it prints link.status_code = 429. It should be 200, for any other website it prints 200.
Also, when it prints soup, it doesnt show what I actually want. Not the HTML for the account is shown, but the HTML for the Instagram-Error-page.
Why does requests open the instagram error page, not the account from the link provided?

To get correct response from the server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
link = requests.get("https://www.instagram.com/google", headers=headers)
soup = BeautifulSoup(link.text, "lxml")
print(link.status_code)
print(soup.select_one('meta[name="description"]')["content"])
Prints:
200
12.5m Followers, 33 Following, 1,642 Posts - See Instagram photos and videos from Google (#google)

Problems with user agent on urllib

I wanted to get the HTML of web site but I cant get it due to the user agent I suppose. Because when I call uClient=ureq(my_url) I get an error like this: urllib.error.HTTPError: HTTP Error 403: Forbidden
This is the code:
from urllib.request import urlopen as ureq, Request
from bs4 import BeautifulSoup as soup
my_url= 'https://hsreplay.net/meta/#tab=matchups&sortBy=winrate'
ureq(Request(my_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}))
uClient=ureq(my_url)
page_html=uClient.read()
uClient.close()
html=soup(page_html,"html.parser")
I have tried other methods of changing th user agent and other user agents, but it isn't work.
I'm pretty sure you will help. Thanks!!

What you did above is clearly a mess. The code should not run at all. Try the below way instead.
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"
req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)
Output:
HSReplay.net
Btw, you can scrape few items that are not javascript encrypted from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib and BeautifulSoup. To get them you need to choose any browser simulator like selenium etc.

Getting error that USER_AGENT is not defined (Python 3)

I'm trying to scrape the information inside an 'iframe' tag. When I execute this code, it says that 'USER_AGENT' is not defined. How can I fix this?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')

The error is telling you clearly what is wrong. You are passing in as headers USER_AGENT, which you have not defined earlier in your code. Take a look at this post on how to use headers with the method.
The documentation states you must pass in a dictionary of HTTP headers for the request, whereas you have passed in an undefined variable USER_AGENT.
From the Requests Library API:
headers = None
Case-insensitive Dictionary of Response Headers.
For example, headers['content-encoding'] will return the value of a 'Content-Encoding' response header.
EDIT:
For a better explanation of Content-Type headers, see this SO post. See also this WebMasters post which explains the difference between Accept and Content-Type HTTP headers.
Since you only seem to be interested in scraping the iframe tags, you may simply omit the headers argument entirely and you should see the results if you print out the test object in your code.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", timeout=10)
soup = BeautifulSoup(page.content, "lxml")
test = soup.find_all('iframe')
for tag in test:
print(tag)

We have to provide a user-agent, HERE's a link to the fake user-agents.
import requests
from bs4 import BeautifulSoup
USER_AGENT = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/53'}
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
You can simply NOT use a User Agent, Code:
import requests
from bs4 import BeautifulSoup
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
I've separated your URL for readability purposes into the URL and token. That's why there's two variables URL and token

Using python Beautiful Soup on a website, keep getting this error: urllib.error.HTTPError: HTTP Error 403: Forbidden

Here's the code I'm using to get Nike's clothing data.
import urllib.request
#Base url for website
url = 'http://store.nike.com/us/en_us/pw/mens-clothing/1mdZ7pu?ipp=120'
# A lot of sites don't like the user agents of Python 3, so I specify one here
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
And then the error looks like this:
urllib.error.HTTPError: HTTP Error 403: Forbidden
How can I open and parse this HTML page?

Or try selenium webdriver.
from selenium import webdriver
from bs4 import BeautifulSoup as bs
browser = webdriver.Firefox()
url = 'http://store.nike.com/us/en_us/pw/mens-clothing/1mdZ7pu?ipp=120'
browser.get(url)
source = browser.page_source
soup = bs(source, "html.parser")
print(soup)
This worked for me, just a newbie though :)

Alternatively you could try requests.
>>> import requests
>>> page = requests.get('http://store.nike.com/us/en_us/pw/mens-clothing/1mdZ7pu?ipp=120').content

Try this:
import urllib.request
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('http://store.nike.com/us/en_us/pw/mens-clothing/1mdZ7pu?ipp=120')
print(response.read())
AppURLopener (which inherits from the .request.FancyURLopener class) offers some nice tools to mimic a browser and therefore bypass the 403: Forbidden errors.
Hope this helps!

The problem is with User-Agent. This website blocks the specified User-Agent but works fine without specifying any User-Agent in the header.
import urllib.request
#Base url for website
url = 'http://store.nike.com/us/en_us/pw/mens-clothing/1mdZ7pu?ipp=120'
# A lot of sites don't like the user agents of Python 3, so I specify one here
req = urllib.request.Request(url)
html = urllib.request.urlopen(req).read()
print(html)
But if you want to add the header anyway I would recommend you to use requests. First install the package via pip using - pip install requests.
import requests
#Base url for website
url = 'http://store.nike.com/us/en_us/pw/mens-clothing/1mdZ7pu?ipp=120'
# A lot of sites don't like the user agents of Python 3, so I specify one here
html = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0'})
print(html.text)
For details documentation about requests see this page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BS4 not allowed to access the web page - python

You’ll need to use User-Agent: not User Agent:. HTTP headers should not use spaces in their keys.

Related

How to Read JS generated Page in Python

Requests returns a status code of 429 for URL https://www.instagram.com/google

Problems with user agent on urllib

Getting error that USER_AGENT is not defined (Python 3)

Using python Beautiful Soup on a website, keep getting this error: urllib.error.HTTPError: HTTP Error 403: Forbidden

Categories

Resources