How to Read JS generated Page in Python - python

Please Note: This problem can be solved easily by using selenium library but I don't want to use selenium since the Host doesn't have a browser installed and not willing to.
Important: I know that render() will download chromium at first time and I'm ok with that.
Q: How can I get the page source when it's generated by JS code? For example this HP printer:
220.116.57.59
Someone posted online and suggested using:
from requests_html import HTMLSession
r = session.get('https://220.116.57.59', timeout=3, verify=False)
session = HTMLSession()
base_url = r.url
r.html.render()
But printing r.text doesn't print full page source and indicates that JS is disabled:
<div id="pgm-no-js-text">
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
</div>
Original Answer: https://stackoverflow.com/a/50612469/19278887 (last part)

Grab the config endpoints and then parse the XML for the data you want.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
with requests.Session() as s:
soup = (
BeautifulSoup(
s.get(
"http://220.116.57.59/IoMgmt/Adapters",
headers=headers,
).text,
features="xml",
).find_all("io:HardwareConfig")
)
print("\n".join(c.find("MacAddress").getText() for c in soup if c.find("MacAddress") is not None))
Output:
E4E749735068
E4E74973506B
E6E74973D06B

Related

Python BS4 not allowed to access the web page

First i used html_doc=requests.get(x) to read the page but when i printed the soup, i got 403 Forbidden error.
In order to bypass this, i added a User Agent and used this code: html_doc=requests.get(x, headers=header)
However, this time, i got a 400 Bad Request error when i tried to print the soup.
Could some one guide me and help find a solution to this problem?
Edit - Code:
from bs4 import BeautifulSoup, NavigableString
from urllib import request
import requests
import lxml
from lxml import etree
from lxml import html
x='https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html'
header = {'User Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'}
html_doc=requests.get(x, headers=header) #With header
html_doc=requests.get(x) #Without Header
soup = BeautifulSoup(html_doc.text, 'lxml')
print(soup)
URL: x=https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html
Thanks for reading!
EDIT2: Solved by using this code:
import requests
session = requests.Session()
response = session.get('https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html', headers={'User-Agent': 'Mozilla/5.0'})
print(response.text)
PS: im just learning coding and this is not for any work related purposes. Just a personal project relating to the stock market.
You’ll need to use User-Agent: not User Agent:. HTTP headers should not use spaces in their keys.

Requests returns a status code of 429 for URL https://www.instagram.com/google

I'm trying to code an Instagram-webscraper in Python to return values like a person's followers, the number of posts etc.
Let's just take Google's Instagram-account for this example.
Here is my code:
import requests
from bs4 import BeautifulSoup
link = requests.get("https://www.instagram.com/google")
soup = BeautifulSoup(link.text, "html.parser")
print(soup)
print(link.status_code)
Pretty straightforward.
However, if I run the code, it prints link.status_code = 429. It should be 200, for any other website it prints 200.
Also, when it prints soup, it doesnt show what I actually want. Not the HTML for the account is shown, but the HTML for the Instagram-Error-page.
Why does requests open the instagram error page, not the account from the link provided?
To get correct response from the server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
link = requests.get("https://www.instagram.com/google", headers=headers)
soup = BeautifulSoup(link.text, "lxml")
print(link.status_code)
print(soup.select_one('meta[name="description"]')["content"])
Prints:
200
12.5m Followers, 33 Following, 1,642 Posts - See Instagram photos and videos from Google (#google)

How to "webscrape" a site containing a popup window, using python?

I am trying to web scrape a certain part of the etherscan site with python, since there is no api for this functionality. Basically going to this link and one would need to press verify, after doing so a popup comes up which you can see here. What I need to scrape is this part 0x0882477e7895bdc5cea7cb1552ed914ab157fe56 in case the message starts with the message as seen in the picture.
I've written the below python script that starts this off, but I don't know how it's possible to interact further with the site, in order to have that popup come to the foreground and scrape the information. Is this possible to do?
from bs4 import BeautifulSoup
from requests import get
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0','X-Requested-With': 'XMLHttpRequest',}
url = "https://etherscan.io/proxyContractChecker?a=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48"
response = get(url,headers=headers )
soup = BeautifulSoup(response.content,'html.parser')
Thank You
import requests
from bs4 import BeautifulSoup
def Main(url):
with requests.Session() as req:
r = req.get(url, headers={'User-Agent': 'Ahmed American :)'})
soup = BeautifulSoup(r.content, 'html.parser')
vs = soup.find("input", id="__VIEWSTATE").get("value")
vsg = soup.find("input", id="__VIEWSTATEGENERATOR").get("value")
ev = soup.find("input", id="__EVENTVALIDATION").get("value")
data = {
'__VIEWSTATE': vs,
'__VIEWSTATEGENERATOR': vsg,
'__EVENTVALIDATION': ev,
'ctl00$ContentPlaceHolder1$txtContractAddress': '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48',
'ctl00$ContentPlaceHolder1$btnSubmit': "Verify"
}
r = req.post(
"https://etherscan.io/proxyContractChecker?a=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48", data=data, headers={'User-Agent': 'Ahmed American :)'})
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find(
"div", class_="alert alert-success").text.split(" ")[-1]
print(token)
Main("https://etherscan.io/proxyContractChecker")
Output:
0x0882477e7895bdc5cea7cb1552ed914ab157fe56
I disagree with #InfinityTM. Usually the workflow that is follow for this kind of problems is that you will need to make a POST request into the website.
Look, if you click on Verify a POST request is made into the website as shown in this image:
This POST request was made with this headers:
and with this params:
You need to figure out how to send this POST request with the correct URL, headers, params, and cookies. Once you have achieved to make the request, you will receive the response:
which contains the information you want to scrape under the div with class "alert alert-success:
Summary
So the steps you need to follow are:
Navigate to your website, and gather all the information (request URL, Cookies, headers, and params) that you will need for your POST request.
Make the request with the requests library.
Once you get a <200> response, scrape the data you are interested in with BS.
Please let me know if this points you in the right direction! :D

Getting error that USER_AGENT is not defined (Python 3)

I'm trying to scrape the information inside an 'iframe' tag. When I execute this code, it says that 'USER_AGENT' is not defined. How can I fix this?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
The error is telling you clearly what is wrong. You are passing in as headers USER_AGENT, which you have not defined earlier in your code. Take a look at this post on how to use headers with the method.
The documentation states you must pass in a dictionary of HTTP headers for the request, whereas you have passed in an undefined variable USER_AGENT.
From the Requests Library API:
headers = None
Case-insensitive Dictionary of Response Headers.
For example, headers['content-encoding'] will return the value of a 'Content-Encoding' response header.
EDIT:
For a better explanation of Content-Type headers, see this SO post. See also this WebMasters post which explains the difference between Accept and Content-Type HTTP headers.
Since you only seem to be interested in scraping the iframe tags, you may simply omit the headers argument entirely and you should see the results if you print out the test object in your code.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", timeout=10)
soup = BeautifulSoup(page.content, "lxml")
test = soup.find_all('iframe')
for tag in test:
print(tag)
We have to provide a user-agent, HERE's a link to the fake user-agents.
import requests
from bs4 import BeautifulSoup
USER_AGENT = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/53'}
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
You can simply NOT use a User Agent, Code:
import requests
from bs4 import BeautifulSoup
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
I've separated your URL for readability purposes into the URL and token. That's why there's two variables URL and token

Cache Access Denied. Authentication Required in requests module

I am trying to make a basic web crawler. My internet is through proxy connection. So I used the solution given here. But still while running the code I am getting the error.
My code is:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
proxies = {
"http": r"http://usr:pass#202.141.80.22:3128",
"https": r"http://usr:pass#202.141.80.22:3128",
}
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url,proxies=proxies)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)
But while running on terminal in ubuntu 14.04 I am getting the following error:
The following error was encountered while trying to retrieve the URL: http://www.santabanta.com/wallpapers/gauhar-khan/? Cache Access Denied. Sorry, you are not currently allowed to request http://www.santabanta.com/wallpapers/gauhar-khan/? from this cache until you have authenticated yourself.
The url posted by me is:http://www.santabanta.com/wallpapers/gauhar-khan/
Please help me
open the url.
hit F12(chrome user)
now go to "network" in the menu below.
hit f5 to reload the page so that chrome records all the data received from server.
open any of the "received file" and go down to "request header"
pass all the header to request.get()
.[Here is an image to help you][1]
[1]: http://i.stack.imgur.com/zUEBE.png
Make the header as follows:
headers = { 'Accept':' */ * ',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
There is another way to solve this problem.
What you can do is let your python script to use the proxy defined in your environment variable
Open terminal (CTRL + ALT + T)
export http_proxy="http://usr:pass#proxy:port"
export https_proxy="https://usr:pass#proxy:port"
and remove the proxy lines from your code
Here is the changed code:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)

Categories

Resources