How to "webscrape" a site containing a popup window, using python?

How to "webscrape" a site containing a popup window, using python? - python

I am trying to web scrape a certain part of the etherscan site with python, since there is no api for this functionality. Basically going to this link and one would need to press verify, after doing so a popup comes up which you can see here. What I need to scrape is this part 0x0882477e7895bdc5cea7cb1552ed914ab157fe56 in case the message starts with the message as seen in the picture.
I've written the below python script that starts this off, but I don't know how it's possible to interact further with the site, in order to have that popup come to the foreground and scrape the information. Is this possible to do?
from bs4 import BeautifulSoup
from requests import get
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0','X-Requested-With': 'XMLHttpRequest',}
url = "https://etherscan.io/proxyContractChecker?a=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48"
response = get(url,headers=headers )
soup = BeautifulSoup(response.content,'html.parser')
Thank You

import requests
from bs4 import BeautifulSoup
def Main(url):
with requests.Session() as req:
r = req.get(url, headers={'User-Agent': 'Ahmed American :)'})
soup = BeautifulSoup(r.content, 'html.parser')
vs = soup.find("input", id="__VIEWSTATE").get("value")
vsg = soup.find("input", id="__VIEWSTATEGENERATOR").get("value")
ev = soup.find("input", id="__EVENTVALIDATION").get("value")
data = {
'__VIEWSTATE': vs,
'__VIEWSTATEGENERATOR': vsg,
'__EVENTVALIDATION': ev,
'ctl00$ContentPlaceHolder1$txtContractAddress': '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48',
'ctl00$ContentPlaceHolder1$btnSubmit': "Verify"
}
r = req.post(
"https://etherscan.io/proxyContractChecker?a=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48", data=data, headers={'User-Agent': 'Ahmed American :)'})
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find(
"div", class_="alert alert-success").text.split(" ")[-1]
print(token)
Main("https://etherscan.io/proxyContractChecker")
Output:
0x0882477e7895bdc5cea7cb1552ed914ab157fe56

I disagree with #InfinityTM. Usually the workflow that is follow for this kind of problems is that you will need to make a POST request into the website.
Look, if you click on Verify a POST request is made into the website as shown in this image:
This POST request was made with this headers:
and with this params:
You need to figure out how to send this POST request with the correct URL, headers, params, and cookies. Once you have achieved to make the request, you will receive the response:
which contains the information you want to scrape under the div with class "alert alert-success:
Summary
So the steps you need to follow are:
Navigate to your website, and gather all the information (request URL, Cookies, headers, and params) that you will need for your POST request.
Make the request with the requests library.
Once you get a <200> response, scrape the data you are interested in with BS.
Please let me know if this points you in the right direction! :D

Related

How to Read JS generated Page in Python

Please Note: This problem can be solved easily by using selenium library but I don't want to use selenium since the Host doesn't have a browser installed and not willing to.
Important: I know that render() will download chromium at first time and I'm ok with that.
Q: How can I get the page source when it's generated by JS code? For example this HP printer:
220.116.57.59
Someone posted online and suggested using:
from requests_html import HTMLSession
r = session.get('https://220.116.57.59', timeout=3, verify=False)
session = HTMLSession()
base_url = r.url
r.html.render()
But printing r.text doesn't print full page source and indicates that JS is disabled:
<div id="pgm-no-js-text">
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
</div>
Original Answer: https://stackoverflow.com/a/50612469/19278887 (last part)

Grab the config endpoints and then parse the XML for the data you want.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
with requests.Session() as s:
soup = (
BeautifulSoup(
s.get(
"http://220.116.57.59/IoMgmt/Adapters",
headers=headers,
).text,
features="xml",
).find_all("io:HardwareConfig")
)
print("\n".join(c.find("MacAddress").getText() for c in soup if c.find("MacAddress") is not None))
Output:
E4E749735068
E4E74973506B
E6E74973D06B

Requests returns a status code of 429 for URL https://www.instagram.com/google

I'm trying to code an Instagram-webscraper in Python to return values like a person's followers, the number of posts etc.
Let's just take Google's Instagram-account for this example.
Here is my code:
import requests
from bs4 import BeautifulSoup
link = requests.get("https://www.instagram.com/google")
soup = BeautifulSoup(link.text, "html.parser")
print(soup)
print(link.status_code)
Pretty straightforward.
However, if I run the code, it prints link.status_code = 429. It should be 200, for any other website it prints 200.
Also, when it prints soup, it doesnt show what I actually want. Not the HTML for the account is shown, but the HTML for the Instagram-Error-page.
Why does requests open the instagram error page, not the account from the link provided?

To get correct response from the server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
link = requests.get("https://www.instagram.com/google", headers=headers)
soup = BeautifulSoup(link.text, "lxml")
print(link.status_code)
print(soup.select_one('meta[name="description"]')["content"])
Prints:
200
12.5m Followers, 33 Following, 1,642 Posts - See Instagram photos and videos from Google (#google)

Beautiful Soup content empty, not able to Web scraper this webpage, do they block it?

I am trying to get a link from a website, but BeautifulSoup gives me back empty content.
Is the webpage block it or is there some javascript?
How to get the links?
page = requests.get(url='https://finbox.com/NASDAQGS:AAPL')
soup = BeautifulSoup(page, "lxml").find_all("a", href=True)

You can get almost all the data via their underlining API. If you look at the Network, Press (Ctrl + Shift + I), then select Network and filter XHR, you will see that the webpage gets data from direct calls.
To get data that you are looking for, just observer how the webpage makes GET or POST calls. Data return is json (python dictionary). That means, you do not have to clean it with Beautifulsoup or Selenium.
You only need requests or httpx for these examples:
import requests
# information about ticker
URL = 'https://makeshift.finbox.com/v4/assets/markets?ticker=NASDAQGS%3AAAPL'
r = requests.get(URL).json()
print(r)
# meta data
r = requests.get("https://makeshift.finbox.com/v4/seo/meta/NASDAQGS:AAPL").json()
print(r)
# post calls
headers = {"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"}
payload = {"category":"view","action":"ViewContent","label":"Company Page","value":0,"data":{"pathname":"/NASDAQGS:AAPL","search":"","ticker":"NASDAQGS:AAPL"}}
r = requests.post("https://finbox.com/_/api/v4/users/events", headers=headers, data=payload).json()
print(r)
# another
payload = {"query":"\n query loadAssetPeerBenchmarks ($ticker: String!, $currency: String) {\n asset: company (ticker: $ticker, currency: $currency) {\n ticker\n is_subscribed\n stats\n }\n }\n ",
"variables":{"ticker":"NASDAQGS:AAPL"}}
r = requests.post("https://finbox.com/_/api/v4/query?raw=true", headers=headers, json=payload).json()
print(r)

The page needs JS to render. Try disabling JS in your browser, and the page will refuse to load. request.get doesn't run anything. It's just an initial file request. You'll probably want to look into using Selenium with to render the JS in a headless browser.

Getting error that USER_AGENT is not defined (Python 3)

I'm trying to scrape the information inside an 'iframe' tag. When I execute this code, it says that 'USER_AGENT' is not defined. How can I fix this?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')

The error is telling you clearly what is wrong. You are passing in as headers USER_AGENT, which you have not defined earlier in your code. Take a look at this post on how to use headers with the method.
The documentation states you must pass in a dictionary of HTTP headers for the request, whereas you have passed in an undefined variable USER_AGENT.
From the Requests Library API:
headers = None
Case-insensitive Dictionary of Response Headers.
For example, headers['content-encoding'] will return the value of a 'Content-Encoding' response header.
EDIT:
For a better explanation of Content-Type headers, see this SO post. See also this WebMasters post which explains the difference between Accept and Content-Type HTTP headers.
Since you only seem to be interested in scraping the iframe tags, you may simply omit the headers argument entirely and you should see the results if you print out the test object in your code.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", timeout=10)
soup = BeautifulSoup(page.content, "lxml")
test = soup.find_all('iframe')
for tag in test:
print(tag)

We have to provide a user-agent, HERE's a link to the fake user-agents.
import requests
from bs4 import BeautifulSoup
USER_AGENT = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/53'}
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
You can simply NOT use a User Agent, Code:
import requests
from bs4 import BeautifulSoup
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
I've separated your URL for readability purposes into the URL and token. That's why there's two variables URL and token

How do I send a POST request to the page after it has been opened using Python?

I want to send a POST request to the page after opening it using Python (using urllib2.urlopen). Webpage is http://wireless.walmart.com/content/shop-plans/?r=wm
Code which I am using right now is:
url = 'http://wireless.walmart.com/content/shop-plans/?r=wm'
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)'
values = {'carrierID':'68',
'conditionToType':'1',
'cssPrepend':'wm20',
'partnerID':'36575'}
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
page = response.read()
walmart = open('Walmart_ContractPlans_ATT.html','wb')
walmart.write(page)
This is giving me page which opens by default, after inspecting the page using Firebug I came to know that carrierID:68 is sent when I click on the button which sends this POST request.
I want to simulate this browser behaviour.
Please help me in resolving this.

For webscraping I prefer to use requests and pyquery. First you download the data:
import requests
from pyquery import PyQuery as pq
url = 'http://wireless.walmart.com/content/getRatePlanInfo'
payload = {'carrierID':68, 'conditionToType':1, 'cssPrepend':'wm20'}
r = requests.post(url, data=payload)
d = pq(r.text)
After this you proceed to parse the elements, for example to extract all plans:
plans = []
plans_selector = '.wm20_planspage_planDetails_sub_detailsDiv_ul_li'
plans = d(plans_selector).each(lambda i, n: plans.append(pq(n).text()))
Result:
['Basic 200',
'Simply Everything',
'Everything Data 900',
'Everything Data 450',
'Talk 450',
...

I recommend looking at a browser emulator like mechanize, rather than trying to do this with raw HTTP requests.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to "webscrape" a site containing a popup window, using python? - python

Related

How to Read JS generated Page in Python

Requests returns a status code of 429 for URL https://www.instagram.com/google

Beautiful Soup content empty, not able to Web scraper this webpage, do they block it?

Getting error that USER_AGENT is not defined (Python 3)

How do I send a POST request to the page after it has been opened using Python?

Categories

Resources