Currently scraping a webpage using python to get a response code from a button within the page, however when inspecting element for this button the html code reads the following:
<div style="cursor: pointer;" onclick="javascript: location.href = '/';" id="TopPromotionMainArea"></div>
I'm quite new to this however other links within the same page have the full url showing after "href=" and when using the requests library I'm able to get the full url. Any idea why in the above example I have "href='/'" and is there a way how I can get the response code for this button?
Related
Site: https://campaigns.avira.com/en/crm/trial/prime
When you open the url, visit the elements tab and Ctrl+F "anchor", you would see the recaptcha api url something like this
https://google.com/recaptcha/api2/anchor?ar=1&k=6LcSJ2kUAAAAAMdjwKdt9HtsmHW2MMWsakI8iiZl&co=aHR0cHM6Ly9jYW1wYWlnbnMuYXZpcmEuY29tOjQ0Mw..&hl=en&v=3TZgZIog-UsaFDv31vC4L9R_&size=invisible&badge=bottomleft&**cb**=a44t3663crd2
Everytime you make a request to the site url , the recaptcha url also changes the last "cb" value and i am not able to get the cb value from any requests/response or headers.
The only place i can get the changed url is from the "Elements tab".
I tried locating the recaptcha div code as a first way to get the text inside using the above code:
soup = BeautifulSoup(r.content,'lxml')
a = soup.find_all('iframe')
for i in a:
print(i)
Output:
<iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-WDGLZF" style="display:none;visibility:hidden" width="0"></iframe>
output making sure that the captcha url dosent exist in response.
How do i get the captcha url ?
I am tring to download few hundreds of HTML pages in order to parse them and calculate some measures.
I tried it with linux WGET, and with a loop of the following code in python:
url = "https://www.camoni.co.il/411788/168022"
html = urllib.request.urlopen(url).read()
but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" i get the full page.
the problem - I need a big anount of pages and can not do it by hand.
URL example - https://www.camoni.co.il/411788/168022 - thelast number changes
thank you
That's because that site is not static. It uses JavaScript (in this example jQuery lib) to fetch additional data from server and paste on page.
So instead of trying to GET raw HTML you should inspect requests in developer tools. There's a POST request on https://www.camoni.co.il/ajax/tabberChangeTab with such data:
tab_name=tab_about
memberAlias=ד-ר-דינה-ראלט-PhD
currentURL=/411788/ד-ר-דינה-ראלט-PhD
And the result is HTML that pasted on page after.
So instead of trying to just download page you should inspect page and requests to get data or use headless browser such as Google Chrome to emulate 'Save As' button and save data.
I created a program to fill out an HTML webpage form in Selenium, but now I want to change it to requests. However, I've come across a bit of a roadblock. I'm new to requests, and I'm not sure how to emulate a request as if a button had been pressed on the original website. Here's what I have so far -
import requests
import random
emailRandom = ''
for i in range(6):
add = random.randint(1,10)
emailRandom += str(add)
payload = {
'email':emailRandom+'#redacted',
'state_id':'34',
'tnc-optin':'on',
}
r= requests.get('redacted.com', data=payload)
The button I'm trying to "click" on the webpage looks like this -
<div class="button-container">
<input type="hidden" name="recaptcha" id="recaptcha">
<button type="submit" class="button red large">ENTER NOW</button>
</div>
What is the default/"clicked" value for this button? Will I be able to use it to submit the form using my requests code?
Using selenium and using requests are 2 different things, selenium uses your browser to submit the form via the html rendered UI, Python requests just submits the data from your python code without the html UI, it does not involve "clicking" the submit button.
The "submit" button in this case just merely triggers the browser to POST the form values.
However your backend will validate against the "recaptcha" token, so you will need to work around that.
Recommend u fiddling requests.
https://www.telerik.com/fiddler
And them recreating them.
James`s answer using selenium is slower than this.
I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.
The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000
When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.
I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.
The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000
When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.