I am attempting to scrape the following website flow.gassco.no as one of my first python projects. I need to bypass the splash screen which redirects to the main page. I have isolated the following action,
<form method="get" action="acceptDisclaimer">
<input type="submit" value="Accept"/>
<input type="button" name="decline" value="Decline" onclick="window.location = 'http://www.gassco.no'" />
</form>
In a browser appending 'acceptDisclaimer?' to the url redirects to the target flow.gassco.no. However if I try to replicate this in urllib, I appear to stay on the same page when outputting the source.
import urllib, urllib2
url="http://flow.gassco.no/acceptDisclaimer?"
url2="http://flow.gassco.no/"
#first pass to invoke disclaimer
req=urllib2.Request(url)
res=urllib2.urlopen(req)
#second pass to access main page
req1=urllib2.Request(url2)
res2=urllib2.urlopen(req1)
data=res2.read()
print data
I suspect that I have oversimplified the problem, but would appreciate any input into how I can accept the disclaimer and continue to output the main page source.
Use a cookiejar. See python: urllib2 how to send cookie with urlopen request
Open the main url first
Open the /acceptDisclaimer after that
Related
I'm wrote a bot that used selenium to scrape all needed data and performed a few simple tasks. I don't know why I didnt use http requests instead from the start but I am now trying to switch to that. One of the selenium functions used a simple driver.get(url) to trigger an action on the site. Using requests.get, however, does not work.
This selenium code worked
import time
from selenium import webdriver
AM4_URL = 'https://www.airline4.net/?gameType=app&uid=102692112805909972638&uid_token=8adee69e774d89fb6e9f903e7d2afc70&mail=bsgpricecheck#gmail.com&mail_token=286f8bd25bcc32f49a02036102ce072c&device=ios&version=6&FCM=daf5d0d8bf4d7962061eac3a8e4bffa770d6593f31fd5b070d690f244dfb40d1#'
def depart():
# Load driver and get login url
if pax_rep > 80:
driver = webdriver.Firefox(executable_path='C:\webdrivers\geckodriver.exe')
driver.get(AM4_URL)
driver.minimize_window()
driver.get("https://www.airline4.net/route_depart.php?mode=all&ids=x")
time.sleep(100)
def randfunc():
depart()
But now im trying to switch over to requests because all the other bot functions work with it. I tried this and it doesn't perform the action.
import requests
# I was able to combine the URLs into one. It still performs the action when on a browser.
dep_url = 'https://www.airline4.net/route_depart.php?mode=all&ids=x?gameType=app&uid=102692112805909972638&uid_token=8adee69e774d89fb6e9f903e7d2afc70&mail=bsgpricecheck#gmail.com&mail_token=286f8bd25bcc32f49a02036102ce072c&device=ios&version=6&FCM=daf5d0d8bf4d7962061eac3a8e4bffa770d6593f31fd5b070d690f244dfb40d1#'
requests.get(dep_url)
I figured this code would work because the url doesnt return any content. I thought it was using a GET request as a command.
I would also like to note, I got the route_depart.php url from an ajax button.
Here's the HTML from that
<div class="btn-group d-flex" role="group">
<button class="btn" style="display:none;" onclick="Ajax('def227_j22.php','runme');"></button>
<button class="btn w-100 btn-danger btn-xs" onclick="Ajax('route_depart.php?mode=all&ids=x','runme',this);">
<span class="glyphicons glyphicons-plane"></span> Depart <span id="listDepartAmount">5</span></button>
</div>
I'm trying to trigger the download of a CSV using requests-html I believe when the button is clicked it triggers "export_csv.php"
<button class='tiny' type='submit' name='action' value='export' style='width:200px;'>Export All Fields</button> </form> <form name='export' method='POST' action='../export_csv.php'>
I'm just not sure how to trigger the php file with python. I don't have to do it in requests if there's a better way but I would like to avoid using selenium if possible.
I'd share the URL but it's an internal resource and not available on the web.
I created a program to fill out an HTML webpage form in Selenium, but now I want to change it to requests. However, I've come across a bit of a roadblock. I'm new to requests, and I'm not sure how to emulate a request as if a button had been pressed on the original website. Here's what I have so far -
import requests
import random
emailRandom = ''
for i in range(6):
add = random.randint(1,10)
emailRandom += str(add)
payload = {
'email':emailRandom+'#redacted',
'state_id':'34',
'tnc-optin':'on',
}
r= requests.get('redacted.com', data=payload)
The button I'm trying to "click" on the webpage looks like this -
<div class="button-container">
<input type="hidden" name="recaptcha" id="recaptcha">
<button type="submit" class="button red large">ENTER NOW</button>
</div>
What is the default/"clicked" value for this button? Will I be able to use it to submit the form using my requests code?
Using selenium and using requests are 2 different things, selenium uses your browser to submit the form via the html rendered UI, Python requests just submits the data from your python code without the html UI, it does not involve "clicking" the submit button.
The "submit" button in this case just merely triggers the browser to POST the form values.
However your backend will validate against the "recaptcha" token, so you will need to work around that.
Recommend u fiddling requests.
https://www.telerik.com/fiddler
And them recreating them.
James`s answer using selenium is slower than this.
Currently scraping a webpage using python to get a response code from a button within the page, however when inspecting element for this button the html code reads the following:
<div style="cursor: pointer;" onclick="javascript: location.href = '/';" id="TopPromotionMainArea"></div>
I'm quite new to this however other links within the same page have the full url showing after "href=" and when using the requests library I'm able to get the full url. Any idea why in the above example I have "href='/'" and is there a way how I can get the response code for this button?
I'm trying to submit a form on an .asp page but Mechanize does not recognize the name of the control. The form code is:
<form id="form1" name="frmSearchQuick" method="post">
....
<input type="button" name="btSearchTop" value="SEARCH" class="buttonctl" onClick="uf_Browse('dledir_search_quick.asp');" >
My code is as follows:
br = mechanize.Browser()
br.open(BASE_URL)
br.select_form(name='frmSearchQuick')
resp = br.click(name='btSearchTop')
I've also tried the last line as:
resp = br.submit(name='btSearchTop')
The error I get is:
raise ControlNotFoundError("no control matching "+description) ControlNotFoundError: no control matching name 'btSearchTop', kind 'clickable'
If I print br I get this: IgnoreControl(btSearchTop=)
But I don't see that anywhere in the HTML.
Any advice on how to submit this form?
The button doesn't submit the form - it calls some javascript function.
Mechanize can't run javascript, so you can't use it to click that button.
The easy way out is to read that function yourself, and see what it does - if it just submits the form, then maybe you can get around it by submitting the form without clicking on anything.
you need to inspect element first, did mechanize recognize the form ?
for form in br.forms():
print form