I want to scrape forms in Just-Eat but it seems that the form didn't exist !
I use that code :
br.open("https://www.just-eat.fr/")
form = br.get_forms()
but form didn't detect any form ! But when you go on the code source we can find a form :
<form class="search-form autocomplete-target" action="#" id="geolocate_form_home">
I don't know how to make it detectable ! Did someone have any idea ?
Thanks a lot !
Server sends page only with <iframe> which has message about blocking for security reason.
First problem in User Agent header. Normally Python use python-requests/2.21.0 but server may need User Agent used in real browser. For example Firefox on Linux
br = robobrowser.RoboBrowser(user_agent='Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0')
But it can still sends page with <iframe> which has message.
But if I load the same url again then it loads correct page.
Probably now it has all needed cookies and now server doesn't make problems.
If you want you can also load page from <iframe> to behave like real human.
import robobrowser
br = robobrowser.RoboBrowser(user_agent='Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0')
br.parser = 'lxml'
br.open("https://www.just-eat.fr")
#print(br.parsed)
print(br.get_forms())
#iframe_src = br.select('iframe')[0]['src']
#print(iframe_src)
#br.open("https://www.just-eat.fr"+iframe_src)
#print(br.parsed)
br.open("https://www.just-eat.fr")
#print(br.parsed)
print(br.get_forms())
Related
I'm trying to automate log-in into Costco.com to check some member only prices.
I used dev tool and the Network tab to identify the request that handles the Logon, from which I inferred the POST URL and the parameters.
Code looks like:
import requests
s = requests.session()
payload = {'logonId': 'email#email.com',
'logonPassword': 'mypassword'
}
#get this data from Google-ing "my user agent"
user_agent = {"User-Agent" : "myusergent"}
url = 'https://www.costco.com/Logon'
response = s.post(url, headers=user_agent,data=payload)
print(response.status_code)
When I run this, it just runs and runs and never returns anything. Waited 5 minutes and still running.
What am I going worng?
maybe you should try to make a get requests to get some cookies before make the post requests, if the post requests doesnt work, maybe you should add a timeout so the script stop and you know that it doesnt work.
r = requests.get(w, verify=False, timeout=10)
This one is tough. Usually, in order to set the proper cookies, a get request to the url is first required. We can go directly to https://www.costco.com/LogonForm so long as we change the user agent from the default python requests one. This is accomplished as follows:
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
with requests.Session() as s:
headers = {'user-agent': agent}
s.headers.update(headers)
logon = s.get('https://www.costco.com/LogonForm')
# Saved the cookies in variable, explanation below
cks = s.cookies
Logon get request is successful, ie status code 200! Taking a look at cks:
print(sorted([c.name for c in cks]))
['C_LOC',
'CriteoSessionUserId',
'JSESSIONID',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'_abck',
'ak_bmsc',
'akaas_AS01',
'bm_sz',
'client-zip-short']
Then using the inspect network in google chrome and clicking login yields the following form data for the post in order to login. (place this below cks)
data = {'logonId': username,
'logonPassword': password,
'reLogonURL': 'LogonForm',
'isPharmacy': 'false',
'fromCheckout': '',
'authToken': '-1002,5M9R2fZEDWOZ1d8MBwy40LOFIV0=',
'URL':'Lw=='}
login = s.post('https://www.costco.com/Logon', data=data, allow_redirects=True)
However, simply trying this makes the request just sit there and infinitely redirect.
Using burp suite, I stepped into the post and and found the post request when done via browser. This post has many more cookies than obtained in the initial get request.
Quite a few more in fact
# cookies is equal to the curl from burp, then converted curl to python req
sorted(cookies.keys())
['$JSESSIONID',
'AKA_A2',
'AMCVS_97B21CFE5329614E0A490D45%40AdobeOrg',
'AMCV_97B21CFE5329614E0A490D45%40AdobeOrg',
'C_LOC',
'CriteoSessionUserId',
'OptanonConsent',
'RT',
'WAREHOUSEDELIVERY_WHS',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'WRIgnore',
'WRUIDCD20200731',
'__CT_Data',
'_abck',
'_cs_c',
'_cs_cvars',
'_cs_id',
'_cs_s',
'_fbp',
'ajs_anonymous_id_2',
'ak_bmsc',
'akaas_AS01',
'at_check',
'bm_sz',
'client-zip-short',
'invCheckPostalCode',
'invCheckStateCode',
'mbox',
'rememberedLogonId',
's_cc',
's_sq',
'sto__count',
'sto__session']
Most of these look to be static, however because there are so many its hard to tell which is which and what each is supposed to be. It's here where I myself get stuck, and I am actually really curious how this would be accomplished. In some of the cookie data I can also see some sort of ibm commerce information, so I am linking Prevent Encryption (Krypto) Of Url Paramaters in IBM Commerce Server 6 as its the only other relevant SO answer question pertaining somewhat remotely to this.
Essentially though the steps would be to determine the proper cookies to pass for this post (and then the proper cookies and info for the redirect!). I believe some of these are being set by some js or something since they are not in the get response from the site. Sorry I can't be more help here.
If you absolutely need to login, try using selenium as it simulates a browser. Otherwise, if you just want to check if an item is in stock, this guide uses requests and doesn't need to be logged in https://aryaboudaie.com/python/technical/educational/2020/07/05/using-python-to-buy-a-gift.html
My goal is to scrape data from consumerreports.com, so I am utilizing 'requests' and 'beautifulsoup' for this project. Webscraping aside, I am having a lot of trouble successfully logging in on consumerreports.com through requests.
Here is my code: I created two text files in which I write the post and response, so I can check if it successfully logged in.
import requests
import os.path
#declares any necessary variables
#file1, file2 to check if login is successful
save_path = '/Users/myName/Documents/Webscraping Project/'
login_url = 'https://www.consumerreports.org/cro/index.htm'
my_url = 'https://www.consumerreports.org/cro/index.htm'
pName = os.path.join(save_path, 'post text file'+".txt")
rName = os.path.join(save_path, 'response text file'+".txt")
post_file = open(pName, "w")
response_file = open(rName, "w")
#login using Session class from Requests package
with requests.Session() as s:
payload = {"userName":"myName#university.edu","password":"my_password"}
p = s.post(login_url, data=payload)
print(p.text)
r = s.get(my_url)
#saves files to see if login was successful
post_file.write(str(p.text.encode('utf-8')))
response_file.write(str(r.text.encode('utf-8')))
post_file.close()
response_file.close()
print('Files created.')
This is what I got:
<!DOCTYPE html>
<html>
<head>
<title>405 Not allowed.</title>
</head>
<body>
<h1>Error 405 Not allowed.</h1>
<p>Not allowed.</p>
<h3>Guru Meditation:</h3>
<p>XID: #some number </p>
<hr>
<p>Varnish cache server</p>
</body>
</html>
In addition, I checked the contents of the 'response text file.txt', and was able to determine through basic ctrl+f function that the system had not successfully logged in.
It seems that the web server does not accept the 'post' method, at least for this particular url, and that is why it's returning the error. However, I don't know how to proceed from here. I looked online, and someone suggested using
response = requests.get(login_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'})
to create a user agent to "log in" or whatever. I'm still fairly new to python, so any advice will be appreciated.
You may need to add headers in s.post There is a solution to this error here. It worked for me. Hope this helps.
The reason for this is the sign in form is created via javascript. As the login form is added to the DOM as a result of a click event, it doesn't exist when you execute the request. All requests does is gets the existing content from the page. If the URL did change to reflect the state (displaying the login form), then you could use that, but it doesn't.
What you need to do is use a headless browser (chrome or firefox in headless mode) combined with a library like Selenium. You can load the site in the headless browser and write code using Selenium to interact with. However, this is significantly more challenging to implement.
Python noob here. I'm trying to extract a link, specifically the link to 'all reviews' on an Amazon product page. I get an unexpected result.
import urllib2
req = urllib2.Request('http://www.amazon.com/Ole-Henriksen-Truth-Collagen- Booster/dp/B000A0ADT8/ref=sr_1_1?s=hpc&ie=UTF8&qid=1342922857&sr=1-1&keywords=truth')
response = urllib2.urlopen(req)
page = response.read()
start = page.find("all reviews")
link_start = page.find("href=", start) + 6
link_end = page.find('"', link_start)
print page[link_start:link_end]
The program should output:
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product- reviews/B000A0ADT8/ref=dp_top_cm_cr_acr_txt?ie=UTF8&showViewpoints=1
Instead, it outputs:
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product-reviews/B000A0ADT8
I get the same result you do, but that appears to be simply because the page Amazon serves to your Python script is different from what it serves to my browser. I wrote the downloaded page to disk and loaded it in a text editor, and sure enough, the link ends with ADT8" without all the /ref=dp_top stuff.
In order to help convince Amazon to serve you the same page as a browser, your script is probably going to have to act a lot more like a browser (by accepting and sending cookies, for example). The mechanize module can help with this.
Ah, okay. If you do the usual trick of faking a user agent, for example:
req = urllib2.Request('http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/dp/B000A0ADT8/ref=sr_1_1?s=hpc&ie=UTF8&qid=1342922857&sr=1-1&keywords=truth')
ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20110506 Firefox/4.0.1'
req.add_header('User-Agent', ua)
response = urllib2.urlopen(req)
then you should get something like
localhost-2:coding $ python plink.py
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product-reviews/B000A0ADT8/ref=dp_top_cm_cr_acr_txt/190-6179299-9485047?ie=UTF8&showViewpoints=1
which might be closer to what you want.
[Disclaimer: be sure to verify that Amazon's TOS rules permit whatever you're going to do before you do it..]
I'm trying to programmatically retrieve editing history pages from the MusicBrainz website. (musicbrainzngs is a library for the MB web service, and editing history is not accessible from the web service). For this, I need to login to the MB website using my username and password.
I've tried using the mechanize module, and using the login page second form (first one is the search form), I submit my username and password; from the response, it seems that I successfully login to the site; however, a further request to an editing history page raises an exception:
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
I understand the exception and the reason for it. I take full responsibility for not abusing the site (after all, any usage will be tagged with my username), I just want to avoid manually opening a page, saving the HTML and running a script on the saved HTML. Can I overcome the 403 error?
The better solution is to respect the robots.txt file and simply download the edit data itself and not screen scrape MusicBrainz. You can down load the complete edit history here:
ftp://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport
Look for the file mbdump-edit.tar.bz2.
And, as the leader of the MusicBrainz team, I would like to ask you to respect robots.txt and download the edit data. Thats one of the reasons why we make the edit data downloadable.
Thanks!
If you want to circumvent the site's robots.txt, you can achieve this by telling your mechanize.Browser to ignore the robots.txt file.
br = mechanize.Browser()
br.set_handle_robots(False)
Additionally, you might want to alter your browser's user agent so you dont look like a robot:
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
Please be aware that when doing this, you're actually tricking the website into thinking you're a valid client.
I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that I am either seeing it pre javascript loaded or else maybe I'm not getting the full info because I don't have the right authentication?? My result is the same as "view source" in Chrome when what I want is what Chrome's 'inspect element' shows. My test is cimber.dk after entering flight information and searching.
I am coding in python and tried the urllib2 library. Then I heard that Selenium was good for this so I tried that, too. However, that also gets me the same limited page source.
This is what I tried with urllib2 after using Firebug to see the parameters. (I deleted all my cookies after opening cimber.dk so I was starting with a 'clean slate')
url = 'https://www.cimber.dk/booking/'
values = {'ARRANGE_BY' : 'D',...} #one for each value
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
#Using HTTPRedirectHandler instead of HTTPCookieProcessor gives the same.
urllib2.install_opener(opener)
request = urllib2.Request(url)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0')]
request.add_header(....) # one for each header, also the cookie one
p = urllib.urlencode(values)
data = opener.open(request, p).read()
# data is now the limited source, like Chrome View Source
#I tried to add the following in some vain attempt to do a redirect.
#The result is always "HTTP Error 400: Bad request"
f = opener.open('https://wftc2.e-travel.com/plnext/cimber/Override.action')
data = f.read()
f.close()
Most libraries like this do not support javascript.
If you want javascript, you will need to either automate an existing browser or browser engine, or get a really monolithic big beefy library that is essentially an advanced web crawler.