I'm trying to programmatically retrieve editing history pages from the MusicBrainz website. (musicbrainzngs is a library for the MB web service, and editing history is not accessible from the web service). For this, I need to login to the MB website using my username and password.
I've tried using the mechanize module, and using the login page second form (first one is the search form), I submit my username and password; from the response, it seems that I successfully login to the site; however, a further request to an editing history page raises an exception:
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
I understand the exception and the reason for it. I take full responsibility for not abusing the site (after all, any usage will be tagged with my username), I just want to avoid manually opening a page, saving the HTML and running a script on the saved HTML. Can I overcome the 403 error?
The better solution is to respect the robots.txt file and simply download the edit data itself and not screen scrape MusicBrainz. You can down load the complete edit history here:
ftp://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport
Look for the file mbdump-edit.tar.bz2.
And, as the leader of the MusicBrainz team, I would like to ask you to respect robots.txt and download the edit data. Thats one of the reasons why we make the edit data downloadable.
Thanks!
If you want to circumvent the site's robots.txt, you can achieve this by telling your mechanize.Browser to ignore the robots.txt file.
br = mechanize.Browser()
br.set_handle_robots(False)
Additionally, you might want to alter your browser's user agent so you dont look like a robot:
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
Please be aware that when doing this, you're actually tricking the website into thinking you're a valid client.
Related
I'm trying to automate log-in into Costco.com to check some member only prices.
I used dev tool and the Network tab to identify the request that handles the Logon, from which I inferred the POST URL and the parameters.
Code looks like:
import requests
s = requests.session()
payload = {'logonId': 'email#email.com',
'logonPassword': 'mypassword'
}
#get this data from Google-ing "my user agent"
user_agent = {"User-Agent" : "myusergent"}
url = 'https://www.costco.com/Logon'
response = s.post(url, headers=user_agent,data=payload)
print(response.status_code)
When I run this, it just runs and runs and never returns anything. Waited 5 minutes and still running.
What am I going worng?
maybe you should try to make a get requests to get some cookies before make the post requests, if the post requests doesnt work, maybe you should add a timeout so the script stop and you know that it doesnt work.
r = requests.get(w, verify=False, timeout=10)
This one is tough. Usually, in order to set the proper cookies, a get request to the url is first required. We can go directly to https://www.costco.com/LogonForm so long as we change the user agent from the default python requests one. This is accomplished as follows:
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
with requests.Session() as s:
headers = {'user-agent': agent}
s.headers.update(headers)
logon = s.get('https://www.costco.com/LogonForm')
# Saved the cookies in variable, explanation below
cks = s.cookies
Logon get request is successful, ie status code 200! Taking a look at cks:
print(sorted([c.name for c in cks]))
['C_LOC',
'CriteoSessionUserId',
'JSESSIONID',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'_abck',
'ak_bmsc',
'akaas_AS01',
'bm_sz',
'client-zip-short']
Then using the inspect network in google chrome and clicking login yields the following form data for the post in order to login. (place this below cks)
data = {'logonId': username,
'logonPassword': password,
'reLogonURL': 'LogonForm',
'isPharmacy': 'false',
'fromCheckout': '',
'authToken': '-1002,5M9R2fZEDWOZ1d8MBwy40LOFIV0=',
'URL':'Lw=='}
login = s.post('https://www.costco.com/Logon', data=data, allow_redirects=True)
However, simply trying this makes the request just sit there and infinitely redirect.
Using burp suite, I stepped into the post and and found the post request when done via browser. This post has many more cookies than obtained in the initial get request.
Quite a few more in fact
# cookies is equal to the curl from burp, then converted curl to python req
sorted(cookies.keys())
['$JSESSIONID',
'AKA_A2',
'AMCVS_97B21CFE5329614E0A490D45%40AdobeOrg',
'AMCV_97B21CFE5329614E0A490D45%40AdobeOrg',
'C_LOC',
'CriteoSessionUserId',
'OptanonConsent',
'RT',
'WAREHOUSEDELIVERY_WHS',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'WRIgnore',
'WRUIDCD20200731',
'__CT_Data',
'_abck',
'_cs_c',
'_cs_cvars',
'_cs_id',
'_cs_s',
'_fbp',
'ajs_anonymous_id_2',
'ak_bmsc',
'akaas_AS01',
'at_check',
'bm_sz',
'client-zip-short',
'invCheckPostalCode',
'invCheckStateCode',
'mbox',
'rememberedLogonId',
's_cc',
's_sq',
'sto__count',
'sto__session']
Most of these look to be static, however because there are so many its hard to tell which is which and what each is supposed to be. It's here where I myself get stuck, and I am actually really curious how this would be accomplished. In some of the cookie data I can also see some sort of ibm commerce information, so I am linking Prevent Encryption (Krypto) Of Url Paramaters in IBM Commerce Server 6 as its the only other relevant SO answer question pertaining somewhat remotely to this.
Essentially though the steps would be to determine the proper cookies to pass for this post (and then the proper cookies and info for the redirect!). I believe some of these are being set by some js or something since they are not in the get response from the site. Sorry I can't be more help here.
If you absolutely need to login, try using selenium as it simulates a browser. Otherwise, if you just want to check if an item is in stock, this guide uses requests and doesn't need to be logged in https://aryaboudaie.com/python/technical/educational/2020/07/05/using-python-to-buy-a-gift.html
I want to scrape forms in Just-Eat but it seems that the form didn't exist !
I use that code :
br.open("https://www.just-eat.fr/")
form = br.get_forms()
but form didn't detect any form ! But when you go on the code source we can find a form :
<form class="search-form autocomplete-target" action="#" id="geolocate_form_home">
I don't know how to make it detectable ! Did someone have any idea ?
Thanks a lot !
Server sends page only with <iframe> which has message about blocking for security reason.
First problem in User Agent header. Normally Python use python-requests/2.21.0 but server may need User Agent used in real browser. For example Firefox on Linux
br = robobrowser.RoboBrowser(user_agent='Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0')
But it can still sends page with <iframe> which has message.
But if I load the same url again then it loads correct page.
Probably now it has all needed cookies and now server doesn't make problems.
If you want you can also load page from <iframe> to behave like real human.
import robobrowser
br = robobrowser.RoboBrowser(user_agent='Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0')
br.parser = 'lxml'
br.open("https://www.just-eat.fr")
#print(br.parsed)
print(br.get_forms())
#iframe_src = br.select('iframe')[0]['src']
#print(iframe_src)
#br.open("https://www.just-eat.fr"+iframe_src)
#print(br.parsed)
br.open("https://www.just-eat.fr")
#print(br.parsed)
print(br.get_forms())
I'm trying to make a script to login to my bank. I successfully made this, but suddenly it stopped working and I CANNOT figure out why. I contacted my bank- they told me that they havn't changed anything their end that would explain why it doesnt work (i.e., prevent bots from logging in). Further, I have a php script that works fine. I don't like php and want it to work in python using mechanize however. This is my code:
import mechanize
def scrapeBank():
url = 'https://online.lloydsbank.co.uk/personal/logon/login.jsp'
userName = 'XXXX'
firstPassword = 'XXXX'
secondPassword = 'XXXX'
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open(url)
f = open("output1.html","w+")
f.write(br.response().read())
f.close()
br.select_form(nr=0)
br.form['frmLogin:strCustomerLogin_userID'] = userName
br.form['frmLogin:strCustomerLogin_pwd'] = firstPassword
br.submit()
f = open("output2.html","w+")
f.write(br.response().read())
f.close()
scrapeBank()
Now, what I would expect to see in output2.html (and what I used to see before it stopped working) is the second stage of the login- where you need to enter the second password. Instead, I get the error "Sorry, we've had to log you off...". Now- I get the SAME error if I use an incorrect username and password, or if I don't even complete this fields at all. This is NOT expected behaviour. Expected behaviour, if I use the wrong details, is an error saying as much (this is what happens if I input the wrong details manually in a browser). Expected behaviour if I don't enter any details is that it shouldn't submit the form at all since the username and password are required fields. This implies to me that the br.submit() is doing something weird.
In summary
If you visit the url above in your browser and put in a random username or password, you will be told your login is wrong. Yet if you run the above code (you should be able to copy and paste it and run it easily if you have mechanize installed), you will NOT get this- you get 'you have been logged off..' (you can view output2.html to see this) error- or at least this is what I get. So- why is the functionality different between logging in with a browser, and using mechanize?
I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that I am either seeing it pre javascript loaded or else maybe I'm not getting the full info because I don't have the right authentication?? My result is the same as "view source" in Chrome when what I want is what Chrome's 'inspect element' shows. My test is cimber.dk after entering flight information and searching.
I am coding in python and tried the urllib2 library. Then I heard that Selenium was good for this so I tried that, too. However, that also gets me the same limited page source.
This is what I tried with urllib2 after using Firebug to see the parameters. (I deleted all my cookies after opening cimber.dk so I was starting with a 'clean slate')
url = 'https://www.cimber.dk/booking/'
values = {'ARRANGE_BY' : 'D',...} #one for each value
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
#Using HTTPRedirectHandler instead of HTTPCookieProcessor gives the same.
urllib2.install_opener(opener)
request = urllib2.Request(url)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0')]
request.add_header(....) # one for each header, also the cookie one
p = urllib.urlencode(values)
data = opener.open(request, p).read()
# data is now the limited source, like Chrome View Source
#I tried to add the following in some vain attempt to do a redirect.
#The result is always "HTTP Error 400: Bad request"
f = opener.open('https://wftc2.e-travel.com/plnext/cimber/Override.action')
data = f.read()
f.close()
Most libraries like this do not support javascript.
If you want javascript, you will need to either automate an existing browser or browser engine, or get a really monolithic big beefy library that is essentially an advanced web crawler.
I am trying to log into netflix with python, would work perfectly but i cant get it to detect weather or not login failed, the code looks like this:
#this is not purely my code! Thanks to Ori for the code
import urllib
username = raw_input('Enter your email: ')
password = raw_input('Enter your password: ')
params = urllib.urlencode(
{'email': username,
'password': password })
f = urllib.urlopen("https://signup.netflix.com/Login", params)
if "The login information you entered does not match an account in our records. Remember, your email address is not case-sensitive, but passwords are." in f.read():
success = False
print "Either your username or password was incorrect."
else:
success = True
print "You are now logged into netflix as", username
raw_input('Press enter to exit the program')
As always, many thanks!!
First, I'll just share some verbiage I noticed on the Netflix site under Limitations on Use:
Any unauthorized use of the Netflix service or its contents will terminate the limited license granted by us and will result in the cancellation of your membership.
In short, I'm not sure what your script does after this, but some activities could jeopardize your relationship with Netflix. I did not read the whole ToS, but you should.
That said, there are plenty of legitimate reasons to scrape html information, and I do it all the time. So my first bet with this specific problem is you're using the wrong detection string... Just send a bogus email/password and print the response... Perhaps you made an assumption about what it looks like when you log in with a browser, but the browser is sending info that gets further into the process.
I wish I could offer specifics on what to do next, but I would rather not risk my relationship with 'flix to give a better answer to the question... so I'll just share a few observations I gleaned from scraping oodles of other websites that made it kindof hard to use web robots...
First, login to your account with Firefox, and be sure to have the Live HTTP Headers add-on enabled and in capture mode... what you will see when you login live is invaluable to your scripting efforts... for instance, this was from a session while I logged in...
POST /Login HTTP/1.1
Host: signup.netflix.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: https://signup.netflix.com/Login?country=1&rdirfdc=true
--->Insert lots of private stuff here
Content-Type: application/x-www-form-urlencoded
Content-Length: 168
authURL=sOmELoNgTeXtStRiNg&nextpage=&SubmitButton=true&country=1&email=EmAiLAdDrEsS%40sOmEMaIlProvider.com&password=UnEnCoDeDpAsSwOrD
Pay particular attention to the stuff below "Content-Length" field and all the parameters that come after it.
Now log back out, and pull up the login site page again... chances are, you will see some of those fields hidden as state information in <input type="hidden"> tags... some web apps keep state by feeding you fields and then they use javascript to resubmit that same information in your login POST. I usually use lxml to parse the pages I receive... if you try it, keep in mind that lxml prefers utf-8, so I include code that automagically converts when it sees other encodings...
response = urlopen(req,data)
# info is from the HTTP headers... like server version
info = response.info().dict
# page is the HTML response
page = response.read()
encoding = chardet.detect(page)['encoding']
if encoding != 'utf-8':
page = page.decode(encoding, 'replace').encode('utf-8')
BTW, Michael Foord has a very good reference on urllib2 and many of the assorted issues.
So, in summary:
Using your existing script, dump the results from a known bogus login to be sure you're parsing for the right info... I'm pretty sure you made a bad assumption above
It also looks like you aren't submitting enough parameters in the POST. Experience tells me you need to set authURL in addition to email and password... if possible, I try to mimic what the browser sends...
Occasionally, it matters whether you have set your user-agent string and referring webpage. I always set these when I scrape so I don't waste cycles debugging.
When all else fails, look at info stored in cookies they send
Sometimes websites base64 encode form submission data. I don't know whether Netflix does
Some websites are very protective of their intellectual property, and programatically reading/archiving the information is considered a theft of their IP. Again, read the ToS... I don't know how Netflix views what you want to do.
I am providing this for informational purposes and under no circumstances endorse, or condone the violation of Netflix terms of service... nor can I confirm whether your proposed activity would... I'm just saying it might :-). Talk to a lawyer that specializes in e-discovery if you need an official ruling. Feet first. Don't eat yellow snow... etc...