I'm trying to get a bunch of PDFs off of a site that sits behind a login so I don't manually have to download each and every one. I figured this would be easy, but I'm getting a "Missing Key-Pair-Id query parameter" error back. Here's what I have:
payload={'username':'user','password':'pass'}
with requests.Session() as session:
post = session.post('https://website.com/login.do', data=payload)
r = session.get('https://files.website.com/1.pdf')
print(r.text)
I'm printing r.text just because that's where I'm getting the above message. My post variable is giving me a response of 200 and the contents post.text is a redirect link with "code: success", too. If I click that link (or copy paste it into a private browser), I'm logged in just fine. And browsing to the pdf link works just fine. What am I missing here? Thanks.
Related
I want to write a function which collects data from yahoo finance site. The website request looks like that:
import requests
def yahoo_summary_stats(stock):
response = requests.get(f"https://finance.yahoo.com/quote/{stock}")
print(response.reason)
if I call the function with parameter 'ALB':
yahoo_summary_stats('ALB')
everything works fine and the request is ok. He correctly leads me to:
https://finance.yahoo.com/quote/ALB
The call:
yahoo_summary_stats('AEE')
on other hand should lead me to the site https://finance.yahoo.com/quote/AEE, which I can call without any problems in firefox.
The program for some reason gives me a 'Not found' error. What is the problem of my request to that website?
Try to set User-Agent in headers...
def yahoo_summary_stats(stock):
response = requests.get(f"http://finance.yahoo.com/quote/{stock}", headers={'User-Agent': 'Custom user agent'})
print(response.status_code)
print(response.reason)
yahoo_summary_stats('ALB')
yahoo_summary_stats('AEE')
I have logged into a website using python requests, but when I want to start scraping data from other pages on the site, it seems like I'm no longer authenticated?
I'm recieving a 401 error when trying to access part of the site that starts with "https://api"
I've tried using auth, proxies, but nothing is working. Works perfectly fine in Chrome. Also after I login I am able to see that my information is appearing in some of the api content, but when I do a GET on the homepage of the website I am not longer logged in.
payload_login = {'email': 'me#email.com', 'password': 'password'}
with requests.Sessions() as s:
url = 'https:/api.website.com/login"
r = s.post(url, data=login_data)
print r.content ###this acutally returns 200 meaning I've successfully login in
print s.get('https://api.website.com/userProjects', auth=HttpNtlmAuth(user,pass)
Output is <Response [401]>
HELLO I'm now trying to get information from the website that needs log in.
But I already get 200 response in the reqeustURL where I should POST some ID, passwords and requests.
headers dict have requests_headers that can be seen in the chrome developer network tap. form data dict have the ID and passwords.
login_site = requests.post(requestUrl, headers=headers, data=form_data)
status_code = login_site.status_code print(status_code)
I got 200
The code below is the way I've tried.
1. Session.
when I tried to set cookies with session, I failed. I've heard that session could set the cookies when I scrape other pages that need log-in.
session = requests.Session()
session.post(requestUrl, headers=headers, data=form_data)
test = session.get('~~') #the website that I want to scrape
print(test.status_code)
I got 403
2. Manually set cookie
I manually made the cookie dict that I can get
cookies = {'wcs_bt':'...','_production_session_id':'...'}
r = requests.post('http://engoo.co.kr/dashboard', cookies = cookies)
print(r.status_code)
I also got 403
Actually, I don't know what should I write in the cookies dict. when I get,'wcs_bt=AAA; _production_session_id=BBB; _ga=CCC;',should I change it to dict {'wcs_bt':'AAA'.. }?
When I get cookies
login_site = requests.post(requestUrl, headers=headers, data=form_data)
print(login_site.cookies)
in this code, I only can get
RequestsCookieJar[Cookie _production_session_id=BBB]
Somehow, I failed it also.
How can I scrape it with the cookie?
Scraping a modern (circa 2017 or later) Web site that requires a login can be very tricky, because it's likely that some important portion of the login process is implemented in Javascript.
Unless you execute that Javascript exactly as a browser would, you won't be able to complete the login. Unfortunately, the basic Python libraries won't help.
Consider Selenium with Python, which is used for testing Web sites but can be used to automate any interaction with a Web site.
I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text
Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.