Python: Download a protected image using Requests Session - python

I'm trying to access a protected image on a website. If I use my broswer and log in, I can easily see the image. Now I'm trying to do the same using python.
I came across this question which seemed like the same issue that I have. With that I wrote the following program:
import requests
session = requests.Session()
session.get("https://www.page.com/login")
response = session.post("https://www.page.com/login", data = {"cmd": "login", "username": "myusername", "password": "mypass"})
print response.text
Output here is:
{"status":"ok","message":"Successful login.","referer":"https:\/\/page.com\/"}
I've also saved the response.text into a .html file, which confirmed that I've logged in. If at this point I open any subpage on page.com, I'm still logged in.
Then I try to access the protected image at url:
with open('image.html', 'wb') as file:
response = session.get(url, headers = {"User-Agent": "Mozilla/5.0"})
print response.status_code
file.write(response.text)
file.close()
Which gives the output of 403. If I open the image.html file there's an "Item not available".
I've tried doing all of this using mechanize and cookielib instead of requests.Session() where I created a cookiejar, created a Browser() object, successfully logged in but as I tried accessing that image, I had the same 403 error. Therefore any ideas will be greatly appriciated.

Related

requests.Session() not working even though the post is working

I'm trying to get a bunch of PDFs off of a site that sits behind a login so I don't manually have to download each and every one. I figured this would be easy, but I'm getting a "Missing Key-Pair-Id query parameter" error back. Here's what I have:
payload={'username':'user','password':'pass'}
with requests.Session() as session:
post = session.post('https://website.com/login.do', data=payload)
r = session.get('https://files.website.com/1.pdf')
print(r.text)
I'm printing r.text just because that's where I'm getting the above message. My post variable is giving me a response of 200 and the contents post.text is a redirect link with "code: success", too. If I click that link (or copy paste it into a private browser), I'm logged in just fine. And browsing to the pdf link works just fine. What am I missing here? Thanks.

Can't get the html of a page python

So I have been trying to solve this for the past 3 days and just can't know why.
I'm trying to access the html of this site that requires login first.
I tried everyway I could and all return with the same problem.
Here is what I tried:
response = requests.get('https://de-legalization.tlscontact.com/eg/CAI/myapp.php', headers=headers, params=params, cookies=cookies)
print(response.content)
payload = {
'_token': 'TOKEN HERE',
'email': 'EMAIL HERE',
'pwd': 'PASSWORDHERE',
'client_token': 'CLIENT_TOKEN HERE'
}
with requests.session() as s:
r = s.post(login_url, data=payload)
print(r.text)
I also tried using URLLIB but they all return this:
<script>window.location="https://de-legalization.tlscontact.com/eg/CAI/index.php";</script>
Anyone knows why this is happening.
Also here is the url of the page I want the html of:
https://de-legalization.tlscontact.com/eg/CAI/myapp.php
You see this particular output because it is in fact the content of the page you are downloading.
You can test it in chrome by opening the following url:
view-source:https://de-legalization.tlscontact.com/eg/CAI/myapp.php
This is how it looks like in Chrome:
This is happening because you are being redirected by the javascript code on the page.
Since the page you are trying to access requires login, you cannot access it just by sending http request to the internal page.
You either need to extract all the cookies and add them to the python script.
Or you need to use a tool like Selenium that allows you to control a browser from your Python code.
Here you can find how to extract all the cookies from the browser session:
How to copy cookies in Google Chrome?
Here you can find how to add cookies to the http request in Python:
import requests
cookies = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookies)

Using Python requests cannot access a private website while it works by using a browser

I'm trying to download some (.csv) files from a private website with Python requests method.
I can access the website by using a browser. After typing in the url, it pops up a window to fill in username and password.
After that, it starts to download a (.csv) file.
However, it failed when I used Python requests method.
Here is my code.
import requests
# username and pwd in base64
b64_IDpass = '******'
tics_headers = {
"Host": 'http://tics-sign.com',
"Authorization": 'Basic {}'.format(b64_IDpass)
}
# company internet proxy
proxy = {'http': '*****'}
# url
url_get = 'http://tics-sign.com/getlist'
r = requests.get(url_get,
headers=tics_headers,
proxies=proxy)
print(r)
# <Response [404]>
I've checked the headers in a browser, there is no problem.
But why it returns <Response [404]> when using Python?
You need to post your password and username before you get the link.
So you could try this:
request.post("http://tics-sign.com", tics_headers)
And then get the info:
request.get(url_get, proxies=proxy)
This has worked for me in all the previous sites have scraped which need authentication.
The problem is that each site has a different way for accepting authentication. So it may
not even work.
It also may be that python is not getting redirected to http://tics-sign.com/displaypanel/login.aspx. curl didn't for me.
Edit:
I looked at the HTML source of your website and I came up with this:
login_data = {"logName": your_id, "pwd": your_password}
request.post(http://tics-sign.com/displaypanel/login.aspx, login_data)
r = request.get(url_get, proxies=proxy)
You can look at my blog for more info.

Web Scraping using Requests - Python

I am trying to get data using the Resquest library, but I’m doing something wrong. My explanation, manual search:
URL - https://www9.sabesp.com.br/agenciavirtual/pages/template/siteexterno.iface?idFuncao=18
I fill in the “Informe o RGI” field and after clicking on the Prosseguir button (like Next):
enter image description here
I get this result:
enter image description here
Before I coding, I did the manual search and checked the Form Data:
enter image description here
And then I tried it with this code:
import requests
data = { "frmhome:rgi1": "0963489410"}
url = "https://www9.sabesp.com.br/agenciavirtual/block/send-receive-updates"
res = requests.post(url, data=data)
print(res.text)
My output is:
<session-expired/>
What am I doing wrong?
Many thanks.
When you go to the site using the browser, a session is created and stored in a cookie on your machine. When you make the POST request, the cookies are sent with the request. You receive an session-expired error because you're not sending any session data with your request.
Try this code. It requests the entry page first and stores the cookies. The cookies are then sent with the POST request.
import requests
session = requests.Session() # start session
# get entry page with cookies
response = session.get('https://www9.sabesp.com.br/agenciavirtual/pages/home/paginainicial.iface', timeout=30)
cks = session.cookies # save cookies with Session data
print(session.cookies.get_dict())
data = { "frmhome:rgi1": "0963489410"}
url = "https://www9.sabesp.com.br/agenciavirtual/block/send-receive-updates"
res = requests.post(url, data=data, cookies=cks) # send cookies with request
print(res.text)

Request not returning same data as browser

Trying to get some values from Duolingo using Python, but urllib is giving me something different than when I navigate to the url via my browser.
Navigating to a url (https://www.duolingo.com/2017-06-30/users/215344344?fields=xpGoalMetToday) via browser gives: {"xpGoalMetToday": false}.
However, trying via the below script:
import urllib.request
url = 'http://www.duolingo.com/2017-06-30/users/215344344?fields=xpGoalMetToday'
user_agent = '[insert my local user agent copied from browser attempt]'
# header variable
headers = { 'User-Agent' : user_agent, "Cache-Control": "no-cache, max-age=0" }
# creating request
req = urllib.request.Request(url, None, headers)
print(urllib.request.urlopen(req).read())
returns just a blank {}.
As you can tell from the above, I've tried a couple things: adding a user agent, cache control. I've even tried using the response module and adding authentication (didn't work).
Any ideas? Am I missing something?
Actually when I open the link in the browser it show me {}
Maybe you have some kind of cookie set in your browser?

Categories

Resources