Log Into Website, Download File - python

I'm trying to use a python script to log into my school's website and then download the homework assignment PDFs that are uploaded once a week. I've successfully downloaded PDFs from normal, non-protected websites, but I'm having trouble understanding the mechanics of cookies. I've done a bunch of googling, but the only code I've found is the following.
import urllib, urllib2, cookielib
testfile = urllib.URLopener()
username = 'example#gmail.com'
password = '*****'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
opener.open('http-this.pdf', login_data)
testfile.retrieve("http-path-to-file")
Basically, I've tried putting in all the appropriate information, but it didn't work, and I have no idea how to manipulate the code to make it do what I want. How can I use python to log into the website and then download a pdf?
Edit
Okay, here's the new code I've got that sort of works, but it outputs a copy of the site's html code with a .pdf extension instead of the file that I'm actually trying to download from the site. What's going wrong?
import requests
s = requests.Session()
data = {"login":"MYLOG", "password":"*****"}
url = "https://website.php"
url2 = "https://path-to-pdf.pdf"
r2 = s.post(url, data=data)
s.get(url2)
r = s.get(url2)
with open("204_HW.pdf", "wb") as code:
code.write(r.content)

Related

Using Python requests cannot access a private website while it works by using a browser

I'm trying to download some (.csv) files from a private website with Python requests method.
I can access the website by using a browser. After typing in the url, it pops up a window to fill in username and password.
After that, it starts to download a (.csv) file.
However, it failed when I used Python requests method.
Here is my code.
import requests
# username and pwd in base64
b64_IDpass = '******'
tics_headers = {
"Host": 'http://tics-sign.com',
"Authorization": 'Basic {}'.format(b64_IDpass)
}
# company internet proxy
proxy = {'http': '*****'}
# url
url_get = 'http://tics-sign.com/getlist'
r = requests.get(url_get,
headers=tics_headers,
proxies=proxy)
print(r)
# <Response [404]>
I've checked the headers in a browser, there is no problem.
But why it returns <Response [404]> when using Python?
You need to post your password and username before you get the link.
So you could try this:
request.post("http://tics-sign.com", tics_headers)
And then get the info:
request.get(url_get, proxies=proxy)
This has worked for me in all the previous sites have scraped which need authentication.
The problem is that each site has a different way for accepting authentication. So it may
not even work.
It also may be that python is not getting redirected to http://tics-sign.com/displaypanel/login.aspx. curl didn't for me.
Edit:
I looked at the HTML source of your website and I came up with this:
login_data = {"logName": your_id, "pwd": your_password}
request.post(http://tics-sign.com/displaypanel/login.aspx, login_data)
r = request.get(url_get, proxies=proxy)
You can look at my blog for more info.

Log in a site and navigate another pages

I have a script for Python 2 to login into a webpage and then move inside to reach a couple of files pointed to on the same site, but different pages. Python 2 let me open the site with my credentials and then create a opener.open() to keep the connection available to navigate to the other pages.
Here's the code that worked in Python 2:
$Your admin login and password
LOGIN = "*******"
PASSWORD = "********"
ROOT = "https:*********"
#The client have to take care of the cookies.
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
#POST login query on '/login_handler' (post data are: 'login' and 'password').
req = urllib2.Request(ROOT + "/login_handler",
urllib.urlencode({'login': LOGIN,
'password': PASSWORD}))
opener.open(rep)
#Set the right accountcode
for accountcode, queues in QUEUES.items():
req = urllib2.Request(ROOT + "/switch_to" + accountcode)
opener.open(req)
I need to do the same thing in Python 3. I have tried with request module and urllib, but although I can establish the first login, I don't know how to keep the opener to navigate the site. I found the OpenerDirector but it seems like I don't know how to do it, because I haven't reached my goal.
I have used this Python 3 code to get the result desired but unfortunately I can't get the csv file to print it.
enter image description here
Question: I don't know how to keep the opener to navigate the site.
Python 3.6ยป Documentation urllib.request.build_opener
Use of Basic HTTP Authentication:
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
f = urllib.request.urlopen('http://www.example.com/login.html')
csv_content = f.read()
Use python requests library for python 3 and session.
http://docs.python-requests.org/en/master/user/advanced/#session-objects
Once you login your session will be automatically managed. You dont need to create your own cookie jar. Following is the sample code.
s = requests.Session()
auth={"login":LOGIN,"pass":PASS}
url=ROOT+/login_handler
r=s.post(url, data=auth)
print(r.status_code)
for accountcode, queues in QUEUES.items():
req = s.get(ROOT + "/switch_to" + accountcode)
print(req.text) #response text

How to login webmail with Python?

import requests
import webbrowser
s=requests.session()
login_data = dict(email='email', password= 'password')
s.post("https://www.otoservisbul.com/tr/login", data=login_data)
r = requests.post('https://www.otoservisbul.com/tr/login', data = {'email':'password'})
webbrowser.get(using='chrome').open("https://www.otoservisbul.com/tr/items/list")
Hi ,
Here is my python code. I try to login my webmail via code. When I run this code login page is openning with Google Chrome but couldn't wrote the username and password. I know there have been a lot of entries about this topic but I really couldn't find the problem.
As far as I can see, you did not pass login_datato post request.
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})

Log in into website and download file with python requests

I have a website with a HTML-Form. After logging in it takes me to a start.php site and then redirects me to an overview.php.
I want to download files from that server... When I click on the download link of a ZIP-File the address behind the link is:
getimage.php?path="vol/img"&id="4312432"
How can I do that with requests? I tried to create a session and do the GET-Command with the right params... but the answer is just the website I would see when I'm not logged in.
c = requests.Session()
c.auth =('myusername', 'myPass')
request1 = c.get(myUrlToStart.PHP)
tex = request1.text
with open('data.zip', 'wb') as handle:
request2 = c.get(urlToGetImage.Php, params=payload2, stream=True)
print(request2.headers)
for block in request2.iter_content(1024):
if not block:
break
handle.write(block)
What you're doing is a request with basic authentication. This does not fill out the form that is displayed on the page.
If you know the URL that your form sends a POST request to, you can try sending the form data directly to this URL
Those who are looking for the same thing could try this...
import requests
import bs4
site_url = 'site_url_here'
userid = 'userid'
password = 'password'
file_url = 'getimage.php?path="vol/img"&id="4312432"'
o_file = 'abc.zip'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'_username': userid, '_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print(f"requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()

Scrape a web page that requires they give you a session cookie first

I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:
http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal
requires that I have a session cookie from the government site attached to the request.
How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.
I tried this:
import urllib2
import cookielib
url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'
def grab_data_with_cookie(cookie_jar, url):
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
data = opener.open(url)
return data
cj = cookielib.CookieJar()
#grab the data
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)
stuff2 = data2.read()
I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?
Using requests this is a trivial task:
>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)
>>> print r.cookies
{'requests-is': 'awesome'}
Using cookies and urllib2:
import cookielib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# use opener to open different urls
You can use the same opener for several connections:
data = [opener.open(url).read() for url in urls]
Or install it globally:
urllib2.install_opener(opener)
In the latter case the rest of the code looks the same with or without cookies support:
data = [urllib2.urlopen(url).read() for url in urls]

Categories

Resources