Download encrypted webpage - python

Hi I have researched this but I can not find any answers this question. I need to download a sub directory of a web page to a string for a search, I know have to do this but the only problem is the site is encrypted and requires a login to acces the directory. I know I need to send the cookies to request the download but I am unsure how to do this. I am coding python. feel free to ask for more info.

import urllib
import urllib2
import cookielib
import time
# All your cookie related things are done by this.
cookie_jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
urllib2.install_opener(opener)
#POST Parameters for login page.
request_body_params = {'your_parameter_name': 'its_value', 'another_parameter_name': 'its_value'}
data_encoding = urllib.urlencode(request_body_params)
url_main = 'https://your_site.com/login'
main_request = urllib2.Request(url_main, data_encoding)
#Any headers required goes here.
main_request.add_header('Accept-encoding', 'gzip')
# This is the response of login. You don't want to read this.
main_response = urllib2.urlopen(main_request)
# You want data from this link.
url_results = 'https://your_site.com/sub_directory'
results_response = urllib2.urlopen(url_results)
print results_response.read()
To check the POST Parameters, go to the site from a browser, click on 'View Source', go to 'Network' in view source. Then as you login in the browser, there will be network logs generated, click on the link and check out it's POST Parameters and headers.

Related

Using python requests module to login on an Wordpress based website

I'm trying to login to a Wordpress based website using python's request module and beautifulsoup4. It seems like the code fails to sucessfully login. Also, there is no csrf token on the website. How do I sucessfully login to the website?
import requests
import bs4 as bs
with requests.session() as c:
link="https://gpldl.com/sign-in/" #link of the webpage to be logged in
initial=c.get(link) #passing the get request
login_data={"log":"*****","pwd":"******"} #the login data from any account on the site. Stars must be replaced with username and password
page_login=c.post(link, data=login_data) #posting the login data into the link
print(page_login) #checking status of requested page
page=c.get("https://gpldl.com/my-gpldl-account/") #requesting source code of logged in page
good_data = bs.BeautifulSoup(page.content, "lxml") #parsing it with BS4
print(good_data.title) #printing this gives the title that is got from the page when accessed from an logged-out account
You are sending your POST request to a wrong URL, the correct one should be https://gpldl.com/wp-login.php, also there're 5 parameters for the payload which are log, pwd, rememberme, redirect_to, redirect_to_automatic.
So it should be:
login_data = {"log": "*****","pwd": "******",
"rememberme": "forever",
"redirect_to": "https://gpldl.com/my-gpldl-account/",
"redirect_to_automatic": "1"
}
page_login = c.post('https://gpldl.com/wp-login.php', data=login_data)
Edit:
You could use Chrome Dev tool to find out all these info while logging in, it's like this:
As to rememberme key, I would suggest you to do exact same thing a browser does, also add some headers for your request, especially User-Agent, because for some websites they just don't welcome you got login this way.

Using Python to request draftkings.com info that requires login?

I'm trying to get contest data from the url: "https://www.draftkings.com/contest/gamecenter/32947401"
If you go to this URL and aren't logged in, it'll just re-direct you to the lobby. If you're logged in, it'll actually show you the contest results.
Here's some things I tried:
-First, I used Chrome's Dev networking tools to watch requests while I manually logged in
-I then tried copying the cookie that I thought contained the authentication info, it was of the form:
'ajs_anonymous_id=%123123123123123, mlc=true; optimizelyEndUserId'
-I then stored that cookie as an Evironment variable and ran this code:
HEADERS= {'cookie': os.environ['MY_COOKIE'] }
requests.get(draft_kings_url, headers= HEADERS)
No luck, this just gave me the lobby.
I then tried request's built in:
HTTPBasicAuth
HTTPDigestAuth
No luck here either.
I'm no python expert by far, and I've pretty much exhausted what I know and the search results I've found. Any ideas?
The tool that you want is selenium. Something along the lines of:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(r"https://www.draftkings.com/contest/gamecenter/32947401" )
username = browser.find_element_by_id("user")
username.send_keys("username")
password = browser.find_element_by_id("password")
password.send_keys("top_secret")
login = selenium.find_element_by_name("login")
login.click()
Use fiddler to see the exact request they are making when you try to log in. Then use Session class in requests package.
import requests
session = requests.Session()
session.get('YOUR_URL_LOGIN_PAGE')
this will save all the cookies from your url in your session variable (Like when you use a browser).
Then make a post request to the login url with appropriate data.
You dont have to manually pass cookie data as it is auto generated when you first visit a website. However you can set some header explicitly like UserAgent etc by:
session.headers.update({'header_name':'header_value'})
HTTPBasicAuth & HTTPDigestAuth might not work based on the website.

python scraping school's webpage which requires user login

I'm using python to scrape my school's webpage, but in order to do that I needed to simulate a user login first. here is my code:
import requests, lxml.html
s = requests.session()
url = "https://my.emich.edu"
login = s.get(url)
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib["name"]:x.attrib["value"] for x in hidden_inputs}
form["username"] = "myusernamge"
form["password"] = "mypassword"
form["submit"] = "LOGIN"
response = s.post("https://netid.emich.edu/cas/loginservice=https%3A%2F%2Fmy.emich.edu%2Fc%2Fportal%2Flogin",form)
response = s.get("http://my.emich.edu")
f = open("result.html","w")
f.write(response.text)
print response.text
i am expecting that response.text will give me my own student account page instead of that it gives me a log in requirement page. Can any one help me with this issue?
BTW this is not a homework
There are a few options here, and I think your requests approach can be made much easier by logging in manually and copying over the headers.
Use a python scripting package like http://wwwsearch.sourceforge.net/mechanize/ to scrape the site.
Use a browser-emulater such as http://casperjs.org/. Using this you can basically do anything you'd be able to do in a browser.
My suggestion here would be to go to the website, log in, and then open the developer console and copy those headers/cookies into your requests headers/cookies. This way you can just hardcode the 'already-authenticated request' and it will work fine. Note that this method is the least reliable for doing robust, everyday scraping, but if you're looking for something that will be the quickest to implement and will work until the authentication runs out, use this method.
Also, you need the request the logged-in homepage (again) after you successfully do the post.

Using Python 3.5 to Login, Navigate, and Scrape Without Using a Browser

I'm trying to scrape multiple financial websites (Wells Fargo, etc.) to pull my transaction history for data analysis purposes. I can do the scraping part once I get to the page I need; the problem I'm having is getting there. I don't know how to pass my username and password and then navigate from there. I would like to do this without actually opening a browser.
I found Michael Foord's article "HOWTO Fetch Internet Resources Using The urllib Package" and tried to adapt one of the examples to meet my needs but can't get it to work (I've tried adapting to several other search results as well). Here's my code:
import bs4
import urllib.request
import urllib.parse
##Navigate to the website.
url = 'https://www.wellsfargo.com/'
values = {'j_username':'USERNAME', 'j_password':'PASSWORD'}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
the_page = response.read()
soup = bs4.BeautifulSoup(the_page,"html.parser")
The 'j_username' and 'j_password' both come from inspecting the text boxes on the login page.
I just don't think I'm pointing to the right place or passing my credentials correctly. The URL I'm using is just the login page so is it actually logging me in? When I print the URL from response it returns https://wellsfargo.com/. If I'm ever able to successfully login, it just takes me to a summary page of my accounts. I would then need to follow another link to my checking, savings, etc.
I really appreciate any help you can offer.

Form-based authentication with Python

I'm trying to use a code read in Kent's Korner for Form-based authentication. At least I'm told the web site I'm trying to read is form-based authenticated.
But I don't seem to be able to get past the login page. The code I'm using is
Import urllib, urllib2, cookielib, string
# configure an opener that will handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
# use the opener to POST to the login form and the protected page
params = urllib.urlencode(dict(username='user', password='stuff'))
f = opener.open('http://www.hammernutrition.com/forums/memberlist.php?mode=viewprofile&u=1323', params)
data = f.read()
f.close()
f = opener.open('http://www.hammernutrition.com/forums/memberlist.php?mode=viewprofile&u=1323')
data = f.read()
f.close()
You can simulate a web browser in Python without using much too resources with mechanize
(Debian/Ubuntu package is called python-mechanize). It handles both cookies and submitting forms, just the way a web browser would do, one great example is a Python Dropbox Uploader script, which you can transform to your needs.

Categories

Resources