How to mirror a reddit moderator page with python - python

I'm trying to create a mirror of specific moderator pages (i.e. restricted) of a subreddit on my own server, for transparency purposes. Unfortunately my python-fu is weak and after struggling a bit with the reddit API, its python wrapper and even some answers in here, I'm no closer to having a working solution.
So what I need to do is login to reddit with a specific user, access a moderator only page and copy its html to a file on my own server for others to access
The problem I'm running into is that the API and its wrapper is not very well documented so I haven't found if there's a way to retrieve a reddit page after logging in. If I can do that, then I could theoretically copy the result to a simple html page on my server.
When trying to do it outside the python API, I can't figure out how to use the built-in modules of python to login and then read a restricted page.
Any help appreciated.

I don't use PRAW so I'm not sure about that, but if I were to do what you wanted to do, I'd do something like: login, save the modhash, grab the HTML from the url of the place you want to go:
It also looks like it's missing some CSS or something when I save it, but it's recognizable enough as it is. You'll need the requests module, along with pprint and json
import requests, json
from pprint import pprint as pp2
#----------------------------------------------------------------------
def login(username, password):
"""logs into reddit, saves cookie"""
print 'begin log in'
#username and password
UP = {'user': username, 'passwd': password, 'api_type': 'json',}
headers = {'user-agent': '/u/STACKOVERFLOW\'s API python bot', }
#POST with user/pwd
client = requests.session()
r = client.post('http://www.reddit.com/api/login', data=UP)
#if you want to see what you've got so far
#print r.text
#print r.cookies
#gets and saves the modhash
j = json.loads(r.text)
client.modhash = j['json']['data']['modhash']
print '{USER}\'s modhash is: {mh}'.format(USER=username, mh=client.modhash)
#pp2(j)
return client
client = login(USER, PASSWORD)
#mod mail url
url = r'http://www.reddit.com/r/mod/about/message/inbox/'
r = client.get(url)
#here's the HTML of the page
pp2(r.text)

Related

python scraping school's webpage which requires user login

I'm using python to scrape my school's webpage, but in order to do that I needed to simulate a user login first. here is my code:
import requests, lxml.html
s = requests.session()
url = "https://my.emich.edu"
login = s.get(url)
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib["name"]:x.attrib["value"] for x in hidden_inputs}
form["username"] = "myusernamge"
form["password"] = "mypassword"
form["submit"] = "LOGIN"
response = s.post("https://netid.emich.edu/cas/loginservice=https%3A%2F%2Fmy.emich.edu%2Fc%2Fportal%2Flogin",form)
response = s.get("http://my.emich.edu")
f = open("result.html","w")
f.write(response.text)
print response.text
i am expecting that response.text will give me my own student account page instead of that it gives me a log in requirement page. Can any one help me with this issue?
BTW this is not a homework
There are a few options here, and I think your requests approach can be made much easier by logging in manually and copying over the headers.
Use a python scripting package like http://wwwsearch.sourceforge.net/mechanize/ to scrape the site.
Use a browser-emulater such as http://casperjs.org/. Using this you can basically do anything you'd be able to do in a browser.
My suggestion here would be to go to the website, log in, and then open the developer console and copy those headers/cookies into your requests headers/cookies. This way you can just hardcode the 'already-authenticated request' and it will work fine. Note that this method is the least reliable for doing robust, everyday scraping, but if you're looking for something that will be the quickest to implement and will work until the authentication runs out, use this method.
Also, you need the request the logged-in homepage (again) after you successfully do the post.

oauth2 without api key

I'm dead lost :)
Goal is to logon to a web site that uses OAuth2. However the section I need to run, doesn't have an API associated with it. So I need to login, just using the username and password, and then navigate to the page in question and do a screen scrape to get my data.
I'm sure the problem isn't at the web site it's sitting at this keyboard. But i've searched for examples and tried a whole bunch of guesses, but nothing is working
Help would be gratefully accepted.
import sys
import requests
import oauth2 as oauth
r = requests.get(logon_url)
consumer = oauth.Consumer(key=user, secret=password)
client = oauth.Client(consumer)
resp, content = client.request(r.url, "GET")
token_url = resp['content-location']
# At this point i'm lost i'm just guessing on the rest
# the next doesn't give an error but i'm sure it's wrong
resp2, content2 = client.request(token_url, 'GET')
# save the cookie, i do have a cookie but not sure what i have
auth_token = resp['set-cookie']
Like so many things, it's just a user error
code to get me to the page is so simple. And the following code does the trick. Thanks to Furas for the pointer.
with requests.session() as s1:
# get login form
r = s1.get(logon_url)
# post the username and password
resp = s1.post(r.url,data=payload)
# get the admin page
resp2 = s1.get(page_url)

scraping data from webpage with python 3, need to log in first

I checked this question but it only has one answer and it's a little over my head (just started with Python). I'm using Python 3.
I'm trying to scrape data from this page, but if you have a BP account, the page is a lot different/more useful. I need my program to log me in before I have BeautifulSoup get the data for me.
So far I have
from bs4 import BeautifulSoup
import urllib.request
import requests
username = 'myUsername'
password = 'myPassword'
from requests import session
payload = {'action': 'Log in',
'Username: ': username,
'Password: ': password}
# the next 3 lines are pretty much copied from a different StackOverflow
# question. I don't really understand what they're doing, and obviously these
# are where the problem is.
with session() as c:
c.post('https://www.baseballprospectus.com/manageprofile.php', data=payload)
response = c.get('http://www.baseballprospectus.com/sortable/index.php?cid=1820315')
soup = BeautifulSoup(response.content, "lxml")
for row in soup.find_all('tr')[7:]:
cells = row.find_all('td')
name = cells[1].text
print(name)
The script does work, it just pulls the data from the site before it's logged in, so its not the data I want.
Conceptually, there is no problem with your code. You're using a session object to send a login request, then with the same session you're sending a request for the desired page. This means that the cookies set by the login request should be kept for the second request. If you want to read more about the workings of the Session object, here's the relevant Requests documentation.
Since I don't have a valid login for Baseball Prospectus, I'll have to guess that something is wrong with the data you're sending to the login page. A quick inspection using the 'Network' tab in Chrome's Developer Tools, shows that the login page, manageprofile.php, accepts four POST parameters:
username: myUsername
password: myPassword
action: muffinklezmer
nocache: some long number, e.g. 2417395155
However you're sending a different set of parameters, and specifying a different value for the 'action' parameter. Note that the parameter names have to match the original request exactly, otherwise manageprofile.php will not accept the login.
Try replacing the payload dictionary with this version:
payload = {
'action': 'muffinklezmer',
'username': username,
'password': password}
If this doesn't work, try adding the 'nocache' parameter too, e.g.:
'nocache': '1437955145'

Github api v3 access via python oauth2 library - Redirect issue

Environment - Python 2.7.3, webpy.
I'm trying a simple oauth 3 way authentication for github using Python web.py. Per the basic oauth guide on github I'm doing something like this:
import web,requests
import oauth2,pymongo,json
from oauth2client.client import OAuth2WebServerFlow
urls=('/', 'githublogin',
'/session','session',
'/githubcallback','githubCallback');
class githublogin:
def GET(self):
new_url = 'https://github.com/login/oauth/authorize'
pay_load = {'client_id': '',
'client_secret':'',
'scope':'gist'
}
headers = {'content-type': 'application/json'}
r = requests.get(new_url, params=pay_load, headers=headers)
return r.content
This is sending me to the GH login page. Once I sign in - GH is not redirecting me to the callback. The redirect_uri parameter is configured in the github application. I've double checked to make sure that's correct.
class githubCallback:
def POST(self):
data = web.data()
print data
def GET(self):
print "callback called"
Instead in the browser I see
http://<hostname>:8080/session
and a 404 message, because I haven't configured the session URL. That's problem no 1. Problem no 2 - If I configure the session URL and print out the post message
class session:
def POST(self):
data = web.data()
print data
def GET(self):
print "callback called"
I can see some data posted to the URL with something called 'authenticity_token'.
I've tried to use the python_oauth2 library but can't get past the authorization_url call. So I've tried this much simpler requests library. Can someone please point out to me whats going wrong here.
So here's how I solved this. Thanks to #Ivanzuzak for the requestb.in tip.
I'm using Python webpy.
import web,requests
import oauth2,json
urls=('/', 'githublogin',
'/githubcallback','githubCallback');
render = web.template.render('templates/')
class githublogin:
def GET(self):
client_id = ''
url_string = "https://github.com/login/oauth/authorize?client_id=" + client_id
return render.index(url_string)
class githubCallback:
def GET(self):
data = json.loads(json.dumps(web.input()))
print data['code']
headers = {'content-type': 'application/json'}
pay_load = {'client_id': '',
'client_secret':'',
'code' : data['code'] }
r = requests.post('https://github.com/login/oauth/access_token', data=json.dumps(pay_load), headers=headers)
token_temp = r.text.split('&')
token = token_temp[0].split('=')
access_token = token[1]
repo_url = 'https://api.github.com/user?access_token=' + access_token
response = requests.get(repo_url)
final_data = response.content
print final_data
app = web.application(urls,globals())
if __name__ == "__main__":
app.run()
I was not using a html file before, but sending the request directly from the githublogin class. That didn't work. Here I'm using a html to direct the user first from where he'll login to gh. With this I added a html and rendered it using the templator.
def with (parameter)
<html>
<head>
</head>
<body>
<p>Well, hello there!</p>
<p>We're going to now talk to the GitHub API. Ready? <a href=$parameter>Click here</a> to begin!</a></p>
<p>If that link doesn't work, remember to provide your own Client ID!</p>
</body>
</html>
This file is taken straight from the dev guide, with just the client_id parameter changed.
Another point to be noted is that in the requests.post method - passing the pay_load directly doesn't work. It has to be serialized using json.dumps.
I'm not sure what the problem is at your end, but try reproducing this flow below, first manually using the browser, and then using your python library. It will help you debug the issue.
create a request bin on http://requestb.in/. A request bin is basically a service that logs all HTTP requests sent to it. You will use this instead of the callback, to log what is being sent to the callback. Copy the URL of the request bin, which is something like http://requestb.in/123a546b
Go to your OAuth application setup on GitHub (https://github.com/settings/applications), enter the setup of your specific application, and set the Callback URL to the URL of the request bin you just created.
Make a request to the GitHub OAuth page, with the client_id defined. Just enter this URL below into your browser, but change the YOUR_CLIENT_ID_HERE to be the client id of your OAuth application:
https://github.com/login/oauth/authorize?client_id=YOUR_CLIENT_ID_HERE
Enter your username and password and click Authorize. The GitHub app will then redirect you to the request bin service you created, and the URL in the browser should be something like (notice the code query parameter):
http://requestb.in/YOUR_REQUEST_BIN_ID?code=GITHUB_CODE
(for example, http://requestb.in/abc1def2?code=123a456b789cdef)
Also, the content of the page in the browser should be "ok" (this is the content returned by the request bin service).
Go to the request bin page that you created and refresh it. You will now see a log entry for the HTTP GET request that the GitHub OAuth server sent you, together with all the HTTP headers. Basically, you will see there the same code parameter that is present in the URL that you were redirected to. If you get this parameter, you are now ready to make a POST request with this code and your client secret, as described in step 2 of the guide you are using: http://developer.github.com/v3/oauth/#web-application-flow
Let me know if any of these steps are causing problems for you.

How to "log in" to a website using Python's Requests module?

I am trying to post a request to log in to a website using the Requests module in Python but its not really working. I'm new to this...so I can't figure out if I should make my Username and Password cookies or some type of HTTP authorization thing I found (??).
from pyquery import PyQuery
import requests
url = 'http://www.locationary.com/home/index2.jsp'
So now, I think I'm supposed to use "post" and cookies....
ck = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
r = requests.post(url, cookies=ck)
content = r.text
q = PyQuery(content)
title = q("title").text()
print title
I have a feeling that I'm doing the cookies thing wrong...I don't know.
If it doesn't log in correctly, the title of the home page should come out to "Locationary.com" and if it does, it should be "Home Page."
If you could maybe explain a few things about requests and cookies to me and help me out with this, I would greatly appreciate it. :D
Thanks.
...It still didn't really work yet. Okay...so this is what the home page HTML says before you log in:
</td><td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_email.gif"> </td>
<td><input class="Data_Entry_Field_Login" type="text" name="inUserName" id="inUserName" size="25"></td>
<td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_password.gif"> </td>
<td><input class="Data_Entry_Field_Login" type="password" name="inUserPass" id="inUserPass"></td>
So I think I'm doing it right, but the output is still "Locationary.com"
2nd EDIT:
I want to be able to stay logged in for a long time and whenever I request a page under that domain, I want the content to show up as if I were logged in.
I know you've found another solution, but for those like me who find this question, looking for the same thing, it can be achieved with requests as follows:
Firstly, as Marcus did, check the source of the login form to get three pieces of information - the url that the form posts to, and the name attributes of the username and password fields. In his example, they are inUserName and inUserPass.
Once you've got that, you can use a requests.Session() instance to make a post request to the login url with your login details as a payload. Making requests from a session instance is essentially the same as using requests normally, it simply adds persistence, allowing you to store and use cookies etc.
Assuming your login attempt was successful, you can simply use the session instance to make further requests to the site. The cookie that identifies you will be used to authorise the requests.
Example
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'username',
'inUserPass': 'password'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('LOGIN_URL', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text
# An authorised request.
r = s.get('A protected web page url')
print r.text
# etc...
If the information you want is on the page you are directed to immediately after login...
Lets call your ck variable payload instead, like in the python-requests docs:
payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)
Otherwise...
See https://stackoverflow.com/a/17633072/111362 below.
Let me try to make it simple, suppose URL of the site is http://example.com/ and let's suppose you need to sign up by filling username and password, so we go to the login page say http://example.com/login.php now and view it's source code and search for the action URL it will be in form tag something like
<form name="loginform" method="post" action="userinfo.php">
now take userinfo.php to make absolute URL which will be 'http://example.com/userinfo.php', now run a simple python script
import requests
url = 'http://example.com/userinfo.php'
values = {'username': 'user',
'password': 'pass'}
r = requests.post(url, data=values)
print r.content
I Hope that this helps someone somewhere someday.
The requests.Session() solution assisted with logging into a form with CSRF Protection (as used in Flask-WTF forms). Check if a csrf_token is required as a hidden field and add it to the payload with the username and password:
import requests
from bs4 import BeautifulSoup
payload = {
'email': 'email#example.com',
'password': 'passw0rd'
}
with requests.Session() as sess:
res = sess.get(server_name + '/signin')
signin = BeautifulSoup(res._content, 'html.parser')
payload['csrf_token'] = signin.find('input', id='csrf_token')['value']
res = sess.post(server_name + '/auth/login', data=payload)
Find out the name of the inputs used on the websites form for usernames <...name=username.../> and passwords <...name=password../> and replace them in the script below. Also replace the URL to point at the desired site to log into.
login.py
#!/usr/bin/env python
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
payload = { 'username': 'user#email.com', 'password': 'blahblahsecretpassw0rd' }
url = 'https://website.com/login.html'
requests.post(url, data=payload, verify=False)
The use of disable_warnings(InsecureRequestWarning) will silence any output from the script when trying to log into sites with unverified SSL certificates.
Extra:
To run this script from the command line on a UNIX based system place it in a directory, i.e. home/scripts and add this directory to your path in ~/.bash_profile or a similar file used by the terminal.
# Custom scripts
export CUSTOM_SCRIPTS=home/scripts
export PATH=$CUSTOM_SCRIPTS:$PATH
Then create a link to this python script inside home/scripts/login.py
ln -s ~/home/scripts/login.py ~/home/scripts/login
Close your terminal, start a new one, run login
Some pages may require more than login/pass. There may even be hidden fields. The most reliable way is to use inspect tool and look at the network tab while logging in, to see what data is being passed on.

Categories

Resources