python: scrape site with randomized attributes

python: scrape site with randomized attributes - python

Full disclosure: I'm brand new to web scraping, and my experience with html is very limited. Please tell me if I am doing something wrong, or if you need more info.
Private content: MY_PASSWORD, MY_USERNAME, MY_SCHOOL, SITE2
My goal: Scrape a school webpage (https://SITE2.MY_SCHOOL.edu/Main_Page.Asp?Page=Select_Subject) that I have access to with my student account.
Website info: When I enter https://SITE2.MY_SCHOOL.edu/ on a browser, I get redirected to https://login.MY_SCHOOL.edu/?App=J4200 to login to my school account.
After logging in, I get redirected back to https://SITE2.MY_SCHOOL.edu/Main_Page.Asp?Page=Select_Subject (the page I want to scrape)
Problem: On refresh of https://login.MY_SCHOOL.edu/?App=J4200, the name attribute of input id/type 'password' is randomized, and the value attribute of input name 'EncryptedStamp' is randomized. I am able to get all of this information from the loaded page, as shown below. when I call session.post('https://login.MY_SCHOOL.edu/?App=J4200', data=form)
import requests, lxml.html
s = requests.session()
login = s.get("https://login.MY_SCHOOL.edu/")
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
# The input type 'password' has a name tag that changes on refresh
# See the example form below
password_input = login_html.xpath(r'//form//input[#type="password"]')
password_nametag = password_input[0].name
jse_id = s.cookies['JSESSIONID']
# Create dict used as 'data' argument of POST request
form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
form['AlternateID'] = 'MY_USERNAME' # AlternateID is name tag for username
form[password_nametag] = 'MY_PASSWORD'
form['JSESSIONID'] = jse_id
Example form:
{'AlternateID': 'MY_USERNAME',
'App': 'MY_SCHOOLNet',
'EncryptedStamp': 'ZYWSNQLPJKMH',
'JSESSIONID': 'EFC8B3319A63332484DFE8F90E4E0272',
'MXVQY': 'MY_PASSWORD'}
Response:
response = s.post('https://login.MY_SCHOOL.edu/?App=J4200', data=form)
# s.get will return the same results
print(response.url)
'https://login.MY_SCHOOL.edu/'
# I think this should be https://SITE2.MY_SCHOOL.edu/Main_Page.Asp?Page=Select_Subject
Rerun all form lines to update form:
{'AlternateID': 'MY_USERNAME',
'App': 'MY_SCHOOLNet',
'EncryptedStamp':'ZYWSNQLPJKMH',
'JSESSIONID':'EFC8B3319A63332484DFE8F90E4E0272',
'RYCSX': 'MY_PASSWORD'}
What I think the solution is: Update the page without refreshing the page, to prevent the two attributes from changing. I have no idea how though. Googling said to use JQuery, but that is a steep learning curve for results that I may not even get.
Thanks for reading, and I look forward to any suggestions.

Related

how to write the 'remember information' checkbox's name when using requests.post

i want to crab the data on a social platform, but i need to login. When i set the postdata, i need to remember the login information in addition to the username and password.I try to find the checkbox in the Elements, it have a value "on", but the checkbox does not have a name. In the attribute, the name is null. So how can i write this in the postdata showed below?
import requests
session = requests.session()
postdata = {
'username':'xxx',
'password':'xxx'
}

How to retrieve Google Contacts in Django using oauth2.0?

My app is registered in Google and I have enabled the contacts API.
In the first view I am getting the access token and I am redirecting the user to the Google confirmation page where he will be prompted to give access to his contacts:
SCOPE = 'https://www.google.com/m8/feeds/'
CLIENT_ID = 'xxxxxxxx'
CLIENT_SECRET = 'xxxxxxxxx'
APPLICATION= 'example.com'
USER_AGENT = 'dummy-sample'
APPLICATION_REDIRECT_URI = 'http://example.com/oauth2callback/'
def import_contacts(request):
auth_token = gdata.gauth.OAuth2Token(
client_id=CLIENT_ID, client_secret=CLIENT_SECRET,
scope=SCOPE, user_agent=USER_AGENT)
authorize_url = auth_token.generate_authorize_url(
redirect_uri=APPLICATION_REDIRECT_URI)
return redirect(authorize_url)
If the user clicks Allow, then Google redirects to my handler which shall retrieve the contacts:
def oauth2callback(request):
code = request.GET.get('code', '')
redirect_url = 'http://example.com/oauth2callback?code=%s' % code
url = atom.http_core.ParseUri(redirect_url)
auth_token.get_access_token(url.query)
client = gdata.contacts.service.ContactsService(source=APPLICATION)
auth_token.authorize(client)
feed = client.GetContactsFeed()
As you can see, my problem is how to get the auth_token object in the second view, because this code is failing on the line auth_token.get_access_token(url.query).
I have tried without success multiple options like putting the object in the session but it is not serializable. I tried also gdata.gauth.token_to_blob(auth_token) but then I can retrieve only the token string and not the object. Working with gdata.gauth.ae_save() and ae_load() seem to require in some way Google App Engine.
The alternative approach that I see in order to get the contacts is to request them directly in the first Django view with the access token, instead exchanging the token with the code:
r = requests.get('https://www.google.com/m8/feeds/contacts/default/full?access_token=%s&alt=json&max-results=1000&start-index=1' % (self.access_token))
But this is not redirecting the users to the google page so that they can give explicitly their approval. Instead, it fetches the contacts directly using the token as credentials. Is this a common practice? What do you think? I think that the first approach is the preferred one, but first I have to manage to get the auth_token object..

Finally I was able to serialize the object and put it in the session, which is not a secure way to go but at least it will point me to the right direction so that I can continue with my business logic related with the social apps.
import gdata.contacts.client
def import_contacts(request):
auth_token = gdata.gauth.OAuth2Token(
client_id=CLIENT_ID, client_secret=CLIENT_SECRET,
scope=SCOPE, user_agent=USER_AGENT)
authorize_url = auth_token.generate_authorize_url(
redirect_uri=APPLICATION_REDIRECT_URI)
# Put the object in the sesstion
request.session['auth_token'] = gdata.gauth.token_to_blob(auth_token)
return redirect(authorize_url)
def oauth2callback(request):
code = request.GET.get('code', '')
redirect_url = 'http://myapp.com/oauth2callback?code=%s' % code
url = atom.http_core.ParseUri(redirect_url)
# Retrieve the object from the session
auth_token = gdata.gauth.token_from_blob(request.session['auth_token'])
# Here is the tricky part: we need to add the redirect_uri to the object in addition
auth_token.redirect_uri = APPLICATION_REDIRECT_URI
# And this was my problem in my question above. Now I have the object in the handler view and can use it to retrieve the contacts.
auth_token.get_access_token(url.query)
# The second change I did was to create a ContactsClient instead of ContactsService
client = gdata.contacts.client.ContactsClient(source=APPLICATION)
auth_token.authorize(client)
feed = client.GetContacts()
all_emails = []
for i, entry in enumerate(feed.entry):
# Loop and fill the list with emails here
...
return render_to_response('xxx/import_contacts.html', {'all_emails': all_emails},
context_instance=RequestContext(request))

Reading page's messages with Python Facebook SDK

Basically i need to get all messages of a page using facebook SDK in python.
Following some tutorial i arrived to this point:
import facebook
def main():
cfg = {
"page_id" : "MY PAGE ID",
"access_token" : "LONG LIVE ACCESS TOKEN"
}
api = get_api(cfg)
msg = "Hre"
status = api.put_wall_post(msg) #used to post to wall message Hre
x = api.get_object('/'+str(MY PAGE ID)+"/conversations/") #Give actual conversations
def get_api(cfg):
graph = facebook.GraphAPI(cfg['access_token'])
resp = graph.get_object('me/accounts')
page_access_token = None
for page in resp['data']:
if page['id'] == cfg['page_id']:
page_access_token = page['access_token']
graph = facebook.GraphAPI(page_access_token)
return graph
if __name__ == "__main__":
main()
The first problem is that api.get_object('/'+str(MY PAGE ID)+"/conversations/")returns a dictionary containing many informations, but what i would like to see is the messages they sent to me, while for now it print the user id that sent to me a message.
The output look like the following:
{u'paging': {u'next': u'https://graph.facebook.com/v2.4/571499452991432/conversations?access_token=Token&limit=25&until=1441825848&__paging_token=enc_AdCqaKAP3e1NU9MGSsvSdzDPIIDtB2ZCe2hCYfk7ft5ZAjRhsuVEL7eFYOOCdQ8okvuhZA5iQWaYZBBbrZCRNW8uzWmgnKGl69KKt4catxZAvQYCus7gZDZD', u'previous': u'https://graph.facebook.com/v2.4/571499452991432/conversations?access_token=token&limit=25&since=1441825848&__paging_token=enc_AdCqaKAP3e1NU9MGSsvSdzDPIIDtB2ZCe2hCYfk7ft5ZAjRhsuVEL7eFYOOCdQ8okvuhZA5iQWaYZBBbrZCRNW8uzWmgnKGl69KKt4catxZAvQYCus7gZDZD&__previous=1'}, u'data': [{u'link': u'/communityticino/manager/messages/?mercurythreadid=user%3A1055476438&threadid=mid.1441825847634%3Af2e0247f54f5c4d222&folder=inbox', u'id': u't_mid.1441825847634:f2e0247f54f5c4d222', u'updated_time': u'2015-09-09T19:10:48+0000'}]}
which is basically paging and data.
Given this is there a way to read the conversation?

In order to get the messages content you need first to request the single messages in the conversation, accessible with the 'id' field in the dictionary you copied, result of
x = api.get_object('/'+str(MY PAGE ID)+"/conversations/") #Give actual conversations
you can request the messages in the conversation by calling
msg = api.get_object('/'+<message id>)
Here it gets tricky, because following the graph api documentation you should receive back a dictionary with ALL the possible fields, including the 'message' (content) field. The function however returns only the fields 'created_time' and 'id'.
Thanks to this other question Request fields in Python Facebook SDK I found that you can request for those fields by adding a dict with such fields specified in the arguments of the graph.get_object() function. As far as I know this is undocumented in the facebook sdk reference for python.
The correct code is
args = {'fields' : 'message'}
msg = api.get_object('/'+<message id>, **args)
Similar question: Read facebook messages using python sdk

Paypal Payflow Pro return url does not work correctly - 'processTransaction.do' appended - raises 404

currently I am trying to migrate a working php paypal payflow implementation to a new python-based system.
I use secure token together with hosted checkout pages. The secure token works fine and I get redirected to the checkout page as well (although it has horrible formatting errors).
THE PROBLEM: after the payment it should redirect to the return url. This works BUT 'processTransaction.do' is appended to it. So my return url is defined as:
'https://mywebsite.com/paypal/succes/'
but i get redirected to
'https://mywebsite.com/paypal/succes/processTransaction.do'
and this raises a 404.
My secure token request parameters:
params = {}
params["PARTNER"] = "paypal"
params["VENDOR"] = "...."
params["TRXTYPE"] = "S"
params["AMT"] = payment_amount #amount to pay
params["CREATESECURETOKEN"] = "Y"
params["SECURETOKENID"] = time.time() #needs to be unique
params["USER"] = "...."
params["PWD"] = "...."
Then I send the request and catch the return which looks like this:
RESULT=0&SECURETOKEN=QQQc0rQZ8TkKSNMqU3Mg2og7o
SECURETOKENID=1431563231.24&RESPMSG=Approved
Afterwards I send the request for the checkout page with the following paramters:
params["SECURETOKEN"] = securetoken
params["SECURETOKENID"] = securetokenid
to: https://payflowpro.paypal.com
I use this code to send the requests:
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
response_text = response.read()
The return url is set in the paypal manager with return type as POST and "Show Confirmation Page" is set to "On my website".
Does somebody know what is wrong and how to fix it?
Thanks!

How to get page access token

I create app in facebook and page in my profile. In "Select how your app integrates with Facebook" section I don't select any option because I want only post text to facebook page (maybe this is problem?).
I have this code:
FACEBOOK_APP_ID = 'myappid'
FACEBOOK_APP_SECRET = 'myappsecret'
FACEBOOK_PROFILE_ID = 'myprofileid'
oauth_args = dict(client_id = FACEBOOK_APP_ID,
client_secret = FACEBOOK_APP_SECRET,
scope = 'publish_stream',
grant_type = 'client_credentials'
)
oauth_response = urllib.urlopen('https://graph.facebook.com/oauth/access_token?' + urllib.urlencode(oauth_args)).read()
oauth_response looks good
but when I run:
resp = urllib.urlopen('https://graph.facebook.com/me/accounts?'+oauth_response).read()
I get error:
{"error":{"message":"An active access token must be used to query information about the current user.","type":"OAuthException","code":2500}}
What am I doing wrong? I want to post on page wall some text when, for example, I click button on my website (Django).
UPDATE:
Ok, I get the pages data in json. I parsing it and I get page_access_token, but when I call this:
attach = {
"name": 'Hello world',
"link": 'http://linktosite',
"caption": 'test post',
"description": 'some test'
}
facebook_graph = facebook.GraphAPI(page_access_token)
try:
response = facebook_graph.put_wall_post('', attachment=attach)
except facebook.GraphAPIError as e:
print e
I get error: "The target user has not authorized this action"

This question is basically asking about the same problem, and the answer seems to be what you're looking for: (OAuthException - #2500) An active access token must be used to query information about the current user

If the page_access_token is correct, I guess you (the page admin) have not yet granted permission for your facebook application to post message to facebook page.
Check facebook login function in client side, whether you ask enough permission, the scope option should be 'mange_pages publish_stream photo_upload...' depends on your requirement, rather than only 'mange_pages'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: scrape site with randomized attributes - python

Related

how to write the 'remember information' checkbox's name when using requests.post

How to retrieve Google Contacts in Django using oauth2.0?

Reading page's messages with Python Facebook SDK

Paypal Payflow Pro return url does not work correctly - 'processTransaction.do' appended - raises 404

How to get page access token

Categories

Resources