Python crawler can not successfully sent out form

Python crawler can not successfully sent out form - python

I am new to Python and trying to crawling the some information on my school website (based on aspx).
What I am trying to do is:
http://url.of.the.page
Login
Open the 4th link on the left
I was trying to log into my account by using req = urllib2.Request(url,data) (the data contains the id, password and some other information I can see through wireshark) together with result = opener.open(req) and print result.read().
Unfortunately the result printed out is the same as the original login page, so I guess I did not login successfully, this result is also same as when I click the 4th lick without login.
(Another proof is that when I wanted to get another link on the web page, I was redirected to the login page).
My question will be:
Do I really fail to login?
If so what is the correct way to login?
My code is as follow:
# -*- coding: utf-8 -*-
import urllib2
import urllib
import cookielib
from bs4 import BeautifulSoup
import datetime
import time
from urlgrabber.keepalive import HTTPHandler
def get_ViewState(soup):
view_input = soup.find(id="__VIEWSTATE")
return (view_input['value'])
def get_EventValidation(soup):
event_input = soup.find(id="__EVENTVALIDATION")
return event_input['value']
cookie = cookielib.CookieJar()
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie),keepalive_handler)
urllib2.install_opener(opener)
__url = 'http://url.of.the.page'
opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36')
,('Connection', 'Keep-Alive')
,('Referer',__url)]
page = urllib.urlopen(__url).read()
soup = BeautifulSoup(page)
viewstate = get_ViewState(soup)
eventvalidation = get_EventValidation(soup)
postdata = urllib.urlencode({
'__EVENTTARGET':'',
'__EVENTARGUMENT:':'',
'__VIEWSTATE':viewstate,
'TxtStudentId':'xxxxxxx',
'TxtPassword':'xxxxxxx',
'BtnLogin':'login',
'__EVENTVALIDATION':eventvalidation
})
req = urllib2.Request(
url = __url,
data = postdata
)
result = opener.open(req)
print result.read()
# result = opener.open(req)
# print result.info()
# print result
# print result.read()
print "------------------------------------------------"
#after login, I need to get the scores table
__queryUrl = 'http://url.of.the.page?key=0'
now = datetime.datetime.now()
opener.addheaders = [('Referer', 'http://url.of.the.page?i='+now.strftime('%H:%M:%S'))]
result = opener.open(__queryUrl)
print result.read()
for item in cookie:
print 'Cookie：Name = '+item.name
print 'Cookie：Value = '+item.value

For logging in you will have to use a Programming API for the website because it will probably ask if you are a robot. For clicking the fourth link, simply view the source code (HTML) of the website and find the class and ID of the link you want. Then, with some Googling, you can add that to the code and you are all set :)

By capturing the packages I found out that my POST message get an OK message from the server, that means I did login successfully.
The reason why GET message got a 302 found as a return is because I didn't include a cookie in the header. I was using urllib2 and it did not include the cookie in the GET message automatically.
So I hard coded the cookie into the header by doing the following:
cookie = cookielib.CookieJar()
ckName = ''
ckValue = ''
for item in cookie:
ckName = item.name
ckValue = item.value
opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36')
,('Referer', 'http://202.120.108.14/ecustedu/K_StudentQuery/K_StudentQueryLeft.aspx?i='+now.strftime('%H:%M:%S'))
,('Cookie',ckName+'='+ckValue)]

Related

Can't login with python requests, even after making a get request first, and setting headers

I am trying to get data from a page. I've tried to read the posts of other people who had the same problem, Making a get request first to get cookies, setting headers, none of it works. When I examine the output of print(soup.title.get_text()) I still end up getting "Log In" as the title returned. The login_data has the same key names as the HTML <input> elements, e.g <input name=ctl00$cphMain$logIn$UserName ...> for username and <input name=ctl00$cphMain$logIn$Password ...> for password. Not sure what to do next. I can't use selenium, as I have to execute this script on an EC2 instance that's running a splunk server.
import requests
from bs4 import BeautifulSoup
link = "****"
login_URL = "https://erecruit.elwoodstaffing.com/Login.aspx"
login_data = {
"ctl00$cphMain$logIn$UserName": "****",
"ctl00$cphMain$logIn$Password": "****"
}
with requests.Session() as session:
z = session.get(login_URL)
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36',
'Content-Type':'application/json;charset=UTF-8',
}
post = session.post(login_URL, data=login_data)
response = session.get(link)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.get_text())

I actually found the answer.
You can basically just go to the network tab using chrome, and then copy requests as a cURL statement. Then, just use a website or tool to convert the cURL statement to its programming language equivalent (Python, node, java, and so forth).

Using Beautiful Soup to scrape a popup that has a URL (and else error)

I am working on a project for science that scrapes skyward.smsd.org it opens in a pop up but at the top of the page it has a URL when I go to it not in the popup it says your session has expired and there is no way around this I can find. I am also having an invalid syntax error with else: msg if anyone can help me find a solution to these issues
while True:
import requests
from bs4 import BeautifulSoup
import time
from time import sleep
url = "https://skyward.smsd.org/scripts/wsisa.dll/WService=wsEAplus/sfcalendar002.w"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
from requests.packages.urllib3 import add_stderr_logger
add_stderr_logger()
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0'
login = {login: 3078774, password: (MY PASSWORD)}
login_response = s.post(url, data=login)
for r in login_response.history:
if r.status_code == 401: # 401 means authentication failed
sys.exit(1) # abort
pdf_response = s.get(pdf_url) # Your cookies and headers are automatically included
if str(soup).find("skyward") == -1:
continue
time.sleep(60)
else:
msg = 'Subject: This is the script talking, check Skyward'
#Possibilty to make this tell you exactly what is changed
#A text feature that goes out daily for missing assignments
fromaddr = '3078774#smsd.org'
toaddrs = ['3078774#smsd.org']
print('From: ' + fromaddr)
print('To: ' + str(toaddrs))
print('Message: ' + msg)
break

Python request login not working as expected

So I am not sure why, but reading plenty of other similar issues and resolved questions on here, I can't see why my request is not printing the page behind the login form. I am using a simple webpage to test it out, where I am registered. Providing the creds in my payload and holding the cookie using .Session() should open my second URL. But instead I get the login form printed. I checked with wireshark, and Burp Suite, and everything looks normal when I run the script, looks like if I login to the webpage.
Here is the code:
# -*- coding: utf-8 -*-
import requests
url = 'http://www.chicago-cz.com/forum/login.php'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}
payload = {
"username": "User_321",
"password": "S33cr3t",
}
with requests.Session() as s:
p = s.post(url, headers=headers, data=payload)
#print p.text
# URL behind login (Inbox)
r = s.get('http://www.chicago-cz.com/forum/privmsg.php?folder=inbox')
print r.content

Using urlopen I can get the html of the page, but a crucial part is missing

I am trying to make a script that gets similar images from google using a url, using a part from this code.
The problem is, that I want to get to this link, because from it I can get to the images themselves by cloicking on the "search by image" link, but when I use the script, I get the exact same page, but without the "search by image" link.
I would like to know why and if there is a way to fix it.
Thanks a lot in advance!
P.S. Here's the code
import os
from urllib2 import Request, urlopen
from cookielib import LWPCookieJar
USER_AGENT = r"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)"
LOCAL_PATH = r"C:\scripts\google_search"
COOKIE_JAR_FILE = r".google-cookie"
class google_search(object):
def cleanup(self):
if os.path.isfile(self.cookie_jar_path):
os.remove(self.cookie_jar_path)
os.chdir(LOCAL_PATH)
for html in os.listdir("."):
if html.endswith(".html"):
os.remove(html)
def __init__(self, cookie_jar_path):
self.cookie_jar_path = cookie_jar_path
self.cookie_jar = LWPCookieJar(self.cookie_jar_path)
self.counter = 0
self.cleanup()
try:
cookie.load()
except Exception:
pass
def get_html(self, url):
request = Request(url = url)
request.add_header("User-Agent", USER_AGENT)
self.cookie_jar.add_cookie_header(request)
response = urlopen(request)
self.cookie_jar.extract_cookies(response, request)
html_response = response.read()
response.close()
self.cookie_jar.save()
return html_response
def main():
url_2 = r"http://www.google.com/search?hl=en&q=http%3A%2F%2Fi.imgur.com%2FqGRxTNA.jpg&btnG=Google+Search"
search = google_search(os.path.join(LOCAL_PATH, COOKIE_JAR_FILE))
html_2 = search.get_html(url_2)
if __name__ == '__main__':
main()

I have tried something of that sort a few weeks back. My server used to reject my requests with a 404 because I was not setting a proper user agent.
In your case, you are not setting the user agent properly. Pasting my User-Agent header.
USER_AGENT = r"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"
PS: I hope you have read the T & C of Google. You might be violating the terms.

Log in to website with python

I'm trying to log in to Wikipedia using a python script, but despite following the instructions here, I just can't get it to work.
import urllib
import urllib2
import cookielib
username = 'myname'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6")]
login_data = urllib.urlencode({'wpName' : username, 'wpPassword' : password})
opener.open('http://en.wikipedia.org/w/index.php?title=Special:UserLogin', login_data)
resp = opener.open('http://en.wikipedia.org/wiki/Special:Watchlist')
All I get is the "You're not logged in" page. I tried logging in to another site with the script with the same negative result. I suspect it's either got something to do with cookies, or I'm missing something incredibly simple here. But I just cannot find it.

If you inspect the raw request sent to the login URL (with the help of a tool such as Charles Proxy), you will see that it is actually sending 4 parameters: wpName, wpPassword, wpLoginAttempt and wpLoginToken. The first 3 are static and you can fill them in anytime, the 4th one however needs to be parsed from the HTML of the login page. You will need to post this value you parsed, in addition to the other 3, to the login URL to be able to login.
Here is the working code using Requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wpLoginAttempt': 'Log in',
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')

Adding these two lines
r = bs(response.content)
print r.get_text()
I should be able to understand if I'm logged in or not, right? I keep seeing "Please log in to view or edit items on your watchlist." but I'm using the clean code given above, with my login and password.
Where is the mistake?

Wikipedia now forces HTTPS and requires other parameters, and wpLoginAttempt became wploginattempt, here is an updated version of K Z initial answer:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')

You need to add header Content-Type: application/x-www-form-urlencoded to your POST request.

I also added the following lines and see myself as not logged in.
page = response.text.encode('utf8')
if page.find('Not logged in'):
print 'You are not logged in. :('
else:
print 'YOU ARE LOGGED IN! :)'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python crawler can not successfully sent out form - python

Related

Can't login with python requests, even after making a get request first, and setting headers

Using Beautiful Soup to scrape a popup that has a URL (and else error)

Python request login not working as expected

Using urlopen I can get the html of the page, but a crucial part is missing

Log in to website with python

Categories

Resources