Extremely strange Web-Scraping issue: Post request not behaving as expected - python

I'm attempting to programmatically submit some data to a form on our company's admin page rather than doing it by hand.
I've written numerous other tools which scrape this website and manipulate data. However, for some reason, this particular one is giving me a ton of trouble.
Walking through with a browser:
Below are the pages I'm attempting to scrape and post data to. Note, that these pages usually show up in js shadowboxes, however, it functions fine with Javascript disabled, so I'm assuming that javascript is not an issue with regards to the scraper trouble.
(Note, since this is a company page, I've filled I've replaced all the form fields with junk titles, so, for instance, the client numbers are completely made-up)
Also, being that it is a company page behind a username/password wall, I can't give out the website for testing, so I've attempted in inject as much detail as possible into this post!
Main entry point is here:
From this page, I click "Add New form", which opens this next page in a new tag (since javascript is disabled).
On this page, I fill out the small form, click submit, which then gets the next page displaying a success message.
Should be simple, right?
Code attempt 1: Mechanize
import mechanize
import base64
import cookielib
br = mechanize.Browser()
username = 'USERNAME'
password = 'PASSWORD'
br.addheaders.append(('Authorization',
'Basic %s' % base64.encodestring('%s:%s' % (username, password))))
br.addheaders = [('User-agent',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML,'
' like Gecko) Chrome/25.0.1364.172 Safari/537.22')]
br.open('www.our_company_page.com/adm/add_forms.php')
links = [link for link in br.links()]
# Follow "Add a form" Link
response = br.follow_link(links[0])
br.select_form(nr=0)
br.form.set_all_readonly(False)
br.form['formNumber'] = "FROM_PYTHON"
br.form['RevisionNumber'] = ['20']
br.form['FormType'] = ['H(num)']
response = br.submit()
print response.read() #Shows the exact same page! >:(
So, as you can see, I attempt to duplicate the steps that I would take in a browser. I load the initial /adm/forms page, follow the first link, which is Add a Form, and fill out the form, and click the submit button. But here's where it get screwy. The response that mechanize returns is the exact same page with the form. No error messages, no success messages, and when I manually check our admin page, no changes have been made.
Inspecting Network Activity
Frustrated, I opened Chrome and watched the network tab as I manually filed out and submitted the form in the browser.
Upon submitting the form, this is the network activity:
Seems pretty straight forward to me. There's the post, and then a get for the css files, and another get for the jquery library. There's another get for some kind of image, but I have no idea what that is for.
Inspecting the details of the POST request:
After some Googling about scraping problems, I saw a suggestion that the server may be expecting a certain header, and the I should simply copy everything that gets made in the POST request and then slowly take away headers until I figure out which one was the important one. So I did just that, copied every bit of information in the Network tab and stuck in my post request.
Code Attempt 2: Urllib
I had some trouble figuring out all of the header stuff with Mechanize, so I switched over to urllib2.
import urllib
import urllib2
import base64
url = 'www.our_company_page.com/adm/add_forms.php'
values = {
'SID':'', #Hidden field
'FormNumber':'FROM_PYTHON1030PM',
'RevisionNumber':'5',
'FormType':'H(num)',
'fsubmit':'Save Page'
}
username = 'USERNAME'
password = 'PASSWORD'
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Accept-Language' : 'en-US,en;q=0.8',
'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
'Authorization': 'Basic %s' % base64.encodestring('%s:%s' % (username, password)),
'Cache-Control' : 'max-age=0',
'Connection' : 'keep-alive',
'Content-Type' : 'application/x-www-form-urlencoded',
'Cookie' : 'ID=201399',
'Host' : 'our_company_page.com',
'Origin' : 'http://our_company_page.com',
'Referer' : 'http://our_company_page.com/adm/add_form.php',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, '
'like Gecko) Chrome/26.0.1410.43 Safari/537.31'
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
print response.read()
As you can see, I added header present in Chrome's Network tab to the POST request in urllib2.
One addition change from the Mechainze version is that I now access the add_form.php page directly by adding the relevant cookie to my Request.
However, even with duplication everything I can, I still have the exact same issue: The response is the exact same page I started on -- no errors, no success messages, no changes on the server, just returned to a blank form.
Final Step: Desperation sits in, I install WireShark
Time to do some traffic sniffing. I'm determined to see WTF is going on in this magical post request!
I download, install, and fire up Wireshark. I filter for http, and then first submit the form manually in the browser, and then run my code with attempts to submit the form programmatically.
This is the network traffic:
Browser:
Python:
Aside from the headers being in a slightly different order (does that matter), they look exactly the same!
So that's where I am, completely confused as to why a post request, which is (as far as I can tell) nearly identical to the one made by the browser, isn't making any changes on the server.
Has anyone ever encountered anything like this? Am I missing something obvious? What's going on here?
Edit
As per Ric's suggestion, I replicated the POST data exactly. I copies it directly from the Network Source tab in Chrome.
Modified code looks as follows
data = 'SegmentID=&Segment=FROMPYTHON&SegmentPosition=1&SegmentContains=Sections&fsubmit=Save+Page'
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
print response.read()
The only thing I changed was the Segment value from FROMBROWSER to FROMPYTHON.
Unfortunately, this still yields the same result. The response is the same page, I started from.
Update
working, but not solved
I checked out the requests library, duplicated my efforts using their API, and lo' magically it worked! The POST actually went through. The question remains: why!? I again took another snapshot with wireshark, and as near as I can tell it is exactly the same as the POST made from the browser.
The Code
def post(eventID, name, pos, containsID):
segmentContains = ["Sections", "Products"]
url = 'http://my_site.com/adm/add_page.php'
cookies = dict(EventID=str(eventID))
payload = { "SegmentID" : "",
"FormNumber" : name,
"RevisionNumber" : str(pos),
"FormType" : containsID,
"fsubmit" : "Save Page"
}
r = requests.post(
url,
auth=(auth.username, auth.password),
allow_redirects=True,
cookies=cookies,
data=payload)
Wireshark output
Requests
Browser
So, to summarize the current state of the question. It works, but I nothing has really changed. I have no idea why attempts with both Mechanize and urllib2 failed. What is going on that allows that requests POST to actually go through?
Edit -- Wing Tang Wong suggestion:
At Wing Tand Wongs suggestion, I created a cookie handler, and attached that to the urllib.opener. So no more cookies are being send manually in the headers -- in fact, I don't assign anything at all now.
I first connect to the adm page with has the link to the form, rather than connecting to the form right away.
'http://my_web_page.com/adm/segments.php?&n=201399'
This gives the ID cookie to my urllib cookieJar. From this point I follow the link to the page that has the form, and then attempt to submit to it as usual.
Full Code:
url = 'http://my_web_page.com/adm/segments.php?&n=201399'
post_url = 'http://my_web_page.com/adm/add_page.php'
values = {
'SegmentID':'',
'Segment':'FROM_PYTHON1030PM',
'SegmentPosition':'5',
'SegmentContains':'Products',
'fsubmit':'Save Page'
}
username = auth.username
password = auth.password
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Accept-Language' : 'en-US,en;q=0.8',
'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
'Authorization': 'Basic %s' % base64.encodestring('%s:%s' % (username, password)),
'Cache-Control' : 'max-age=0',
'Connection' : 'keep-alive',
'Content-Type' : 'application/x-www-form-urlencoded',
'Host' : 'mt_site.com',
'Origin' : 'http://my_site.com',
'Referer' : 'http://my_site.com/adm/add_page.php',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31'
}
COOKIEFILE = 'cookies.lwp'
cj = cookielib.LWPCookieJar()
if os.path.isfile(COOKIEFILE):
cj.load(COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
data = urllib.urlencode(values)
req = urllib2.Request(url, headers=headers)
handle = urllib2.urlopen(req)
req = urllib2.Request(post_url, data, headers)
handle = urllib2.urlopen(req)
print handle.info()
print handle.read()
print
if cj:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
cj.save(COOKIEFILE)
Same thing as before. No changes get made on the server. To verify that the cookies are indeed there, I print them to the console after submitting the form, which gives the output:
These are the cookies we have received so far :
<Cookie EventID=201399 for my_site.com/adm>
So, the cookie is there, and has been sent along side the request.. so still not sure what's going on.

Read and re-read your post and the other folks answers a few times. My thoughts:
When you implemented in mechanize and urllib2, it looks like the cookies were hard coded into the header response. This would most likely cause the form to kick you out.
When you switched to using the web broswer and using the python 'requests' library, the cookies and sessions handling was being taken care of behind the scenes.
I believe that if you change your code to take into account the cookie and session states, ie. presume an automated session at start, has an empty cookie for the site and no session data, but properly tracks and manages it during the session, it should work.
Simple copying and substituting the header data will not work, and a properly coded site should bounce you back to the beginning.
Without seeing the backend code for the website, the above is my observation. Cookies and Session data are the culprit.
Edit:
Found this link: http://docs.python-requests.org/en/latest/
Which describes accessing a site with authentication/etc. The format of the authentication is similar to the Requests implementation you are using. They link to a git source for a urllib2 implementation that does the same thing and I noticed that the authentication bits are different from how you are doing the auth bits:
https://gist.github.com/kennethreitz/973705
from the page:
password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, gh_url, 'user', 'pass')
auth_manager = urllib2.HTTPBasicAuthHandler(password_manager)
opener = urllib2.build_opener(auth_manager)
I wonder if you change the way you are implementing the authentication bits for the urllib2 implementation, that it would work.

I think that the PHP script is erroring out and not displaying anything because your form data is not exactly identical. Try replicating a post request to be completely identical including all the values. I see that the line-based text data on your Wireshark screenshot for the browser includes parameters such as SegmentPosition which is 0, but in your Python screenshot does not have a value for SegmentPosition. The format for some of the parameters such as Segment seem different between the Browser and the Python request which may be causing an error as it tries to parse it.

Related

Python Requests ASP.net always redirected to login

Using Python 3, I am trying to download a file (xlsx) from a https ASP.Net form page using Python requests. I am creating a session and at first trying to login to the site. It is HTTPS but I do not have access to SSL cert, so I am using Verify=False, which I am happy with for this purpose.
I have manually set the User-Agent header with help from here,
to the same as the browser in Network traffic capturing under IE F12 feature, as this page seems to need a browser user-agent, as the python requests user-agent may be forbidden.
I am also capturing __VIEWSTATE and __VIEWSTATEGENERATOR from the response text as advised in this answer and adding this to my POST data along with Username & Password.
import requests
import bs4
login_payload = {'ctl00_txtEmailAddr':my_login, 'ctl00_txtPwd': pwd}
headers = {'User-Agent': user_agent,
'Accept':r'*/*',
'Accept-Encoding':r'gzip, deflate',
'Connection': r'Keep-Alive'}
s = requests.Session()
req = requests.Request('GET', my_url, headers=headers)
prep0 = s.prepare_request(req)
s.headers.update(headers)
resp = s.send(
prep0,
verify=False,
allow_redirects=True,
)
soup = bs4.BeautifulSoup(resp.text)
login_payload["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
login_payload["__VIEWSTATEGENERATOR"] =
soup.select_one("#__VIEWSTATEGENERATOR")["value"]
req_login = requests.Request('POST', juvo_url, headers=s.headers,
data=login_payload)
prep1 = s.prepare_request(req_login)
login_resp = s.send(prep1, verify=False)
Here is the rest of the request body if this helps, I am not using this.
__EVENTTARGET=&__EVENTARGUMENT=&forErrorMsg=&ctl00%24txtEmailAddr=*MYLOGIN*&ctl00%24txtPwd=*MYPASSWORD*&ctl00%24ImgBtnLoging.x=0&ctl00%24ImgBtnLoging.y=0
With other attempts with more code additional to the above, every page, including trying to get the file from the direct hyperlink copied from IE, returns "Object moved to here" (with a direct link to the file I need which works in browser) or redirects me to the login page.
If I try to download this, in Python using this direct link from requests.history, I download a html file with the same, depending on the response either "Object moved to here" or the html of the login page.
My request status is always 302 or 200 as seen from urllib3 debugging being enabled, but I am yet to see any response other than login/object moved to here.
Closest I can get is with this header after doing a GET request after modifying in Python the copied browser URL to the date I am interested in:
(which may actually be a website vulnerability if I can get this far without being logged in...)
{'Cache-Control': 'private', 'Content-Length': '873', 'Content-Type': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; charset=utf-8', 'Location': 'redacted login page with a whole load of params', 'Server': 'Microsoft-IIS/7.5', 'content-disposition': 'attachment;filename='redacted filename', 'X-AspNet-Version': '2.0.50727', 'X-Powered-By': 'ASP.NET'}
With almost every SO hyperlink now purple, any clues/suggestions would be greatly appreciated.
Many thanks.

UrlOpen Redirected to default page

The default data link is http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk
But, I do not want data on this default page. I want the data under Portfolio tab. So, I used Firefox to determine the url of the portfolio and attempted following python code:
testpage = urlopen('http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk&tabAction=Portfolio')
However, page is always redirected to the default link. How do I get to the portfolio page?
You need to pay attention to the request that is being made along with all the headers and the data.
For getting the "portfolio" data, if you inspect, you will see that POST request is being along with log of data is sent and payload data (form data) is to used to send the portfolio data back in response.
What you need to do is mimic the request to fetch the response data and then handle that according to your need. You can do something like this :
import requests
from lxml import html
headers = {
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'accept-language':'en-US,en;q=0.8,ms;q=0.6'
}
url = "http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk"
payload = {
'ctl00_ContentPlaceHolder1_aFundScreenerResultControl_ScriptManager1_HiddenField':';;AjaxControlToolkit, Version=3.5.7.123, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-GB:5a4df314-b4a2-4da2-a207-9284f1b1e96c:de1feab2:f2c8e708:720a52bf:f9cec9bc:589eaa30:a67c2700:ab09e3fe:87104b7c:8613aea7:3202a5a2:be6fb298',
'__EVENTTARGET':'TabAction',
'__EVENTARGUMENT':'Portfolio',
'__LASTFOCUS':'',
'__VIEWSTATE':'Or/Z5BkJx2WVGMIPWgbVTVzk9hu+/eDKDHsbG74cJRlPSPW9dXuSQt31f2njq7X4NCZF/VW7u63TU5lF3lWGIAFNRoIIWwlRVMeMWeHygunbmBVxWWO08k90rAhbOCiyeOgKoaL1lVKO0R0DGS9rjl1Gah7C2NiIyLeD8boWobKLRV47aRiqaWI9ZYprxoky4zmuNp4NP51z0QLfb/4TvQKfcXJcUHHAAknVurwXfye3cHiUGf7pOyI84E9KJscHsbowC6mejPX4XmlXLVrVXk/lupYU8yTXSp03D2vfyPcQcrxt3y/uF0kXNG+4A/hFWOQFazVk1SRMYnQlrWtQ9Ulh58Q71zEZvX3yZhnp2EA5ZnYuOfeFWCnwwUBa6s9o8uLocDK1Q4chtjXDqK7q2W89kPZoyYjmgB5xunFDt8A7Sz3IFaDkJEyPYdBPOKx1Y1zv0g3/gwBnd64UXkTlBySHZao2CB/OBNQoqI6RqI6L44nrbESabh+DHBCdcCKeL8Pj+lsM5o7P0ShXpXHbCRTPk4PiWVeP4hk1vyOFA7tiReoWEPwQvDe3sqWh+K7EHLHefW5ke6W9zy5seHuC1vfcVTwT5FUIcTaAnhoDSphsMHWPoVc/vtcfExPWUx/aC2KIf1m+DKtN/no8Frt4SYqxMGtDMSUZjMR5xhFHSaqfjv/0Gs+RVod4N+A4rYeUO07A9VTLTE8SuZ4ovxjhrEtAQ3bYqzt29leHpmFT7Pfl7OZw3t3wt6SjQX+Q3M5ozThannhRKaDJCnBZdFh7ZnY4fgCLpDNyMDq3FccJC0V6PDSuu6enpPWOcy4NJj5H+/rEqo61/e2wgmefzt7Zaygu4v66MmOKLqbWymNa6C1Xuc0u0FhERUrSWrL/rS9kwC+LA7aWFPhdnEnwPewV6yj7kWzb4IZZ6ivGs3CXYH7G2HTnsP/P2bHXNV+YaaXTkdKJXkiPF7/qQ3JKzZYhDJjj0PObqtI2RlhmeecF8Lq8SfRRrBTXWjvg48q7nXurZ7ztX28QRDHC5aP/13+X8RyvLRmiM4V3vRdMjxpt8ySZtKM5wpEA4XjTUtCWtrNKO18yc0pbMaGRm5xEoXLY/i1cHC62OvKRsEYX82q+KuGyqEKwPoElc2SbMfyEit8M6tkA12wBce4cqlUMX4D85OOKCMzhY+h0loQFSsgVFfqKpEHpH9yg5lKtRg0dZ8P301xoGCeXhBhyZIp234EAdOOQySV4iNcBykLFGOuB0w63KqVbQRejqlnj2Qd+OkoXQ4hAh9tgCXdxhOHZ1hLB3nHHNMT3TDBO+j8eXgxAE8PN9zt6Xj7qGqDmkHAlwMP3Q8er0Ms2i80x7pUzvy5ixozAbUgfuEeKtjkK7fSD0UkKMa/YELEjTkgVJm6goPPIR3D2lwNAQLyHM8xLFSy3evkpJojw+QEFw4U9n31CoO6OB15Isqy/E1MPgwq9Wz3mUn2iYH1JruwsgQqQXraUKAiyMlpfbtj2YQL39Zp+AwzPeDDbwRaCCNBFvmpapcJyMpmzlzd0tr9gdV1GoVTtWBg+UcVGSsQi4XkD+32CfUuQ+ZpFlmUoYLuYSAEFV7Y97MlqLMqW89r/BZXRXNacpizFFrnQlCnsM4Bj4DUp+K7pcxAaYRKWcH3tiQO6zhCa2b8YoawWzQ5Ij8b19Z7PLN3Yug9ldeJ2CcYOzUQebT23ofSNtCU+uTbYzzh6RE8Bg/rut8R6A1uwYBWvjfL7N7M2fUSd01pwYgJ0BfsViV1pipzpCvTL5hGf1aK25gR+T7GtIxNbrdlo6Z1LbV/xYQYIDTod5dq6wUttZJVLeLVZRkCAv+M+o7Bvd86pi82TIdC8foOPgo7OR6ykPk+aMt1pr/hBV3tmBOUvMyYADmmOZQR+L/AQ57tRukeRyACeTJq1b5icpxawI+qn71we6eAKmg5POvkbq+pI+YnoSs1Mhk9OWeJ1CPRg3P5TDMIhqXsG4mKY6awMwZXF12/r4qb7bRnfZGFukHBAYJTRsmZsLgiicM2uJ7kchxs2U/jwVcItGHgnIYkg1r7TTJ3oFo1rHEVhFHm8dIem3iI/VUpbe/XZyEKseDxoALSbASjYxM5n2eGfBLFnHMHv8RPfrX5EBfD0ZzMAVc8MoSycTsJuJI8L912Eewk9Cz3mb7o2zF9L+8syg8NpEDy78kIa0lE+QNqvdtk3P7uCxUckKWdmKLUfU2zaTBBGkIcDo8xXktZFgC+yQbUtxD2yFC21tvSA3xJaPVWqycMiVRp3fwIabWylnRnnwLqvAjIPTKiZI5w/szdciCwzx0GhSY14xpVV+jlLlfH8KCqVBVL5NIzxRTw+ELVPHOS3orE1dKtCcOqM22GE5PsU69E7ViA+fC2Gn/HzkUfUHPBKjKixX9hTmZzOnXToBU5sdEMZ1i3Jte+xfk3YVzYv9TO9f5EiibdNgw8MdCrXwxlgYNZUob0PixOajsPed+qv2PTl+kvLOSTkw6Z6K892TJkBvpAGQvP/zSgUorcNhuAJwQVG32TnX0HypMPpVwX0SqOhZLGM9essa7guKOrA3GdIDsoA2/f4JkFlJMtVgXKGPNXr7mTCeq2H8vFfQbH/59wPfMgrxxo6s9C+Tyt5zG3lyRoTEGUr4QwBkSeHq4J6Vya3sFDH911QHrfFuaKF2auqHHGuKyCCViqpb3A1Z4/GbllXBC4cmjyKc8FfI5i2eSSEMOd95N198ZCOD7x1zXPACX9QjaMdzZbadJ9UHXYsb/7l87ujNY4x5S9oQXgfW8fva9i4oqTqMV3VXTQK8lVcFovH0OxXXpNZ+rPm8Tj5kbRGrMgp6CdyxWSLKvqYv8f57ICr6ozaxyAd8XiTM+AhkfnXsN8BcH0u1yP6WUBDkUjhBi+4lfO6Dj5r6pFIN65GqPaz0mRFDpZU3nVQ1CmmeXneh0ZT/u7tG7Ray5Md5jr9onVsWfWnbc0hbUP0ghMANhtZtcrLpFikwxxQybdsS/xWdB4dLenTMAi2hn0KQ196thhQvvhEvEWaSxuEjX+iaQB14kXwOHAsBj8Ikp4lIdBsVctVQFVNzM3+F+UfDIbpTFh4IaAvOWNZzFGZYjdKDKKIuIgSAhdkHZbjQGpvXWdx12WR1/I/aqk5dx8OFpU3Lq/thZxQ+0oODetvex87L6lKWMgUcvQQAzAXbwzFp4wcTHnQuKJ21hqotOfn8F0GmWv59/hqfH1oFpt6/ENAs162hXOdGt5kTYl7u6X+ciQiIioRLiJ/NRIOoa1T++6v2FMk9acnOfNYMxEGeBdtqmLIN70aL8wvoFLliCkUhfe4yPaFQzFo26JsnnAXUpuiDKfs5fjDS+Rk/1BfVScqDIMv8IL8RDIoWxg8NX5DOOJPwAc3uC+s/kCCpoG2L0m9FLgSBv6Nr9wuv1rt59C/K/5RETD/VP415ArnuUBrdGpuYza1FvYyCo85HREzIL2lN6yZUBUXbBBrWxa3LiGaojhfhCyflhhHs+GoM8zfY5IW7Wpvp/YMPAgxXNRtegGL80+HU/dmlkRO8nRx3eyzpcpWZ302rK9m+OYqtfUXwvFKR7ULWnk/2aHsTQe6lwifxK70QG+jhZlrJqbPGi8vSpajsGMw5iU+VJM4CEDcGhvgpzODw3LkXPvsFrdLq8eUzHXo1Ox+yiZ2zSN3vGDcGeEZiQAbG2dcNt7niW+reozfdxVQAi4uLpPGWYu8jvVnRxoMuQEKEGzIiwNNsvpgCMGdUfk0izvvkTplz8lvk6ROlhy417VtiiVXVIMTAovFNO+W3O3/17LJ9Ed7QPYdoUO4n5fidYX6r4QUAoRGowMAPHQRIdg/AqN7N/EfmDRD72t7BOqvzVetXuTVId75vKB2P0CwoQPDIy0ynLZcTRykRs38LIHwYI5irp1NUjCee7mvo1RE0asD670LM03ZFMCOu/hmgln2dk5oFeyysISdxVUQKRmI6VytwEsSviOZeP1cZgB5DakdSgCaloRI2JGVbZ22B9UgO6hFSvfHhox5y1p/CzIrJPd+GUB80wmFX8Kgl0DSjsf9PJNQlAKu/jb85+wvF7exNrPyrShkWE9lYjbcmBHPYc+8J3ia1N3LWtVbR1x554dYoWHVGw3VbU3bWfqLjn5Eon4x3h5R/bCwVBorVCsQ99SzWCv5J9dMRF38r8y1yA4iEUPcX89n4nl91t6cnia4THuk2hhbaBPeu2PFwnTuwiJxAknVEGFUslAXu621wvmyssftVnQ+jzirCQNJAXyE75t+pNrWmQJXrpHDxnR3V9/LFrNy3tZn61H+UkEY1QK29bUJHE+DOfSnS4QkNY3VpLpaBdBeBorOOZ6dEc+lzVDcPrgjL+1fqu/yHwFmxCN8MfreDuX5E8M86YAR/xnyYAJRMxafR1p9eIG+cgHwIBeOhCw1J0p+ydN/bNK7KousyY4OcEr4zTF6crn6LmN6C7zDqabx8cjMRmUeNl24x27LxJhakNbmQjMVPSfo6Ro/edo0L7pG+pbj9SwiaJxGkr65b8pvrTwDFKh7tLUQtZ9j4s4y6MiQ565q2OJp93rm2deHHXUsM/ziI1t9OV00dbjuhTaLTmF5u3rIpKgryYVvmIa6G081WcKCeWz46amFLg7v39SB6XNuL0AIxIow5Hu+S0oIv51+ycUZoLUnypTFe/SnlobjYxAxsJt/cnxZ4wh556EZ9rN0HaJrbbb7pA4uaDBvz4EL8ndM+zEmlacyfQlKSr+jdB0XX+An6zhNQv3D5dkz84QdmPuAavhemrwr2m30Q/tNZ8DZdsBpOyR/U86nplu9Sx5LcFGnWULX+teY24nBUUghfuhGRPEr0dHPUUgMwpQq1fpcz6YQft09B0uthQiYWhNXvnsrlvnLzZTWTZLjFfwDlNn5RZqn0fAxudbM+eOzL9xvx4TBEEpcyf5nLTuNAKvfeZm4KWcRmV+WPnDJxmf7OlTVNKsXiY7Y+bJMjgNfKMh3oQws/1+gtATMlYSdjNIzuYSglhMyXS+BPRI8dpDq2VeF/cb6AII0Pvyq0H7nRadP1xccD6hTKdb4rP7kEAAZClm7P4M4Mog+CAXePDMw3kSkRGzsbT/6rKffKp9crRcOnKwSHU2yuf+NBTES6xeaPD0R7YwjbrRHPDsOoOdQXEcn/bl0oNnnLheSZhDKdFERtlvrpVB8qZ469A2Jqw/X5QMcIrEb0gisLWRSuiCpg/zmFDqaDsj1M8evc2MPGtkKxw9IsuupthsWKxYkbwB2inJdnLwgCDx+2B5oIT7pYbLricSseF1ukjL3uEyHicA3WztLzKoLjumpzevRWBs1VnYCL0Ow0U0yABR/dz3nh0mcE6X0iBb8ulgp+zn/8CNTNEE7lVSPPn3FFr6+mNuYu5O9fn9G6lji/8muhJWTW/9bbrA/2ZVPK4pto7mmfh/OWmkHnw3Te4ZysDIOXcXD7BCixoSB/3l88JQrGB/EAqrNz6oEhXeQ9hof2EhwKI3ZoxvKh5jfDii3PWI+NJPdFFtP+zRS+P1p4aMpQC703rHkmiSFRJIIaPnnbnXNN1NhBefjkjFA6nTvUcYsbBtKQzFJbAEiBnhOo/+jgUdd31gZbZbRi2Iw+Pv07qjDgwVznE5HLwEu8y2k+mdW7f1RKIgjiOhPA4CzBcWumeo7USUDpHaLNWEP0lLiwuxxB8CigRUln763e4xFAvd+vPlBoJJsJBUezJ5OdV5AC6Fe9/UuFT8Ov+Bsknk072xPIHxLks5J6XxDNrm7mnDKTirLE3y2OLpy5gAUPc1n7UpdH08k3C4y+8iqILZXfN6WzR459QcmY7Uu2YFSbxVM6dVYsE7arsp3zyDgjCgnctbrlO2A2iJ2P7f8eGdYEnMjm8Hv4lwFfSHDKVuVoD3+2Amw5CtE9Smtdidm4OTC/C1yePG+IvtXlx+21lgPpRdWOCFmz8/bQusVvRlQCz8any5fVnXJERaeQxC8UMbWgmRPQs1Q0rrpe7V8LKq4W5rwmEClsmUoqWiaDXN/nuzuPY2Bm7l4Qo7B0JQd/AEA3Kw4/4L8XLbQ8JHtnamJExXbDFZp2jPjz/9igiPq5j1+/ZqJtnwHPa54R2gLGbV9plpEMOu94Og993CM4QxKN4LSD/TKUV/ik46I5H1texmN7RWMcL1gAPnO0AVbiI8sP1xNxclROHanYTudyVVKld0qrq6ht4eYOgPL2RWB8ma1i8DiqfEy2Z+iDHIDv4nB92ktT7BKNA9MEC19+Nmbv1nNgiULtt+jOZZ92XlPnVU5fYIQZEsn/VPzxFx+6yoDmdN9+aeC6k/SfWFPJhdQ6e0Kk9sOpZQowS+GpaTGw86m2K/RyyzQyzJetV58Wlon3bYwzST6N/CHxiyWJ6KiC+JOJHjKuZjoX33FKKp/LqkG7PlC599l6afuCmU/e9r+MczU+BqdMZTAHshA5mpgPqT2gjHvD2Er+J9Be/P7YHCUvOUFpnfcDWVDz0Hx3kqibf8iFKBFDzK+YHk7U0I0O7yRgLTpxf3CC3ZpEF0unuuM/29BgViIucoRdPIHceE2aTUcH27myPTT5SJ8SGpJrfr5bS264VZ7la/ewkdSVAvfKZAD95KofW0up8Bt4z14+IQCE1Xe1c2fA6Vr20junvkX4ZgQVrlWGztCX26LigP19olHT8mDFPGzCEuXXqFDSGUaJ9bpRglkd5Ps9JkKbCfnzF+PLFMByr5xDywnCaMDyzLttVzfdu32qVI3UP51UZ1lnnylXuKjuctCJiC3it20cmMf4VK41gZpGbUlWoasmnx5OGapdvwGTtrT0cVNRkg6UXjj2BuZncnTBPzNhwarsZUsBBtOAbDue7xfEOXn1lLUDQ1u8xKqwzN4tqYTbp+VQAuI3LG9FaF6cD0wQX9qyFji7oIcmsHMe9KROCKi83IdG/Td0ML2/h3yUG/VkKVTdKbX2QGNShAKCT3JPPfR49q0WIeuJ4SCskuiZfzzo2m3rdodBvqEPiGSOmlmp6/RLGF0iWtZ177aqYzeEEMwO2BpxR0ZB/+0JTD9u+DE50h83q3eRsikNc4VlZDB8JWJW3WRLU/AxRefA0rP3nes0B7Thx0MTXpRsOGjprtSKYR9h73QxDugTSNA77sCpjSutaswovVdn9Lj6k4IL6av1gRK2Wso/e8YNEglHp+VBEegXw35ZOByleDUwqdOS07xCA9hBS0Oec0v7YvFEAIJnW4FunEV2fscciH2gVpDR2s+FbKjVdT7t6qNnWT4HX2PuL2mrHDKgE/l8tDvJ2z2zZW1fTiExfTngLbTOyhlptrc8RFDAJBi3jdYw4HU9LufawJjUIukBgiX9cM5y2IykqNzM10tMsMVfxRH10lHqieGf9e0u2ht+gRmYwooRzoyenkKlWVHh8E37+yXa0SHuV5qXb/8sk1IGqE5p0wL7qWUfOTRAdWtPll30n16f7Epfl+dYI8m93uTk2FrL0Dsosdkp5BmIilduNXje1bMonEtliHrJ012Q0FIxVjEOZDUTUXYwRw0mF0eaxvKu27cJ1OqYUGfJk9zqAiAc9QnTBDL/f+zljgBzs6FWC2PWASaoMrS5Q0aNlN/y5wmFma9swHoEwrBUXr4Vi2Nyf5jj/FijJz77DaNs1J4G5uUF8Abe8HRvYc0XCtEMWqCcv3W78Px4/v/ThOMvamMcJvBh91/6Ep95/SzHvuEDpb8WsUKjwpXDdmp1k7QgMwb0ymrheZhxj/mYklx4EtnMWYwIPt02RJbEFoEcgB09chAg5x8rTh6FmJzGHmZOv7A8oEg0CvrO2pT+aiKqCTRcJsOKKvZnLXlQg1TwgJh3jCgvVVSPGIEO4RpIWMNT1/Opno6ytmiubgX5NoythDBrG5WtoAltsfomRTkb1NWOhcam9Q=',
'__VIEWSTATEGENERATOR':'7400285C',
'__VIEWSTATEENCRYPTED':'',
'__EVENTVALIDATION':'1p0545yV8Pljo87c7Dlb6kiemgGIXd5S3wYUGUoMyg6IPO12GWyBgNM7bu67YNl9f9Sx9ad9lHLIfwYtw0nDGqWYtWBnM8PHrdmxYdOb5+qUooGamIPBCCel/8Ri+FGpvNGPTZkKeuYzfomnlqr/mYoMcjdnsiQWCf7Dvou8X5p8A/pkHRReFtE8H4xIhr8X3MU6lpxBhHZKj3UK+hBHCWxEnQkGb0Nz2Pi8hyWNt5AUu830RSQnl793RwuxwQ1HCmJYFEx00c11gXmSn36PPP42OCMstDR/GpK2LUPsNQbdJ7TUq25rzG/5SIjYxWA0nQbGY/mWaY3Q9iCo7k5o9QnZEf4yLaOF5g7nEva4lTZNwx27ynyDAWrRBVE0KsGTsbIQMgMPqCV8gzc6irsluosW+EI6zW0mdaeoiBaGYHFQnJ77a5rnbpL0j6fMiDfL+5VW7jAaRnyz3Y10Cn1TlXEY4rvQjHZIxvK/rBe2WSVkyXIhrUgAl7a7EvDGnBniVdsizrBSgASrjcT9svJ0aHPEpfxJmy8nuzV0pZbXGzG1q06Xyij4SoxHfi7tf1dPjOn6+zdR2SY0/sQvXmQ35bAlbnFMKWdzyJHB0uEm6GYQtV4Dcj6fjGijxjIiQW0SgjuRd7/8k1i7MvEnD+I6MRIBhx/KNOdP3os9oP8pyMicIz8V1o7KENKwX4fyUmbIx34adCXt/DXT1sjFtNu4S0vzvOU/AtxBLOEcw+clV25xSp/94dEq/dge+K2ySRuxKt6DcNhjDMYvc1ACXbfVjANG8ar3x7n9kX14EMtnpip0RI9ypma3tOmjqip+Qc+lyc12A0jV714BfTzw6nSjYya75Idztq1gZifNt+pQn7GO7Qw2kIqNnXvpA+UkWbsTlnTKyY9gTqPHbF5XcSvtvDfNYM6mxVMJZ1MyAt4pxrCgyGAC0IswPZ8wMAablR9fNStFs67D4kyeUCU/2IVTD1/pfmMC8meLuaXpgHkl2er3Wr84H2lVL+xUd7/wCUSkLFke68SeRfqPl7dIR9hstVJC0cCTbmco0KxTzcwln3QdoxveE/N8v31Z9teZoJxdeRRFyJQFGHw81JVor2kACBsIkioLUvz4IpUxE8XEwUrjHCBJyZG3QzcQAxSBXprztdoknBgrd38wCssuCa3gQvIoMtbCaedWhmY0pA/AI0aHTHR/j5nTg9jaqeEViZF0hLVEhVz2MojXswtp0aA70YwuMPBmMCdgy5w+wSeThtsyt6j+b2NHHRkE0uqEc8D5XDUB5M2UolpsZCcuOKSwK+jwGdeb3gPWssUQShMgRTEoFLapZJKX7c8/yeZ7Lf4KrFE4pBz8+JeFJOVFl3y1ewckAFZVHdYvu1901Aq6PKTMu6kGz/LElno8fCJbyKacsZ3LtpssGhvFBN0vv0WIn4elkiLCL3u9v6oWOxaK4OIDTaVwDLjb5BvBpd2Szj5diHAG1IXoVQYvJ2VEDVwbiTUChXRZcDY6bAm7dvkYWLOxsa0whGz2xeeApbEceeQrHREelH89ucBenmuENPiF98Kf4mZQ/ThiVhFxiWAux1b0Dn0z7M/mXfposuy8ytqtRry3SJoC5V7I+7E4N5x0JyVwN/vtxBpd5h443R35RvDZ1tnscirsGzNoulevkeUqM+I6TgjrvBF00fv3isZvzUjIXK6E9cAg4G7aPyuirI+KIICi9VNxF5fDRxi2UPTHiB3NT01Vez5GVt0Tu8lpn2iakJSBjihOYORrSI+xJzbQdnCzJa1+h8UiAFXgpqWviJUVXG22wFQ1HQckAbFxU/Pcyx+QrsnDrhwihqmnwFd1fuwOy74SAvPMojpxujxWDe+37nhroEyhrk5yOB65RDUcQFS77a+3RwuNyXAodTC3QMp5lMZD1Ae8zGEBesg4zbkP7aMS+ljYBShRN6n9KYhHZ7s2Iq5V4K6GrUcOFdXP157jN4vBuj8l+UoBIPjpMm9KKpLnCuSjGNPIyoxfPg==',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$ddlPageSize':'20',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnFilterBySelection':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedIndex':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedFunds':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnCustomImageFileIds':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedSecurityCount':'0',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnTabs':'Snapshot,ShortTerm,Performance,Portfolio,FeesAndDetails',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnSelectedRow':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnCheckedColumns':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnExportLimit':'50',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnExportCount':'0',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$toExport':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnAllowAllFundsExport':'true',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$txtSaveSearch':'',
'ctl00$__RequestVerificationToken':'e_BrPK0DxBjkgMrhfkdyFJjp1nPzltSn0h20aUjHJPSe3W4w3FRsFQNo_YY3Ml0D1CkNGqC5PEJBigtZuvdbiYrldSMrUoOQFUjaPifPbM41'
}
r = requests.post(url, headers = headers, data = payload)
print(r.content)
root = html.fromstring(r.content)
You can now fetch the elements you need from root using xpath such as :
root.xpath('//input[#class="some_class"]')
Refer to scraping and lxml documentation for more understanding.
I have used all the payload data from the request, you can remove some and check for what is absolutely necessary for the request.
Also, follow the website rules about scraping and scrape gracefully without putting too much pressure on the website.

How to login phone.ipkall.com with python requests library?

I am try to learn python, but I have no knowledge about HTTP, I read some posts here about how to use requests to login web site. But it doesn't work. My simple code is here (not real number and password):
#!/usr/bin/env python3
import requests
login_data = {'txtDID': '111111111',
'txtPswd': 'mypassword'}
with requests.Session() as c:
c.post('http://phone.ipkall.com/login.asp', data=login_data)
r = c.get('http://phone.ipkall.com/update.asp')
print(r.text)
print("Done")
But I can't get my personal information which should be showed after login. Can anyone give me some hint? Or point me a direction? I have no idea what's going wrong.
Servers don't like bots (scripts) for security reason. So your script have to behave like human using real browser. First use get() to get session cookies, set user-agent in headers to real one. Use http://httpbin.org/headers to see what user-agent is send by your browser.
Always check results r.status_code and r.url
So you can start with this:
(I don't have acount on this server so I can't test it)
#!/usr/bin/env python3
import requests
s = requests.Session()
s.headers.update({
'User-agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0",
})
# --------
# to get cookies, session ID, etc.
r = s.get('http://phone.ipkall.com/login.asp')
print( r.status_code, r.url )
# --------
login_data = {
'txtDID': '111111111',
'txtPswd': 'mypassword',
'submit1': 'Submit'
}
r = s.post('http://phone.ipkall.com/process.asp?action=verify', data=login_data)
print( r.status_code, r.url )
# --------
BTW: If page use JavaScript you have problem because requests can't run javascript on page.

It is not possible to parse a part of a webpage that is visible when open with browser

I have this strange problem parsing the webpage Herald Sun to get the list of rss from it. When I look at the webpage in the browser, I can see the links with titles. Though, when I used Python and Beautiful Soup to parse the page, the response does not even have the section I would like to parse.
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.71 (KHTML, like Gecko) Version/7.0 Safari/537.71',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request("http://www.heraldsun.com.au/help/rss", headers=hdr)
try:
page = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.fp.read())
html_doc = page.read()
f = open("Temp/original.html", 'w')
f.write(html_doc.decode('utf-8'))
The written file as you can check, does not have the results in there, so obviously, Beautiful Soup has nothing to do here.
I wonder, how does the webpage enable this protection and how to overcome it? Thanks,
For commercial use, read the terms of services First
There are really not that much information the server know about who is making this request.
Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.
JavaScript or Not?
(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS.
However, in this case, your website looks it is not that sophisticated.
User-Agent? Cookie?
(2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.
As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.
Here are the codes how to add cookie using urllib2
import urllib2, bs4, re
opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})
for a in div.find_all('a'):
try:
if 'feeds.news' in a['href']:
print a
except:
pass
And here are the outputs:
Breaking News
Top Stories
World News
Victoria and National News
Sport News
...
The site could very likely be serving different content, depending on the User-Agent string in the headers. Websites will often do this for mobile browsers, for example.
Since you're not specifying one, urllib is going to use its default:
By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number.
You could try spoofing a common User-Agent string, by following the advice in this question. See What's My User Agent?

Facebook stream API error works in Browser but not Server-side

If I enter this URL in a browser it returns to me the valid XML data that I am interested in scraping.
http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=36343869811&filter=2&max_time=0&try_scroll_load=false&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=0
However, if I do it from the server-side, it doesn't work as it previously did. Now it just returns this error, which seems to be the default error message
{u'silentError': 0, u'errorDescription': u"Something went wrong. We're working on getting it fixed as soon as we can.", u'errorSummary': u'Oops', u'errorIsWarning': False, u'error': 1357010, u'payload': None}
here is the code in question, I've tried multiple User Agents, to no avail:
import urllib2
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; he; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'
uaheader = { 'User-Agent' : user_agent }
wallurl='http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=36343869811&filter=2&max_time=0&try_scroll_load=false&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=0'
req = urllib2.Request(wallurl, headers=uaheader)
resp = urllib2.urlopen(req)
pageData=convertTextToUnicode(resp.read())
print pageData #and get that error
What would be the difference between the server calls and my own browser aside from User Agents and IP addresses?
I tried the above url in both chrome and firefox. It works on chrome but fails on firefox. On chrome, I am signed into facebook while on Firefox, I am not.
This could be the reason for this discrepancy. You will need to provide authentication in your urllib2 based script that you have posted.
There is a existing question on authentication with urllib2.

Categories

Resources