python-requests does not grab JSESSIONID and SessionData cookies - python

I want to scrape a pdf file from http://www.jstor.org/stable/pdf/10.1086/512825.pdf but it wants me to accept Terms and Conditions. While downloading from browser I found out that JSTOR saves my acceptance in 2 cookies with names JSESSIONID and SessionData but python-requests does not grab these two cookie( It grab two other cookies but not these).
Here is my session instantiation code:
def get_raw_session():
session = requests.Session()
session.headers.update({'User-Agent': UserAgent().random})
session.headers.update({'Connection': 'keep-alive'})
return session
Note that I used python-requests for login-required sites several times before and it worked great but in this case it's not.
I guess problem is that JSTOR is built with jsp and python-requests does not support that.
Any Idea?

The following code is working perfectly fine for me -
import requests
from bs4 import BeautifulSoup
s = requests.session()
r = s.get('http://www.jstor.org/stable/pdf/10.1086/512825.pdf')
soup = BeautifulSoup(r.content)
pdfurl = 'http://www.jstor.org' + soup.find('a', id='acptTC')['href']
with open('export.pdf', 'wb') as handle:
response = s.get(pdfurl, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

Related

Grab auto Download Links Using requests

I'm Trying to grab auto started direct download Link from Yourupload using Bs4
the direct download Link is auto generated every time,
the direct download Link also start automatically after 5 seconds,
i want to get the direct download Link and store it in "Link.txt" Files
import requests
import bs4
req = requests.get('https://www.yourupload.com/download?file=2573285', stream = True)
req = bs4.BeautifulSoup(req.text,'lxml')
print(req)
Well, actually the site is running a JavaScript code to handle the redirect to the final-destination url to stream the download with just token validation.
Now we will be more wolfs and get through it.
We will send a GET request firstly with maintaining the session via requests.Session() to maintain the session object and again send GET request to download the Video :).
Which means that you currently have the final url, you can do whatever, to download it now or later.
import requests
from bs4 import BeautifulSoup
def Main():
main = "https://www.yourupload.com/download?file=2573285"
with requests.Session() as req:
r = req.get(main)
soup = BeautifulSoup(r.text, 'html.parser')
token = soup.findAll("script")[2].text.split("'")[1][-4:]
headers = {
'Referer': main
}
r = req.get(
f"https://www.yourupload.com/download?file=2573285&sendFile=true&token={token}", stream=True, headers=headers)
print(f"Downloading From {r.url}")
name = r.headers.get("Content-Disposition").split('"')[1]
with open(name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
f.write(chunk)
print(f"File {name} Saved.")
Main()
Output:
Downloading From https://s205.vidcache.net:8166/play/a202003090La0xSot1Kl/okanime-2107-HD-19_99?&attach=okanime-2107-HD-19_99.mp4
File okanime-2107-HD-19_99.mp4 Saved.
Confirmation By Size: As you can see 250M
Notice that the download link is one time callable as the token is only validated one-time by the back-end.

Python - Reading different urls using urllib2 returned the same results?

I'm trying to use Python urllib2 to read some pages but for given different urls returned the same page.
The page is an inquiry for campsite availability for a given campground from recreation.gov. Since there might be a lot campsites in a campground, the last index in url tells the page how many campsites will be listed.
For example if startIdx=0 the page lists out campsite 1~25, and if startIdx=25 the page lists out campsite 26~50.
So I constructed some urls with different startIdx but after using urllib2 to read the page, the returned html were all the same -- it seems somehow the startIdx in url was ignored.
In addition, if I manually open those urls in browser the pages look normal, but if I use webbrowser.open to open those urls the pages look weird.
The brief sample code duplicates the problem I'm having:
import urllib2
url1 = 'http://www.recreation.gov/campsiteCalendar.do?page=calendar&contractCode=NRSO&parkId=70928&calarvdate=03/11/2016&sitepage=true&startIdx=0'
url2 = 'http://www.recreation.gov/campsiteCalendar.do?page=calendar&contractCode=NRSO&parkId=70928&calarvdate=03/11/2016&sitepage=true&startIdx=25'
hdr = {'User-Agent': 'Mozilla/5.0'}
request1 = urllib2.Request( url1, headers = hdr )
response1 = urllib2.urlopen( request1 )
html1 = response1.read()
request2 = urllib2.Request( url2, headers = hdr )
response2 = urllib2.urlopen( request2 )
html2 = response2.read()
In [1]:html1 == html2
Out[2]: True
I have no other knowledge about how things work in inquiries and PHP related stuff. So I'm curious why does urllib2 behave like this. The Python version I'm using is 2.7
Thanks!
The web page may change during runtime, whereas you are only requesting HTML. There is probably some JavaScript that changes the contents of the page based on the URL encoded information. If the content was loaded server-side with PHP, then it would be present with the request as the server changes the HTML before sending it. JavaScript will change the HTML after sending it.
In other words, a regular browser will change the HTML based on the URL using JavaScript. Your simple request will not do that.

Authentication using python requests. Getting redirect to login page

I am using python requests library to login the site:
http://www.zsrcpod.aviales.ru/login/login_auth2.pl
and then trying to get a file, but authentication fails and i am gettin redirect to login page.
I already had some experience in using requests library for other sites and it works fine with .seesion, but this script is not working:
login_to_site_URL= r'http://www.zsrcpod.aviales.ru/login/login_auth2.pl'
URL = r"http://www.zsrcpod.aviales.ru/modistlm-cgi/seances.pl?db=modistlm"
payload = {r'login': r'XXXXXX',
r'second': r'XXXXXX'}
with requests.session() as s:
s.post(login_to_site_URL, payload)
load = s.get(URL, stream=True)
# download
with open('G:\!Download\!TEST.html', "wb") as save_command:
for chunk in load.iter_content(chunk_size=1024):
if chunk:
save_command.write(chunk)
save_command.flush()
I need a help with adopting it.

Simulate browser using cookielib to fetch url in python

I am using cookielib and some times opening a url in browser downloads many other files by browser making many other requests. Can I replicate the same behaviour using cookie lib or any other python library?
For example: To get all the required information from page https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1
I have to make more than 1 GET requests from my python script. I got the request urls of all the requests browser makes by analysing the network requests when I opened the page.
I am seeing if there is any way I can just make 1 request and it fetches all the related requests by itself like browser.
I am not very much interested in the js or css but the main html.
I tried with the following code but it couldn't download whole page
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response = opener.open('https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1')
html = response.read()
but when I fetched 3 other GET urls in sequence it is able to give me the required html in the third GET response. I got these urls by examining network tab of the browser
'https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PT_NAV.ISCRIPT1.FieldFormula.IScript_UniHeader_Frame?c=NNTCgkqGs001AcPaisqGbYpTu%2fbGx4jx&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes')
'https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PTPPB.ISCRIPT2.FieldFormula.IScript_TemplatePageletBuilder?PTPPB_PAGELET_ID=KC_LNAV_APPLICANT&target=KCNV_KC_LNAV_APPLICANT_TMPL&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes&PortalTargetFrame=TargetContent'
'https://hronline.keybank.com/psc/hrshrm/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalCRefLabel=Careers&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&NoCrumbs=yes&PortalKeyStruct=yes'
and following is the complete code for the other fetches I am making
response = opener.open('https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PT_NAV.ISCRIPT1.FieldFormula.IScript_UniHeader_Frame?c=NNTCgkqGs001AcPaisqGbYpTu%2fbGx4jx&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes')
response.read()
response = opener.open('https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PTPPB.ISCRIPT2.FieldFormula.IScript_TemplatePageletBuilder?PTPPB_PAGELET_ID=KC_LNAV_APPLICANT&target=KCNV_KC_LNAV_APPLICANT_TMPL&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes&PortalTargetFrame=TargetContent')
response.read()
response = opener.open('https://hronline.keybank.com/psc/hrshrm/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalCRefLabel=Careers&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&NoCrumbs=yes&PortalKeyStruct=yes')
required_html = response.read()
requests can handle cookies, as you can see here.
It's a great library, far more powerful that urllib2, and yet simpler-looking.
>>> import requests
>>> r = requests.get('https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1')
>>> r.cookies
Edit: This answer dos not really address the problem, I read too fast. Sorry about that.
As suggested by #J.F.Sebastian, I'm adding a link to a python webkit client, Ghost.py, that could emulate a browser, as you requested.

Scrape a web page that requires they give you a session cookie first

I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:
http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal
requires that I have a session cookie from the government site attached to the request.
How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.
I tried this:
import urllib2
import cookielib
url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'
def grab_data_with_cookie(cookie_jar, url):
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
data = opener.open(url)
return data
cj = cookielib.CookieJar()
#grab the data
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)
stuff2 = data2.read()
I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?
Using requests this is a trivial task:
>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)
>>> print r.cookies
{'requests-is': 'awesome'}
Using cookies and urllib2:
import cookielib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# use opener to open different urls
You can use the same opener for several connections:
data = [opener.open(url).read() for url in urls]
Or install it globally:
urllib2.install_opener(opener)
In the latter case the rest of the code looks the same with or without cookies support:
data = [urllib2.urlopen(url).read() for url in urls]

Categories

Resources