Scrapy-Splash doesn't set custom request headers - python

I am trying to scrape a website using Scrapy + Splash in Python 2.7.
The website uses JavaScript to generate most of the HTML, which is why I need Splash.
First, I make a FormRequest with Scrapy to login to a website. It is successful.
I then extract "access_token" from JSON response, because it should be used in the next request as an "Authorization" header - to confirm to the website that I am logged in.
jsonresp = json.loads(response.body_as_unicode())
self.token = 'Bearer ' + jsonresp['access_token']
self.my_headers['Authorization'] = self.token
Before proceeding with SplashRequest, I decided to test the session with scrapy.Request. I passed cookies and the new headers:
yield scrapy.Request('https://www.example.com/products', cookies=self.cookies, dont_filter=True, callback=self.parse_pages, headers=self.my_headers)
The HTML from result.body confirmed that I was logged in. Great!
Calling response.request.headers showed that 'Authorization' header was also sent.
{'Accept-Language': ['en-US,en;q=0.5'],
'Accept-Encoding': ['gzip,deflate'],
'Accept': ['application/json, text/plain, */*'],
'User-Agent': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'],
'Connection': ['keep-alive'],
'Referer': ['https://www.example.com/Web'],
'Cookie': ["___cookies___"],
'Content-Type': ['application/x-www-form-urlencoded'],
'Authorization': ['Bearer Zyb9c20JW0LLJCTA-GmLtEeL9A48se_AviN9xajP8NZVE8r6TddoPHC6dJnmbQ4RCddM8QVJ2v23ey-kq5f8S12uLMXlLF_WzInNI9eaI29WAcIwNK-FixBpDm4Ws3SqXdwBIXfkqYhd6gJs4BP7sNpAKc93t-A4ws9ckpTyih2cHeC8KGQmTnQXLOYch2XIyT5r9verzRMMGHEiu6kgJWK9yRL19PVqCWDjapYbtutKiTRKD1Q35EHjruBJgJD-Fg_iyMovgYkfy9XtHpAEuUvL_ascWHWvrFQqV-19p-6HQPocEuri0Vu0NsAqutfIbi420_zhD8sDFortDmacltNOw-3f6H1imdGstXE_2GQ']}
Cookie DEBUG showed that all cookies were sent without issues.
After that I substituted scrapy.Request with SplashRequest:
yield SplashRequest('https://www.example.com/products', cookies=self.cookies, callback=self.parse_pages, args={"lua_source": lua_script, 'headers':self.my_headers}, endpoint='execute', errback=self.errors)
lua_script:
lua_script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(2))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
html = splash:html(),
}
end
"""
However, the HTML that I got from Splash response showed that I was not logged in.
Cookie DEBUG didn't show any issues - the same cookies were sent as before.
But here is what I got from calling response.request.headers:
{'Accept-Language': ['en'],
'Accept-Encoding': ['gzip,deflate'],
'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],
'User-Agent': ['Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'],
'Cookie': ["___cokies___"],
'Content-Type': ['application/json']}
As you can see, Splash didn't set my custom headers, instead it just combined cookies with the default ones.
I tried setting my own headers both as SplashRequest function arguments and inside lua_script, but none of the approaches worked.
My question is, how to set my own request headers in Splash?

Related

How to get h3 tag with class in web scraping Python

I want to scrape the text of an h3 with class as shown in the attached photo.
I modified the code based on the posted recommendation:
import requests
import urllib
session = requests.session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept': '*/*',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Content-Type': 'application/json',
'Origin': 'https://auth.fool.com',
'Connection': 'keep-alive',
})
response1 = session.get("https://www.fool.com/secure/login.aspx")
assert response1
response1.cookies
#<RequestsCookieJar[Cookie(version=0, name='_csrf', value='8PrzU3pSVQ12xoLeq2y7TuE1', port=None, port_specified=False, domain='auth.fool.com', domain_specified=False, domain_initial_dot=False, path='/usernamepassword/login', path_specified=True, secure=True, expires=1609597114, discard=False, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>
params = urllib.parse.parse_qs(response1.url)
params
payload = {
"client_id": params["client"][0],
"redirect_uri": "https://www.fool.com/premium/auth/callback/",
"tenant": "fool",
"response_type": "code",
"scope": "openid email profile",
"state": params["https://auth.fool.com/login?state"][0],
"_intstate": "deprecated",
"nonce": params["nonce"][0],
"password": "XXX",
"connection": "TMF-Reg-API",
"username": "XXX",
}
formatted_payload = "{" + ",".join([f'"{key}":"{value}"' for key, value in payload.items()]) + "}"
url = "https://auth.fool.com/usernamepassword/login"
response2 = session.post(url, data=formatted_payload)
response2.cookies
#<RequestsCookieJar[]>
response2.cookies is empty thus it seems that the login fails.
I can only give you some partial advice but you might be able to find the "last missing piece" (I have no access to the premium content of your target page). It's correct, that you need to login first, in order to get the content:
What's usually useful is using a session that handles cookies. Also, a proper header often does the trick:
import requests
import urllib
session = requests.session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept': '*/*',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Content-Type': 'application/json',
'Origin': 'https://auth.fool.com',
'Connection': 'keep-alive',
})
Next we get some cookies for our session from the "official" login page:
response = session.get("https://www.fool.com/secure/login.aspx")
assert response
We will use some of the response URL (yes, there are a couple of redirects) parameters to get a valid payload for the actual login:
params = urllib.parse.parse_qs(response.url)
params
payload = {
"client_id": params["client"][0],
"redirect_uri": "https://www.fool.com/premium/auth/callback/",
"tenant": "fool",
"response_type": "code",
"scope": "openid email profile",
"state": params["https://auth.fool.com/login?state"][0],
"_intstate": "deprecated",
"nonce": params["nonce"][0],
"password": "#pas$w0яδ",
"connection": "TMF-Reg-API",
"username": "seralouk#stackoverflow.com",
}
formatted_payload = "{" + ",".join([f'"{key}":"{value}"' for key, value in payload.items()]) + "}"
Finally, we can login:
url = "https://auth.fool.com/usernamepassword/login"
response = session.post(url, data=formatted_payload)
Let me know if you are able to login or if we need to tweak the script. And just some general comments: I normally use an incognito tab to inspect the browser requests an then copy them over to postman where I play around with the parameters and see how they influence the HTTP response.
I rarely use Selenium but rather invest the time to build a proper requests tu be used with python's internal library and then use BeautifulSoup.
Edit:
After logging in, you can use BeautifulSoup to parse the content of the actual site:
# add BeautifulSoup to our project
from bs4 import BeautifulSoup
# use the session with the login cookies to fetch the data
the_url = "https://www.fool.com/premium/stock-advisor/coverage/tags/buy-recommendation"
data = BeautifulSoup(session.get(the_url).text, 'html.parser')
my_h3 = data.find("h3", "content-item-headline")

Cannot login to a page with session.post(URL, data=payload)

I've been trying to automatically login with python requests to a page to download a file every hour, but i haven't had any luck with it.
The page is this one: https://www.still-fds.com/fleetmanager/login/dercomaq
i'm using the Session object like this:
import requests
login_URL = 'https://www.still-fds.com/fleetmanager/login/dercomaq'
next_URL = 'https://www.still-fds.com/fleetmanager/pages/reports/vehicle.xhtml'
#(these arent the real username and password, I use the real ones in my code)
payload = {"j_idt181:username": "User", "j_idt181:password": "Password"}
with requests.Session() as ses:
ses.get(login_URL) #I get a JSESSIONID cookie here
ses.post(login_URL, data = payload) #I send the login request
r = ses.get(next_URL) #I try accessing the next page after login
with open('login-test.html', 'wb') as f: #writing the HTML i get back to a file so I can preview it
f.write(r.content)
However when I check the preview if the next page, it always redirects/shows me the login page
I've also tried sending a more complete payload, copying everything that gets sent in the normal login request, like this
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source':'j_idt181:j_idt190',
'javax.faces.partial.execute': '#all',
'j_idt181:j_idt190': 'j_idt181:j_idt190',
'j_idt181': 'j_idt181',
'j_idt181:username': 'User',
'j_idt181:password': 'Password',
'javax.faces.ViewState': '-4453297688092219000:-1561371993877484606 '
}
And I've tried with copying the request headers:
#I replace the JSESSIONID cookie for the one that I'm given in the first get request
req_header = {
'Accept': 'application/xml, text/xml, */*; q=0.01'
,'Accept-Encoding': 'gzip, deflate, br'
,'Accept-Language': 'es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7'
,'Connection': 'keep-alive'
,'Content-Length': '287'
,'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
,'Cookie': 'JSESSIONID=U27BXdHsu+EXITX-QfUGkmqW; clientId=dercomaq'
,'Faces-Request': 'partial/ajax'
,'Host': 'www.still-fds.com'
,'Origin': 'https://www.still-fds.com'
,'Referer': 'https://www.still-fds.com/fleetmanager/login/dercomaq'
,'Sec-Fetch-Dest': 'empty'
,'Sec-Fetch-Mode': 'cors'
,'Sec-Fetch-Site': 'same-origin'
,'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
,'X-Requested-With': 'XMLHttpRequest'
}
However I never get the ckientId cookie (I've tried manually giving it a clientId cookie, since it's always 'dercomaq', but it doesn't help either)
All of this works very easily when using Selenium and ChromeDriver, but because of web app constraints I can't use that

What elements of a login form do I need to perform this web scraping task?

I'm trying to log in and scrape a grading website. I've set up the following code to access the website and enter a pay load of:
- username/email
- password
- csrf_token
Is there additional information that I need to include in my payload in order to log in?
I'm using python 2.7. I've added code to print out the last page the script opens and it prints out the login page making me think that it never successfully logged in.
import requests
from lxml import html
payload = {
"username": "...",
"password": "...",
"csrf_token": "ImE2N2E1YzkzZGU2ZjY3NjQ0YTc4YmZiYWJjNWRiN2Y3MjlhYWZmYjQi.XBvDVg.ALSRF6Ui7Y2L7ST0kQG-CC4HTzQ"
}
session_requests = requests.session()
login_url = "https://www.zipgrade.com/login"
user_url = 'https://www.zipgrade.com/user'
result = session_requests.get(login_url)
# make HTML parse tree from page
tree = html.fromstring(result.text)
authenticity_token =
list(set(tree.xpath("//input[#name='csrf_token']")))[0]
# send payload through
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
result = session_requests.get(
user_url,
headers = dict(referer = user_url)
)
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//div[#class='row']")
print(result.ok)
print(bucket_names[0].text_content().strip())
I would like it to take me to the 'https://www.zipgrade.com/user' page but it looks like its staying on the 'https://www.zipgrade.com/login' page.
Hmm.. it seems there's a session token passed in the cookie header; I just tried to mimic a login and my request looked like this:
import http.client
conn = http.client.HTTPConnection("www,zipgrade,com")
payload = "username=some%40email.com&password=some%40password&csrf_token=IjhmNWU1Y2EzYWExMjcwM2FiZmY5MjEzOGUwNDQ2N2UxZWQ4ODY0OTMi.XBwSeg.RU2oZBM15U7-ECl1Ldfv7LYlcnQ%5E&origURL="
headers = {
'Connection': "keep-alive",
'Cache-Control': "max-age=0",
'Origin': "https://www.zipgrade.com",
'Upgrade-Insecure-Requests': "1",
'Content-Type': "application/x-www-form-urlencoded",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'Referer': "https://www.zipgrade.com/login/",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "en-US,en;q=0.9",
'Cookie': "session=eyJfcGVybWFuZW50Ijp0cnVlLCJjc3JmX3Rva2VuIjp7IiBiIjoiT0dZMVpUVmpZVE5oWVRFeU56QXpZV0ptWmpreU1UTTRaVEEwTkRZM1pURmxaRGc0TmpRNU13PT0ifX0.XBwSeg.EPMMH0CcBMif4qUoxGPKFvcnzRw",
'cache-control': "no-cache",
'Postman-Token': "865a89b0-c5cc-49b1-9e24-df413be64fc0"
}
conn.request("POST", "login,", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Notice that your payload is correct; you are passing the parameters right; however there is a session passed in the header; you need to obtain the session token and pass it with your header;
I'd make two requests, one is a plain request to the login page https://www.zipgrade.com/login/ which will return back a cookie containing the session parameter you need; parse the cookie and extract the session; when done resume to your scraping function and make sure you update the header variable with that session;
When you bang the URL for the session, you can grab the csrf token at the same time from the hidden input field eg:
This way your first call prepares you for the scraping call; by gathering the dynamic tokens from both cookie and hidden input field.
Keep in mind sessions on different sites have different expiration times; some session tokens can be used in multiple page scraping while others need to obtain a new session each jump. Just a tip; but I think this will lead you in the right direction.

Python 3.4 login on aspx

I'm trying to login to an aspx page then get the contents of another page as a logged in user.
import requests
from bs4 import BeautifulSoup
URL="https://example.com/Login.aspx"
durl="https://example.com/Daily.aspx"
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36'
language = 'en-US,en;q=0.8'
encoding = 'gzip, deflate'
accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
connection = 'keep-alive'
headers = {
"Accept": accept,
"Accept-Encoding": encoding,
"Accept-Language": language,
"Connection": connection,
"User-Agent": user_agent
}
username="user"
password="pass"
s=requests.Session()
s.headers.update(headers)
r=s.get(URL)
print(r.cookies)
soup=BeautifulSoup(r.content)
LASTFOCUS=soup.find(id="__LASTFOCUS")['value']
EVENTTARGET=soup.find(id="__EVENTTARGET")['value']
EVENTARGUMENT=soup.find(id="__EVENTARGUMENT")['value']
VIEWSTATEFIELDCOUNT=soup.find(id="__VIEWSTATEFIELDCOUNT")['value']
VIEWSTATE=soup.find(id="__VIEWSTATE")['value']
VIEWSTATE1=soup.find(id="__VIEWSTATE1")['value']
VIEWSTATE2=soup.find(id="__VIEWSTATE2")['value']
VIEWSTATE3=soup.find(id="__VIEWSTATE3")['value']
VIEWSTATE4=soup.find(id="__VIEWSTATE4")['value']
VIEWSTATEGENERATOR=soup.find(id="__VIEWSTATEGENERATOR")['value']
login_data={
"__LASTFOCUS":"",
"__EVENTTARGET":"",
"__EVENTARGUMENT":"",
"__VIEWSTATEFIELDCOUNT":"5",
"__VIEWSTATE":VIEWSTATE,
"__VIEWSTATE1":VIEWSTATE1,
"__VIEWSTATE2":VIEWSTATE2,
"__VIEWSTATE3":VIEWSTATE3,
"__VIEWSTATE4":VIEWSTATE4,
"__VIEWSTATEGENERATOR":VIEWSTATEGENERATOR,
"__SCROLLPOSITIONX":"0",
"__SCROLLPOSITIONY":"100",
"ctl00$NameTextBox":"",
"ctl00$ContentPlaceHolderNavPane$LeftSection$UserLogin$UserName":username,
"ctl00$ContentPlaceHolderNavPane$LeftSection$UserLogin$Password":password,
"ctl00$ContentPlaceHolderNavPane$LeftSection$UserLogin$LoginButton":"Login",
"ctl00$ContentPlaceHolder1$RetrievePasswordUserNameTextBox":"",
"hiddenInputToUpdateATBuffer_CommonToolkitScripts":"1"
}
r1=s.post(URL, data=login_data)
print (r1.cookies)
d=s.get(durl)
print (d.cookies)
dsoup=BeautifulSoup(r1.content)
print (dsoup)
but the thing is that the cookies are not preserved into the session and I can't get to the next page as a logged in user.
Can someone give me some pointers on this.
Thanks.
When you post to the login page:
r1=s.post(URL, data=login_data)
It's likely issuing a redirect to another page. So the response to the POST request returns the cookies in the response, then it redirects to another page. The redirect is what is captured in r1 and does not contain the cookies.
Try the same command but not allowing redirects:
r1 = s.post(URL, data=login_data, allow_redirects=False)

Simulating ajax request with python using requests lib

Why does request not download a response for this webpage?
#!/usr/bin/python
import requests
headers={ 'content-type':'application/x-www-form-urlencoded; charset=UTF-8',
'Accept-Encoding': 'gzip, deflate',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0',
'Referer' : 'http://sportsbeta.ladbrokes.com/football',
}
payload={'N': '4294966750',
'facetCount_156%23327': '12',
'facetCount_157%23325': '8',
'form-trigger':'moreId',
'moreId':'156%23327',
'pageId':'p_football_home_page',
'pageType':'EventClass',
'type':'ajaxrequest'
}
url='http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController'
r = requests.post(url, data=payload, headers=headers)
These are the parameters of the POST that I see in Firebug, and there the response received back contains a list (of football leagues), yet when I run my python script like this I get nothing.
(you can see the request in Firefox by clicking the See All in the competitions section of the left hand nav bar of link and looking at the XHR in Firebug. The Firebug response shows the HTML body as expected.)
Anyone any ideas? Will my handling of the % symbols in the payload be causing any trouble at all?
EDIT: Attempt using session
from requests import Request, Session
#turn post string into dict:
def parsePOSTstring(POSTstr):
paramList = POSTstr.split('&')
paramDict = dict([param.split('=') for param in paramList])
return paramDict
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0',
'Referer' : 'http://sportsbeta.ladbrokes.com/football'
}
#prep the data (POSTstr copied from Firebug raw source)
POSTstr = "moreId=156%23327&facetCount_156%23327=12&event=&N=4294966750&pageType=EventClass&
pageId=p_football_home_page&type=ajaxrequest&eventIDNav=&removedSelectionNav=&
currentSelectedId=&form-trigger=moreId"
payload = parsePOSTstring(POSTstr)
#end url
url='http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController'
#start a session to manage cookies, and visit football page first so referer agrees
s = Session()
s.get('http://sportsbeta.ladbrokes.com/football')
#now visit disired url with headers/data
r = s.post(url, data=payload, headers=headers)
#print output
print r.text #this is empty
Working curl
curl 'http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController'
-H 'Cookie: JSESSIONID=DE93158F07E02DD3CC1CC32B1AA24A9E.ecomprodsw015;
geoCode=FRA;
FLAGS=en|en|uk|default|ODDS|0|GBP;
ECOM_BETA_SPORTS=1;
PLAYED=4%7C0%7C0%7C0%7C0%7C0%7C0'
-H 'Referer: http://sportsbeta.ladbrokes.com/football'
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0)
Gecko/20100101 Firefox/27.0'
--data 'facetCount_157%23325=8&moreId=156%23327&
facetCount_156%23327=12&event=&
N=4294966750&
pageType=EventClass&pageId=p_football_home_page&
type=ajaxrequest&eventIDNav=&
removedSelectionNav=&currentSelectedId=&
form-trigger=moreId' --compressed
Yet this curl works.
Here's the smallest working example that I can come up with:
from requests import Session
session = Session()
# HEAD requests ask for *just* the headers, which is all you need to grab the
# session cookie
session.head('http://sportsbeta.ladbrokes.com/football')
response = session.post(
url='http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController',
data={
'N': '4294966750',
'form-trigger': 'moreId',
'moreId': '156#327',
'pageType': 'EventClass'
},
headers={
'Referer': 'http://sportsbeta.ladbrokes.com/football'
}
)
print response.text
You just weren't decoding the percent-encoded POST data properly, so # was being represented as %23 in the actual POST data (e.g. 156%23327 should've been 156#327).

Categories

Resources