Scrape a dynamic(AJAX) website using selenium in Python

Scrape a dynamic(AJAX) website using selenium in Python - python

I have an AJAX based website https://stackshare.io/application_and_data. I am trying to scrape the logos of tech-stacks across all the pages. I used selenium to find_element_by_class--it's returning an empty list. The JQuery found in the XHR request does not have a URL which I can use. Help needed in reverse-engineer the jQuery script.
The other URLs I found in the Network data also seem to fail. I tried postman to replicate the request,but could not do it correctly.
Any help is very much appreciated.
import time
import requests
from bs4 import BeautifulSoup
import urlparse
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox(executable_path="/home/Documents/geckodriver")
driver.get("https://stackshare.io/application_and_data/")
content = driver.find_elements_by_class_name("btn btn-ss-alt btn-lg load-more-layer-stacks")
content_1 = driver.find_elements_by_class_name("div-center hidden-xs")
Content and content_1 give an empty list. How do I proceed or what am I oding wrong here?
Following is the reverse engineering approach I tried.
request_url = 'https://stackshare.io/application_and_data/load-more'
request_headers = {
'Accept' : '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language' : 'en-GB,en;q=0.5',
'Connection' : 'keep-alive',
'Content-Length' : '128',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'cookie' :'_stackshare_production_session=cUNIOVlrV0h2dStCandILzJDWmVReGRlaWI1SjJHOWpYdDlEK3BzY2JEWjF3Lzd6Z0F6Zmg1RjUzNGo0U1dPNFg2WHdueDl5VEhCSHVtS2JiaVdNN0FvRWJMV0pBS0ZaZ0RWYW14bFFBcm1OaDV6RUptZlJMZ29TQlNOK1pKOFZ3NTVLbEdmdjFhQnRLZDl1d29rSHVnPT0tLWFzQlcrcy9iQndBNW15c0lHVHlJNkE9PQ%3D%3D--b0c41a10e8b0cf8cd020f7b07d6507894e50a9c5; ajs_user_id=null; ajs_group_id=null; ajs_anonymous_id=%224cf45ffc-a1ab-4048-94ba-d8c58063df95%22; wooTracker=Psbca0UX84Do; _ga=GA1.2.877065752.1528363377; amplitude_id_63407ddf709a227ea844317f20f7b56estackshare.io=eyJkZXZpY2VJZCI6IjcwYmNiMGQ3LTM1MjAtNDgzZi1iNWNlLTdmMTIzYzQxZGEyMVIiLCJ1c2VySWQiOm51bGwsIm9wdE91dCI6ZmFsc2UsInNlc3Npb25JZCI6MTUyODgwNTg2ODQ0NiwibGFzdEV2ZW50VGltZSI6MTUyODgwNjc0Nzk2OSwiZXZlbnRJZCI6ODUsImlkZW50aWZ5SWQiOjUsInNlcXVlbmNlTnVtYmVyIjo5MH0=; uvts=7an3MMNHYn0XBZYF; __atuvc=3%7C23; _gid=GA1.2.685188865.1528724539; amplitude_idundefinedstackshare.io=eyJvcHRPdXQiOmZhbHNlLCJzZXNzaW9uSWQiOm51bGwsImxhc3RFdmVudFRpbWUiOm51bGwsImV2ZW50SWQiOjAsImlkZW50aWZ5SWQiOjAsInNlcXVlbmNlTnVtYmVyIjowfQ==; _gat=1; _gali=wrap',
'Host' :'stackshare.io',
'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0',
'Referer' :'https://stackshare.io/application_and_data',
'X-CSRF-Token' : 'OEhhwcDju+WcpweukjB09hDFPDhwqX…nm+4fAgbMceRxnCz7gg4g//jDEg==',
'X-Requested-With' : 'XMLHttpRequest'
}
payload = {}
response = requests.post(request_url, data=payload, headers=request_headers)
print response
Observation: I got a 499 Response code. What payload do I need to give?
I checked the XHR request, but could not find the correct URL,it leads to.

Related

Python request to crawl URL returns 404 Error while working inside the browser

I have a crawling python script that hangs on a url: pulsepoint.com/sellers.json
The bot uses a standard request to get the content, but is returned Error 404. In the browser it works (there is a 301 redirect, but request can follow that). My first hunch is that this could be a request header issue, so I copied my browser configuration. The code looks like this
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info(" %d" % r.status_code)
But I am still getting a 404 Error.
My next guess:
Login? Not used here
Cookies? Not that I can see
So how is their server blocking my bot? This is an url that is supposed to be crawled by the way, nothing illegal..
Thanks in advance!

You can also do a workaround on the SSL certificate error like below:
from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read()
# print in dictionary format
print(json.loads(response))
Sample response:
{'contact_email': 'PublisherSupport#pulsepoint.com', 'contact_address': '360 Madison Ave, 14th Floor, NY, NY, 10017', 'version': '1.0', 'identifiers': [{'name': 'TAG-ID', 'value': '89ff185a4c4e857c'}], 'sellers': [{'seller_id': '508738', ...
...'seller_type': 'PUBLISHER'}, {'seller_id': '562225', 'name': 'EL DIARIO', 'domain': 'impremedia.com', 'seller_type': 'PUBLISHER'}]}

You can just go directly to the link and extract the data, no need to get 301 to the correct link
import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)

Ok, just for other people, an hardened version of âńōŋŷXmoůŜ's answer, because:
Some website want headers to answer;
Some website use weird encoding
Some website send gzipped answer when not requested.
import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))
Thanks again!

Scraping site missing data

So I'm trying to scrape the open positions on this site and when I use any type of requests (currently trying request-html) it doesn't show everything that's in the HTML.
# Import libraries
import time
from bs4 import BeautifulSoup
from requests_html import HTMLSession
# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
session = HTMLSession()
# Connect to the URL
response = session.get(url)
response.html.render()
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html5lib")
b = soup.findAll('a')
Not sure where to go. Originally thought the problem was due to javascript rendering but this is not working.

The issue is that the initial GET doesn't get the data (which I assume is the job listings), and the js that does do that, uses a POST with a authorization token in the header. You need to get this token and then make the POST to get the data.
This token appears to be dynamic so we're going to get a little wonky getting it, but doable.
url0=r'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
url=r'https://germanamerican.csod.com/services/x/career-site/v1/search'
s=HTMLSession()
r=s.get(url0)
print(r.status_code)
r.html.render()
soup=bs(r.text,'html.parser')
scripts=soup.find_all('script')
for script in scripts:
if 'csod.context=' in script.text: x=script
j=json.loads(x.text.replace('csod.context=','').replace(';',''))
payload={
'careerSiteId': 5,
'cities': [],
'countryCodes': [],
'cultureId': 1,
'cultureName': "en-US",
'customFieldCheckboxKeys': [],
'customFieldDropdowns': [],
'customFieldRadios': [],
'pageNumber': 1,
'pageSize': 25,
'placeID': "",
'postingsWithinDays': None,
'radius': None,
'searchText': "",
'states': []
}
headers={
'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'authorization': 'Bearer '+j['token'],
'cache-control': 'no-cache',
'content-length': '272',
'content-type': 'application/json',
'csod-accept-language': 'en-US',
'origin': 'https://germanamerican.csod.com',
'referer': 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
r=s.post(url,headers=headers,json=payload)
print(r.status_code)
print(r.json())
the r.json() thats printed out is a nice json format version of the table of job listings.

I don't think it's possible to scrape that website with Requests.
I would suggest using Selenium or Scrapy.

Welcome to SO!
Unfortunately, you won't be able to scrape that page with requests (nor requests_html or similar libraries) because you need a tool to handle dynamic pages - i.e., javascript-based.
With python, I would strongly suggest selenium and its webdriver. Below a piece of code that prints the desired output - i.e., all listed jobs (NB it requires selenium and Firefox webdriver to be installed and with the correct PATH to run)
# Import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
browser = webdriver.Firefox() # initialize the webdriver. I use FF, might be Chromium or else
browser.get(url) # go to the desired page. You might want to wait a bit in case of slow connection
page = browser.page_source # this is the page source, now full with the listings that have been uploaded
soup = BeautifulSoup(page, "lxml")
jobs = soup.findAll('a', {'data-tag' : 'displayJobTitle'})
for j in jobs:
print(j.text)
browser.quit()

Why does the requests.get with correct header return empty content?

I am trying to crawl a website and copied the Request Headers information from Chrome directly,however, after using the requests.get, the returned content is empty.But the header I printed from requests is correct. Anyone knows the reason for this? Thx!
Mac, Chrome, Python3.7
General InformationRequests Information
import requests
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': '_RSG=Ja4TD8hvFh2MGc7wBysunA; _RDG=28458f5367f9b123363c043b75e3f9aa31; _RGUID=2acfe6b2-0d74-4913-ac78-dbc2fa1e6416; _abtest_userid=bce0b01e-fdb6-48c8-9b86-4e1d8ef468df; _ga=GA1.2.937100695.1547968515; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuery=&SmartLinkHost=&SmartLinkLanguage=zh; HotelCityID=5split%E5%93%88%E5%B0%94%E6%BB%A8splitHarbinsplit2019-01-25split2019-01-26split0; Mkt_UnionRecord=%5B%7B%22aid%22%3A%224897%22%2C%22timestamp%22%3A1548157938143%7D%5D; ASP.NET_SessionId=w1pq5dvchogxhbnxzmbgbtkk; OID_ForOnlineHotel=1509697509766jepc81550141458933102003; _RF1=123.165.147.203; MKT_Pagesource=PC; HotelDomesticVisitedHotels1=698432=0,0,4.5,3674,/hotel/8000/7899/df84daa197dd4b868868cba4db14f71f.jpg,&448367=0,0,4.3,4455,/fd/hotel/g6/M02/6D/8B/CggYtFc1nAKAEnRYAAdgA-rkEXw300.jpg,&13679014=0,0,4.9,1484,/200g0w000000k4wqrB407.jpg,; __zpspc=9.6.1550232718.1550232718.1%234%7C%7C%7C%7C%7C%23; _jzqco=%7C%7C%7C%7C1550232718632%7C1.2024536341.1547968514847.1550141461869.1550232718448.1550141461869.1550232718448.undefined.0.0.13.13; _gid=GA1.2.506035914.1550232719; _bfi=p1%3D102003%26p2%3D102003%26v1%3D18%26v2%3D17; appFloatCnt=8; _bfa=1.1509697509766.jepc8.1.1550141458610.1550232715314.7.19; _bfs=1.2',
'Host': 'hotels.ctrip.com',
'Referer': 'http://hotels.ctrip.com/hotel/698432.html?isFull=F',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'
}
url ='http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815'
data = requests.get(url, headers = headers)
print(data.request.headers)

The request header information that you shared in the image, gives the info that the server responded correctly to the request. Also the actual url that you shared http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815
was something different from the one shown in the image. Infact it seems the actual page is indeed calling lot of other urls to form the final page. so there is no guarantee that you will get the response as you see in the browser when you use requests. If the server or the actual implementation at the server end is depending on the browser's javascript engine to execute the javascript and then render the content, you won't be able to get the final html as it looks like in the browser. Would be better to use selenium webdriver in those cases to hit the url and then get the html content. Again if you can share the actual url, can suggest on other ideas

How to generate UserLoginType[_token] for login request

I'm trying to login into a website using a post request like this:
import requests
cookies = {
'_SID': 'c1i73k2mg3sj0ugi5ql16c3sp7',
'isCookieAllowed': 'true',
}
headers = {
'Host': 'service.premiumsim.de',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://service.premiumsim.de/',
'Content-Type': 'application/x-www-form-urlencoded',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
data = [
('_SID', 'c1i73k2mg3sj0ugi5ql16c3sp7'),
('UserLoginType[alias]', 'username'),
('UserLoginType[password]', 'password'),
('UserLoginType[logindata]', ''),
('UserLoginType[_token]', '1af70f3d0e5b9e6c39e1475b6d84e9d125d076de'),
]
requests.post('https://service.premiumsim.de/public/login_check', headers=headers, cookies=cookies, data=data)
The problem is 'UserLoginType[_token] in params. The above code is working, but I don't have that token and have no clue how to generate it. So when I do my request, I do it without the _token and the request fails.
Google did not found any helpful information about UserLoginType.
Does anyone know how to generate it(e.g. with another request first) to be able to login?
Edit:
Thanks to the suggestion of t.m.adam I used bs4 to get the token:
import requests
from bs4 import BeautifulSoup as bs
tokenRequest = requests.get('https://service.premiumsim.de')
html_bytes = tokenRequest.text
soup = bs(html_bytes, 'lxml')
token = soup.find('input', {'id':'UserLoginType__token'})['value']

Python XHR Request Timing Out

Trying to wrap my head around using requests to get Javscript loaded content without spawning an actual browser to render it. I'm looking at using the requests lib to get the tables but I keep getting a 504 with my test code and I'm not 100% why.
So I'm looking at getting horse racing data from: sports.betway.com/#/horse-racing/uk-and-ireland/haydock
I watched the network traffic and found the source of the traffic. It's a call to /emoapi/emos with an eventIds number.
I tried this:
import requests
url = 'https://sports.betway.com/emoapi/emos'
params = {
'eventIds': '807789',
'lang': 'en'
}
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '271',
'Content-Type': 'application/json',
'Host': 'sports.betway.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36'}
#Note: I do also set the origin and ref link in the header but I can't post that many links in a question.
response = requests.post(url, params=params, headers=headers)
print response
fixtures = response.json()
print fixtures
I can't see what else I'm missing from the request. But the print response comes back as a
This is an example of the full payload on the browser header which requests a whole bunch of Ids rather than just the one I'm trying:
{"eventIds":[807789,808612,808597,807790,808613,808598,807791,808611,808599,807792,808614,808600,807793,808615,808601,807794,808616,808602,807795,808617,807781,808591,807782,808589,807783,808590,807785,808592,807784,808593,807786,808594,807788,808595,807787],"lang":"en"}
And it's a POST to that URL so I'm not sure why it's timing out.
Can anyone shed any light on where I'm going wrong here? Is it something painfully obvious?

The payload should be included in request body rather than url params.
The payload in this case is a json raw string.
import requests
url = 'https://sports.betway.com/emoapi/emos'
data = '{"eventIds": [807789]}'
response = requests.post(url, data=data )
print response.text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape a dynamic(AJAX) website using selenium in Python - python

Related

Python request to crawl URL returns 404 Error while working inside the browser

Scraping site missing data

Why does the requests.get with correct header return empty content?

How to generate UserLoginType[_token] for login request

Python XHR Request Timing Out

Categories

Resources