How to scrape through Single page Application websites in python using bs4 - python

I am scraping players name through the NBA website. The player's name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players.
Here is the link: https://in.global.nba.com/playerindex/
from selenium import webdriver
from bs4 import BeautifulSoup
class make():
def __init__(self):
self.first=""
self.last=""
driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get('https://in.global.nba.com/playerindex/')
html_doc = driver.page_source
soup = BeautifulSoup(html_doc,'lxml')
names = []
layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
span = a.find("span",class_="ng-binding")
thing = make()
thing.first = span.text
spans = a.find("span",class_="ng-binding").find_next_sibling()
thing.last = spans.text
names.append(thing)

When dealing with SPAs, you shouldn't try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you'll see the page HTML doesn't have the data you need.
But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.
Here https://in.global.nba.com/playerindex/ loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en
Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.
import requests
if __name__ == '__main__':
page_url = 'https://in.global.nba.com/playerindex/'
s = requests.Session()
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
# visit the homepage to populate session with necessary cookies
res = s.get(page_url)
res.raise_for_status()
json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
res = s.get(json_url)
res.raise_for_status()
data = res.json()
player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
print(player_names)
output:
['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...
Dealing with auth
One thing to watch out for is that some websites require an authentication token to be sent with requests. You can see it in the API requests if it's present.
If you're building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.
This token (mostly a JWT token, starts with ey...) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.
...
<script>
const state = {"token": "ey......", ...};
</script>
import json
import re
res = requests.get('url/to/page')
# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']
res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})

Related

Scraping data from site with more button and JSON file doesn't load

So I am trying to scrape all the available jobs from the following site: https://apply.workable.com/fitxr/ The issue is that the site uses javascript and has a load more button.
I went to the chrome network settings and found the json file that the site uses
however when I go to the site https://apply.workable.com/api/v3/accounts/fitxr/jobs I get an not found error
Not sure how to get the data.
here is the code I wrote to try and scrape the data via xpath.
data = []
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
url = "https://apply.workable.com/fitxr/"
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
xpath = '/html/body/div/div/div/main/div[2]/ul/li[*]/div/h3'
jobs = tree.xpath(xpath)
for job in jobs:
print(job.text)
and here using the JSON site
data = []
url = "https://apply.workable.com/api/v3/accounts/fitxr/jobs"
r = requests.get(url)
json = r.json()
for x in range(len(json["results"])):
print(json["results"][x]["title"])
both sets of code return nothing
The request you found in your browser's development tools is a POST request to the /jobs endpoint; your attempt used requests.get (which sends a GET request to the same endpoint). /jobs does not respond to GET reqests, apparently.
Change your call to requests.get() to requests.post(), instead:
import requests
data = []
url = "https://apply.workable.com/api/v3/accounts/fitxr/jobs"
r = requests.post(url)
json = r.json()
for x in range(len(json["results"])):
print(json["results"][x]["title"])
Repl.it
Results:
Engineering Manager - Services & Full Stack
Interim Talent Partner
Customer Experience Manager
Content Manager (Production)
Performance Marketing Manager
Performance Marketing Manager
Content Creator (Fitness and Music)
Content Creator (Fitness and Music)
Automation Tester
Engineering Manager - Security, Data and DevOps

Unable to scrape a name from a webpage using requests

I've created a script in python to fetch a name which is populated upon filling in an input in a webpage. Here is how you can get that name -> after opening that webpage (sitelink has been given below), put 16803 right next to CP Number and hit the search button.
I know how to grab that using selenium but I'm not interested to go that route. I'm trying here to collect the name using requests module. I tried to mimick the steps (what I can see in the chrome dev tools) within my script as to how the requests is being sent to that site. The only thing I can't supply automatically within payload parameter is ScrollTop.
Website Link
This is my attempt:
import requests
from bs4 import BeautifulSoup
URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"
with requests.Session() as s:
r = s.get(URL)
cookie_item = "; ".join([str(x)+"="+str(y) for x,y in r.cookies.items()])
soup = BeautifulSoup(r.text,"lxml")
payload = {
'StylesheetManager_TSSM':soup.select_one("#StylesheetManager_TSSM")['value'],
'ScriptManager_TSM':soup.select_one("#ScriptManager_TSM")['value'],
'__VIEWSTATE':soup.select_one("#__VIEWSTATE")['value'],
'__VIEWSTATEGENERATOR':soup.select_one("#__VIEWSTATEGENERATOR")['value'],
'__EVENTVALIDATION':soup.select_one("#__EVENTVALIDATION")['value'],
'dnn$ctlHeader$dnnSearch$Search':soup.select_one("#dnn_ctlHeader_dnnSearch_SiteRadioButton")['value'],
'dnn$ctr410$MemberSearch$ddlMemberType':0,
'dnn$ctr410$MemberSearch$txtCpNumber': 16803,
'ScrollTop': 474,
'__dnnVariable': soup.select_one("#__dnnVariable")['value'],
}
headers = {
'Content-Type':'multipart/form-data; boundary=----WebKitFormBoundaryBhsR9ScAvNQ1o5ks',
'Referer': 'https://www.icsi.in/student/Members/MemberSearch.aspx',
'Cookie':cookie_item,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
res = s.post(URL,data=payload,headers=headers)
soup_obj = BeautifulSoup(res.text,"lxml")
name = soup_obj.select_one(".name_head > span").text
print(name)
When I execute the above script I get the following error:
AttributeError: 'NoneType' object has no attribute 'text'
How can I grab a name populated upon filling in an input in a webpage using requests?
The main issue with your code is the data encoding. I've noticed that you've set the Content-Type header to "multipart/form-data" but that is not enough to create multipart-encoded data. In fact, it is a problem because the actual encoding is different since you're using the data parameter which URL-encodes the POST data. In order to create multipart-encoded data, you should use the files parameter.
You could do that either by passing an extra dummy parameter to files,
res = s.post(URL, data=payload, files={'file':''})
(that would change the encoding for all POST data, not just the 'file' field)
Or you could convert the values in your payload dictionary to tuples, which is the expected structure when posting files with requests.
payload = {k:(None, str(v)) for k,v in payload.items()}
The first value is for the file name; it is not needed in this case so I've set it to None.
Next, your POST data should contain an __EVENTTARGET value that is required in order to get a valid response. (When creating the POST data dictionary it is important to submit all the data that the server expects. We can get that data from a browser: either by inspecting the HTML form or by inspecting the network traffic.) The complete code,
import requests
from bs4 import BeautifulSoup
URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"
with requests.Session() as s:
r = s.get(URL)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
payload['dnn$ctr410$MemberSearch$txtCpNumber'] = 16803
payload["__EVENTTARGET"] = 'dnn$ctr410$MemberSearch$btnSearch'
payload = {k:(None, str(v)) for k,v in payload.items()}
r = s.post(URL, files=payload)
soup_obj = BeautifulSoup(r.text,"lxml")
name = soup_obj.select_one(".name_head > span").text
print(name)
After some more tests, I discovered that the server also accepts URL-encoded data (probably because there are no files posted). So you can get a valid response either with data or with files, provided that you don't change the default Content-Type header.
It is not necessary to add any extra headers. When using a Session object, cookies are stored and submitted by default. The Content-Type header is created automatically - "application/x-www-form-urlencoded" when using the data parameter, "multipart/form-data" using files. Changing the default User-Agent or adding a Referer is not required.

Login a Website and Web Scraping using python

I am trying to figure out ways to web scraping a real estate website https://www.brickz.my/ for my research project. I have been trying between selenium and beautiful soup and decide to choose beautiful soup was the best way for me since the structure of url for each real estate allow my code to navigate the website easily and faster
I am trying to build a database transaction for each real estate'. Without login, only 10 latest transactions will be displayed for a particular property. By login, I can access to the whole transaction for a particular type of property. here is the example
without login, i can only access 10 transaction for each property
After login, i can access to more than 10 transaction plus previously obscure property address
i try to login using request in python, yet it keep bringing me to the page without login and end up, i just manage to scrap the 10 latest transaction instead of whole transaction. here is the example of my login code in python
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.brickz.my/login/", auth=
('email', 'password'))
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}
soup = BeautifulSoup(page.content, 'html.parser')
#I put one of the property url to be scrapped inside response
response = get("https://www.brickz.my/transactions/residential/kuala-
lumpur/titiwangsa/titiwangsa-sentral-condo/non-landed/?range=2012+Oct-",
headers = headers)
Here is what I used to scrape the table
table = BeautifulSoup(response.text, 'html.parser')
table_rows = table.find_all('tr')
names = []
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
names.append(row)
How am I able to successfully login and get access to the whole transaction? I heard about Mechanize library but it is not available for python 3.
I am sorry if my question is not clear, this is my first time posting, and i just learn to use python only a couple of months ago.
Try the below code. What do you see when you print it (changing email and password)? Doesn't it print Logoutas result?
import requests
from bs4 import BeautifulSoup
URL = "https://www.brickz.my/login/"
payload = {
'email': 'your_email',
'pw': 'your_password',
'submit': 'Submit'
}
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
s.post(URL,data=payload)
res = s.get("https://www.brickz.my/")
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("select#menu_select .nav2"):
data = [' '.join(item.text.split()) for item in items.select("option")[-1:]]
print(data)
A simple HTTP trace will show that a POST is made to https://www.brickz.my/login/ with email and pw as form parameters.
Which translates into this requests command:
session = requests.Session()
resp = session.post('https://www.brickz.my/login/', data={'email': '<youremail>', 'pw': '<yourpassword'})
if resp.ok:
print("You should now be logged in")
# then use session to request the site, like
# resp = session.get("https://www.brickz.my/whatever")
WARNING: Untested since I don't have an account there.

how to use python requests to login to website

Im trying to login and scrape a job site and send me notification when ever certain key words are found.I think i have correctly traced the xpath for the value of feild "login[iovation]" but i cannot extract the value, here is what i have done so far to login
import requests
from lxml import html
header = {"User-Agent":"Mozilla/4.0 (compatible; MSIE 5.5;Windows NT)"}
login_url = 'https://www.upwork.com/ab/account-security/login'
session_requests = requests.session()
#get csrf
result = session_requests.get(login_url)
tree=html.fromstring(result.text)
auth_token = list(set(tree.xpath('//*[#name="login[_token]"]/#value')))
auth_iovat = list(set(tree.xpath('//*[#name="login[iovation]"]/#value')))
# create payload
payload = {
"login[username]": "myemail#gmail.com",
"login[password]": "pa$$w0rD",
"login[_token]": auth_token,
"login[iovation]": auth_iovation,
"login[redir]": "/home"
}
#perform login
scrapeurl='https://www.upwork.com/ab/find-work/'
result=session_requests.post(login_url, data = payload, headers = dict(referer = login_url))
#test the result
print result.text
This is screen shot of form data when i login successfully
This is because upworks uses something called iOvation (https://www.iovation.com/) to reduce fraud. iOvation uses digital fingerprint of your device/browser, which are sent via login[iovation] parameter.
If you look at the javascripts loaded on your site, you will find two javascript being loaded from iesnare.com domain. This domain and many others are owned by iOvaiton to drop third party javascript to identify your device/browser.
I think if you copy the string from the successful login and send it over along with all the http headers as is including the browser agent in python code, you should be okie.
Are you sure that result is fetching 2XX code
When I am this code result = session_requests.get(login_url)..its fetching me a 403 status code, which means I am not going to login_url itself
They have an official API now, no need for scraping, just register for API keys.

Google Search Web Scraping with Python

I've been learning a lot of python lately to work on some projects at work.
Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?
I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.
You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.
Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[#class="r"]/a)
In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.
Example code using lxml and requests:
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
A note on google banning your IP: In my experience, google only bans
if you start spamming google with search requests. It will respond
with a 503 if Google thinks you are bot.
Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.
Here is a python code sample:
import requests
headers = {
'apikey': '',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
You have 2 options. Building it yourself or using a SERP API.
A SERP API will return the Google search results as a formatted JSON response.
I would recommend a SERP API as it is easier to use, and you don't have to worry about getting blocked by Google.
1. SERP API
I have good experience with the scraperbox serp api.
You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN with your scraperbox API token.
import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")
# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)
# Print the first result title
print(response["organic_results"][0]["title"])
2. Build your own Python scraper
I recently wrote an in-depth blog post on how to scrape search results with Python.
Here is a quick summary.
First you should get the HTML contents of the Google search result page.
import urllib.request
url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
Then you can use BeautifulSoup to extract the search results.
For example, the following code will get all titles.
from bs4 import BeautifulSoup
# The code to get the html contents here.
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())
You can extend this code to also extract the search result urls and descriptions.
You can also use a third party service like Serp API - I wrote and run this tool - that is a paid Google search engine results API. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.
It's easy to integrate with Python:
from lib.google_search_results import GoogleSearchResults
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
GitHub: https://github.com/serpapi/google-search-results-python
Current answers will work but google will ban your for scrapping.
My current solution uses the requests_ip_rotator
import requests
from requests_ip_rotator import ApiGateway
import os
keywords = ['test']
def parse(keyword, session):
url = f"https://www.google.com/search?q={keyword}"
response = session.get(url)
print(response)
if __name__ == '__main__':
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
access_key_secret=AWS_SECRET_ACCESS_KEY)
gateway.start()
session = requests.Session()
session.mount("https://www.google.com", gateway)
for keyword in keywords:
parse(keyword, session)
gateway.shutdown()
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.
This solution allow you to parse 1 million requests (amazon free limit)

Categories

Resources