Script fails to extract certain fields from a webpage using requests

Script fails to extract certain fields from a webpage using requests - python

I'm trying to scrape title, website and email from this webpage. The content available in there are heavily dynamic. Although requests module can't handle javascript heavy content, there are always alternatives to grab the same using the very library, as in using any external link retrieved from dev tools. However, I just can't find the right way to do the trick.
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://www.firmy.cz/detail/13153157-azajola-stare-mesto-nova-seninka.html'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1").text
website = soup.select_one("a.companyUrl").text
email = soup.select_one("a.companyMail").text
print(title,website,email)
When I run the script above, it throws AttributeError.
Output that I'm after:
AZAJOLA, s.r.o., www.chatanovaseninka.cz, info#chatanovaseninka.cz
PS I don't wish to go for any browser simulator like selenium.

The site uses FastRPC protocol (which is additionally encoded with Base64). You can install PyFRPC module from https://pypi.org/project/pyfrpc/ to encode/decode the messages:
import re
import pyfrpc # <-- install from https://pypi.org/project/pyfrpc/
import base64
from pprint import pprint
url = (
"https://www.firmy.cz/detail/13153157-azajola-stare-mesto-nova-seninka.html"
)
api_url = "https://www.firmy.cz/RPC2/"
id_ = int(re.search(r"detail/(\d+)", url).group(1))
headers = {
"Accept": "application/x-base64-frpc",
"Content-Type": "application/x-base64-frpc",
}
c = pyfrpc.FrpcCall(
name="system.multicall",
args=[
[
{
"methodName": "detail.getDetail",
"params": [
{"version": "1.0"},
id_,
{"searchInCategory": False, "deliveryNetwork": ""},
],
},
{
"methodName": "review.listReviews",
"params": [{"version": "1.0"}, id_, 0, 3],
},
]
],
)
msg_to_send = base64.b64encode(pyfrpc.encode(c, 0x0201))
r = requests.post(api_url, headers=headers, data=msg_to_send)
response = pyfrpc.decode(base64.b64decode(r.text))
# uncomment to see all data:
# pprint(response.data)
print(response.data[0]["result"]["title_web"])
print(response.data[0]["result"]["email"])
print(response.data[0]["result"]["url"].split("#")[0])
Prints:
AZAJOLA, s.r.o.
info#chatanovaseninka.cz
http://www.chatanovaseninka.cz

Related

Access denied - python selenium - even after using User-Agent and other headers

Using python, I am trying to extract the options chain data table published publicly by NSE exchange on https://www.nseindia.com/option-chain
Tried to use requests session as well as selenium, but somehow the website is not allowing to extract data using bot.
Below are the attempts done:
Instead of plain requests, tried to setup a session and attempted to first get csrf_token from https://www.nseindia.com/api/csrf-token and then called the url. However the website seems to have certain additional authorization using javascripts.
On studying the xhr and js tabs of chrome developer console, the website seems to be using certain js scripts for first time authorisation, so used selenium this time. Also passed useragent and Accept-Language arguments in headers (as per this stackoverflow answer) while loading driver. But somehow the access is still blocked by website.
Is there anything obvious that i am missing ? Or website will make all attempts to block automated extraction of data from website using selenium/requests + python? Either case, how do i extract this data?
Below is my current code: ( to get table contents from https://www.nseindia.com/option-chain)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36")
opts.add_argument("Accept-Language=en-US,en;q=0.5")
opts.add_argument("Accept=text/html")
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe",chrome_options=opts)
#driver.get('https://www.nseindia.com/api/csrf-token')
driver.get('https://www.nseindia.com/')
#driver.get('https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY')
driver.get('https://www.nseindia.com/option-chain')

The data is loaded via Javascript from external URL. But you need first to load cookies visiting other URL:
import json
import requests
from bs4 import BeautifulSoup
symbol = 'NIFTY'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=' + symbol
with requests.session() as s:
# load cookies:
s.get('https://www.nseindia.com/get-quotes/derivatives?symbol=' + symbol, headers=headers)
# get data:
data = s.get(url, headers=headers).json()
# print data to screen:
print(json.dumps(data, indent=4))
Prints:
{
"records": {
"expiryDates": [
"03-Sep-2020",
"10-Sep-2020",
"17-Sep-2020",
"24-Sep-2020",
"01-Oct-2020",
"08-Oct-2020",
"15-Oct-2020",
"22-Oct-2020",
"29-Oct-2020",
"26-Nov-2020",
"31-Dec-2020",
"25-Mar-2021",
"24-Jun-2021",
"30-Dec-2021",
"30-Jun-2022",
"29-Dec-2022",
"29-Jun-2023"
],
"data": [
{
"strikePrice": 4600,
"expiryDate": "31-Dec-2020",
"PE": {
"strikePrice": 4600,
"expiryDate": "31-Dec-2020",
"underlying": "NIFTY",
"identifier": "OPTIDXNIFTY31-12-2020PE4600.00",
"openInterest": 19,
"changeinOpenInterest": 0,
"pchangeinOpenInterest": 0,
"totalTradedVolume": 0,
"impliedVolatility": 0,
"lastPrice": 31,
"change": 0,
"pChange": 0,
"totalBuyQuantity": 10800,
"totalSellQuantity": 0,
"bidQty": 900,
"bidprice": 3.05,
"askQty": 0,
"askPrice": 0,
"underlyingValue": 11647.6
}
},
{
"strikePrice": 5000,
"expiryDate": "31-Dec-2020",
...and so on.

Can't extract a link connected to `see all` button from a webpage

I've created a script to log in to linkedin using requests. The script is doing fine.
After logging in, I used this url https://www.linkedin.com/groups/137920/ to scrape this name Marketing Intelligence Professionals from there which you can see in this image.
The script can parse the name flawlessly. However, what I wish to do now is scrape the link connected to the See all button located at the bottom of that very page shown in this image.
Group link you gotta log in to access the content
I've created so far (it can scrape the name shown in the first image):
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = 'your email' #put your username here
payload['session_password'] = 'your password' #put your password here
r = s.post(post_url,data=payload)
r = s.get(target_url)
soup = BeautifulSoup(r.text,"lxml")
items = soup.select_one("code:contains('viewerGroupMembership')").get_text(strip=True)
print(json.loads(items)['data']['name']['text'])
How can I scrape the link connected to See all button from there?

There is an internal Rest API which is called when you click on "See All" :
GET https://www.linkedin.com/voyager/api/search/blended
The keywords query parameter contains the title of the group you have requested initially (the group title in the initial page).
In order to get the group name, you could scrape the html of the initial page, but there is an API which returns the group information when you gives the group ID :
GET https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:GROUP_ID
The group id in your case is 137920 which can be extracted from the URL directly
An example :
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode
username = 'your username'
password = 'your password'
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
group_res = re.search('.*/(.*)/$', target_url)
group_id = group_res.group(1)
with requests.Session() as s:
# login
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = username
payload['session_password'] = password
r = s.post(post_url, data = payload)
# API
csrf_token = s.cookies.get_dict()["JSESSIONID"].replace("\"","")
r = s.get(f"https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:{group_id}",
headers= {
"csrf-token": csrf_token
})
group_name = r.json()["name"]["text"]
print(f"searching data for group {group_name}")
params = {
"count": 10,
"keywords": group_name,
"origin": "SWITCH_SEARCH_VERTICAL",
"q": "all",
"start": 0
}
r = s.get(f"https://www.linkedin.com/voyager/api/search/blended?{urlencode(params)}&filters=List(resultType-%3EGROUPS)&queryContext=List(spellCorrectionEnabled-%3Etrue)",
headers= {
"csrf-token": csrf_token,
"Accept": "application/vnd.linkedin.normalized+json+2.1",
"x-restli-protocol-version": "2.0.0"
})
result = r.json()["included"]
print(result)
print("list of groupName/link")
print([
(t["groupName"], f'https://www.linkedin.com/groups/{t["objectUrn"].split(":")[3]}')
for t in result
])
A few notes :
those API call require cookie session
those API call require a specific header for a XSRF token that is the same as JSESSIONID cookie value
a special media type application/vnd.linkedin.normalized+json+2.1 is necessary for the search call
the parenthesis inside the fields queryContext and filters shouldn't be url encoded otherwise it will not take these params into account

you can try the selenium, click the See all button, and scrape the linked connected content:
from selenium import webdriver
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.linkedin.com/xxxx')
driver.find_element_by_name('s_image').click()
selenium docs: https://selenium-python.readthedocs.io/

How to scrape google?

So I wanna scrape google, I have successfully scraped craigslist using this method but I can't seam to scrape google for some reason (yes of course I changed the class and stuff..) this is what I want to scrape:
I want to scrape websites description:
from selenium import webdriver
path = r"C:\Users\Skid\Desktop\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.google.com/#q=python+webscape+google")
posts = driver.find_elements_by_class_name("r")
for post in posts:
print(post.text)

Solved, Add a timer (import time, time.sleep(2)) before scraping.

You can scrape Google Search Description Website using BeautifulSoup web scraping library.
More about what are CSS selectors are, and cons of using CSS selectors.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
"q": "python web scrape google", # query
"gl": "us", # country to search from
"hl": "en", # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
website_description_data = []
for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
description = result.select_one(".lEBKkf").text
website_description_data.append({
"website_name" : website_name,
"description" : description
})
print(json.dumps(website_description_data, indent=2))
Example output
[
{
"website_name": "https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python",
"description": "Mar 13, 2021 \u2014 First, we're using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the\u00a0..."
}
]
[
{
"website_name": "https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python",
"description": "Mar 13, 2021 \u2014 First, we're using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the\u00a0..."
},
{
"website_name": "https://stackoverflow.com/questions/38619478/google-search-web-scraping-with-python",
"description": "You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top\u00a0..."
}
# ...
]

How to check if a url is indexed by google using Google Custom search API and Python?

i need to check if some URLs are indexed by google using a python script and google custom search.
I'd like to obtain in the script the same results i obtain when from my browser i google for site:www.example.it.
My code is:
import urllib2
import json
import pprint
data = urllib2.urlopen('https://www.googleapis.com/customsearch/v1?key=AIzaSyA3xNw1doOc4rjoUGc7sq1gltQvOgalHqA&cx=017576662512468239146:omuauf_lfve&q=site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1')
data=json.load(data)
print data
The output of this is:
{ u'kind': u'customsearch#search',
u'queries': { u'request': [ { u'count': 10,
u'cx': u'017576662512468239146:omuauf_lfve',
u'inputEncoding': u'utf8',
u'outputEncoding': u'utf8',
u'safe': u'off',
u'searchTerms': u'site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1',
u'title': u'Google Custom Search - site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1',
u'totalResults': u'0'}]},
u'searchInformation': { u'formattedSearchTime': u'0.55',
u'formattedTotalResults': u'0',
u'searchTime': 0.552849,
u'totalResults': u'0'},
u'url': { u'template': u'https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json',
u'type': u'application/json'}}
As you can see there are no "items" while if you google for site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1 you have at least one item.
After various experiments it seems that google custom search doesn't work for the queries site:website.
Do you know any solution or alternative to this problem?
Thanks.

With Google CSE you specify the site via your CSE configuration (corresponding to your 'cx' parameter) not via the 'site:' query parameter. In the 'basics' tab of your CSE you should see a section called "Sites to search".

Urls are in Excel file
import requests
import pandas as pd
import time
from bs4 import BeautifulSoup
from urllib.parse import urlencode
seconds = 3
proxies = {
'https' : 'https://localhost:8123',
'http' : 'http://localhost:8123'
}
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
headers = { 'User-Agent' : user_agent}
df = pd.read_excel('url_links.xlsx')
for i in range(0, len(df)):
line = df.loc[i,'links']
#print(line)
if line:
query = {'q': 'site:' + line}
google = "https://www.google.com/search?" + urlencode(query)
data = requests.get(google, headers=headers)
data.encoding = 'ISO-8859-1'
soup = BeautifulSoup(str(data.content), "html.parser")
try:
check = soup.find(id="rso").find("div").find("div").find("div").find("div").find("div").find("a")["href"]
print("URL is Index ")
except AttributeError:
print("URL Not Index")
time.sleep(float(seconds))
else:
print("Invalid Url")

google search with python requests library

(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?

Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.

import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']

input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park

The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.

In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())

You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Script fails to extract certain fields from a webpage using requests - python

Related

Access denied - python selenium - even after using User-Agent and other headers

Can't extract a link connected to `see all` button from a webpage

How to scrape google?

How to check if a url is indexed by google using Google Custom search API and Python?

google search with python requests library

Categories

Resources