How can dynamic css selectors be used with Beautiful Soup?

How can dynamic css selectors be used with Beautiful Soup? - python

Following code scrapes data from specific selectors for the sites in siteUrlArray. It works fine.
However this requires me to write one function for each website - just to define the selectors. I try to dynamically build the
soup.find and soup.select using exec and a dict to hold selector variables - but I can't get it to work.
Working Code
from bs4 import BeautifulSoup
import requests
import sys
import tldextract
def processmarketbeat(soup):
try:
mbtSel = soup.findAll("div", {"id": "cphPrimaryContent_tabAnalystRatings"})[0].find("table")
print(mbtSel)
except Exception as e:
print(str(e))
def processwsj(soup):
try:
wsjSel = soup.select('.at8-col4 > .zonedModule')[0]
print(wsjSel)
except Exception as e:
print(str(e))
global siteUrl, domain, header, parser
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
parser = 'html.parser'
siteUrlArray = ['http://www.marketbeat.com/stocks/NASDAQ/GOOGL/price-target',
'http://www.wsj.com/market-data/quotes/GOOGL/research-ratings',
'http://www.marketbeat.com/stocks/NASDAQ/SQ/price-target',
'http://www.wsj.com/market-data/quotes/SQ/research-ratings']
for i in range(len(siteUrlArray)):
siteUrl = siteUrlArray[i]
parsedUrl = tldextract.extract(siteUrl)
domain = parsedUrl.domain
r = requests.get(siteUrl, headers=header, verify=False)
soup = BeautifulSoup(r.text, parser)
getattr(sys.modules[__name__], "process%s" % domain)(soup)
Attempt at using dynamic selector
stockDict = {
'marketbeat': '"""x = soup.findAll("div", {"id": "cphPrimaryContent_tabAnalystRatings"})[0].find("table")"""',
'wsj': '"""x = soup.select(".at8-col4 > .zonedModule")[0]"""'
}
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
parser = 'html.parser'
siteUrlArray = ['http://www.marketbeat.com/stocks/NASDAQ/GOOGL/price-target',
'http://www.wsj.com/market-data/quotes/GOOGL/research-ratings',
'http://www.marketbeat.com/stocks/NASDAQ/SQ/price-target',
'http://www.wsj.com/market-data/quotes/SQ/research-ratings']
for i in range(len(siteUrlArray)):
siteUrl = siteUrlArray[i]
parsedUrl = tldextract.extract(siteUrl)
domain = parsedUrl.domain
r = requests.get(siteUrl, headers=header, verify=False)
soup = BeautifulSoup(r.text, parser)
selector = stockDict.get(domain)
exec(selector)
# I want the EXEC to run the equivalent of
# x = soup.findAll("div", {"id": "cphPrimaryContent_tabAnalystRatings"})[0].find("table")
# so that I can print the tags as print(x)
print(x)
But the x is printing as None instead of HTML code of the selected objects.

I was able to achieve what I intended to do with the following code:
selectorDict = {
'marketbeat': 'x = soup.findAll("div", {"id": "cphPrimaryContent_tabAnalystRatings"})[0].find("table")\nprint(x)',
'wsj': 'x = soup.select(".at8-col4 > .zonedModule")[0]\nprint(x)'
}
for i in range(len(siteUrlArray)):
siteUrl = siteUrlArray[i]
print(siteUrl)
parsedUrl = tldextract.extract(siteUrl)
domain = parsedUrl.domain
r = requests.get(siteUrl, headers=header, verify=False)
soup = BeautifulSoup(r.text, parser)
selector = selectorDict.get(domain)
try:
exec(selector)
except Exception as e:
print(str(e))

Related

Failed to parse content from a webpage using requests

I'm trying to create a script using requests module (without using session) to parse two fields from a webpage but the script fails miserably. However, when I created another script using session, I could fetch the content from that site flawlessly.
Here goes the manual steps to reach the content:
Choose the first item from dropdown.
Get the links to the detail page.
Grab these two fields from detail page.
While creating the script using plain requests, I tried to make use of cookies but I ended up getting AttributeError.
Script without session:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://compranet.hacienda.gob.mx'
link = 'https://compranet.hacienda.gob.mx/web/login.html'
vigen_detail_page = 'https://compranet.hacienda.gob.mx/esop/toolkit/opportunity/current/{}/detail.si'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def grab_first_link_from_dropdown(link):
r = requests.get(link,headers=headers)
soup = BeautifulSoup(r.text,"html.parser")
category_link = urljoin(base,soup.select_one('ul.dropdown-menu > li > a:contains("Vigentes")').get("href"))
return category_link
def fetch_detail_page_link(cat_link):
res = requests.get(cat_link,headers=headers)
str_cookie = f"JSESSIONID={res.cookies['JSESSIONID']}"
soup = BeautifulSoup(res.text,"html.parser")
for items in soup.select("table.list-table > tbody.list-tbody > tr"):
target_link = items.select_one("a.detailLink").get("onclick")
detail_num = re.findall(r"goToDetail\(\'(\d+?)\'",target_link)[0]
inner_link = vigen_detail_page.format(detail_num)
yield str_cookie,inner_link
def get_content(str_cookie,inner_link):
headers['Cookie'] = str_cookie
res = requests.get(inner_link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
try:
expediente = soup.select_one(".form_question:contains('Código del Expediente') + .form_answer").get_text(strip=True)
except AttributeError: expediente = ""
try:
descripcion = soup.select_one(".form_question:contains('Descripción del Expediente') + .form_answer").get_text(strip=True)
except AttributeError: descripcion = ""
return expediente,descripcion
if __name__ == '__main__':
category_link = grab_first_link_from_dropdown(link)
for cookie,detail_page_link in fetch_detail_page_link(category_link):
print(get_content(cookie,detail_page_link))
What possible change should I bring about to make the script work?

There's a redirect that occurs on fetch_detail_page_link. Python Requests follows redirects by default. When your script obtains the cookies, it is only grabbing the cookies for the final request in the chain. You must access the history field of the response to see the redirects that were followed. Doing this with a Session object worked because it was preserving those cookies for you.
I must agree with others who have commented that it really would be a good idea to use a Session object for this. However if you insist on not using Session, your script would look like this:
import re
import requests
from requests.cookies import RequestsCookieJar
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://compranet.hacienda.gob.mx'
link = 'https://compranet.hacienda.gob.mx/web/login.html'
vigen_detail_page = 'https://compranet.hacienda.gob.mx/esop/toolkit/opportunity/current/{}/detail.si'
headers = {
'User-Agent': "Scraping Your Vigentes 1.0",
}
def grab_first_link_from_dropdown(link):
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
category_link = urljoin(base, soup.select_one('ul.dropdown-menu > li > a:contains("Vigentes")').get("href"))
return category_link
def fetch_detail_page_link(cat_link):
res = requests.get(cat_link, headers=headers)
cookies = RequestsCookieJar() # create empty cookie jar
for r in res.history:
cookies.update(r.cookies) # merge in cookies from each redirect response
cookies.update(res.cookies) # merge in cookies from the final response
soup = BeautifulSoup(res.text, "html.parser")
for items in soup.select("table.list-table > tbody.list-tbody > tr"):
target_link = items.select_one("a.detailLink").get("onclick")
detail_num = re.findall(r"goToDetail\(\'(\d+?)\'", target_link)[0]
inner_link = vigen_detail_page.format(detail_num)
yield cookies, inner_link
def get_content(cookies, inner_link):
res = requests.get(inner_link, headers=headers, cookies=cookies)
if not res.ok:
print("Got bad response %s :(" % res.status_code)
return "", ""
soup = BeautifulSoup(res.text, "html.parser")
try:
expediente = soup.select_one(".form_question:contains('Código del Expediente') + .form_answer").get_text(strip=True)
except AttributeError:
expediente = ""
try:
descripcion = soup.select_one(".form_question:contains('Descripción del Expediente') + .form_answer").get_text(strip=True)
except AttributeError:
descripcion = ""
return expediente, descripcion
if __name__ == '__main__':
category_link = grab_first_link_from_dropdown(link)
for cookie, detail_page_link in fetch_detail_page_link(category_link):
print(get_content(cookie, detail_page_link))

python/ beautifulsoup KeyError: 'href'

I am using bs4 to write a webscraper to obtain funding news data.
The first part of my code extracts the title, link, summary and date
of each article for n number of pages.
The second part of my code loops through the link column and inputs
the resulting url in a new function, which extracts the url of the
company in question.
For the most part, the code works fine (40 pages scraped without errors). I am trying to stress test it by raising it to 80 pages, but i'm running into KeyError: 'href' and I don't know how to fix this.
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from tqdm import tqdm
def clean_data(column):
df[column]= df[column].str.encode('ascii', 'ignore').str.decode('ascii')
#extract
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
url = f'https://www.uktechnews.info/category/investment-round/series-a/page/{page}/'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
#transform
def transform(soup):
for item in soup.find_all('div', class_ = 'post-block-style'):
title = item.find('h3', {'class': 'post-title'}).text.replace('\n','')
link = item.find('a')['href']
summary = item.find('p').text
date = item.find('span', {'class': 'post-meta-date'}).text.replace('\n','')
news = {
'title': title,
'link': link,
'summary': summary,
'date': date
}
newslist.append(news)
return
newslist = []
#subpage
def extract_subpage(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
r = requests.get(url, headers)
soup_subpage = BeautifulSoup(r.text, 'html.parser')
return soup_subpage
def transform_subpage(soup_subpage):
main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
if len(main_data):
subpage_link = {
'subpage_link': main_data[0]['href']
}
subpage.append(subpage_link)
else:
subpage_link = {
'subpage_link': '--'
}
subpage.append(subpage_link)
return
subpage = []
#load
page = np.arange(0, 80, 1).tolist()
for page in tqdm(page):
try:
c = extract(page)
transform(c)
except:
None
df1 = pd.DataFrame(newslist)
for url in tqdm(df1['link']):
t = extract_subpage(url)
transform_subpage(t)
df2 = pd.DataFrame(subpage)
Here is a screenshot of the error:
Screenshot
I think the issue is that my if statement for the transform_subpage function does not account for instances where main_data is not an empty list but does not contain href links. I am relatively new to Python so any help would be much appreciated!

You are correct, it's caused by main_data[0] not having an 'href' attribute at some point. You can try changing the logic to something like:
def transform_subpage(soup_subpage):
main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
if len(main_data):
if 'href' in main_data[0].attrs:
subpage_link = {
'subpage_link': main_data[0]['href']
}
subpage.append(subpage_link)
else:
subpage_link = {
'subpage_link': '--'
}
subpage.append(subpage_link)
Also just a note, it's probably not a great idea to iterate through a variable list, and use the same variable name for each item in the list. So change to something like:
page_list = np.arange(0, 80, 1).tolist()
for page in tqdm(page_list):

name 'get_pages_count' is not defined

I am writing a parser but I have a problem. I understand that you can find many similar questions on the Internet, but they did not suit me. Therefore, I ask for help from you.I have little experience, so this question may not be very correct.
Code:
import requests
from bs4 import BeautifulSoup
URL = 'https://stopgame.ru/topgames'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'accept': '*/*'}
HOST = 'https://stopgame.ru'
def get_html(url, params=None):
r = requests.get(url, headers=HEADERS, params=params)
return r
def get_pages(html):
soup = BeautifulSoup(html, 'html.parser')
pagination = soup.find_all('a', class_='page')
if pagination:
return int(pagination[1].get.text())
else:
return 1
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', class_="lent-brief")
games = []
for item in items:
games.append({
"title": item.find("div", class_="title lent-title").get_text(strip=True),
"date": item.find("div", class_="game-date").get_text(strip=True),
"ganre": item.find("div", class_="game-genre").get_text(strip=True),
})
print(games)
print(len(games))
return games
def parse():
html = get_html(URL)
if html.status_code == 200:
pages_count = get_pages_count(html.text)
print(pages_count)
else:
print('Error')
parse()
Error:
File "D:/Python/parser1.py", line 45, in parse
pages_count = get_pages_count(html.text)
NameError: name 'get_pages_count' is not defined

Your function is named get_pages, but you're calling get_pages_count:
def get_pages(html):
.. but when attempting to call it:
pages_count = get_pages_count(html.text)
.. the call should be:
pages_count = get_pages(html.text)

In this below function the method you have called is wrong.
Instead of this pagination[1].get.text() it should be pagination[1].get_text() or
pagination[1].text
Code:
def get_pages(html):
soup = BeautifulSoup(html, 'html.parser')
pagination = soup.find_all('a', class_='page')
if pagination:
return int(pagination[1].get_text())
else:
return 1
OR
def get_pages(html):
soup = BeautifulSoup(html, 'html.parser')
pagination = soup.find_all('a', class_='page')
if pagination:
return int(pagination[1].text)
else:
return 1

Beatifulsoup not returning full html of the page

I want to scrape few pages from amazon website like title,url,aisn and i run into a problem that script only parsing 15 products while on the page it is showing 50. i decided to print out all html to console and i saw that the html is ending at 15 products without any errors from the script.
Here is the part of my script
keyword = "men jeans".replace(' ', '+')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
url = "https://www.amazon.com/s/field-keywords={}".format(keyword)
request = requests.session()
req = request.get(url, headers = headers)
sleep(3)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup)

It's because few of the items are generated dynamically. There might be any better solution other than using selenium. However, as a workaround you can try the below way instead.
from selenium import webdriver
from bs4 import BeautifulSoup
def fetch_item(driver,keyword):
driver.get(url.format(keyword.replace(" ", "+")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
for items in soup.select("[id^='result_']"):
try:
name = items.select_one("h2").text
except AttributeError: name = ""
print(name)
if __name__ == '__main__':
url = "https://www.amazon.com/s/field-keywords={}"
driver = webdriver.Chrome()
try:
fetch_item(driver,"men jeans")
finally:
driver.quit()
Upon running the above script you should get 56 names or something as result.

import requests
from bs4 import BeautifulSoup
for page in range(1, 21):
keyword = "red car".replace(' ', '+')
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
url = "https://www.amazon.com/s/field-keywords=" + keyword + "?page=" + str(page)
request = requests.session()
req = request.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
results = soup.findAll("li", {"class": "s-result-item"})
for i in results:
try:
print(i.find("h2", {"class": "s-access-title"}).text.replace('[SPONSORED]', ''))
print(i.find("span", {"class": "sx-price-large"}).text.replace("\n", ' '))
print('*' * 20)
except:
pass
Amazon's page range is max till 20 here is it crawling the pages

Why is Beautiful Soup not finding this element with multiple classes?

I'm trying to select an element that looks like <li class="result first_result"> using the query soup.find_all("li", {"class" : "first_result"})
The element is definitely on the page but it's not showing up when I run my script. I've also tried soup.find_all("li", {"class" : "result first_result"}) for the record, but still nothing.
What am I doing wrong?
edit: at alecxe's request I've posted the code I have so far. I'm on 64-bit Windows 7 using Python 3.4 which I'm sure is the culprit. The specific part I made this question for is at the very bottom under ###METACRITIC STUFF###
from bs4 import BeautifulSoup
from urllib3 import poolmanager
import csv
import requests
import sys
import os
import codecs
import re
import html5lib
import math
import time
from random import randint
connectBuilder = poolmanager.PoolManager()
inputstring = sys.argv[1] #argv string MUST use double quotes
inputarray = re.split('\s+',inputstring)
##########################KAT STUFF########################
katstring = ""
for item in inputarray: katstring += (item + "+")
katstring=katstring[:-1]
#kataddress = "https://kat.cr/usearch/?q=" + katstring #ALL kat
kataddress = "https://kat.cr/usearch/" + inputstring + " category:tv/?field=seeders&sorder=desc" #JUST TV kat
#print(kataddress)
numSeedsArray = []
numLeechArray = []
r = requests.get(kataddress)
soup = BeautifulSoup(r.content, "html5lib")
totalpages = [h2.find('span') for h2 in soup.findAll('h2')][0].text #get a string that looks like 'house of cards results 1-25 from 178'
totalpages = int(totalpages[-4:]) #slice off everything but the total # of pages
totalpages = math.floor(totalpages/25)
#print("totalpages= "+str(totalpages))
iteration=0
savedpage = ""
def getdata(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
global numSeedsArray
global numLeechArray
tds = soup.findAll("td", { "class" : "green center" })
numSeedsArray += [int(td.text) for td in tds]
tds = soup.findAll("td", { "class" : "red lasttd center"})
numLeechArray += [int(td.text) for td in tds]
#print(numSeedsArray)
def getnextpage(url):
global iteration
global savedpage
#print("url examined= "+url)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
nextpagelinks = soup.findAll("a", { "class" : "turnoverButton siteButton bigButton" })
nextpagelinks = [link.get('href') for link in nextpagelinks]
#print(nextpagelinks)
activepage = soup.findAll("a", { "class" : "turnoverButton siteButton bigButton active" })
#print("activepage= " +activepage[0].text)
currentpagenum = activepage[0].text
#print("currentpagenum= "+currentpagenum)
if len(currentpagenum)==1 and iteration>1:
nextpage = savedpage+str(int(currentpagenum)+1)+str(nextpagelinks[0][-27:])
#print("nextpage= "+nextpage)
nextpage = re.sub(r'(%20)', ' ', nextpage)
nextpage = re.sub(r'(%3A)', ':', nextpage)
nextpage = "https://kat.cr"+nextpage
#print(nextpage)
elif len(currentpagenum)==1 and iteration<=1:
nextpage = str(nextpagelinks[0][:-28])+str(int(currentpagenum)+1)+str(nextpagelinks[0][-27:])
savedpage = str(nextpagelinks[0][:-28])
#print("savedpage= "+savedpage )
nextpage = re.sub(r'(%20)', ' ', nextpage)
nextpage = re.sub(r'(%3A)', ':', nextpage)
nextpage = "https://kat.cr"+nextpage
#print(nextpage)
elif len(currentpagenum)==2:
nextpage = savedpage+str(int(currentpagenum)+1)+str(nextpagelinks[0][-27:])
#print("nextpage= "+nextpage)
nextpage = re.sub(r'(%20)', ' ', nextpage)
nextpage = re.sub(r'(%3A)', ':', nextpage)
nextpage = "https://kat.cr"+nextpage
#print(nextpage)
return nextpage
if totalpages<2:
while iteration < totalpages-1: #should be totalpages-1 for max accuracy
getdata(kataddress)
iteration+=1
kataddress = getnextpage(kataddress)
else:
while iteration < 2: #should be totalpages-1 for max accuracy
getdata(kataddress)
iteration+=1
kataddress = getnextpage(kataddress)
# print(str(sum(numSeedsArray)))
# print(str(sum(numLeechArray)))
print(str(sum(numLeechArray)+sum(numSeedsArray)))
def getgoogdata(title):
title = re.sub(r' ', '+', title)
url = 'https://www.google.com/search?q=' +title+ '&ie=utf-8&oe=utf-8'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
resultnum = soup.find("div", {"id": "resultStats"}).text[:-14]
s2 = resultnum.replace(',', '')
resultnum = re.findall(r'\b\d+\b', s2)
print(resultnum)
getgoogdata(inputstring)
####################METACRITIC STUFF#########################
metainputstring = ""
for item in inputarray:
metainputstring += item + " "
metainputstring = metainputstring[:-1]
metacriticaddress = "http://www.metacritic.com/search/tv/" + metainputstring + "/results"
print (metacriticaddress)
r = requests.get(metacriticaddress)
soup = BeautifulSoup(r.content, "html5lib")
first_result = soup.find_all("li", attrs={"class" : "first_result"})
# first_result = soup.select("li.result.first_result")
print(first_result)

All other answers are not related to your actual problem.
You need to pretend to be a real browser to be able to see the search results:
r = requests.get(metacriticaddress, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})
Proof (searching for Game of Thrones, of course):
>>> from bs4 import BeautifulSoup
>>>
>>> import requests
>>>
>>> metacriticaddress = "http://www.metacritic.com/search/tv/game%20of%20thrones/results"
>>> r = requests.get(metacriticaddress, headers={
... "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
... })
>>> soup = BeautifulSoup(r.content, "html5lib")
>>> first_result = soup.find_all("li", class_="first_result")
>>>
>>> print(first_result[0].find("h3", class_="product_title").get_text(strip=True))
Game of Thrones

Quoting the documentation:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_
Thus, you need to instead write: soup.find_all("li", class_="first_result").
If you're using a pre-4.1.2 version of BeautifulSoup, or if you're insistent upon passing a dictionary, you need to specify that the dictionary fills the attrs parameter: soup.find_all("li", attrs={"class" : "first_result"}).

Your first attempt (soup.find_all("li", {"class" : "first_result"})) is almost correct, but you need to specify the parameter that your dictionary is being passed to (in this case the parameter name is attrs), and call it like soup.find_all("li", attrs={"class" : "first_result"}).
However, I would suggest doing this with a CSS selector because you're matching against multiple classes. You can do this using the .select() method of the soup like this
results = soup.select("li.result.first_result")
Be aware that .select() will always return a list, so if there's only one element, don't forget to access it as results[0].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can dynamic css selectors be used with Beautiful Soup? - python

Related

Failed to parse content from a webpage using requests

python/ beautifulsoup KeyError: 'href'

name 'get_pages_count' is not defined

Beatifulsoup not returning full html of the page

Why is Beautiful Soup not finding this element with multiple classes?

Categories

Resources