Trying to get a specific portion of text from this web page... trying to use code I found from a similar post:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://www.baseball-reference.com/players/k/kershcl01.shtml')
# Parsing the page
tree = html.fromstring(page.content)
# Get element using XPath
share = tree.xpath(
'//div[#id="leaderboard_cyyoung"]/table/tbody/tr[11]/td/a')
print(share)
Output is just empty brackets []
You are getting empty results because the div element you are trying to query is commented out in the requested page's source. Note that when you use the requests.get method, you get the page's HTML source code, not the rendered HTML code generated by the browser from your interaction with the page and that you can inspect with the browser's developer tools.
So I would say: check again if this is really the element you see rendered on the page and see if you can identify what kind of interaction makes it rendered. Then you can use a tool to mock this interaction so that you can get the rendered HTML code within your Python environment. I would suggest helium for doing so. If this is not the right element, you can simply update the specified XPath to get the right source-code available element and successfully scrape it.
As stated, this is rendered/dynamic part of the site. It is there in the comments, so you'll need to pull out the comments of the html, then parse. The other issue with it is in the comments, there is no <tbody> tag, so it wont find anything, you'd need to remove that. I'm not sure what you want to pull out though (is it the link, is it the text?). I alerted your code to show you how to use it with lxml, but hoestly not a fan. I'd prefer to just use BeautifulSoup. BeautifulSoup however doesn't intigrate with xpath, so used css selector instead.
Your code altered:
import requests
from lxml import html
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
htmlStr = str(each)
# Parsing the page
tree = html.fromstring(htmlStr)
# Get element using XPath
share = tree.xpath('//div[#id="leaderboard_cyyoung"]/table/tr[11]/td/a')
print(share)
How I would do it:
import requests
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
soup = BeautifulSoup(str(each), 'html.parser')
share = soup.select('div#leaderboard_cyyoung > table > tr:nth-child(12) > td > a')
print(share)
break
Output:
[4.58 Career Shares]
Related
While scraping the following website (https://www.middletownk12.org/Page/4113), this code could not locate the table rows (To get the staff name, email & department) even though they are visible when I use the Chrome developer tools. The soup object is not readbale enough to locate the tr tags that have the info needed.
import requests
from bs4 import BeautifulSoup
url = "https://www.middletownk12.org/Page/4113"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(response.text)
I used different libraries such as bs4, request & selenium with no chance. I also tried Css selectors & XPATH with selenium with no chance. The Tr elements could not be located.
That table of contact information is filled in by Javascript after the page has loaded. The content doesn't exist in the page's HTML and you won't see it using requests.
By using the developer tools available in the browser, we can examine the requests made after the page has loaded. There are a lot of them, but at least in my browser it's obvious the contact information is loaded near the end.
Looking at the request log, I see a request for a spreadsheet from docs.google.com:
If we examine that entry, we find that it's a request for:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0
And if we fetch the above link, we get a spreadsheet with the source data for that table.
Actually I used Selenium & then bs4 without any results. The code does not find the 'tr' elements...
Why are you using Selenium? The whole point to this answer is that you don't need to use Selenium if you can figure out the link to retrieve the data -- which we have.
All we need is requests to fetch the data and BeautifulSoup to parse it:
import requests
import bs4
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for link in soup.findAll('a'):
print(f"{link.text}: {link.get('href')}")
I've recently started looking into purchasing some land, and I'm writing a little app to help me organize details in Jira/Confluence to help me keep track of who I've talked to and what I talked to them about in regards to each parcel of land individually.
So, I wrote this little scraper for landwatch(dot)com:
[url is just a listing on the website]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
Everything works great except that the class d19de has a ton of values hidden behind the Read More button.
While Googling away at this, I discovered How to Scrape reviews with read more from Webpages using BeautifulSoup, however I either don't understand what they're doing well enough to implement it, or this just doesn't work anymore:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
Any thoughts on how to bypass that Read More button? I'm really hoping to do this in BeautifulSoup, and not selenium.
Here's an example URL for testing: https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
That data is present within a script tag. Here is an example of extracting that content, parsing with json, and outputting land description info as a list:
from bs4 import BeautifulSoup
import requests, json
url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r')
You may wish to examine what else is in that script tag:
from pprint import pprint
pprint(all_data)
I am working on a webscraper using html requests and beautiful soup (I am new to this). For multiple webpages e.g. (https://www.selfridges.com/GB/en/cat/hermes-rose-herms-silky-blush-6g_R03752945/?previewAttribute=32%20Rose%20Pommette) I am trying to grab the image link, which is always in the same for multiple webpages. The HTML is:
<img class="c-image-gallery__img" src="//images.selfridges.com/is/image/selfridges/R03752945_32ROSEPOMMETTE_M?$PDP_M_ZOOM$" loading="lazy">
I have tried to use the CSS selector:
r = scraper.get(link)
soup = BeautifulSoup(r.content, 'lxml')
imagelink = soup.select('body > section > section.c-product-hero.--multiple-product-shot > div.c-product-hero__product-shots.c-image-gallery > div > picture:nth-child(1) > img')
which returns None
or find_all:
soup.find_all('img')
But the specific link is not in the list. I am unsure why this is. Any help would be appreciated
This page you are trying to scrape uses Cloudflare and it has some kind of protection from being scraped. The server returns a "403 Forbidden" HTTP status code. Some websites use a lot of javascript and these are also hard to scrape without a javascript capable browser. I would suggest you use a different technology like Puppeteer.
from bs4 import BeautifulSoup
import requests
link = "https://www.selfridges.com/GB/en/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 OPR/75.0.3969.171"}
page = requests.get(link, headers=headers)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, "lxml")
soup_imgs = soup.find_all("img")
for img in soup_imgs:
print(img)
import lxml.html
import requests
l1=[]
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get('http://www.naukri.com/jobs-by-location', headers=headers)
html = r.content
root = lxml.html.fromstring(html)
urls = root.xpath('//div[4]/div/div[1]/div/a/#href') #This xpath should give the list of cities(their links)
l1.extend(urls)
This python code is meant to scrape the list of job cities(their 'a href' tags) and store it in list l1. But here I am getting a blank list. The same xpath is working on Chrome console but it's not working in this code. Due to that I added headers to make my code act as a browser but still it's not working..
http://i.stack.imgur.com/Xx1xW.jpg
I tried to achieve the same using Selenium WebDriver, and this also succeeds. When this succeeds from your computer, it might be a problem in one of the used libraries.
import selenium.webdriver as driver
browser = driver.Chrome()
browser.get("http://www.naukri.com/jobs-by-location")
links = browser.find_elements_by_xpath("//div[4]/div/div[1]/div/a")
for link in links:
href = link.get_attribute("href")
print(href)
browser.quit()
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.######.com/##/")
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link['href'].startswith('http://'):
print(link)
I am using these code, through that script parsing the href tag but when trying with iframe they can't give the output. I dont know what happening there. anyone suggest me plz...
how about use iframe and src and also like requests it is better them urllib2
from bs4 import BeautifulSoup
#import urllib2
import requests
#page = urllib2.urlopen("http://www.######.com/##/")
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
page1 = requests.get(url,headers=headers)
page = page1.text
soup = BeautifulSoup(page,'lxml')
link = soup.find_all({'iframe':'src'})
link_clean = re.compile('src="(.+?)"').findall(str(z))
for item in link_clean:
print item
Oh, so you are trying to get all iframes on page? Everything looks ok except you should use src attribute with iframes. If that doesn't help please provide an example page.