I use soup = BeautifulSoup(driver.page_source) to parse the whole page from Selenium in BeautifulSoup.
But how to just parse one element of Selenium in BeautifulSoup.
Below code will throw
TypeError: object of type 'FirefoxWebElement' has no len()
element = driver.find_element_by_id(id_name)
soup = BeautifulSoup(element)
I don't know if selenium does this out of the box, but I managed to find this workaround
element_html = f"<{element.tag_name}>{element.get_attribute('innerHTML')}</{element.tag_name}>"
you may want to replace innerHTML with innerTEXT if you want to get only the text, for example
<li>Hi <span> man </span> </li>
Getting the innerHTML will return all of what inside but the innerTEXT won't, try & see.
now create your Soup object
soup = BeautifulSoup(element_html)
print(soup.WHATEVER)
using the above technique, just create a method parseElement(webElement) & use it whenever you want to parse an element.
Btw I only use lxml & when I forgot to type it, the script didn't work
Related
I'm web scraping a site with beautiful soup that has class names like the following:
<a class="Component-headline-0-2-109" data-key="card-headline" href="/article/politics-senate-elections-legislation-coronavirus-pandemic-bills-f100b3a3b4498a75d6ce522dc09056b0">
The primary issue is that the class name always starts with Component-headline- but just send with a random number. When I use beautiful soup's soup.find_all('class','Component-headline'), it's not able to grab anything because of the unique number. Is it possible to use find_all, but to grab all the classes that just start with "Component-headline"?
I was also thinking on using the data-key="card-headline", and use soup.find_all('data-key','card-headline'), but for some reason that didn't work either, so I assume I can't find by data-key, but not sure. Any suggestions?
BeautifulSoup supports regex, so you can use re.compile to search for partial text on the class attribute
import re
soup.find_all('a', class_=re.compile('Component-headline'))
You can also use lambda
soup.find_all('a', class_=lambda c: c.startswith('Component-headline'))
Try using an [attribute^=value] CSS Selector.
To use a CSS Selector, instead of the find_all() method, use select().
The following selects all classes that start with Component-headline:
soup = BeautifulSoup(html, "html.parser")
print(soup.select('[class^="Component-headline"]'))
I want to scrape the URLs within the HTML of the 'Racing-Next to Go' section of www.tab.com.au.
Here is an excerpt of the HTML:
<a ng-href="/racing/2020-07-31/MACKAY/MAC/R/8" href="/racing/2020-07-31/MACKAY/MAC/R/8"><i ng-
All I want to scrape is the last bit of that HTML which is a link, so:
/racing/2020-07-31/MACKAY/MAC/R/8
I have tried to find the element by using xpath, but I can't get the URL I need.
My code:
driver = webdriver.Firefox(executable_path=r"C:\Users\Harrison Pollock\Downloads\Python\geckodriver-v0.27.0-win64\geckodriver.exe")
driver.get('https://www.tab.com.au/')
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.text)
Probaly you want to use get_attribute insted of .text. Documentation here.
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.get_attribute("href"))
Yes, you can use getAttribute(attributeLocator) function for your requirement.
selenium.getAttribute(//xpath#href);
Specify the Xpath of the element for which you require to know the class of.
The value /racing/2020-07-31/MACKAY/MAC/R/8 within the HTML is the value of href attribute but not the innerText.
Solution
Instead of using the text attribute you need to use get_attribute("href") and the effective lines of code will be:
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.get_attribute("href"))
I am trying to scrape information from a website using a CSS Selector in order to get a specific text element but have come across a problem. I try to search for my desired portion of the website but my program is telling me that it does not exist. My program returns an empty list.
I am using the requests and lxml libraries and am using CSS Selectors to do my HTML Scraping. I have Python 3.7. I try searching for the part of the website that I need with a selector and it is not appearing. I have also tried using XPath but that has failed as well. I have tried using the following selector:
div#showtimes
When I use this selector, I get the following result:
[<Element div at 0x3bf6f60>]
I get the expected result, which is the desired element. When I try to go one step further and access the element nested inside of the div#showtimes element (see below), I get an empty list.
div#showtimes div
I get the following result:
[]
Through inspection of the website's HTML, I know that there is a nested element within the div#showtimes element. This problem has occurred on other web pages as well. I am using the code below.
import requests
from lxml import html
from lxml.cssselect import CSSSelector
# Set URL
url = "http://www.fridleytheatres.com/location/7425/Paramount-7-Theatres-
Showtimes"
# Get HTML from page
page = requests.get(url)
data = html.fromstring(page.text)
# Set up CSSSelector
sel = CSSSelector('div#showtimes div')
# Apply Selector
results = sel(data)
print(results)
I expect the output to be a list containing a element, but it is returning an empty list [].
If I understand the problem correctly, you're attempting to get a div element which is a child of div#showtimes. Try using div#showtimes > div.
Hi i have the following in python
#Searching for company
varA = soup.find(Microsoft)
#Finding the <a> tag which contains href
#{<a data-deptmodal="true" href="https://someURL BASED ON COMPANY NAME">TEXT BASED ON COMPANY NAME</a>}
button = org.find_previous('a')
driver.find_element_by_tag_name(button).click()
and i get an error like
TypeError: Object of type 'Tag' is not JSON serializable
How do I make the webdriver click on my href after i get the soup
please note that my href changes everytime i change the company name.
To add to the existing comment, BeautifulSoup is an HTML parser, it helps you to extract data from the HTML, it is not interacting with the page in any manner - it cannot, for instance, click the link.
If you need to click the link in the browser, do it via selenium. In your case the .find_element_by_link_text() (or .find_element_by_partial_link_text()) locator fits the problem really well:
driver.find_element_by_link_text("Microsoft")
Documentation reference: Locating Hyperlinks by Link Text.
I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]