How to scraping the html text using selenium python

How to scraping the html text using selenium python - python

I'm trying to get text "General (8)" shown in below HTML code using selenium webdriver but kept running into issues. Any input is highly appreciated. Thanks.
my code:
test1 = driver.find_element_by_xpath("//input[#id = 'General'][#role = 'presentation']").text
print(test1)
returns null
HTML:
<li class="" role="checkbox" aria-checked="false">
<div class="extend_clickable" tabindex="0">
<input id="General" role="presentation" name="General" checked="checked" type="checkbox">
General (8)
<label for="General" role="presentation"></label>
</div>
</li>

input node is always empty. It means it cannot contain any child nodes (including text nodes). What you want is a text sibling of input which you can get as text content of parent div:
test1 = driver.find_element_by_xpath('//div[#class="extend_clickable"]').text.strip()

As per the HTML you have provided to print the text General (8) you have to extract it from the <div class="extend_clickable" tag, as the text is not within <input> tag and you can use the following code block using Python's splitlines() method as follows :
all_text = driver.find_element_by_xpath("//li[#role='checkbox']/div[#class='extend_clickable']").get_attribute("innerHTML")
myText = all_text.splitlines()
print(myText[1])
Console Output :
General (8)
Update
As per #Andersson's counter question/comment the following screenshot should address and answer all the queries.

Related

Using Python and BeautifulSoup to scrape list with variable orders and tags based on text strings

Details: MacOS, Python3, BeautifulSoup4
I am new to Python and even newer to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape html pages which do not heavily differentiate their tags by classes or div ids. In other words, I am trying to scrape the middle section of a list. The list will have an unpredictable amount of tags and elements (sometimes they use an unordered list, other times they are using a description list) so what I am scraping is fairly unpredictable, however, I do have two known variables and those would be the header string text I want to START at and the header string text I want to END at.
I have assembled the following example html to test this on:
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">First Section Title - Known Variable or String</h3>
</div>
</div>
<div>
<ul class="unstyled">
<li>Item1</li>
<li>Item2</li>
<li>Empty LI Tags Also Exist</li>
</ul>
<dl class="dl-horizontal">
<dt>Title of some description list</dt>
<dd>Another item may exist here</dd>
</dl>
</div>
<div>
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Another Section Title</h3>
</div>
</div>
<ul class="unstyled">
<li>Item1</li>
<li></li>
</ul>
<dl class="dl-horizontal">
<dt>Another Description List Title</dt>
<dd>Another item may exist here</dd>
<dt>And here</dt>
<dd>And Here</dd>
</dl>
</div>
<div>
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Section Title (String) I Wish To Stop At - Known Variable or String</h3>
</div>
</div>
</div>
Again, using the above model, I want to start at the first section I listed and end at the known text string of a particular section towards the bottom.
I have listed my Python script below. So far, the following Python is grabbing the correct information, however, I do not believe it will work under all circumstances, and there is probably a more efficient way to go about this. Here are some of the issues I believe are in my script:
My script is rather static - while it appears to start at the correct header, I have pieced out two sections separately as I do not believe my For loop is working the way it should be (I do not think ##Section 2 should be needed if written correctly).
Because my For loop is likely not doing what I probably think it is (I'd like it to iterate through the sections) I never had to define the stopping point (the string of text at the section I wish to stop at).
Since I am not convinced the loop is working correctly, I do not believe this will handle any curveballs I am thrown by the site - for example variable numbers of items on the list and if they add an additional section I would want between the "Beginning section" and "Ending section" defined.
I believe what needs to happen is:
Librarys need to be imported
Locate first section
Find next sibling
Keep finding siblings and returning text until the stop string matches
Python:
##Scrape
#import beautifulsoup and requests library
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(open("mock.html"), "html.parser")#BeautifulSoup(page.read())
#Begin by grabbing the section
stuff = soup.find_all(class_="panel-heading")
#Search for the first section title text string
next_elem = soup.find(text="First Section Title - Known Variable or String").findNext('li').contents[0]
#Attempt to scan the remainder of the section, starting with the next line item
next_next = next_elem.parent.find_next_sibling()
for item in next_next.findAll('li','dt','dd'):
if isinstance(item, Tag):
print(item.text)
print(next_elem)
print(next_next.text)
##Section 2 - I'd like to cut this out
s2_elem = soup.find(text="Another Section Title").findNext('li').contents[0]
s2_nxnx = s2_elem.parent.find_next_sibling()
s2_nxnxnx = s2_nxnx.parent.find_next_sibling()
print(s2_elem)
print(s2_nxnx.text)
print(s2_nxnxnx.text)

You could use a variable to spot when you are between search_start and search_end:
from bs4 import BeautifulSoup, Tag
import requests
search_start = "First Section Title - Known Variable or String"
search_end = "Section Title (String) I Wish To Stop At - Known Variable or String"
soup = BeautifulSoup(open("mock.html"), "html.parser")
start = False
for el in soup.find_all(['li', 'dt', 'dd', 'h3']):
if el.name == 'h3':
if el.text == search_start:
start = True
elif el.text == search_end:
break
elif start and isinstance(el, Tag):
print(el.text)
This would give you the following output:
Item1
Item2
Empty LI Tags Also Exist
Title of some description list
Another item may exist here
Item1
Another Description List Title
Another item may exist here
And here
And Here

Python/selenium webscraping

for link in data_links:
driver.get(link)
review_dict = {}
# get the size of company
size = driver.find_element_by_xpath('//[#id="EmpBasicInfo"]//span')
#location = ??? need to get this part as well.
my concern:
I am trying to scrape a website. I am using selenium/python to scrape the "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span, but I am not able to extract the text element from the website using xpath.I have tried getText, get attribute everything. Please, help!
This is the output for each iteration:I am not getting the text value.
Thank you in advance!

It seems you want only the text, instead of interacting with some element, one solution is to use BeautifulSoup to parse the html for you, with selenium getting the code built by JavaScript, you should first get the html content with html = driver.page_source, and then you can do something like:
html ='''
<div id="CompanyContainer">
<div id="EmpBasicInfo">
<div class="">
<div class="infoEntity"></div>
<div class="infoEntity">
<label>Industry</label>
<span class="value">Woodcliff</span>
</div>
<div class="infoEntity">
<label>Size</label>
<span class="value">501 to 1000 employees</span>
</div>
</div>
</div>
</div>
''' # Just a sample, since I don't have the actual page to interact with.
soup = BeautifulSoup(html, 'html.parser')
>>> soup.find("div", {"id":"EmpBasicInfo"}).findAll("div", {"class":"infoEntity"})[2].find("span").text
'501 to 1000 employees'
Or, of course, avoiding specific indexing and looking for the <label>Size</label>, it should be more readable:
>>> [a.span.text for a in soup.findAll("div", {"class":"infoEntity"}) if (a.label and a.label.text == 'Size')]
['501 to 1000 employees']
Using selenium you can do:
>>> driver.find_element_by_xpath("//*[#id='EmpBasicInfo']/div[1]/div/div[3]/span").text
'501 to 1000 employees'

capturing states between tags in python using xpath

I want to capture the word WORD sentence This is what I want. in following format:
<div id="message1">
<div class="message2">
<strong>WORD</strong> This is what I want.<br/>
</div>
</div>
What I tried is:
import requests
from lxml import html
cont=session.get('http://mywebsite.com').content
tree=html.fromstring(cont)
word=tree.xpath('//div[#class="message2"]/strong')
sentence=tree.xpath('//div[#class="message2"]/br')
print word
print sentence
Nothing is printed for me!

I find xpath helper is great for solving problems like this one
word = tree.xpath('//div[#class="message2"]/strong/text()')[0]
sentence = tree.xpath('//div[#class="message2"]/strong/following-sibling::text()[1]')[0]

This is what you want :)
from lxml import html
text = """
<div id="message1">
<div class="message2">
<strong>WORD</strong> This is what I want.<br/>
</div>
</div>
"""
tree = html.fromstring(text);
print(tree.xpath("//div[#class='message2']/strong/following-sibling::text()")[0])

I'm not sure specific about LXML but if that is the text you're looking for, calling text will not return the child tree text which exists inside of the strong tag.
So in general XPath terms, this is what you're looking for to only match that text.
//*[#class="message2"]/text()

Alter XPath to Extract Text Selenium

This HTML block:
<td class="tl-cell tl-popularity" data-tooltip="9,043,725 plays" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 57%"></div>
</div>
</td>
equates to this XPath:
xpath = '//*[#id="album-tracks"]/table/tbody/tr[5]/td[6]'
Trying to extract the text: 9,043,725 plays with
find_element_by_xpath(xpath).text()
returns an empty string. This text is only generated when a user hovers their mouse over the HTML block.
Is there a way to alter the XPath so that an empty string is not returned but the actual string is returned?

Try using get_attribute instead. The intended element can be located using any find_elements mechanisms. See the API DOC
element = browser.find_elements_by_css_selector('.tl-cell.tl-popularity')
text = element.get_attribute('data-tooltip')

retrieve text from the HTML source page using selenium with python

My HTML page looks like this :
<div id="s-resultlist-header" class="s-resultlist-header" style="width: 607px;">
<div class="s-resultlist-hits">
<span>Total hits: 203</span>
</div>
</div>
I want to retrieve a value "Total hits: 203" within the span into a variable. I need to validate if total results is equal to 203. Language used: Python.
I have tried:
elem = self.browser.find_element_by_xpath(//div[#class='s-resultlist-hits']span
but gives a error: Expected ). Can anyone correct the syntax?
I have also tried:
print browser.pagesource
But it prints the entire page source. Can anyone guide?

elem = self.browser.find_element_by_xpath("//div[#id='s-resultlist-header']/div[#class='s-resultlist-hits']/span");

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scraping the html text using selenium python - python

input node is always empty. It means it cannot contain any child nodes (including text nodes). What you want is a text sibling of input which you can get as text content of parent div: test1 = driver.find_element_by_xpath('//div[#class="extend_clickable"]').text.strip()

Related

Using Python and BeautifulSoup to scrape list with variable orders and tags based on text strings

Python/selenium webscraping

capturing states between tags in python using xpath

Alter XPath to Extract Text Selenium

retrieve text from the HTML source page using selenium with python

Categories

Resources