I have more or less this structure, How can I select the next element after title? The starting point must be x or y since the structure have duplicated classes and so on, and this one is an unique anchor. Just to clarify I need to catch the content and the reference is the title.
x = wd.find_elements_by_class_name('title')[0] // print title 0
y = wd.find_elements_by_class_name('title')[1] // print title 1
HTML:
<div class='global'>
<div class="main">
<p class="title">Title0</p>
<p class="content">Content0</p>
</div>
<div class="main">
<p class="title">Title1</p>
<p class="content">Content1</p>
</div>
</div>
If you are using selenium try following css selector to get the p tag based on class title.
driver.find_elements_by_css_selector(".title+p")
To get the content value.
for item in driver.find_elements_by_css_selector(".title+p"):
print(item.text)
For Specific Element:
driver.find_element(By.XPATH, '(//p[#class()="title"]/following-sibling::p)[1]' #Content0
driver.find_element(By.XPATH, '(//p[#class()="title"]/following-sibling::p)[2]' #Content1
For all Elements:
for content in driver.find_elements(By.XPATH, '//p[#class()="title"]/following-sibling::p'):
print(content.text)
Related
Here's the document struvture:
<div class="search-results-container">
<div>
<div class="feed-shared-update-v2">
<div class="update-components-actor">
<div class="update-components-actor__image">
<img class="presence-entity__image" src="https://www.testimage.com/test.jpg"/>
<span></span>
<span>test</span>
</div>
</div>
</div>
</div>
<div>
<div class="feed-shared-update-v2">
<div class="update-components-actor">
<div class="update-components-actor__image">
<img class="presence-entity__image" src="https://www.testimage.com/test.jpg"/>
<span></span>
<span>test</span>
</div>
</div>
</div>
</div>
</div>
not sure the best way to do this but hoping someone can help. I have a for loop that grabs all the divs that precede a div with class "feed-shared-update-v2". This works:
elements = driver.find_elements(By.XPATH, "//*[contains(#class, 'feed-shared-update-v2')]//preceding::div[1]");
I then run a for loop over it:
for card in elements:
however i'm having trouble trying to target the img and the second span in these for loops. I tried:
for card in elements:
profilePic = card.find_element(By.XPATH, ".//following::div[#class='update-components-actor']//following::img[1]").get_attribute('src')
text = card.find_element(By.XPATH, ".//following::div[#class='update-components-text']//following::span[2]").text
but this produces a error saying:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//following::div[#class='update-components-actor']//following::img[1]"}
so I'm hoping someone can point me in the right direction as to what i'm doing wrong. I know its my xpath syntax and i'm not allowed to chain "followings" (although even just trying .//following doesn't work, so is ".//" not the right syntax?) but i'm not sure what the right syntax should be, especially since the span does not have a class. :(
Thanks!
I guess you are overusing the following:: axis. Simply try the following (no pun intended):
For your first expression use
//*[contains(#class, 'feed-shared-update-v2')]/..
This will select the parent <div> of the <div class="feed-shared-update-v2">. So you will select the whole surrounding element.
To retrieve the children you want, use these XPaths: .//img/#src and .//span[2]. Full code is
for card in elements:
profilePic = card.find_element(By.XPATH, ".//img").get_attribute('src')
text = card.find_element(By.XPATH, ".//span[2]").text
That's all. Hope it helps.
It seems in the span that there is not such class of div called: update-components-text
did you mean: update-components-actor?
Im not such a fan of xpath, but when i copied your html and img selector, it did find me 2 img, maybe you are not waiting for the element to load, and then it fails?
try using implicit/explicit waits in your code.
I know you are using xpath, but concider using css
This might do the trick:
.feed-shared-update-v2 span:nth-of-type(2)
And if you want a css of the img:
.feed-shared-update-v2 img
HTML:
<div id="related">
<a class="123" href="url">
<h3 class="456">
<span id="id00" aria-label="TEXT HERE">
</span>
</h3>
</a>
<a class="123" href="url">
<h3 class="456">
<span id="id00" aria-label="NOT HERE">
</span>
</h3>
</a>
</div>
I'm trying to find & click on <a (inside the div id="related" with class="123" AND where SPAN aria-label contains "TEXT"
items = driver.find_elements(By.XPATH, "//div[#id='related']//a[#class='123'][contains(#href, 'url')]//span[contains(#aria-label, 'TEXT']")
But it's not finding the href, it's only finding the span.
then I want to do:
items[3].click()
How can I do that.
Your XPath has some typo problems.
Try this:
items = driver.find_elements(By.XPATH, "//div[#id='related']//a[#class='123'][contains(#href,'watch?v=')]//span[contains(#aria-label,'TEXT')]")
This will give you the span element inside the presented block.
To locate the a element you should use another XPath.
UPD
To find all the a elements inside div with #id='related' and containing span with specific aria-label attribute can be clearly translated to XPath like this:
items = driver.find_elements(By.XPATH, "//div[#id='related']//a[#class='123' and .//span[contains(#aria-label,'TEXT')]]")
I'm using selenium and I need to get an href from a link that is above many tags!
But the only information that I can use and I have for sure, is the text "Test text!" from the h3 tag!
Here is the example:
<a href="/link/post" class="link" >
<div class="inner">
<div class="header flex">
<h3 class="mb-0">
Test text!
</h3>
</div>
</div>
</a>
Try using the following xpath to locate the desired element:
//a[#href and .//h3[contains(text(),'Test text!')]]
So, to get the href value you have to
from selenium.webdriver.common.by import By
href = driver.find_element(By.XPATH, '//a[#href and .//h3[contains(text(),'Test text!')]]')
An alternative to the approach in Prophet's answer would be to use a XPATH like
//h3[contains(text(),"Test text!")]]//ancestor::a
i.e. first search for the h3 tag and then for an a tag above.
Prophet's answer uses the opposite approach, first find all a tags and then only keep the one with the correct h3 tag below.
EDIT:
So I found a way to do it by clicking on the Countries elements, see my answer.
Still have one question that would make this better:
When I execute the scrollIntoView(true) on a country <li> it goes under another element (<div class="sportList_subtitle">Desportos</div>) and is not clickable.
Is there some javascript or selenium function like "scrollIntoClickable"?
ORIGINAL:
I'm trying to scrape info from Betclic website with python and BeautifulSoup + Selenium.
Given the URL for each game has the structure: "domain"/"sports_url"/"competition_url"/"match_url"
Example: https://www.betclic.pt/futebol-s1/liga-dos-campeoes-c8/rennes-chelsea-m2695669
You can try it in your language, they translate the actual URL string but the structure and ID's are the same.
The only thing that's left is grabbing all the different "competition_url"
So my question now is from the "sports_url" (https://www.betclic.pt/futebol-s1) how can I get all sub "competition_url"?
The problem is with the "hidden" URL's under each country's name on the left panel. Those only appear after you click the arrow next to each country's name, like a drop-down list. The click event actually adds one class "is-active" to the <li> for that country and also
an <ul> at the end of that <li>. It's this added <ul> that has the URL's list I'm trying to get.
Code before click:
<!---->
<li class="sportList_item has-children ng-star-inserted" routerlinkactive="active-link" id="rziat-DE">
<div class="sportList_itemWrapper prebootFreeze">
<div class="sportlist_icon flagsIconBg is-DE"></div>
<div class="sportlist_name">Alemanha</div>
</div>
<!---->
</li>
Code after click (reduced for presentation):
<li class="sportList_item has-children ng-star-inserted is-active" routerlinkactive="active-link" id="rziat-DE">
<div class="sportList_itemWrapper prebootFreeze">
<div class="sportlist_icon flagsIconBg is-DE"></div>
<div class="sportlist_name">Alemanha</div>
</div>
<!---->
<ul class="sportList_listLv2 ng-star-inserted">
<!---->
<li class="sportList_item ng-star-inserted" routerlinkactive="active-link">
<a class="sportList_itemWrapper prebootFreeze" id="competition-link-5" href="/futebol-s1/alemanha-bundesliga-c5">
<div class="sportlist_icon"></div>
<div class="sportlist_name">Alemanha - Bundesliga</div>
</a>
</li>(...)
</li>(...)
</li>(...)
</li>
</ul>
</li>
In this example is that "/futebol-s1/alemanha-bundesliga-c5" that I'm looking for.
Is there a way to get all those URL's? Or the "hiden" <ul> for that matter?
Maybe a way to simulate the click and parse the HTML code again?
Thanks in advance!
So I found a way to do it by clicking on the Countries elements.
Still have one question that would make this better:
When I execute the scrollIntoView(true) on a country <li> it goes under another element (<div class="sportList_subtitle">Desportos</div>) and is not clickable.
Is there some javascript or selenium function like "scrollIntoClickable"?
How I'm doing it now:
driver = webdriver.Chrome(ChromeDriverManager().install())
url = "https://www.betclic.pt/"
driver.get(url)
link_set = set()
all_sports = driver.find_element_by_css_selector(
("body > app-desktop > div.layout > div > app-left-menu > div >"
" app-sports-nav-bar > div > div:nth-child(2) > ul")
).find_elements_by_tag_name("li")
try:
cookies = driver.find_element_by_css_selector("body > app-desktop > bc-gb-cookie-banner > div > div > button")
cookies.click()
except:
print("Cookie error or not found...")
for sport in all_sports:
sport.click()
has_container = driver.find_element_by_tag_name("app-block-ext").size.get('height')>0
if not has_container:
for competition in driver.find_elements_by_css_selector("a[id*='block-link-']"):
link_set.add(competition.get_attribute("href"))
driver.execute_script("arguments[0].scrollIntoView(true);", competition)
else:
driver.execute_script("arguments[0].scrollIntoView(true);", driver.find_element_by_tag_name("app-block-ext"))
all_countries = driver.find_elements_by_css_selector("li[id^='rziat']")
for country in all_countries:
country.click()
competitions = driver.find_elements_by_css_selector("a[id^='competition-link']")
for element in competitions:
link_set.add(element.get_attribute("href"))
driver.execute_script("arguments[0].scrollIntoView(true);", country)
for link in sorted(link_set):
print(link)
I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text