I've tried to target a.nostyle in my code, however when I do so, it will sometimes grab the email above as they share the same tags. I can't seem to find any tags unique to the phone number. How would you go about doing so?
SEE IMAGE BELOW. Any help would be greatly appreciated.
You can try
a.nostyle:not([itemprop])
UPDATE
As it seem that BeautifulSoup doesn't support :not() syntax, you can try workaround
link = [link for link in soup.select('a.nostyle') if 'itemprop' not in link.attrs][0]
to select link with required class attribute which doesn't contain itemprop attribute (as email link)
You can make a list which contains all the "a" tags. Then you can target required tag by using index numbers
Example
allATagContainer = soup.findAll("a")
then you can use allATagContainer[1] to target second a tag.
Related
I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.
To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs
I want to read out the text in this html element using selenium with python. I just can't find a way to find or select it without using the text (i don't want that because its content changes)
<div font-size="14px" color="text" class="sc-gtsrHT jFEWVt">0.101 ONE</div>
Do you have an idea how i could select it? The conventional ways listed in the documentation seem to not work for me. To be honest i'm not very good with html what doesn't make things any easier.
Thank you in advance
Try this :
element = browser.find_element_by_class_name('sc-gtsrHT jFEWVt').text
Or use a loop if you have several elements :
elements = browser.find_elements_by_class_name('sc-gtsrHT jFEWVt')
for e in elements:
print(e.text)
print(browser.find_element_by_xpath("//*[#class='sc-gtsrHT jFEWVt']").text)
You could simply grab it by class name. It's 2 class names so it would be like so. by_class_name only uses one.
If the class name isn't dynamic otherwise you'd have to right click and copy the xpath or find a unique identiftier.
Find by XPath as long as font size and color attribute are consistent. Be like,
//div[#font-size='14px' and #color='text' and starts-with(#class,'sc-')]
I guess the class name is random?
I am trying to scrape the name and email from here. https://pubmed.ncbi.nlm.nih.gov/28958615/. The problem is When you click on the link next to the name it "expands" the section below it...and takes you there. There is no way to know which name's respective link would have an email in it's description. So, I am stuck at getting either but not both. This is a very simple site..so, no issues with finding elements. Hope someone could help with the logic here.
I am doing this..though this is not correct, I know.
Aut_div = driver.find_element_by_class_name('inline-authors')
nam_a = Aut_div.find_elements_by_tag_name('a')
for name in nam_a:
try:
name.click()
lis = driver.find_element_by_tag_name('li').text
if '#' in lis:
print(lis)
print(name.text)
break
except:
continue
Please, use the following logic:
Find all elements using the class locator: authors-list-item
For each author item find all a tags (find_elements method)
-first tag contains author name - get the name
-second tag contains link to reference details - get attribute href
Using the value of href attribute find element by css locator [data-affiliation-id={id}] (you have to remove # from its value: #affiliation-1 -> affiliation-1
Extract the inner text of element found using locator from point 3. -> here you can find email -> extract it using RegEx (e.g. ([a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+.[a-zA-Z0-9_-]+))
I am trying to scrap title of a website but the problem it has no class and id.
Usually i use this to get title that has class:
titles = response.xpath('//a[#class="result-title hdrlnk"]/text()').extract()
Now I am trying to extract text, please see the screenshot, can you please fix it? [https://i.stack.imgur.com/k6aCN.png][1]
You may locate a specific node by any attribute (not only class and id) or its relative position with some others.
A few examples for the text in your screenshot:
response.xpath('//div[#class="job-title-text"]/a/text()')
response.xpath('//a[contains(#onclick,"clickJObTitle")]/text()')
response.xpath('//a[contains(#href,"jobdetails")]/text()')
response.css('div.job-title-text a::text')
response.css('a[onclick*=clickJObTitle]::text')
response.css('a[href*=jobdetails]::text')
See also:
https://www.w3schools.com/xml/xpath_syntax.asp
https://www.w3schools.com/cssref/css_selectors.asp
I'm a beginner in Python and currently I'm trying to write a simple script using BeautifulSoup to extract some information from a web page and write it to a CSV file. What I'm trying to do here, is to go through all the lists on the web page. In the specific HTML file which I'm looking to work with, only one 'ul' has an id and I wish to skip that one and save all the other list elements in an array. My code doesn't work and I can't figure out how to solve my problem.
for ul in content_container.findAll('ul'):
if 'id' in ul:
continue
else:
for li in ul.findAll('li'):
list.append(li.text)
print(li.text)
here when I print the list out, I still see the elements from the ul with the id. I know it's a simple problem but I'm stuck at the moment. Any help would be appreciated
You are looking for id=False. Use this:
for ul in content_container.find_all('ul', id=False):
for li in ul.find_all('li'):
list.append(li.text)
print(li.text)
This will ignore all tags that have id as an attribute. Also, your approach was nearly correct. You just need to check whether id is present in tag attributes, and not in tag itself (as you are doing). So, use if 'id' in ul.attrs() instead of if 'id' in ul
try this
all_uls = content_container.find_all('ul')
#assuming that the ul with id is the first ul
for i in range(1, len(all_uls)):
print(all_uls[i])