I am trying to scrape the name and email from here. https://pubmed.ncbi.nlm.nih.gov/28958615/. The problem is When you click on the link next to the name it "expands" the section below it...and takes you there. There is no way to know which name's respective link would have an email in it's description. So, I am stuck at getting either but not both. This is a very simple site..so, no issues with finding elements. Hope someone could help with the logic here.
I am doing this..though this is not correct, I know.
Aut_div = driver.find_element_by_class_name('inline-authors')
nam_a = Aut_div.find_elements_by_tag_name('a')
for name in nam_a:
try:
name.click()
lis = driver.find_element_by_tag_name('li').text
if '#' in lis:
print(lis)
print(name.text)
break
except:
continue
Please, use the following logic:
Find all elements using the class locator: authors-list-item
For each author item find all a tags (find_elements method)
-first tag contains author name - get the name
-second tag contains link to reference details - get attribute href
Using the value of href attribute find element by css locator [data-affiliation-id={id}] (you have to remove # from its value: #affiliation-1 -> affiliation-1
Extract the inner text of element found using locator from point 3. -> here you can find email -> extract it using RegEx (e.g. ([a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+.[a-zA-Z0-9_-]+))
Related
I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.
To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs
I've tried to target a.nostyle in my code, however when I do so, it will sometimes grab the email above as they share the same tags. I can't seem to find any tags unique to the phone number. How would you go about doing so?
SEE IMAGE BELOW. Any help would be greatly appreciated.
You can try
a.nostyle:not([itemprop])
UPDATE
As it seem that BeautifulSoup doesn't support :not() syntax, you can try workaround
link = [link for link in soup.select('a.nostyle') if 'itemprop' not in link.attrs][0]
to select link with required class attribute which doesn't contain itemprop attribute (as email link)
You can make a list which contains all the "a" tags. Then you can target required tag by using index numbers
Example
allATagContainer = soup.findAll("a")
then you can use allATagContainer[1] to target second a tag.
I am trying to scrap title of a website but the problem it has no class and id.
Usually i use this to get title that has class:
titles = response.xpath('//a[#class="result-title hdrlnk"]/text()').extract()
Now I am trying to extract text, please see the screenshot, can you please fix it? [https://i.stack.imgur.com/k6aCN.png][1]
You may locate a specific node by any attribute (not only class and id) or its relative position with some others.
A few examples for the text in your screenshot:
response.xpath('//div[#class="job-title-text"]/a/text()')
response.xpath('//a[contains(#onclick,"clickJObTitle")]/text()')
response.xpath('//a[contains(#href,"jobdetails")]/text()')
response.css('div.job-title-text a::text')
response.css('a[onclick*=clickJObTitle]::text')
response.css('a[href*=jobdetails]::text')
See also:
https://www.w3schools.com/xml/xpath_syntax.asp
https://www.w3schools.com/cssref/css_selectors.asp
I am trying to get data from the website but I want to select first 1000 link open one by one and get data from there.
I have tried:
list_links = driver.find_elements_by_tag_name('a')
for i in list_links:
print (i.get_attribute('href'))
through this getting extra links which are not required.
for example: https://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1,2,3,4,5,%3E5&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&cityName=Mumbai
we will get more than 50k link. How to open only first 1000 link has in below with properties photos.
Edit
I have tried this also:
driver.find_elements_by_xpath("//div[#class='.l-srp__results.flex__item']")
driver.find_element_by_css_selector('a').get_attribute('href')
for matches in driver:
print('Liking')
print (matches)
#matches.click()
time.sleep(5)
But getting error: TypeError: 'WebDriver' object is not iterable
Why I am not getting link by using this line: driver.find_element_by_css_selector('a').get_attribute('href')
Edit 1
I am trying to sort links as per below but getting error
result = re.findall(r'https://www.magicbricks.com/propertyDetails/', my_list)
print (result)
Error: TypeError: expected string or bytes-like object
or Tried
a = ['https://www.magicbricks.com/propertyDetails/']
output_names = [name for name in a if (name[:45] in my_list)]
print (output_names)
Not getting anything.
All links are in list. Please suggest
Thank you in advance. Please suggest
Selenium is not a good idea for web scraping. I would suggest you to use JMeter which is FREE and Open Source.
http://www.testautomationguru.com/jmeter-how-to-do-web-scraping/
If you want to use selenium, the approach you are trying to follow is not a stable approach - clicking and grabbing the data. Instead I would suggest you to follow this - something similar here. The example is in java. But you could get the idea.
driver.get("https://www.yahoo.com");
Map<Integer, List<String>> map = driver.findElements(By.xpath("//*[#href]"))
.stream() // find all elements which has href attribute & process one by one
.map(ele -> ele.getAttribute("href")) // get the value of href
.map(String::trim) // trim the text
.distinct() // there could be duplicate links , so find unique
.collect(Collectors.groupingBy(LinkUtil::getResponseCode)); // group the links based on the response code
More info is here.
http://www.testautomationguru.com/selenium-webdriver-how-to-find-broken-links-on-a-page/
I believe you should collect all the elements in list which having tag name "a" with "href" properties which is not null.
Then traverse through the list and click on element one by one.
Create a list of type WebElement and store all the valid links.
Here you can apply more filters or conditions i.e. link contains some characters or some other condition.
To Store the WebElement in list you can use driver.findEelements() this method will return list of type WebElement.
I asked my previous question here:
Xpath pulling number in table but nothing after next span
This worked and i managed to see the number i wanted in a firefox plugin called xpath checker. the results show below.
so I know i can find this number with this xpath, but when trying to run a python scrpit to find and save the number it says it cannot find it.
try:
views = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']/preceding-sibling::text()")
except NoSuchElementException:
print "NO views"
views = 'n/a'
pass
I no that pass is not best practice but i am just testing this at the moment trying to find the number. I'm wondering if i need to change something on the end of the xpath like .text as the xpath checker normally shows a results a little differently. Like below:
i needed to use the xpath i gave rather than the one used in the above picture because i only want the number and not the date. You can see part of the source in my previous question.
Thanks in advance! scratching my head here.
The xpath used in find_element_by_xpath() has to point to an element, not a text node and not an attribute. This is a critical thing here.
The easiest approach here would be to:
get the td's text (parent)
get the span's text (child)
remove child's text from parent's
Code:
span = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']")
td = span.find_element_by_xpath('..')
views = td.text.replace(span.text, '').strip()