Scraper unable to extract titles from a website

Scraper unable to extract titles from a website - python

I've written a script in python in combination with Selenium to extract the titles of different news being displayed in the left sided bar in finance.yahoo website. I've used css selector to get the content. However, the script is neither giving any result nor throwing any error. I can't figure out the mistake I'm making. Hope somebody will take a look into it. Thanks in advance.
Here is my script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://finance.yahoo.com/")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "u.StretchedBox")))
for item in driver.find_elements_by_css_selector("u.StretchedBox span"):
print(item.text)
driver.quit()
Elements within which the titles are:
<h3 class="M(0)" data-reactid="128"><a rel="nofollow noopener noreferrer" class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 Td(n) C(#0078ff):h C(#000)" target="_blank" href="https://beap.gemini.yahoo.com/mbclk?bv=1.0.0&es=bVwDtPMGIS8NDKqncZWZBjLsQQHm58Z9cLJuMqC6LadDlYfVCoy.d3GqO599EPAiYnsxB0SB8aRURPve9Q8mOEjH.NrcVcVDhldut.C_9Vn16XER1q1G07a48FMQ_.sv9GCyVx7zcj1kBtWPysaYzQqboJWgUo5DRRHbAnejwVtYRPHJTEptil92tx_ccJZ9FnxE8L3tfDuS0Q3l5ftVhamTOon_nzuvtvqqBwD7X0T.7Z3wZBgtH93gM1xImZ0hdFUzsuQPDAjZWs1KdH0YsXIf3uLrmcJFoI9leh8KRljnIPC.RdhOF6OYcJfHtDks85nSIgfOsMyUr1wEhMA2Qa2htpEg5w.P4UIXeoldjzJ_NsUrtXqEFIJNKoaeq_FNiQ9wcI16utKO87167zkfSPzVY09d3pVLZg20V7tqTThOkG_IakPnmlOriJKnufsBWj1wp.6Q4PasAt2g4Y1yw9U71FIfG2dDwpryRKDWrUBfTvjwwItlSyXyvWvIYUyXXxR74qWcIEC3KAvVN7.iqSckV_EssVM8ytp5HiN4iTACpEmc96rpdNEqHYpRotwze8NF5cDubsZbW58Hauq_aO.DbhZJ7TbBDx5vZK_M%26lp=https%3A%2F%2Fin.search.yahoo.com%2Fsearch%3Fp%3Dcheap%2Bairfare%2Bdomestic%26fr%3Dstrm-tts-thg%26.tsrc%3Dstrm-tts-thg%26type%3Dcheapairfaredomestic-in" data-reactid="129">
<u class="StretchedBox" data-reactid="130"></u>
<span data-reactid="131">The Cheapest Domestic Airfare Rates</span></a></h3>

You didn't get neither error nor results because:
find_elements_...() method intend to return you a list. If your selector match no elements you won't get error, just an empty list. Also if to try to iterate through the empty list, you won't get error
your CSS selector should match span that is descendant of u with attribute class="StretchedBox", but actually required span is not descendant, but sibling.
Try to use below code:
for item in driver.find_elements_by_css_selector("u.StretchedBox+span"):
print(item.text)

Related

How to use XPath to scrape javascript website values

I'm trying to scrape (in python) the savings interest rate from this website using the value's xpath variable.
I've tried everything: beautifulsoup, selenium, etree, etc. I've been able to scrape a few other websites successfully. However, this site and many others are giving me fits. I'd love a solution that can scrape info from several sites regardless of their formatting using xpath variables.
My current attempt:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
service = Service(executable_path="/chromedriver")
options = Options()
options.add_argument(' — incognito')
options.headless = True
driver = webdriver.Chrome(service=service, options=options)
url = 'https://www.americanexpress.com/en-us/banking/online-savings/account/'
driver.get(url)
element = driver.find_element(By.XPATH, '//*[#id="hysa-apy-2"]')
print(element.text)
if element.text == "":
print("Error: Element text is empty")
driver.quit()

The interest rates are written inside span elements. All span elements which contain interest rates share the same class heading-6. But bear in mind, the result returns two span elements for each interest rate, each element for a different viewport.
The xpath selector:
'//span[#class="heading-6"]'
You can also get elements by containing text APY:
'//span[contains(., "APY")]'
But this selector looks for all span elements in the DOM that contain word APY.

If you find unique id, it is recommended to be priority, like this :find_element(By.ID,'hysa-apy-2') like #John Gordon comment.
But sometimes when the element found, the text not yet load.
Use xpath with add this logic and text()!=""
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[#id="hysa-apy-2" and text()!=""]')))
Following import:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

How to get all elements with multiple classes in selenium

This is how I get the website
from selenium import webdriver
url = '...'
driver = webdriver.Firefox()
driver.get(url)
Now I want to extract all elements with a certain classes into a list
<li class=foo foo-default cat bar/>
How would I get all the elements from the website with these classes?
There is something like
fruit = driver.find_element_by_css_selector("#fruits .tomatoes")
But when I do this (I tried without spaces between the selectors too)
elements = driver.find_element_by_css_selector(".foo .foo-default .cat .bar")
I get
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .foo .foo-default .cat .bar
Stacktrace:
WebDriverError#chrome://remote/content/shared/webdriver/Errors.jsm:183:5
NoSuchElementError#chrome://remote/content/shared/webdriver/Errors.jsm:395:5
element.find/</<#chrome://remote/content/marionette/element.js:300:16
These are the classes I copied from the DOM`s website though...

If this is just the HTML
<li class=foo foo-default cat bar/>
You can remove the space and put a . to make a CSS SELECTOR as a locator.
elements = driver.find_elements(By.CSS_SELECTOR, "li.foo.foo-default.cat.bar")
print(len(elements))
or my recommendation would be to use it with explicit waits:
elements_using_ec = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.foo.foo-default.cat.bar")))
print(len(elements))
Imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Have you tried without spaces between class names?
fruit = driver.find_element_by_css_selector(".foo.foo-default.cat.bar")

There is an undocumented function
driver.find_elements_by_css_selector(".foo.foo-default.cat.bar")
^
This works.

python selenium how to click <a href > login button

<a href="/?redirectURL=%2F&returnURL=https%3A%2F%2Fpartner.yanolja.com%2Fauth%2Flogin&serviceType=PC" aria-current="page" class="v-btn--active v-btn v-btn--block v-btn--depressed v-btn--router theme--light v-size--large primary" data-v-db891762="" data-v-3d48c340=""><span class="v-btn__content">
login
</span></a>
I want to click this href button using python-selenium.
First, I tried to using find_element_by_xpath(). But, I saw a error message that is no such element: Unable to locate element. Then, I used link_text, partial_link_text, and so on.
I think that message said that 'It can't find login-button' right?
How can I specify the login button?
It is my first time to using selenium, and My html knowledge is also not enough.
What should I study first?
+)
url : https://account.yanolja.bz/
I want to login and get login session this URL.
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
html = urlopen('https://account.yanolja.bz/')
id = 'ID'
password = 'PW'
#options = webdriver.ChromeOptions()
#options.add_argument("headless")
#options.add_argument("window-size=1920x1080")
#options.add_argument("--log-level=3")
driver = webdriver.Chrome("c:\\chromedriver")
driver.implicitly_wait(3)
driver.get('https://account.yanolja.bz/')
print("Title: %s\nLogin URL: %s" %(driver.title, driver.current_url))
id_elem = driver.find_element_by_name("id")
id_elem.clear()
id_elem.send_keys(id)
driver.find_element_by_name("password").clear()
driver.find_element_by_name("password").send_keys(password)

Most probably this thing is not working due to span in your a tag. I have tried to give some examples but I am not sure, if you are supposed to click the a tag or span. I have tried to click the span in all of them.I hope it does work. Only if you could give us what you have tried, it would be a great help to find out mistakes, if any.
Have you tried to find the element using class. Your link is named by so many classes, Is none of them unique?
driver.find_element(By.CLASS_NAME, "{class-name}").click()
using x-path:
driver.find_element_by_xpath("//span[#class='v-btn__content']").click()
driver.find_element_by_xpath("//a[#class='{class-name}']/span[#class='v-btn__content']").click()
if this xpath is not unique then you can use css selector
driver.find_element_by_css_selector("a[aria-current='page']>span.v-btn__content").click()

Firstly, I want to note that span class="v-btn__content">login</span> is not clickable. Thus, it raises an error.
Try to use this instead
driver.find_element_by_xpath('//a[#href="'+url+'"]')
Replace url with the url given in <a href='url'>

Wait for every element in Selenium

I want to retrieve from a website every div class='abcd' element using Selenium together with 'waiter' and 'XPATH' classes from Explicit.
The source code is something like this:
<div class='abcd'>
<a> Something </a>
</div>
<div class='abcd'>
<a> Something else </a>
...
When I run the following code (Python) I get only 'Something' as a result. I'd like to iterate over every instance of the div class='abcd' appearing in the source code of the website.
from explicit import waiter, XPATH
from selenium import webdriver
driver = webdriver.Chrome(PATH)
result = waiter.find_element(driver, "//div[#class='abcd']/a", by=XPATH).text
Sorry if the explanation isn't too technical, I'm only starting with webscraping. Thanks

I've used like this. You can also use if you like this procedure.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(PATH)
css_selector = "div.abcd"
results = WebDriverWait(driver, 10).until((expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, css_selector))))
for result in results:
print(result.text)

Get text using selenium PhantomJS inside Span

I tried to get text inside span using Seleniung webdriver PhantomJS. My code is like this :
href = driver.find_elements_by_xpath("//a[#class='_8mlbc _vbtk2 _t5r8b']")
for rt in href:
rt.click()
if href:
name = driver.find_elements_by_xpath("//*[#class='_99ch8']/span").text
# name = driver.find_element_by_xpath("//li[a[#title='nike']]/span").text
print(name)
In HTML :
<li class="_99ch8"><a class="_4zhc5 notranslate _ebg8h" title="nike" href="/nike/">nike</a><span><span>Nobody believed a boy from Madeira would make it to the stars. Except the boy from Madeira. </span><br>#nike<span> </span>#soccer<span> </span>#football<span> </span>#CR7<span> </span>#Cristiano<span> </span>#CristianoRonaldo<span> </span>#Mercurial<span> </span>#justdoit</span></li>
I want try to get text inside span.

You cannot use XPath expression that returns text node as it unacceptable option for selenium- selector should return WebDriver element only
Also note that class name of li seem to be dynamic, so you might use title attribute value of child anchor instead:
driver.find_element_by_xpath("//li[a[#title='nike']]/span").text
UPDATE
The complete code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www.instagram.com/nike/')
links = driver.find_elements_by_xpath('//a[contains(#href, "/?taken-by=nike")]')
for link in links:
link.click()
wait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//div/article")))
print(driver.find_element_by_xpath("//li[a[#title='nike']]/span").text)
driver.find_element_by_xpath("//div[#role='dialog']/button").click()
UPDATE#2
You can also simply grab the same text without opening each image:
links = driver.find_elements_by_xpath('//img')
for img in links:
print(img.get_attribute('alt'))

I think first of all, if you want to go for a single element, you need to use find_element_by_xpath() method instead of find_elements_by_xpath() method to get to the element.
If you're using find_elements_by_xpath() then you need to use a looping statement for printing all the names that comes in the name variable.
Also, using the .text property of an element would give you the desired result.
Try this
name = driver.find_element_by_xpath(//li[#class='_69ch8']/span).text
print(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraper unable to extract titles from a website - python

Related

How to use XPath to scrape javascript website values

How to get all elements with multiple classes in selenium

python selenium how to click <a href > login button

Wait for every element in Selenium

Get text using selenium PhantomJS inside Span

Categories

Resources