Hi i have the following in python
#Searching for company
varA = soup.find(Microsoft)
#Finding the <a> tag which contains href
#{<a data-deptmodal="true" href="https://someURL BASED ON COMPANY NAME">TEXT BASED ON COMPANY NAME</a>}
button = org.find_previous('a')
driver.find_element_by_tag_name(button).click()
and i get an error like
TypeError: Object of type 'Tag' is not JSON serializable
How do I make the webdriver click on my href after i get the soup
please note that my href changes everytime i change the company name.
To add to the existing comment, BeautifulSoup is an HTML parser, it helps you to extract data from the HTML, it is not interacting with the page in any manner - it cannot, for instance, click the link.
If you need to click the link in the browser, do it via selenium. In your case the .find_element_by_link_text() (or .find_element_by_partial_link_text()) locator fits the problem really well:
driver.find_element_by_link_text("Microsoft")
Documentation reference: Locating Hyperlinks by Link Text.
Related
I want to extract links of pdfs from this page using Selenium in python
I managed to extract the entire table that contains the rows and the links to the pdfs.
driver.get(company_link)
announcement_link = driver.find_element(By.XPATH, '//*[#id="heading1"]/h1/a').get_attribute('href')
driver.get(announcement_link)
table = driver.find_element(By.XPATH, '//*[#id="lblann"]/table/tbody/tr[4]/td')
I am looking for a shortest possible method to create a list of all pdf links in a sequence.
How do I do that?
I want to extract links of pdfs from this page using Selenium in python
In the page you provided, each link has a unique class tablebluelink which makes it easy to select all of their hrefs with a XPath expression selects the href attribute of all a elements that have a class attribute with the value tablebluelink:
//a[#class='tablebluelink']/#href
and then use find_elements_by_xpath in order to iterate over them:
elems = driver.find_elements_by_xpath("//a[#class='tablebluelink']/#href")
I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.
To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs
I want to extract a field from the following web page: https://www.olx.bg/d/ad/podemnitsi-haspeli-tovarni-i-kuhnenski-asansori-motor-reduktori-CID1012-ID8pWNq.html
The value that I want to get is this one ( 3143 ):
I tried to do it, but no success, the value is JS generated. Here is my code so far.
page = requests.get('https://www.olx.bg/d/ad/podemnitsi-haspeli-tovarni-i-kuhnenski-asansori-motor-reduktori-CID1012-ID8pWNq.html')
soup = BeautifulSoup(page.content, 'html.parser')
Do you have any idea how can I do this ?
Check the source code of the webpage ( using Ctrl + U on the browser) and search your desired element whether they're present or not.
If they are present in the page source then extract them, it could be using regex if for example, they are inside a JSON.
If not try selenium library
I use soup = BeautifulSoup(driver.page_source) to parse the whole page from Selenium in BeautifulSoup.
But how to just parse one element of Selenium in BeautifulSoup.
Below code will throw
TypeError: object of type 'FirefoxWebElement' has no len()
element = driver.find_element_by_id(id_name)
soup = BeautifulSoup(element)
I don't know if selenium does this out of the box, but I managed to find this workaround
element_html = f"<{element.tag_name}>{element.get_attribute('innerHTML')}</{element.tag_name}>"
you may want to replace innerHTML with innerTEXT if you want to get only the text, for example
<li>Hi <span> man </span> </li>
Getting the innerHTML will return all of what inside but the innerTEXT won't, try & see.
now create your Soup object
soup = BeautifulSoup(element_html)
print(soup.WHATEVER)
using the above technique, just create a method parseElement(webElement) & use it whenever you want to parse an element.
Btw I only use lxml & when I forgot to type it, the script didn't work
I'm new to Selenium's webdriver and Python. I know about this article about getting the HTML source, but I don't want the entire HTML line for a DOM object, just the content in between the tags. I also looked at webdriver source code for help on finding this button DOM object
for example:
<button id = "picview">Pic View</button>
How do I just get "Pic View"?
Also, using get_attribute("button id"), How would I get this specific button id as there are multiple buttons on the page with button id?
For example:
picbox_elem_attr = picbox_elem.get_attribute('button id')
print picbox_elem_attr
How do I ensure that picbox_elem_attr variable is set to the "picview" button and not some other button?
I don't have a
driver.find_element_by_name("xxx")
or a
driver.find_element_by_id("aaa")
to use to find this button.
To get text of an element use the text property. For example:
text_inside_button_id = driver.find_element_by_id("picview").text
Here is some additional documentation to help you with the webdriver binding library.
http://goldb.org/sst/selenium2_api_docs/html/
I forgot about xpath. Oops!
driver.find_element_by_xpath('//*[#id="picview"]')
and then you can right-click the object and use xpath within the dev tools.