I am trying to extract data from two divisions in XHTML having the same class name using python but when I try to take out their xpaths, they are different. I tried using
driver = webdriver.Chrome()
content = driver.find_element_by_class_name("abc")
print content.text
but it gives only the content of first div. I heard this can be done using xpath. The xpaths of the divs are as follows:
//*[#id="u_jsonp_2_o"]/div[2]/div[1]/div[3]
//*[#id="tl_unit_-5698935668596454905"]/div/div[2]/div[1]/div[3]
//*[#id="u_jsonp_3_c"]/div[2]/div[1]/div[3]
What I thought, since each xpath has same ending, how can we use this similarity in ending and then access the divisions in python by writing [1],[2],[3].... at the end of the xpath?
Also, I want make content an array containing all the content of classes named abc. Moreover, I don't know how many abcs exist! How to integrate the data of all of them in one content array?
In your case it doesn't matter if you use class name or css, you are only searching for "one" element with find_element but you want to find several elements:
you need to use find_element**s**_by_class_name
content = driver.find_elements_by_class_name("abc")
for element in content:
// here you can work with every single element that has class "abc"
// do whatever you want
print element.text
Related
I am learning web scraping using selenium and I've come into an issue when trying to select an attribute inside of a selenium object. I can get the broader data if I just print elems.text inside the loop (this outputs the whole paragraph for each listing) however when I try to access the xpath of the h2 title tag of all the listings inside this broader element, it only appends the first listing to the titles array, whereas I want all of them. I checked the XPATH and they are the same for each listing. How can I get all of the listings instead of just the first one?
titles = []
driver.get("https://www.sellmytimesharenow.com/timeshare/All+Timeshare/vacation/buy-timeshare/")
results = driver.find_elements(By.CLASS_NAME, "results-list")
for elems in results:
print(elems.text) #this prints out full description paragraphs
elem_title = elems.find_element(By.XPATH, '//*[#id="search-page"]/div[3]/div/div/div[2]/div/div[2]/div/a[1]/div/div[1]/div/h2')
titles.append(elem_title.text)
If you aren't limited to accessing the elements by XPATH only, then here is my solution:
results = driver.find_elements(By.CLASS_NAME, "result-box")
for elems in results:
titles.append(elems.text.split("\n")[0])
When you try getting the listings, you use find_elements(By.CLASS_NAME, "results-list"), but on the website, there is only one element with the class name "results-list". This aggregates all the text in this div into one long string and therefore you can't get the heading.
However, there are multiple elements with the class name "result-box", so find_elements will store each as its own item in "results". Because the title of each listing is on the first line, you can split the text of each element by the newline.
I am having issues web-scraping a tag element while using BeautifulSuop4 in Python. Typically the elements are given a class or id identifier where I can use:
.find_all(<p>, class_ = 'class-name')
to find the element however the elements I am trying to isolate are in a consecutive list of tags all of which have no identifier for their element.
Is there a way to choose every tag after a tag that has an identifier? Or maybe a way to isolate the specific tags I want without them having any shared class/id?
You could use find_next_sibling to find the classless next sibling of an element.
Consider this example HTML. The first div has the class "blah". The second div has no class but is beside the first div.
html='<div><div class="blah">1</div><div>no class</div></div>'
import bs4
soup = bs4.BeautifulSoup(html,'html.parser')
soup.find('div',{'class':"blah"}).find_next_sibling()
#outputs second div without a class
<div>no class</div>
See this and this for more details.
I want to have a list of all of the '_7Uhw9...' classes, despite there being multiple of these classes, spans in them and everything, it prints out an empty list. Why?
message_response = driver.find_elements_by_class_name('_7UhW9 xLCgt MMzan KV-D4 p1tLr hjZTB')
print(message_response)
If you wanna get up close and personal with the HTML you can go to any Instagram account's dm inbox, it should be there. But here is a picture if that's all you need: Picture of HTML element
class name separated by space indicates multiple classes
by_class uses css class notation under the hood , so if it has multiple classes you should replace space with '.'
message_response = driver.find_elements_by_class_name('_7UhW9.xLCgt.MMzan.KV-D4.p1tLr.hjZTB')
Else you can use css or xpath attribute check instead:
here we see class as just attribute and uses below syntax:
syntax for xpath and css are like
xpath:
//tagname[#attribute="value"]
css
tagname[attribute="value"]
so use any of this syntax
message_response = driver.find_elements_by_xpath("//*[#class='_7UhW9 xLCgt MMzan KV-D4 p1tLr hjZTB']")
or
message_response = driver.find_elements_by_css_selector("[class='_7UhW9 xLCgt MMzan KV-D4 p1tLr hjZTB']")
Using Selenium to perform some webscraping. Have it log in to a site, where an HTML table of data is returned with five values at a time. I'm going to have Selenium scrape a particular bit of data off the table, write to a file, click next, and repeat with the next five.
New automation script. I've a myriad of variations of get_attribute, find_elements_by_class_name, etc. Example:
pnum = prtnames.get_attribute("title")
for x in prtnames:
print('pnum')
Here's the HTML from one of the returned values:
<div class="text-container prtname"><span class="PrtName" title="P011">P011</span></div>
I need to get that "P011" value. Obviously Selenium doesn't have "find_elements_by_title", and there is no HTML id for the value. The Xpath for that line of HTML is:
//*[#id="printerConnectTable"]/tbody/tr[5]/td/table/tbody/tr[1]/td[2]/div/span
But I don't see a reference to "title" or "P011" in that Xpath.
pnum = prtnames.get_attribute("title")
AttributeError: 'list' object has no attribute 'get_attribute'
It's like get_attribute doesn't exist, but there is some (albeit not much) documentation on it.
Fundamentally I'd like to grab that "P011" value and print to console, then I know Selenium is working with the right data.
P.S. I'm self-taught with all of this, I'm automating a sysadmin task.
I think the problem is that prtnames is a list of element, not a specific element. You can use a list comprehension if you want a list of the attributes of titles for the list of prtnames.
pnums = [x.get_attribute('title') for x in prtnames]
I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text