I am able to log on and access my account page, here is a sample of the HTML (modified for brevity and to not exceed the URL limit):
<div class='table m_t_4'>
<table class='data' border=0 width=100% cellpadding=0 cellspacing=0>
<tr class='title'>
<td align='center' width='15'><a></a></td>
<td align='center' width='60'></td>
</tr>
<TR bgcolor=>
<td valign='top' align='center'>1</TD>
<td valign='top' align='left'><img src='/images/sale_small.png' alt='bogo sale' />Garden Escape Planters</TD>
<td valign='top' align='right'>13225</TD>
<td valign='top' align='center'>2012-01-17 11:34:32</TD>
<td valign='top' align='center'>FILLED</TD>
<td valign='top' align='center'><A HREF='https://www.daz3d.com/i/account/orderdetail?order=7886745'>7886745</A></TD>
<td valign='top' align='center'><A HREF='https://www.daz3d.com/i/account/req_dlreset?oi=18087292'>Reset</A>
</TR>
Note that the only item I really need is the first HREF with the "order=7886745'>7886745<"...
And there are several of the TR blocks that I need to read.
I am using the following xpath coding:
browser.get('https://www.daz3d.com/i/account/orderitem_hist?')
account_history = browser.find_element_by_xpath("//div[#class='table m_t_4']");
print account_history
product_block = account_history.find_element_by_xpath("//TR[contains(#bgcolor, '')]");
print product_block
product_link = product_block.find_element_by_xpath("//TR/td/A#HREF")
print product_link
I am using the Python FireFox version of webdriver.
When I run this, the account_history and product_block xpath's seem to work fine (they print as "none" so I assume they worked), but I get a "the expession is not a legal expression" error on the product_link.
I have 2 questions:
1: Why doesn't the "//TR/td/A#HREF" xpath work? It is supposed to be using the product_block - which it (should be) just the TR segment, so it should start with the TR, then look for the first td that has the HREF...correct?
I tried using the exact case used in the HTML, but I think it shouldn't matter...
2: What coding do I need to use to see the content (HTML/text) of the elements?
I need to be able to do this to get the URL I need for the next page to call.
I would also like to see for sure that the correct HTML is being read here...that should be a normal part of debugging, IMHO.
How is the element data stored? Is it in an array or table that I can read using Python? It has to be available somewhere, in order to be of any use in testing - doesn't it?
I apologize for being so confused, but I see a lot of info on this on the web, and yet much of it either doesn't do anything, or it causes an error.
There do not seem to be any "standard" coding rules available...and so I am a bit desperate here...
I really like what I have seen in Selenium up to this point, but I need to get past it in order to make this work!
Edited!
OK, after getting some sleep the first answer provided the clue - find_elements_by_xpath creates a list...so I used that to find all of the xpath("//a[contains(#href,'https://www.daz3d.com/i/account/orderdetail?order=')]"); elements in the entire history, then accessed the list it created...and write it to a file to be sure of what I was seeing.
The revised code:
links = open("listlinks.txt", "w")
browser.get('https://www.daz3d.com/i/account/orderitem_hist?')
account_history = browser.find_element_by_xpath("//div[#class='table m_t_4']");
print account_history.get_attribute("div")
product_links = []
product_links = account_history.find_elements_by_xpath("//a[contains(#href,'https://www.daz3d.com/i/account/orderdetail?order=')]");
print str(len(product_links)) + ' elements'
for index, item in enumerate(product_links):
link = item.get_attribute("href")
links.write(str(index) + '\t' + str(link) + '\n')
And this gives me the file with the links I need...
0 https://www.daz3d.com/i/account/orderdetail?order=7905687
1 https://www.daz3d.com/i/account/orderdetail?order=7886745
2 https://www.daz3d.com/i/account/orderdetail?order=7854456
3 https://www.daz3d.com/i/account/orderdetail?order=7812189
So simple I couldn't see it for tripping over it...
Thanks!
1: Why doesn't the "//TR/td/A#HREF" xpath work? It is supposed to be
using the product_block - which it (should be) just the TR segment, so
it should start with the TR, then look for the first td that has the
HREF...correct?
WebDriver only returns elements, not attributes of said elements, thus:
"//TR/td/A"
works, but
"//TR/td/A#HREF"
or
"//TR/td/A#ANYTHING"
does not.
2: What coding do I need to use to see the content (HTML/text) of the
elements?
To retrieve the innertext:
string innerValue = element.Text;
To retrieve the innerhtml:
This is a little harder, you would need to iterate through each of the child elements and reconstruct the html based on that - or you could process the html with a scraping tool.
To retrieve an attribute:
string hrefValue = element.GetAttribute("href");
(C#, hopefully you can make the translation to Python)
There are other ways too to access an element than browser.find_element_by_xpath.
You can access by for e.g. id, or class
browser.find_element_by_id
browser.find_element_by_link_text
browser.find_element
browser.find_element_by_class_name
browser.find_element_by_css_selector
browser.find_element_by_name
browser.find_element_by_partial_link_text
browser.find_element_by_xpath
browser.find_element_by_tag_name
Each of above has a similar function which returns a list(just replace element with elements
Note: I have separated top two rows as I think they might help you.
Related
I'm new to scrapy and have been struggling for this problem for hours.
I need to scrape a page, with its source somehow looks like this:
<tr class="odd">
<td class="pfama_PF02816">Pfam</td>
<td>Alpha_kinase</td>
<td>1389</td>
<td>1590</td>
<td class="sh" style="display: none">21.30</td>
</tr>
I need to get the information of the tr.odd tag, if and only if the a tag has "Alpha_kinase" value
I can get all of those content (including "Alpha_kinase", 1389, 1590 and many other values) and then process the output to get "Alpha_kinase" only, but this approach will be significantly fragile and ugly. Currently I have to do that way:
positions = response.css('tr.odd td:not([class^="sh"]) td a::text').extract()
then do a for-loop to check.
Is there any condition (like td.not above) expression to put in response.css to solve my problem?
Thanks in advance. Any advice will be highly appreciated!
You can use another selector: response.xpath to select element from the html,
and filter the text with xpath contains function.
>>> response.xpath("//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]")
[<Selector xpath="//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]" data='<a href="http://pfam.xfam.org/family/Alp'>]
I assume there are multiple such tr elements on the page. If so, I would probably do something like:
# get only rows containing 'Alpha_kinase' in link text
for row in response.xpath('//tr[#class="odd" and contains(./td/a/text(), "Alpha_kinase")]'):
# extract all the information
item['link'] = row.xpath('./td[2]/a/#href').extract_first()
...
yield item
I've got a table with a bunch of links. The IDs are all unique but do not correspond to the actual text that is displayed so I'm having some trouble.
Ex.
<tr>
<td><a id="011" href="/link">Project 1</a></td>
</tr>
<tr>
<td><a id="235" href="/link">Project 2</a></td>
</tr>
<tr>
<td><a id="033" href="/link">Project 3</a></td>
</tr>
<tr>
<td><a id="805" href="/link">Project 4</a></td>
</tr>
I only know the text within the ahref (ie. Project 1) and I want to search for it and click it. I haven't been able to figure this out and I've been playing around with find_element_by_xpath for a while.
I've been using
selectproject = browser.find_element_by_xpath("//a[contains(.,projectname)]").click();
(projectname is a variable that changes every iteration)
I think it works to find the element since the script runs but it doesn't click. I think it's because I'm not actually searching for the ahref and just for the text?
Here is the Answer to your Question:
If you want to click the link with text Project 1 you can use the following line of code:
browser.find_element_by_xpath("//a[contains(text(),'Project 1')]").click()
or
browser.find_element_by_xpath("//a[#id="011"][contains(text(),'Project 1')]").click()
Update:
As you mentioned the Project 1 part is dynamic so you can try to construct a separate function() for clicking these links. Call the function with all the projectnames one by one as follows (the function is in Java consider to convert as per your required language binding):
public void clickProject(String projectName)
{
browser.findElement(By.xpath("//a[.='" + projectName + "']")).click();
}
Now you can call from your main() class as: clickProject(Project1)
Let me know if this Answers your Question.
If your requirement is to "click on the link Project 1", then you should use that as the locator. No need to mess around with XPath.
browser.find_element_by_linkText("Project 1").click();
// or the more flexible
browser.find_element_by_partialLinkText("Project 1").click();
The .find_element_by_partialLinkText() locator strategy should account for any extra whitespace padding due to the extra span element.
Note: I write Java, so the above Python syntax may be off. But those methods must exist.
I am trying to read in information from this table that changes periodically. The HTML looks like this:
<table class="the_table_im_reading">
<thead>...</thead>
<tbody>
<tr id="uc_6042339">
<td class="expansion">...</td>
<td>
<div id="card_6042339_68587" class="cb">
TEXT I NEED TO READ
</td>
<td>...</td>
more td's
</tr>
<tr id="uc_6194934">...</tr>
<td class="expansion">...</td>
similar as the first <tr id="uc...">
I was able to get to the table using:
table_xpath = "//*[#id="content-wrapper"]/div[5]/table"
table_element = driver.find_element_by_xpath(table_xpath)
And I am trying to read the TEXT I NEED TO READ part for each unique <tr id="uc_unique number">. The id=uc_unique number changes periodically, so I cannot use find element by id.
Is there a way reach that element and read that specific text?
Looks like you can search via the anchor-element link (href-attribute), since I guess this will not change.
via xpath:
yourText = table_element.find_element_by_xpath(.//a[#href='/blahsomelink']).text
UPDATE
OP mentioned that his link is also changing (with each call?), which means that the first approach is not for him.
if you want the text of the first row-element you can try this:
yourText = table_element.find_element_by_xpath(.//tr[1]//a[#class='cl']).text
if you know for example that the link element is always in the second data-element of the first row and there is only one link-element, then you can do this:
yourText = table_element.find_element_by_xpath(.//tr[1]/td[2]//a).text
Unless you provide more detailed requirements as to what you are really searching for, this will have to suffice so far...
Another UPDATE
OP gave more info regarding his requirement:
I am trying to get the text in each row.
Given there is only one anchor-element with class cl in each tr element you can do the following:
elements = table_element.find_elements_by_xpath(.//tr//a[#class='cl'])
for element in elements:
row_text = element.text
Now you can do whatever you need with all these texts...
It looks like you have a few options.
If all you want is the first A, it might be as simple as
table_element.find_element_by_css_selector("a.cl")).text
or the little more specific
table_element.find_element_by_css_selector("div.cb > a.cl")).text
If you want all the As, try the find_elements_* versions of the above.
I managed to find the elements I needed using .get_attribute("textContent") instead of .text , a tip from Get Text from Span returns empty string
I have the following html page where I am trying to locate the word silver and keep count on how many were found.
This is just two showing here but it can generate more so I don't know the exact count.
<tr id="au_4" class="odd">
<td>Silver</td>
</tr>
<tr id="au_4" class="even">
<td>Silver</td>
</tr>
This is what I tried but no luck:
count = driver.find_elements_by_xpath("//td[text()='Silver']")
count is a list of all elements that were found. In order to find its length, you should:
len(count)
I highly recommend you to go through the docs to better understand how Selenium works.
It would be quite faster to retrieve count via execute_script then getting len from result of find_elements_by method call:
script = "return $(\"td:contains('{}')\").length".format(text="Silver")
count = driver.execute_script(script)
Sample above suits to you If you use jQuery, another ways of retrieving elements by text value can be found in How to get element by innerText SO question ...
I'm currently trying to extract information from a badly formatted web page. Specifically, the page has used the same id attribute for multiple table elements. The markup is equivalent to something like this:
<body>
<div id="random_div">
<p>Some content.</p>
<table id="table_1">
<tr>
<td>Important text 1.</td>
</tr>
</table>
<h4>Some heading in between</h4>
<table id="table_1">
<tr>
<td>Important text 2.</td>
<td>Important text 3.</td>
</tr>
</table>
<p>How about some more text here.</p>
<table id="table_1">
<tr>
<td>Important text 4.</td>
<td>Important text 5.</td>
</tr>
</table>
</div>
</body>
Clearly this is incorrectly formatted HTML, due to the multiple use of the same id for an element.
I'm using XPath to try and extract all the text in the various table elements, utilising the language through the Scrapy framework.
My call, looks something like this:
hxs.select('//div[contains(#id, "random_div")]//table[#id="table_1"]//text()').extract()
Thus the XPath expression is:
//div[contains(#id, "random_id")]//table[#id="table_1"]//text()
This returns: [u'Important text 1.'], i.e., the contents of the first table that matches the id value "table_1". It seems to me that once it has come across an element with a certain id it ignores any future occurrences in the markup. Can anyone confirm this?
UPDATE
Thanks for the fast responses below. I have tested my code on a page hosted locally, which has the same test format as above and the correct response is returned, i.e.,
`[u'Important text 1.', u'Important text 2.', . . . . ,u'Important text 5.']`
There is therefore nothing wrong with either the Xpath expression or the Python calls I'm making.
I guess this means that there is a problem on the webpage itself which is either screwing up XPath or the html parser, which is libxml2.
Does anyone have any advice as to how I can dig into this a bit more?
UPDATE 2
I have successfully isolated the problem. It is actually with the underlying parsing library, which is lxml (which provides Python bindings for the libxml2 C library.
The problem is that the parser is unable to deal with vertical tabs. I have no idea who coded up the site I am dealing with but it is full of vertical tabs. Web browser seem to be able to ignore these, which is why running the XPath queries from Firebug on the site in question, for example, are successful.
Further, because the above simplified example doesn't contain vertical tabs it works fine. For anyone who comes across this issue in Scrapy (or in python generally), the following fix worked for me, to remove vertical tabs from the html responses:
def parse_item(self, response):
# remove all vertical tabs from the html response
response.body = filter(lambda c: c != "\v", response.body)
hxs = HtmlXPathSelector(response)
items = hxs.select('//div[contains(#id, \"random_div\")]' \
'//table[#id="table_1"]//text()').extract()
With Firebug, this expression:
//table[#id='table_1']//td/text()
gives me this:
[<TextNode textContent="Important text 1.">,
<TextNode textContent="Important text 2.">,
<TextNode textContent="Important text 3.">,
<TextNode textContent="Important text 4.">,
<TextNode textContent="Important text 5.">]
I included the td filtering to give a nicer result, since otherwise, you would get the whitespace and newlines between the tags. But all in all, it seems to work.
What I noticed was that you query for //div[contains(#id, "random_id")], while your HTML snippet has a tag that reads <div id="random_div"> -- the _id and _div being different. I don't know Scrapy so I can't really say if that does something, but couldn't that be your issue as well?
count(//div[#id = "random_div"]/table[#id= "table_1"])
This xpath returns 3 for your sample input. So your problem is not with the xpath itself rather with the functions you use to extract the nodes.