I have a selenium python script that reads a table on a page. The table has 3 columns, the first is a list of IDs and the 3rd is a check box. I iterate through the IDs until I find the one I want then then click the corresponding check box and save. It works fine but is very slow as the table can be 4K rows.
This is the current code (self.questionID is a dictionary with the IDs I'm looking for):
k, v in self.questionID.items():
foundQuestion = False
i = 1
while foundQuestion is False:
questionIndex = driver.find_element_by_xpath('/html/body/div[1]/form/table[2]/tbody/tr/td[1]/table/tbody/tr/td/fieldset[2]/div/table[1]/tbody/tr/td/table/tbody/tr/td/div/table/tbody[%d]/tr/td[1]' % i).text
if questionIndex.strip() == k:
d = i - 1
driver.find_element_by_name('selectionIndex[%d]' % d).click()
foundQuestion = True
i +=1
This is a sample of the table, just the first couple of rows:
<thead>
<tr>
<th class="first" width="5%">ID</th>
<th width="90%">Question</th>
<th class="last" width="1%"> </th>
</tr>
</thead>
<tbody>
<tr>
<td class="rowodd">AG001 </td>
<td class="rowodd">Foo: </td>
<td class="rowodd"><input class="input" name="selectionIndex[0]" tabindex="30" type="checkbox"></td>
</tr>
</tbody>
<tbody>
<tr>
<td class="roweven">AG002 </td>
<td class="roweven">Bar </td>
<td class="roweven"><input class="input" name="selectionIndex[1]" tabindex="30" type="checkbox"></td>
</tr>
</tbody>
As you can probably guess I'm no python ninja. Is there is a quicker way to read this table and locate the correct row?
You can find the relevant checkbox in one go by using an xpath expression to search a question node by text and to get it's td following sibling and input inside it:
checkbox = driver.find_element_by_xpath('//tr/td[1][(#class="rowodd" or #class="roweven") and text() = "%s${nbsp}"]/following-sibling::td[2]/input[starts-with(#name, "selectionIndex")]' % k)
checkbox.click()
Note that it would throw NoSuchElementException in case a question and a related to it checkbox is not found. You probably need to catch the exception:
try:
checkbox = driver.find_element_by_xpath('//tr/td[1][(#class="rowodd" or #class="roweven") and text() = "%s${nbsp}"]/following-sibling::td[2]/input[starts-with(#name, "selectionIndex")]' % k)
checkbox.click()
except NoSuchElementException:
# question not found - need to handle it, or just move on?
pass
Related
I am a newbie in web scraping and I have been stuck with some issue. I tried and searched but nothing at all. I want to extract data from the table. The problem is that tr and td elements don't have attributes, just class. <td class="review-value"> is the same for different values. I don't know how to separate them. I need lists with every single value for example:
list_1 = [aircraft1, aircraft2, aircraft3, ..]
list_2 = [type_of_traveller1, type_of_traveller2, ...]
list_3 = [cabin_flown1, cabin_flown2,...]
list_4 = [route1, route2,...]
list_5 = [date_flown1, date_flown2, ..]
This is the table code:
<table class="review-ratings">
<tr>
<td class="review-rating-header aircraft">Aircraft</td>
<td class="review-value">Boeing 787-9</td>
</tr>
<tr>
<td class="review-rating-header type_of_traveller">Type Of Traveller</td>
<td class="review-value">Couple Leisure</td>
</tr>
<tr>
<td class="review-rating-header cabin_flown">Seat Type</td>
<td class="review-value">Business Class</td>
</tr>
<tr>
<td class="review-rating-header route">Route</td>
<td class="review-value">Mexico City to London</td>
</tr>
<tr>
<td class="review-rating-header date_flown">Date Flown</td>
<td class="review-value">February 2023</td>
</tr>
</table>
I am using BeautifulSoup:
page = requests.get(url)
table = soup.find('article')
review_table = table.find_all('table', class_ = 'review- ratings')
find_tr_header = table.find_all('td', class_ = 'review-rating-header')
headers = []
for i in find_tr_header:
headers.append(i.text.strip())
And I don't know what to do with class="review-value".
As I can see in your table each field has a cell .review-value that is following it (direct sibling).
So what you can do is use the selector + in CSS.
For instance .aircraft + .review-value will give you the value of the aircraft.
In Beautiful Soup you can even avoid this type of selector since there are built-in methods available for you. Check next-sibling
I create an html-rendering of a table from a dataframe and I would like to make it mobile friendly. One solution I found would require adding attributes to the <td> tags for each column entry (as per this suggestion https://codemyui.com/pure-css-responsive-table/ of using a CSS file which decides when to switch from table view to a kind of list-view).
So basically I would like to get this code
import pandas as pd
df = pd.DataFrame({'H1': [1,2], 'H2':[10,20]})
print(df.to_html())
to produce something like this:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td data-column="H1">1</td>
<td data-column="H2">10</td>
</tr>
<tr>
<th>1</th>
<td data-column="H1">2</td>
<td data-column="H2">20</td>
</tr>
</tbody>
</table>
If this is not the way to go I would be happy to hear any suggestions.
You can do something like this (assuming it only goes up to h2):
tab_html = df.to_html()
ntd = tab_html.count("<td>")
tab_html = tab_html.replace('<td>', '<td data-column="H{}">')
colindex = [i%2 + 1 for i in range(ntd)]
print(tab_html.format(*colindex))
This will work as long as h1 and h2 are the same length and {} does not appear elsewhere in your html. If you want to increase the number of keys (say e.g. up to h4), just change the number in i%2 to the number of keys.
I'm trying to extract the value 1 from the table with Selenium, but I'm not finding a good way to do it.
<td width="1%" style="text-align: right">1</td>
Here is how the page's HTML looks like:
<tr class="linhaPar" onMouseOver="javascript:this.style.backgroundColor='#C4D2EB'" onMouseOut="javascript:this.style.backgroundColor=''">
<td>
Scientific American
</td>
<td>
A Base Molecular da Vida Uma Introducao a Biologia Molecular
</td>
<td>
</td>
<td>
<table width="100%">
<tbody style="background-color: transparent;">
<tr>
<td>
1971
</td>
</tr>
</tbody>
</table>
</td>
<td width="1%" style="text-align: right">
1
</td>
<td width="1%">
<a id="formBuscaPublica:ClinkView" href="#" onclick="if(typeof jsfcljs == 'function'){jsfcljs(document.getElementById('formBuscaPublica'),{'formBuscaPublica:ClinkView':'formBuscaPublica:ClinkView','idTitulo':'39117','idsBibliotecasAcervoPublicoFormatados':'47_46','apenasSituacaoVisivelUsuarioFinal':'true'},'');}return false"><img id="formBuscaPublica:ImageView" src="/sigaa/img/view.gif" style="border:none" title="Visualizar Informações dos Materiais Informacionais" /></a>
</td>
I've tried using this code, but it didn't work at all.
x = browser.find_elements_by_xpath('//*[#id="listagem"]/tbody/tr[1]/td[5]/').text
Thanks!
Try following xpath:
x = driver.find_element_by_xpath('//tr[#class="linhaPar" and contains(.,"Scientific American")]//td[contains(#style, "text-align")]').text
print(x)
Note:
Don't use .find_elements, but .find_element
Here's how I would do it, I created a re-usable function that returns the first element by tag and matching attributes.
def getElementByTagAndAttributes(browser, tag, **kwargs):
for element in browser.find_elements_by_tag_name(tag):
for key, value in kwargs.items():
attribute = element.get_attribute(key)
if attribute != value:
break
else:
return element
x = getElementByTagAndAttributes(browser, "td", width="1%", style="text-align: right").text
As its a table structure and data represent in rows and column. You can go with finding the value based on some specific data. So in your case lets say you want to retrieve the value 1 based on "Scientific American" then go with below xpath -
x = browser.find_elements_by_xpath("//tr/td[contains(.,'Scientific American')]/following-sibling::td[4]").text
To extract the text 1 from the element:
<td width="1%" style="text-align: right">1</td>
You can use either of the following xpath based solutions:
Using the text Scientific American:
print(browser.find_elements_by_xpath("//td[contains(., 'Scientific American')]//following::td[3]//following-sibling::td[1]").text)
Using the text A Base Molecular da Vida Uma Introducao a Biologia Molecular:
print(browser.find_elements_by_xpath("//td[contains(., 'A Base Molecular da Vida Uma Introducao a Biologia Molecular')]//following::td[2]//following-sibling::td[1]").text)
I have a big long table in an HTML, so the tags aren't nested within each other. It looks like this:
<tr>
<td>A</td>
</tr>
<tr>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
</tr>
<tr>
<td class ="y">...</td>
<td class ="y">...</td>
<td class ="y">...</td>
<td class ="y">...</td>
</tr>
<tr>
<td>B</td>
</tr>
<tr>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
</tr>
<tr>
<td class ="y">I want this</td>
<td class ="y">and this</td>
<td class ="y">and this</td>
<td class ="y">and this</td>
</tr>
So first I want to search the tree to find "B". Then I want to grab the text of every td tag with class y after B but before the next row of table starts over with "C".
I've tried this:
results = soup.find_all('td')
for result in results:
if result.string == "B":
print(result.string)
This gets me the string B that I want. But now I am trying to find all after this and I'm not getting what I want.
for results in soup.find_all('td'):
if results.string == 'B':
a = results.find_next('td',class_='y')
This gives me the next td after the 'B', which is what I want, but I can only seem to get that first td tag. I want to grab all of the tags that have class y, after 'B' but before 'C' (C isn't shown in the html, but follows the same pattern), and I want to it to a list.
My resulting list would be:
[['I want this'],['and this'],['and this'],['and this']]
Basically, you need to locate the element containing B text. This is your starting point.
Then, check every tr sibling of this element using find_next_siblings():
start = soup.find("td", text="B").parent
for tr in start.find_next_siblings("tr"):
# exit if reached C
if tr.find("td", text="C"):
break
# get all tds with a desired class
tds = tr.find_all("td", class_="y")
for td in tds:
print(td.get_text())
Tested on your example data, it prints:
I want this
and this
and this
and this
I am using scrapy to extract data.
There are thousands of product which i am scraping
The problem is the data on these pages is not consistent
ie.
<table class="c999 fs12 mt10 f-bold">
<tbody><tr>
<td width="16%">Type</td>
<td class="c222">Kurta</td>
</tr>
<tr>
<td>Fabric</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>Sleeves</td>
<td class="c222">3/4th Sleeves</td>
</tr>
<tr>
<td>Neck</td>
<td class="c222">Mandarin Collar</td>
</tr>
<tr>
<td>Wash Care</td>
<td class="c222">Gentle Wash</td>
</tr>
<tr>
<td>Fit</td>
<td class="c222">Regular</td>
</tr>
<tr>
<td>Length</td>
<td class="c222">Knee Length</td>
</tr>
<tr>
<td>Color</td>
<td class="c222">Brown</td>
</tr>
<tr>
<td>Fabric Details</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>
Style </td>
<td class="c222"> Printed</td>
</tr>
<tr>
<td>
SKU </td>
<td id="qa-sku" class="c222"> SR227WA70ROJINDFAS</td>
</tr>
<tr>
<td></td>
</tr>
</tbody></table>
So these rows are not consistent .
Sometimes the "Type" is at first position and sometimes it is at second.
I wrote the code to loop through the values and compare the value of 1st td if it is "Type" the get the value of its corresponding td but it is not working
Here is the code.
table_data = response.xpath('//*[#id="productInfo"]/table/tr')
for data in table_data:
name = data.xpath('td/text()').extract()
What should i do??
You can try using the following xpath :
name = data.xpath("td[position()=(count(../../tr/td[.='Type']/preceding-sibling::td)+1)]/text()").extract()
Above xpath filters <td> by position, returning only <td> in position equal to position of <td>Type</td>. Getting position of <td>Type</td> done by counting number of it's preceding sibling <td> plus one.
If you want to get sibling node of td containing string 'Type' no matter what is position of this td you can try following xpath:
//td[contains(text(),'Type')]/following-sibling::td/text()
Try this,
In [29]: response.xpath('//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/following-sibling::td/text()|//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/preceding-sibling::td/text()').extract()
Out[29]: [u'Kurta']
no matter whether td is coming after Type or before Type, This will work.
//table/tbody/tr/td[.="Fabric"]/../td[2]/text()
Did it with the above code