Xpath get data if conditions is satisfied in scrapy - python

I am using scrapy to extract data.
There are thousands of product which i am scraping
The problem is the data on these pages is not consistent
ie.
<table class="c999 fs12 mt10 f-bold">
<tbody><tr>
<td width="16%">Type</td>
<td class="c222">Kurta</td>
</tr>
<tr>
<td>Fabric</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>Sleeves</td>
<td class="c222">3/4th Sleeves</td>
</tr>
<tr>
<td>Neck</td>
<td class="c222">Mandarin Collar</td>
</tr>
<tr>
<td>Wash Care</td>
<td class="c222">Gentle Wash</td>
</tr>
<tr>
<td>Fit</td>
<td class="c222">Regular</td>
</tr>
<tr>
<td>Length</td>
<td class="c222">Knee Length</td>
</tr>
<tr>
<td>Color</td>
<td class="c222">Brown</td>
</tr>
<tr>
<td>Fabric Details</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>
Style </td>
<td class="c222"> Printed</td>
</tr>
<tr>
<td>
SKU </td>
<td id="qa-sku" class="c222"> SR227WA70ROJINDFAS</td>
</tr>
<tr>
<td></td>
</tr>
</tbody></table>
So these rows are not consistent .
Sometimes the "Type" is at first position and sometimes it is at second.
I wrote the code to loop through the values and compare the value of 1st td if it is "Type" the get the value of its corresponding td but it is not working
Here is the code.
table_data = response.xpath('//*[#id="productInfo"]/table/tr')
for data in table_data:
name = data.xpath('td/text()').extract()
What should i do??

You can try using the following xpath :
name = data.xpath("td[position()=(count(../../tr/td[.='Type']/preceding-sibling::td)+1)]/text()").extract()
Above xpath filters <td> by position, returning only <td> in position equal to position of <td>Type</td>. Getting position of <td>Type</td> done by counting number of it's preceding sibling <td> plus one.

If you want to get sibling node of td containing string 'Type' no matter what is position of this td you can try following xpath:
//td[contains(text(),'Type')]/following-sibling::td/text()

Try this,
In [29]: response.xpath('//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/following-sibling::td/text()|//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/preceding-sibling::td/text()').extract()
Out[29]: [u'Kurta']
no matter whether td is coming after Type or before Type, This will work.

//table/tbody/tr/td[.="Fabric"]/../td[2]/text()
Did it with the above code

Related

How to get text from nested html table with beautifulsoup?

Within each of the main tables respectively, there are two tables nested of which the first one contains the data A_A_A_A that i want to extract to a pandas.dataframe
<table>
<tr valign="top">
<td> </td>
<td>
<br/>
<center>
<h2>asd</h2>
</center>
<h4>asd</h4>
<table>
<tr>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
<tr>
<td> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" width="100%">
<tr>
<td width="2%"> </td>
<td> A_A_A_A <br/> A_A_A_A 111-222<br/> </td>
<td width="2%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td width="4%"> </td>
<td class="unique"> asd <br/> asd </td>
<td width="4%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td> </td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
.
.
.
</table>
<br/>
<table>
</table>
</td>
</tr>
</table>
I figured that because of the limited availiability of attributes the only way to go forward would be an iteration over a td siblings with .next_siblings and if needed .next_elements
data1 = []
for item in soup.find_all('td', attrs={'width': '2%'}):
data = item.find_next_sibling().text
data1.append(data)
returns and empty list []. Now i dont know forward because i cannot identify any other helpful attributes/classes that would help me get to the middle td that contains the information.
.find_next(name=None, attrs={}, text=None, **kwargs)
Returns the first item that matches the given criteria and appears after this Tag in the document. So in your case:
item = soup.find('td', attrs={'width': '2%'})
data = item.find_next('td').text
Note that, I removed for loop since the desired data is coming after first td with width: '2%'. After running this, data will be:
' A_A_A_A A_A_A_A 111-222 '
I took #Wiktor Stribiżew answer from here regex for loop over list in python
and kind of merged it with yours #Rustam Garayev
item = soup.find_all('td', attrs={'width': '2%'})
data = [x.find_next('td').text for x in item]
since i needed not only the first AAAA but from all the following tables as well. The code above gives this output:
['A_A_A_A',
'\xa0',
'A_A_A_A',
'\xa0', ...]
which is good enough for my purpose. I think the '\xa0' comes from it trying to do the find_next on the third td sibling, which does not have a consecutive.

Regex html dynamic table

I have stuck with regex syntax. I am trying to create a regex for html code, that looks for a specific string, which is located in a table and gives you back the next column value next to our search string.
[u'<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan \xe1llapota')
So I would like to create a regex to look for "Ingatlan \xe1llapota" and return "fel\xfaj\xedtott":
Ingatlan \xe1llapota fel\xfaj\xedtott
My current regex expression is the following: \bIngatlan állapota\s+(.*)
I would need to incorporate the td tags and to limit how long string would it return after the search string(Ingatlan állapota)
Any help is much appreciated. Thanks!
As pointed out before use xpath or css instead:
import scrapy
class txt_filter:
sterm='Ingatlan \xe1llapota'
txt= '''<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan </td></tr></table>
'''
resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
print(resp.xpath('.//td[.="'+sterm+'"]/following-sibling::td[1]/text()').extract())
Result:
$ python3 so_51590811.py
['felújított']

Iterating Through Table Rows in Selenium (Python)

I have a webpage with a table that only appears when I click 'Inspect Element' and is not visible through the View Source page. The table contains only two rows with several cells each and looks similar to this:
<table class="datadisplaytable">
<tbody>
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</tbody>
</table>
What I'm trying to do is to iterate through the rows and return the text contained in each cell. I can't really seem to do it with Selenium. The elements contain no IDs and I'm not sure how else to get them. I'm not very familiar with using xpaths and such.
Here is a debugging attempt that returns a TypeError:
def check_grades(self):
table = []
for i in self.driver.find_element_by_class_name("dddefault"):
table.append(i)
print(table)
What is an easy way to get the text from the rows?
XPath is fragile. It's better to use CSS selectors or classes:
mytable = find_element_by_css_selector('table.datadisplaytable')
for row in mytable.find_elements_by_css_selector('tr'):
for cell in row.find_elements_by_tag_name('td'):
print(cell.text)
If you want to go row by row using an xpath, you can use the following:
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault'][text()])
Which outputs:
['16759', 'MATH', '123', '001', 'Calculus']
['16449', 'PHY', '456', '002', 'Physics']
Using td[text()] will avoid getting any Nones returned for the td's that hold no text.
So to do the same using selenium you would:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][1]"])
For multiple tables:
def get_row_data(table):
for row in table.find_elements_by_xpath(".//tr"):
yield [td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][text()]"])
for table in driver.find_elements_by_xpath("//table[#class='datadisplaytable']"):
for data in get_row_data(table):
# use the data
Correction of the Selenium part of #Padraic Cunningham's answer:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault']")])
Note: there was one missing round bracket at the end; also removed the [1] index, to match the first XML example.
Another note: Though, the example with the index [1] should also be preserved, to show how to extract individual elements.
Another Version (modified and corrected post by Padraic Cunningham):
Tested with Python 3.x
#!/usr/bin/python
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault']")])

Taking only 1 on 3 <tr>

I'm actually parsing website to extract data in Python using Xpaths.
But I don't know how to this thing :
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
I know I can do //tr[3] to get the thrid one. But how can I do to get all thirds ?
Use position funxtion and take the remainder of the division by 3. Because xpath understand zero as false, you can write
//tr[not(position() mod 3)]
You can use position() function:
//tr[position() mod 3 = 0]

Python Selenium Copy Table Columns by Column Name

I have a table that has these headers, like this:
How would I select the whole column using xpath to store in an array.
I was hoping for different arrays, like:
courses = []
teacher = []
avg = []
Bare in mind these column don't have any ID's or classes, so I need a way to select just by using the name of the column.
Here is the code for the table:
<table border="0">
<tbody>
<tr>
<td nowrap="nowrap">Courses</td>
<td nowrap="nowrap">Teacher</td>
<td><select name="fldMarkingPeriod" onchange="switchMarkingPeriod(this.value);">
<option value="MP1">MP1</option>
<option selected="selected" value="MP2">MP2</option>
<option value="MP3">MP3</option>
</select>Avg</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
Any ideas? Thanks.
Not sure why exactly you need the data by columns, but here is a sample implementation:
courses = []
teachers = []
avgs = []
for row in table.find_elements_by_css("table > tbody > tr")[1:]:
course, teacher, _, avg = [td.text for td in row.find_elements_by_xpath(".//td")]
courses.append(course)
teachers.append(teacher)
avgs.append(avg)

Categories

Resources