I have stuck with regex syntax. I am trying to create a regex for html code, that looks for a specific string, which is located in a table and gives you back the next column value next to our search string.
[u'<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan \xe1llapota')
So I would like to create a regex to look for "Ingatlan \xe1llapota" and return "fel\xfaj\xedtott":
Ingatlan \xe1llapota fel\xfaj\xedtott
My current regex expression is the following: \bIngatlan állapota\s+(.*)
I would need to incorporate the td tags and to limit how long string would it return after the search string(Ingatlan állapota)
Any help is much appreciated. Thanks!
As pointed out before use xpath or css instead:
import scrapy
class txt_filter:
sterm='Ingatlan \xe1llapota'
txt= '''<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan </td></tr></table>
'''
resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
print(resp.xpath('.//td[.="'+sterm+'"]/following-sibling::td[1]/text()').extract())
Result:
$ python3 so_51590811.py
['felújított']
Related
I am working on python Django templates in which I have a table having column as id, factor A, factor B, factor C. Values for id, factor A, factor B and factor C respectively are 79, 0.56, 1.1, 1.3.
The code for the html template is like this:
<table class="table table-bordered">
<thead>
<tr>
<th class="text-center">id</th>
<th class="text-center">Factor A</th>
<th class="text-center">Factor B</th>
<th class="text-center">Factor C</th>
</tr>
</thead>
<tbody >
<tr ng-class="{'info':aggregateData.Mode, 'closed':!aggregateData.Open}">
<td class="text-center">{{aggregateData.id}}</td>
<td class="text-center">{{aggregateData.factor_a}}</td>
<td class="text-center">{{aggregateData.factor_b}}</td>
<td class="text-center">{{aggregateData.factor_c}}</td>
</tr>
</tbody>
</table
I want to add a clickable icon to this similar like this for rows having aggregateData.Open true.
Can someone suggest a way how I can achieve this.
Try this.
<table class="table table-bordered">
<thead>
<tr>
<th class="text-center">id</th>
<th class="text-center">Factor A</th>
<th class="text-center">Factor B</th>
<th class="text-center">Factor C</th>
</tr>
</thead>
<tbody >
<tr ng-class="{'info':aggregateData.Mode, 'closed':!aggregateData.Open}">
<td class="text-center">{{aggregateData.id}}</td>
<td class="text-center">{{aggregateData.factor_a}}</td>
<td class="text-center">{{aggregateData.factor_b}}</td>
<td class="text-center">{{aggregateData.factor_c}}</td>
{% if aggregateData.open == True %}
<td class="text-center">
<a href="https://www.google.co.in">
<img src="/path_toicon.png">
</a>
</td>
{% endif %}
</tr>
</tbody>
</table>
I'm actually parsing website to extract data in Python using Xpaths.
But I don't know how to this thing :
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
I know I can do //tr[3] to get the thrid one. But how can I do to get all thirds ?
Use position funxtion and take the remainder of the division by 3. Because xpath understand zero as false, you can write
//tr[not(position() mod 3)]
You can use position() function:
//tr[position() mod 3 = 0]
I have a table that has these headers, like this:
How would I select the whole column using xpath to store in an array.
I was hoping for different arrays, like:
courses = []
teacher = []
avg = []
Bare in mind these column don't have any ID's or classes, so I need a way to select just by using the name of the column.
Here is the code for the table:
<table border="0">
<tbody>
<tr>
<td nowrap="nowrap">Courses</td>
<td nowrap="nowrap">Teacher</td>
<td><select name="fldMarkingPeriod" onchange="switchMarkingPeriod(this.value);">
<option value="MP1">MP1</option>
<option selected="selected" value="MP2">MP2</option>
<option value="MP3">MP3</option>
</select>Avg</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
Any ideas? Thanks.
Not sure why exactly you need the data by columns, but here is a sample implementation:
courses = []
teachers = []
avgs = []
for row in table.find_elements_by_css("table > tbody > tr")[1:]:
course, teacher, _, avg = [td.text for td in row.find_elements_by_xpath(".//td")]
courses.append(course)
teachers.append(teacher)
avgs.append(avg)
I am using scrapy to extract data.
There are thousands of product which i am scraping
The problem is the data on these pages is not consistent
ie.
<table class="c999 fs12 mt10 f-bold">
<tbody><tr>
<td width="16%">Type</td>
<td class="c222">Kurta</td>
</tr>
<tr>
<td>Fabric</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>Sleeves</td>
<td class="c222">3/4th Sleeves</td>
</tr>
<tr>
<td>Neck</td>
<td class="c222">Mandarin Collar</td>
</tr>
<tr>
<td>Wash Care</td>
<td class="c222">Gentle Wash</td>
</tr>
<tr>
<td>Fit</td>
<td class="c222">Regular</td>
</tr>
<tr>
<td>Length</td>
<td class="c222">Knee Length</td>
</tr>
<tr>
<td>Color</td>
<td class="c222">Brown</td>
</tr>
<tr>
<td>Fabric Details</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>
Style </td>
<td class="c222"> Printed</td>
</tr>
<tr>
<td>
SKU </td>
<td id="qa-sku" class="c222"> SR227WA70ROJINDFAS</td>
</tr>
<tr>
<td></td>
</tr>
</tbody></table>
So these rows are not consistent .
Sometimes the "Type" is at first position and sometimes it is at second.
I wrote the code to loop through the values and compare the value of 1st td if it is "Type" the get the value of its corresponding td but it is not working
Here is the code.
table_data = response.xpath('//*[#id="productInfo"]/table/tr')
for data in table_data:
name = data.xpath('td/text()').extract()
What should i do??
You can try using the following xpath :
name = data.xpath("td[position()=(count(../../tr/td[.='Type']/preceding-sibling::td)+1)]/text()").extract()
Above xpath filters <td> by position, returning only <td> in position equal to position of <td>Type</td>. Getting position of <td>Type</td> done by counting number of it's preceding sibling <td> plus one.
If you want to get sibling node of td containing string 'Type' no matter what is position of this td you can try following xpath:
//td[contains(text(),'Type')]/following-sibling::td/text()
Try this,
In [29]: response.xpath('//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/following-sibling::td/text()|//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/preceding-sibling::td/text()').extract()
Out[29]: [u'Kurta']
no matter whether td is coming after Type or before Type, This will work.
//table/tbody/tr/td[.="Fabric"]/../td[2]/text()
Did it with the above code
I have a table formed like this from a website:
<table>
<tr class="head">
<td class="One">
Column 1
</td>
<td class="Two">
Column 2
</td>
<td class="Four">
Column 3
</td>
<td class="Five">
Column 4
</td>
</tr>
<tr class="DataSet1">
<td class="One">
<table>
<tr>
<td class="DataType1">
Data 1
</td>
</tr>
<tr>
<td class="DataType_2">
<ul>
<li> Data 2a</li>
<li> Data 2b</li>
<li> Data 2c</li>
<li> Data 2d</li>
</ul>
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 3
</td>
</tr>
<tr>
<td class="DataType_4">
Data 4
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 5
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 6
</td>
</tr>
</table>
</td>
</tr>
<tr class="Empty">
<td class="One">
</td>
<td class="Two">
</td>
<td class="Four">
</td>
<td class="Five">
</td>
</tr>
<tr class="DataSet2">
<td class="One">
<table>
<tr>
<td class="DataType_1">
Data 7
</td>
</tr>
<tr>
<td class="DataType_2">
Data 8
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 9
</td>
</tr>
<tr>
<td class="DataType_4">
Data 10
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 11
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 12
</td>
</tr>
</table>
</td>
</tr>
<!-- and so on -->
</table>
The tags sometimes are also empty, for example:
<td class="DataType_6> </td>
I tried to scrape the content with Scrapy and the following script:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem
class MySpider(BaseSpider):
name = "SpiderName"
allowed_domains = ["url"]
start_urls = ["url"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr')
items = []
item = ProjectItem()
item["Data_1"] = rows.select('//td[#class="DataType_1"]/text()').extract()
item["Data_2"] = rows.select('//td[#class="DataType_2"]/text()').extract()
item["Data_3"] = rows.select('//td[#class="DataType_3"]/text()').extract()
item["Data_4"] = rows.select('//td[#class="DataType_4"]/text()').extract()
item["Data_5"] = rows.select('//td[#class="DataType_5"]/text()').extract()
item["Data_6"] = rows.select('//td[#class="DataType_6"]/text()').extract()
items.append(item)
return items
If I crawl using this command:
scrapy crawl SpiderName -o output.csv -t csv
I only get crap like as many times as I have got the Dataset all the values for "Data_1".
Had a similar problem. First of all, rows = hxs.select('//tr') is going to loop on everything from the first child. You need to dig a bit deeper, and use relative paths. This link gives an excellent explanation on how to structure your code.
When I finally got my head around it, I realised that in that order to parse each item separately, row.select should not have the // in it.
Hope this helps.