Taking only 1 on 3 <tr> - python

I'm actually parsing website to extract data in Python using Xpaths.
But I don't know how to this thing :
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
<tr> </tr>
<tr> </tr>
<tr> Data </tr>
I know I can do //tr[3] to get the thrid one. But how can I do to get all thirds ?

Use position funxtion and take the remainder of the division by 3. Because xpath understand zero as false, you can write
//tr[not(position() mod 3)]

You can use position() function:
//tr[position() mod 3 = 0]

Related

Regex html dynamic table

I have stuck with regex syntax. I am trying to create a regex for html code, that looks for a specific string, which is located in a table and gives you back the next column value next to our search string.
[u'<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan \xe1llapota')
So I would like to create a regex to look for "Ingatlan \xe1llapota" and return "fel\xfaj\xedtott":
Ingatlan \xe1llapota fel\xfaj\xedtott
My current regex expression is the following: \bIngatlan állapota\s+(.*)
I would need to incorporate the td tags and to limit how long string would it return after the search string(Ingatlan állapota)
Any help is much appreciated. Thanks!
As pointed out before use xpath or css instead:
import scrapy
class txt_filter:
sterm='Ingatlan \xe1llapota'
txt= '''<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan </td></tr></table>
'''
resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
print(resp.xpath('.//td[.="'+sterm+'"]/following-sibling::td[1]/text()').extract())
Result:
$ python3 so_51590811.py
['felújított']

Use BeautifulSoup to fetch rows by header

I have an html structure like this one:
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
Those attributes are not always present, sometimes I can have only Brand, other cases Brand and Flavoring.
To scrap this I did a code like this:
BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows = soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand = stats_rows[1].td.get_text(),
type = stats_rows[2].td.get_text(),
contents = stats_rows[3].td.get_text(),
flavoring = stats_rows[4].td.get_text())
But as expected it fails with index out bounds (or get really messed up) when the table ordering is different (type before brand) or some of the rows are missing (no contents).
Is there any better approach to something like:
Give me the data from row with header with string 'brand'
It is definitely possible. Check this out:
from bs4 import BeautifulSoup
html_content='''
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
header = item.text
rows = item.find_next_sibling().text
print(header,rows)
Output:
Brand 2 Guys Smoke Shop
Blend Type Aromatic
Contents Black Cavendish, Virginia
Flavoring Other / Misc
This would build a dict for you:
from BeautifulSoup import BeautifulSoup
valid_headers = ['brand', 'type', 'contents', 'flavoring']
t = """<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>"""
bs = BeautifulSoup(t)
results = {}
for row in bs.findAll('tr'):
hea = row.findAll('th')
if hea.strip().lstrip().lower() in valid_headers:
val = row.findAll('td')
results[hea[0].string] = val[0].string
print results

Xpath get data if conditions is satisfied in scrapy

I am using scrapy to extract data.
There are thousands of product which i am scraping
The problem is the data on these pages is not consistent
ie.
<table class="c999 fs12 mt10 f-bold">
<tbody><tr>
<td width="16%">Type</td>
<td class="c222">Kurta</td>
</tr>
<tr>
<td>Fabric</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>Sleeves</td>
<td class="c222">3/4th Sleeves</td>
</tr>
<tr>
<td>Neck</td>
<td class="c222">Mandarin Collar</td>
</tr>
<tr>
<td>Wash Care</td>
<td class="c222">Gentle Wash</td>
</tr>
<tr>
<td>Fit</td>
<td class="c222">Regular</td>
</tr>
<tr>
<td>Length</td>
<td class="c222">Knee Length</td>
</tr>
<tr>
<td>Color</td>
<td class="c222">Brown</td>
</tr>
<tr>
<td>Fabric Details</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>
Style </td>
<td class="c222"> Printed</td>
</tr>
<tr>
<td>
SKU </td>
<td id="qa-sku" class="c222"> SR227WA70ROJINDFAS</td>
</tr>
<tr>
<td></td>
</tr>
</tbody></table>
So these rows are not consistent .
Sometimes the "Type" is at first position and sometimes it is at second.
I wrote the code to loop through the values and compare the value of 1st td if it is "Type" the get the value of its corresponding td but it is not working
Here is the code.
table_data = response.xpath('//*[#id="productInfo"]/table/tr')
for data in table_data:
name = data.xpath('td/text()').extract()
What should i do??
You can try using the following xpath :
name = data.xpath("td[position()=(count(../../tr/td[.='Type']/preceding-sibling::td)+1)]/text()").extract()
Above xpath filters <td> by position, returning only <td> in position equal to position of <td>Type</td>. Getting position of <td>Type</td> done by counting number of it's preceding sibling <td> plus one.
If you want to get sibling node of td containing string 'Type' no matter what is position of this td you can try following xpath:
//td[contains(text(),'Type')]/following-sibling::td/text()
Try this,
In [29]: response.xpath('//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/following-sibling::td/text()|//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/preceding-sibling::td/text()').extract()
Out[29]: [u'Kurta']
no matter whether td is coming after Type or before Type, This will work.
//table/tbody/tr/td[.="Fabric"]/../td[2]/text()
Did it with the above code

pandas to_html no value representation

When I run the line below, the NaN number in the dataframe does not get modified. Utilizing the exact same argument with .to_csv(), I get the expected result. Does .to_html require something different?
df.to_html('file.html', float_format='{0:.2f}'.format, na_rep="NA_REP")
It looks like the float_format doesn't play nice with na_rep. However, you can work around it if you pass a function to float_format that conditionally handles your NaNs along with the float formatting you want:
>>> df
Group Data
0 A 1.2225
1 A NaN
Reproducing your problem:
>>> out = StringIO()
>>> df.to_html(out,na_rep="Ted",float_format='{0:.2f}'.format)
>>> out.getvalue()
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Group</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td> A</td>
<td>1.22</td>
</tr>
<tr>
<th>1</th>
<td> A</td>
<td> nan</td>
</tr>
</tbody>
So you get the proper float precision but not the correct na_rep. But the following seems to work:
>>> out = StringIO()
>>> fmt = lambda x: '{0:.2f}'.format(x) if pd.notnull(x) else 'Ted'
>>> df.to_html(out,float_format=fmt)
>>> out.getvalue()
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Group</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td> A</td>
<td>1.22</td>
</tr>
<tr>
<th>1</th>
<td> A</td>
<td> Ted</td>
</tr>
</tbody>
</table>

Scrape table with Scrapy

I have a table formed like this from a website:
<table>
<tr class="head">
<td class="One">
Column 1
</td>
<td class="Two">
Column 2
</td>
<td class="Four">
Column 3
</td>
<td class="Five">
Column 4
</td>
</tr>
<tr class="DataSet1">
<td class="One">
<table>
<tr>
<td class="DataType1">
Data 1
</td>
</tr>
<tr>
<td class="DataType_2">
<ul>
<li> Data 2a</li>
<li> Data 2b</li>
<li> Data 2c</li>
<li> Data 2d</li>
</ul>
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 3
</td>
</tr>
<tr>
<td class="DataType_4">
Data 4
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 5
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 6
</td>
</tr>
</table>
</td>
</tr>
<tr class="Empty">
<td class="One">
</td>
<td class="Two">
</td>
<td class="Four">
</td>
<td class="Five">
</td>
</tr>
<tr class="DataSet2">
<td class="One">
<table>
<tr>
<td class="DataType_1">
Data 7
</td>
</tr>
<tr>
<td class="DataType_2">
Data 8
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 9
</td>
</tr>
<tr>
<td class="DataType_4">
Data 10
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 11
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 12
</td>
</tr>
</table>
</td>
</tr>
<!-- and so on -->
</table>
The tags sometimes are also empty, for example:
<td class="DataType_6> </td>
I tried to scrape the content with Scrapy and the following script:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem
class MySpider(BaseSpider):
name = "SpiderName"
allowed_domains = ["url"]
start_urls = ["url"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr')
items = []
item = ProjectItem()
item["Data_1"] = rows.select('//td[#class="DataType_1"]/text()').extract()
item["Data_2"] = rows.select('//td[#class="DataType_2"]/text()').extract()
item["Data_3"] = rows.select('//td[#class="DataType_3"]/text()').extract()
item["Data_4"] = rows.select('//td[#class="DataType_4"]/text()').extract()
item["Data_5"] = rows.select('//td[#class="DataType_5"]/text()').extract()
item["Data_6"] = rows.select('//td[#class="DataType_6"]/text()').extract()
items.append(item)
return items
If I crawl using this command:
scrapy crawl SpiderName -o output.csv -t csv
I only get crap like as many times as I have got the Dataset all the values for "Data_1".
Had a similar problem. First of all, rows = hxs.select('//tr') is going to loop on everything from the first child. You need to dig a bit deeper, and use relative paths. This link gives an excellent explanation on how to structure your code.
When I finally got my head around it, I realised that in that order to parse each item separately, row.select should not have the // in it.
Hope this helps.

Categories

Resources