Within each of the main tables respectively, there are two tables nested of which the first one contains the data A_A_A_A that i want to extract to a pandas.dataframe
<table>
<tr valign="top">
<td> </td>
<td>
<br/>
<center>
<h2>asd</h2>
</center>
<h4>asd</h4>
<table>
<tr>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
<tr>
<td> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" width="100%">
<tr>
<td width="2%"> </td>
<td> A_A_A_A <br/> A_A_A_A 111-222<br/> </td>
<td width="2%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td width="4%"> </td>
<td class="unique"> asd <br/> asd </td>
<td width="4%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td> </td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
.
.
.
</table>
<br/>
<table>
</table>
</td>
</tr>
</table>
I figured that because of the limited availiability of attributes the only way to go forward would be an iteration over a td siblings with .next_siblings and if needed .next_elements
data1 = []
for item in soup.find_all('td', attrs={'width': '2%'}):
data = item.find_next_sibling().text
data1.append(data)
returns and empty list []. Now i dont know forward because i cannot identify any other helpful attributes/classes that would help me get to the middle td that contains the information.
.find_next(name=None, attrs={}, text=None, **kwargs)
Returns the first item that matches the given criteria and appears after this Tag in the document. So in your case:
item = soup.find('td', attrs={'width': '2%'})
data = item.find_next('td').text
Note that, I removed for loop since the desired data is coming after first td with width: '2%'. After running this, data will be:
' A_A_A_A A_A_A_A 111-222 '
I took #Wiktor Stribiżew answer from here regex for loop over list in python
and kind of merged it with yours #Rustam Garayev
item = soup.find_all('td', attrs={'width': '2%'})
data = [x.find_next('td').text for x in item]
since i needed not only the first AAAA but from all the following tables as well. The code above gives this output:
['A_A_A_A',
'\xa0',
'A_A_A_A',
'\xa0', ...]
which is good enough for my purpose. I think the '\xa0' comes from it trying to do the find_next on the third td sibling, which does not have a consecutive.
Related
is it possible to capture all EAN numbers in such a construct using XPath, or do I need to use regular expressions?
<table>
<tr>
<td>
EAN Giftbox
</td>
<td>
7350034654483
</td>
</tr>
<tr>
<td>
EAN Export Carton:
</td>
<td>
17350034643958
</td>
</tr>
</table>
I want to get a list of ['7350034654483', '17350034643958']
from lxml import html as lh
html = """<table>
<tr>
<td>
EAN Giftbox
</td>
<td>
7350034654483
</td>
</tr>
<tr>
<td>
EAN Export Carton:
</td>
<td>
17350034643958
</td>
</tr>
</table>
"""
root = lh.fragment_fromstring(html)
tds = root.xpath('//tr[*]/td[2]')
for td in tds:
print(td.text.strip())
Output:
7350034654483
17350034643958
I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)
from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]
I have a table that has these headers, like this:
How would I select the whole column using xpath to store in an array.
I was hoping for different arrays, like:
courses = []
teacher = []
avg = []
Bare in mind these column don't have any ID's or classes, so I need a way to select just by using the name of the column.
Here is the code for the table:
<table border="0">
<tbody>
<tr>
<td nowrap="nowrap">Courses</td>
<td nowrap="nowrap">Teacher</td>
<td><select name="fldMarkingPeriod" onchange="switchMarkingPeriod(this.value);">
<option value="MP1">MP1</option>
<option selected="selected" value="MP2">MP2</option>
<option value="MP3">MP3</option>
</select>Avg</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
Any ideas? Thanks.
Not sure why exactly you need the data by columns, but here is a sample implementation:
courses = []
teachers = []
avgs = []
for row in table.find_elements_by_css("table > tbody > tr")[1:]:
course, teacher, _, avg = [td.text for td in row.find_elements_by_xpath(".//td")]
courses.append(course)
teachers.append(teacher)
avgs.append(avg)
I am using scrapy to extract data.
There are thousands of product which i am scraping
The problem is the data on these pages is not consistent
ie.
<table class="c999 fs12 mt10 f-bold">
<tbody><tr>
<td width="16%">Type</td>
<td class="c222">Kurta</td>
</tr>
<tr>
<td>Fabric</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>Sleeves</td>
<td class="c222">3/4th Sleeves</td>
</tr>
<tr>
<td>Neck</td>
<td class="c222">Mandarin Collar</td>
</tr>
<tr>
<td>Wash Care</td>
<td class="c222">Gentle Wash</td>
</tr>
<tr>
<td>Fit</td>
<td class="c222">Regular</td>
</tr>
<tr>
<td>Length</td>
<td class="c222">Knee Length</td>
</tr>
<tr>
<td>Color</td>
<td class="c222">Brown</td>
</tr>
<tr>
<td>Fabric Details</td>
<td class="c222">Cotton</td>
</tr>
<tr>
<td>
Style </td>
<td class="c222"> Printed</td>
</tr>
<tr>
<td>
SKU </td>
<td id="qa-sku" class="c222"> SR227WA70ROJINDFAS</td>
</tr>
<tr>
<td></td>
</tr>
</tbody></table>
So these rows are not consistent .
Sometimes the "Type" is at first position and sometimes it is at second.
I wrote the code to loop through the values and compare the value of 1st td if it is "Type" the get the value of its corresponding td but it is not working
Here is the code.
table_data = response.xpath('//*[#id="productInfo"]/table/tr')
for data in table_data:
name = data.xpath('td/text()').extract()
What should i do??
You can try using the following xpath :
name = data.xpath("td[position()=(count(../../tr/td[.='Type']/preceding-sibling::td)+1)]/text()").extract()
Above xpath filters <td> by position, returning only <td> in position equal to position of <td>Type</td>. Getting position of <td>Type</td> done by counting number of it's preceding sibling <td> plus one.
If you want to get sibling node of td containing string 'Type' no matter what is position of this td you can try following xpath:
//td[contains(text(),'Type')]/following-sibling::td/text()
Try this,
In [29]: response.xpath('//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/following-sibling::td/text()|//table[#class="c999 fs12 mt10 f-bold"]/tr[contains(td/text(), "Type")]/td[contains(text(), "Type")]/preceding-sibling::td/text()').extract()
Out[29]: [u'Kurta']
no matter whether td is coming after Type or before Type, This will work.
//table/tbody/tr/td[.="Fabric"]/../td[2]/text()
Did it with the above code
I have a table formed like this from a website:
<table>
<tr class="head">
<td class="One">
Column 1
</td>
<td class="Two">
Column 2
</td>
<td class="Four">
Column 3
</td>
<td class="Five">
Column 4
</td>
</tr>
<tr class="DataSet1">
<td class="One">
<table>
<tr>
<td class="DataType1">
Data 1
</td>
</tr>
<tr>
<td class="DataType_2">
<ul>
<li> Data 2a</li>
<li> Data 2b</li>
<li> Data 2c</li>
<li> Data 2d</li>
</ul>
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 3
</td>
</tr>
<tr>
<td class="DataType_4">
Data 4
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 5
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 6
</td>
</tr>
</table>
</td>
</tr>
<tr class="Empty">
<td class="One">
</td>
<td class="Two">
</td>
<td class="Four">
</td>
<td class="Five">
</td>
</tr>
<tr class="DataSet2">
<td class="One">
<table>
<tr>
<td class="DataType_1">
Data 7
</td>
</tr>
<tr>
<td class="DataType_2">
Data 8
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 9
</td>
</tr>
<tr>
<td class="DataType_4">
Data 10
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 11
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 12
</td>
</tr>
</table>
</td>
</tr>
<!-- and so on -->
</table>
The tags sometimes are also empty, for example:
<td class="DataType_6> </td>
I tried to scrape the content with Scrapy and the following script:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem
class MySpider(BaseSpider):
name = "SpiderName"
allowed_domains = ["url"]
start_urls = ["url"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr')
items = []
item = ProjectItem()
item["Data_1"] = rows.select('//td[#class="DataType_1"]/text()').extract()
item["Data_2"] = rows.select('//td[#class="DataType_2"]/text()').extract()
item["Data_3"] = rows.select('//td[#class="DataType_3"]/text()').extract()
item["Data_4"] = rows.select('//td[#class="DataType_4"]/text()').extract()
item["Data_5"] = rows.select('//td[#class="DataType_5"]/text()').extract()
item["Data_6"] = rows.select('//td[#class="DataType_6"]/text()').extract()
items.append(item)
return items
If I crawl using this command:
scrapy crawl SpiderName -o output.csv -t csv
I only get crap like as many times as I have got the Dataset all the values for "Data_1".
Had a similar problem. First of all, rows = hxs.select('//tr') is going to loop on everything from the first child. You need to dig a bit deeper, and use relative paths. This link gives an excellent explanation on how to structure your code.
When I finally got my head around it, I realised that in that order to parse each item separately, row.select should not have the // in it.
Hope this helps.