scrapy xpath : choose the ancestor node - python

I have a question about xpath
<div id="A" >
<div class="B">
<div class="C">
<div class="item">
<div class="area">
<div class="sec">USA</div>
<table>
<tbody>
<tr>
<td>D1</td>
<td>D2</td>
</tr>
<tr class="even">
<td>E1</td>
<td>E2</td>
</tr>
</tbody>
</table>
</div>
<div class="area">
<div class="sec">UK</div>
<table>
<tbody>
<tr>
<td>F1</td>
<td>F2</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>>
</div>
</div>
My code is:
sel = Selector(response)
group = sel.xpath("//div[#id='A']/div[#class='B']/div[#class='C']/div[#class='item']/div[#class='area']/table/tbody/tr")
for g in group:
# section = g.xpath("").extract() #ancestor???
context = g.xpath("./td[1]/a/text()").extract()
brief = g.xpath("./td[2]/text()").extract()
# print section[0]
print context[0]
print brief[0]
it will print:
D1
D2
E1
E2
F1
F2
But I want to print :
USA
D1
D2
USA
E1
E2
UK
F1
F2
So I need to choose the value of the parent node so I can get USA and UK
I can't figure it out for a while.
Please teach me thank you!

In XPath, you can traverse backwards a tree with .. , so a selector like this could work for you:
section = g.xpath('../../../div[#class="sec"]/text()').extract()
Although this would work, it heavily depends on the exact document structure you have. If you need a bit more flexibility, to say allow minor structural changes to the document, you could search backwards for an ancestor like this:
section = g.xpath('ancestor::div[#class="area"]/div[#class="sec"]/text()').extract()

http://www.tizag.com/xmlTutorial/xpathparent.php nice link.
Getting parent element can be done by xpathchild/..

from lxml import etree, html
import urllib2
a='<div id="A" ><div class="B"><div class="C"><div class="item"><div class="area"><div class="sec">USA</div> <table> <tbody> <tr> <td>D1</td> <td>D2</td> </tr> <tr class="even"> <td>E1</td> <td>E2</td> </tr> </tbody> </table> </div> <div class="area"> <div class="sec">UK</div> <table> <tbody> <tr> <td>F1</td> <td>F2</td> </tr> </tbody> </table> </div> </div> </div> </div> </div>'
tree = etree.fromstring(a)
print filter(lambda x:x.strip(),tree.xpath('//div[#class="area"]//text()'))
Output: ['USA', 'D1', 'D2', 'E1', 'E2', 'UK', 'F1', 'F2']
// - extract all descendants
/ - extracts only the direct child elements

Related

Finding certain element using bs4 beautifulSoup

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:
You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)
Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text
To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

Python Beautiful Soup Iterate over Multiple Tables

Trying to find multiple tables using the CSS names and I am only getting the CSS in the output initially. I want to loop over each of the small tables and from there each row contains player info with the tds attributes about each player. How come what I have there doesn't actually print the table contents to begin with? I want to confirm I have made this first step right, before I then go on and into
the tr and tds for each mini table. I think part of the issue is that the first table.
My program -
import requests
from bs4 import BeautifulSoup
#url = 'https://www.skysports.com/premier-league-table'
base_url = 'https://www.skysports.com'
# Squad Data
squad_url = base_url + '/liverpool-squad'
squad_r = requests.get(squad_url)
print(squad_r.status_code)
premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser')
premier_squad_table = premier_squad_soup.find_all = ('table', {'class': 'table -small no-wrap football-squad-table '})
print(premier_squad_table)
HTML -
each table looks like the following but with a different title
<table class="table -small no-wrap football-squad-table " title="Goalkeeper">
<colgroup>
<col class="" style="">
<col class="digit-4 -bp30-hdn">
<col class="digit-3 ">
<col class="digit-3 ">
<col class="digit-3 ">
</colgroup>
<thead>
<tr class="text-s -interact text-h6" style="">
<th class=" text-h4 -txt-left" title="">Goalkeeper</th>
<th class=" text-h6" title="Played">Pld</th>
<th class=" text-h6" title="Goals">G</th>
<th class=" text-h6" title="Yellow Cards ">YC</th>
<th class=" text-h6" title="Red Cards">RC</th>
</tr>
</thead>
<tbody>
<tr class="text-h6 -center">
<td>
<a href="/football/player/141016/alisson-ramses-becker">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Alisson Ramses Becker</h6></span>
</div>
</a>
</td>
<td>
13 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/simon-mignolet">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Simon Mignolet</h6></span>
</div>
</a>
</td>
<td>
1 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/football/player/153304/kamil-grabara">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Kamil Grabara</h6></span>
</div>
</a>
</td>
<td>
1 (1) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
Output -
200
('table', {'class': 'table -small no-wrap football-squad-table '})
Had to find the div first to then get the table inside the div
premier_squad_div = premier_squad_soup.find('div', {'class': '-bp30-box col span1/1'})
premier_squad_table = premier_squad_div.find_all('table', {'class': 'table -small no-wrap football-squad-table '})

python beautifulsoup parsing recursing

I'm a python/BeautifulSoup beginner, I'm trying to extract all the content in <td width="473" valign="top"> -> <strong>.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl">
<head>
<title>MIEJSKI OŚRODEK KULTURY W ŻORACH Repertuar Kina Na Starówce</title>
</head>
<body>
<div class="page_content">
<p> </p>
<p>
<table style="width: 450px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="57" valign="top">
<p align="center"><strong>Data</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>Tytuł Filmu</strong></p>
</td>
<td width="95" valign="top">
<p align="center"><strong>Godzina</strong></p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong> </strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>1 - 5.05</strong></p>
</td>
<td width="95" valign="top">
<p align="center"> </p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong>1</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>KINO POWTÓREK: ZWIERZOGRÓD </strong>USA/b.o cena 10 zł</p>
</td>
<td width="95" valign="top">
<p align="center">16:30</p>
</td>
</tr>
</tbody>
</table>
</p>
</body>
</html>
The furthest I can go is to get a list of all the tags with this code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("zory1.html"), "html.parser")
y = soup.find_all(width="473")
newy = str(y)
newsoup = BeautifulSoup(newy ,"html.parser")
stronglist = newsoup.find_all('strong')
lasty = str(stronglist)
lastsoup = BeautifulSoup(lasty , "html.parser")
lst = soup.find_all('strong')
for item in lst:
print item
How can I take out the content within the tag, in a beginner's level?
Thanks
Use get_text() to get a node's text.
Complete working example where we go over all the rows and all the cells inside the table:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "html.parser")
for row in soup.find_all("tr"):
print([cell.get_text(strip=True) for cell in row.find_all("td")])
Prints:
['Data', 'Tytuł Filmu', 'Godzina']
['', '1 - 5.05', '']
['1', 'KINO POWTÓREK: ZWIERZOGRÓDUSA/b.o\xa0 cena 10 zł', '16:30']
Here you are
from bs4 import BeautifulSoup
navigator = BeautifulSoup(open("zory1.html"), "html.parser")
tds = navigator.find_all("td", {"width":"473"})
resultList = [item.strong.get_text() for item in tds]
for item in resultList:
print item
Result
$ python test.py
Tytuł Filmu
1 - 5.05
KINO POWTÓREK: ZWIERZOGRÓD

How to extract pairs of (href, alt) wih python scrapy

I have an html page (seed) of the form:
<div class="sth1">
<table cellspacing="6" width="600">
<tr>
<td>
<img alt="alt1" border="0" height="22" src="img1" width="92">
</td>
<td>
name1
</td>
<td>
<img alt="alt2" border="0" height="22" src="img2" width="92">
</td>
<td>
name2
</td>
</tr>
</table>
</div>
What I would like to do is loop into all <tr>'s and extract all href, alt pairs with python scrapy. In this example, I should get:
link1, alt1
link2, alt2
Here is an example from the Scrapy Shell:
$ scrapy shell index.html
In [1]: for cell in response.xpath("//div[#class='sth1']/table/tr/td"):
...: href = cell.xpath("a/#href").extract()
...: alt = cell.xpath("a/img/#alt").extract()
...: print href, alt
[u'link1'] [u'alt1']
[u'link1'] []
[u'link2'] [u'alt2']
[u'link2'] []
where index.html contains the sample HTML provided in the question.
You could try Scrapy's built-in SelectorList combined with Python's zip():
from scrapy.selector import SelectorList
xpq = '//div[#class="sth1"]/table/tr/td[./a/img]'
cells = SelectorList(response.xpath(xpq))
zip(cells.xpath('a/#href'), cells.xpath('a/img/#alt'))
=> [('link1', 'alt1'), ('link2', 'alt2')]

Parsing html table with BeautifulSoup to python dictionary

This is an html code than I'm trying to parse with BeautifulSoup:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
... (amount of this tags isn't fixed too)
</ul>
</td>
</tr>
</table>
The output I would like to get is a dictionary like this:
DICT = {
'menu1': ['Some data1','Foo1 Bar1'],
'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}
As I already mentioned in the code, amount of <li> tags is not fixed. Additionally, there could be:
menu1 and menu2
just menu1
just menu2
no menu1 and menu2 (just <table></table>)
so e.g. it could looks just like this:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
</table>
I was trying to use this example but with no success. I think it's because of that <ul> tags, I can't read proper data from table. Problem for me is also variable amount of menus and <li> tags.
So my question is how to parse this particular table to python dictionary?
I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is.
request = c.get('http://example.com/somepage.html)
soup = bs(request.text)
and this is always the first table of the page, so I can get it with:
table = soup.find_all('table')[0]
Thank you in advance for any help.
html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
.
{
u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'],
u'menu1': [u'Some data1', u'Foo1']
}

Categories

Resources