scrapy xpath : choose the ancestor node

scrapy xpath : choose the ancestor node - python

I have a question about xpath
<div id="A" >
<div class="B">
<div class="C">
<div class="item">
<div class="area">
<div class="sec">USA</div>
<table>
<tbody>
<tr>
<td>D1</td>
<td>D2</td>
</tr>
<tr class="even">
<td>E1</td>
<td>E2</td>
</tr>
</tbody>
</table>
</div>
<div class="area">
<div class="sec">UK</div>
<table>
<tbody>
<tr>
<td>F1</td>
<td>F2</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>>
</div>
</div>
My code is:
sel = Selector(response)
group = sel.xpath("//div[#id='A']/div[#class='B']/div[#class='C']/div[#class='item']/div[#class='area']/table/tbody/tr")
for g in group:
# section = g.xpath("").extract() #ancestor???
context = g.xpath("./td[1]/a/text()").extract()
brief = g.xpath("./td[2]/text()").extract()
# print section[0]
print context[0]
print brief[0]
it will print:
D1
D2
E1
E2
F1
F2
But I want to print :
USA
D1
D2
USA
E1
E2
UK
F1
F2
So I need to choose the value of the parent node so I can get USA and UK
I can't figure it out for a while.
Please teach me thank you!

In XPath, you can traverse backwards a tree with .. , so a selector like this could work for you:
section = g.xpath('../../../div[#class="sec"]/text()').extract()
Although this would work, it heavily depends on the exact document structure you have. If you need a bit more flexibility, to say allow minor structural changes to the document, you could search backwards for an ancestor like this:
section = g.xpath('ancestor::div[#class="area"]/div[#class="sec"]/text()').extract()

http://www.tizag.com/xmlTutorial/xpathparent.php nice link.
Getting parent element can be done by xpathchild/..

from lxml import etree, html
import urllib2
a='<div id="A" ><div class="B"><div class="C"><div class="item"><div class="area"><div class="sec">USA</div> <table> <tbody> <tr> <td>D1</td> <td>D2</td> </tr> <tr class="even"> <td>E1</td> <td>E2</td> </tr> </tbody> </table> </div> <div class="area"> <div class="sec">UK</div> <table> <tbody> <tr> <td>F1</td> <td>F2</td> </tr> </tbody> </table> </div> </div> </div> </div> </div>'
tree = etree.fromstring(a)
print filter(lambda x:x.strip(),tree.xpath('//div[#class="area"]//text()'))
Output: ['USA', 'D1', 'D2', 'E1', 'E2', 'UK', 'F1', 'F2']
// - extract all descendants
/ - extracts only the direct child elements

Related

Finding certain element using bs4 beautifulSoup

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:

You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)

Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text

To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

Python Beautiful Soup Iterate over Multiple Tables

Trying to find multiple tables using the CSS names and I am only getting the CSS in the output initially. I want to loop over each of the small tables and from there each row contains player info with the tds attributes about each player. How come what I have there doesn't actually print the table contents to begin with? I want to confirm I have made this first step right, before I then go on and into
the tr and tds for each mini table. I think part of the issue is that the first table.
My program -
import requests
from bs4 import BeautifulSoup
#url = 'https://www.skysports.com/premier-league-table'
base_url = 'https://www.skysports.com'
# Squad Data
squad_url = base_url + '/liverpool-squad'
squad_r = requests.get(squad_url)
print(squad_r.status_code)
premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser')
premier_squad_table = premier_squad_soup.find_all = ('table', {'class': 'table -small no-wrap football-squad-table '})
print(premier_squad_table)
HTML -
each table looks like the following but with a different title
<table class="table -small no-wrap football-squad-table " title="Goalkeeper">
<colgroup>
<col class="" style="">
<col class="digit-4 -bp30-hdn">
<col class="digit-3 ">
<col class="digit-3 ">
<col class="digit-3 ">
</colgroup>
<thead>
<tr class="text-s -interact text-h6" style="">
<th class=" text-h4 -txt-left" title="">Goalkeeper</th>
<th class=" text-h6" title="Played">Pld</th>
<th class=" text-h6" title="Goals">G</th>
<th class=" text-h6" title="Yellow Cards ">YC</th>
<th class=" text-h6" title="Red Cards">RC</th>
</tr>
</thead>
<tbody>
<tr class="text-h6 -center">
<td>
<a href="/football/player/141016/alisson-ramses-becker">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Alisson Ramses Becker</h6></span>
</div>
</a>
</td>
<td>
13 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/simon-mignolet">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Simon Mignolet</h6></span>
</div>
</a>
</td>
<td>
1 (0) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="text-h6 -center">
<td>
<a href="/football/player/153304/kamil-grabara">
<div class="row-table -2cols">
<span class="col span4/5 -txt-left"><h6 class=" text-h5">Kamil Grabara</h6></span>
</div>
</a>
</td>
<td>
1 (1) </td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
Output -
200
('table', {'class': 'table -small no-wrap football-squad-table '})

Had to find the div first to then get the table inside the div
premier_squad_div = premier_squad_soup.find('div', {'class': '-bp30-box col span1/1'})
premier_squad_table = premier_squad_div.find_all('table', {'class': 'table -small no-wrap football-squad-table '})

python beautifulsoup parsing recursing

I'm a python/BeautifulSoup beginner, I'm trying to extract all the content in <td width="473" valign="top"> -> <strong>.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl">
<head>
<title>MIEJSKI OŚRODEK KULTURY W ŻORACH Repertuar Kina Na Starówce</title>
</head>
<body>
<div class="page_content">
<p> </p>
<p>
<table style="width: 450px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="57" valign="top">
<p align="center"><strong>Data</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>Tytuł Filmu</strong></p>
</td>
<td width="95" valign="top">
<p align="center"><strong>Godzina</strong></p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong> </strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>1 - 5.05</strong></p>
</td>
<td width="95" valign="top">
<p align="center"> </p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong>1</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>KINO POWTÓREK: ZWIERZOGRÓD </strong>USA/b.o cena 10 zł</p>
</td>
<td width="95" valign="top">
<p align="center">16:30</p>
</td>
</tr>
</tbody>
</table>
</p>
</body>
</html>
The furthest I can go is to get a list of all the tags with this code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("zory1.html"), "html.parser")
y = soup.find_all(width="473")
newy = str(y)
newsoup = BeautifulSoup(newy ,"html.parser")
stronglist = newsoup.find_all('strong')
lasty = str(stronglist)
lastsoup = BeautifulSoup(lasty , "html.parser")
lst = soup.find_all('strong')
for item in lst:
print item
How can I take out the content within the tag, in a beginner's level?
Thanks

Use get_text() to get a node's text.
Complete working example where we go over all the rows and all the cells inside the table:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "html.parser")
for row in soup.find_all("tr"):
print([cell.get_text(strip=True) for cell in row.find_all("td")])
Prints:
['Data', 'Tytuł Filmu', 'Godzina']
['', '1 - 5.05', '']
['1', 'KINO POWTÓREK: ZWIERZOGRÓDUSA/b.o\xa0 cena 10 zł', '16:30']

Here you are
from bs4 import BeautifulSoup
navigator = BeautifulSoup(open("zory1.html"), "html.parser")
tds = navigator.find_all("td", {"width":"473"})
resultList = [item.strong.get_text() for item in tds]
for item in resultList:
print item
Result
$ python test.py
Tytuł Filmu
1 - 5.05
KINO POWTÓREK: ZWIERZOGRÓD

How to extract pairs of (href, alt) wih python scrapy

I have an html page (seed) of the form:
<div class="sth1">
<table cellspacing="6" width="600">
<tr>
<td>
<img alt="alt1" border="0" height="22" src="img1" width="92">
</td>
<td>
name1
</td>
<td>
<img alt="alt2" border="0" height="22" src="img2" width="92">
</td>
<td>
name2
</td>
</tr>
</table>
</div>
What I would like to do is loop into all <tr>'s and extract all href, alt pairs with python scrapy. In this example, I should get:
link1, alt1
link2, alt2

Here is an example from the Scrapy Shell:
$ scrapy shell index.html
In [1]: for cell in response.xpath("//div[#class='sth1']/table/tr/td"):
...: href = cell.xpath("a/#href").extract()
...: alt = cell.xpath("a/img/#alt").extract()
...: print href, alt
[u'link1'] [u'alt1']
[u'link1'] []
[u'link2'] [u'alt2']
[u'link2'] []
where index.html contains the sample HTML provided in the question.

You could try Scrapy's built-in SelectorList combined with Python's zip():
from scrapy.selector import SelectorList
xpq = '//div[#class="sth1"]/table/tr/td[./a/img]'
cells = SelectorList(response.xpath(xpq))
zip(cells.xpath('a/#href'), cells.xpath('a/img/#alt'))
=> [('link1', 'alt1'), ('link2', 'alt2')]

Parsing html table with BeautifulSoup to python dictionary

This is an html code than I'm trying to parse with BeautifulSoup:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
... (amount of this tags isn't fixed too)
</ul>
</td>
</tr>
</table>
The output I would like to get is a dictionary like this:
DICT = {
'menu1': ['Some data1','Foo1 Bar1'],
'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}
As I already mentioned in the code, amount of <li> tags is not fixed. Additionally, there could be:
menu1 and menu2
just menu1
just menu2
no menu1 and menu2 (just <table></table>)
so e.g. it could looks just like this:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
</table>
I was trying to use this example but with no success. I think it's because of that <ul> tags, I can't read proper data from table. Problem for me is also variable amount of menus and <li> tags.
So my question is how to parse this particular table to python dictionary?
I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is.
request = c.get('http://example.com/somepage.html)
soup = bs(request.text)
and this is always the first table of the page, so I can get it with:
table = soup.find_all('table')[0]
Thank you in advance for any help.

html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
.
{
u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'],
u'menu1': [u'Some data1', u'Foo1']
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy xpath : choose the ancestor node - python

http://www.tizag.com/xmlTutorial/xpathparent.php nice link. Getting parent element can be done by xpathchild/..

Related

Finding certain element using bs4 beautifulSoup

Python Beautiful Soup Iterate over Multiple Tables

python beautifulsoup parsing recursing

How to extract pairs of (href, alt) wih python scrapy

Parsing html table with BeautifulSoup to python dictionary

Categories

Resources