How to extract pairs of (href, alt) wih python scrapy

How to extract pairs of (href, alt) wih python scrapy - python

I have an html page (seed) of the form:
<div class="sth1">
<table cellspacing="6" width="600">
<tr>
<td>
<img alt="alt1" border="0" height="22" src="img1" width="92">
</td>
<td>
name1
</td>
<td>
<img alt="alt2" border="0" height="22" src="img2" width="92">
</td>
<td>
name2
</td>
</tr>
</table>
</div>
What I would like to do is loop into all <tr>'s and extract all href, alt pairs with python scrapy. In this example, I should get:
link1, alt1
link2, alt2

Here is an example from the Scrapy Shell:
$ scrapy shell index.html
In [1]: for cell in response.xpath("//div[#class='sth1']/table/tr/td"):
...: href = cell.xpath("a/#href").extract()
...: alt = cell.xpath("a/img/#alt").extract()
...: print href, alt
[u'link1'] [u'alt1']
[u'link1'] []
[u'link2'] [u'alt2']
[u'link2'] []
where index.html contains the sample HTML provided in the question.

You could try Scrapy's built-in SelectorList combined with Python's zip():
from scrapy.selector import SelectorList
xpq = '//div[#class="sth1"]/table/tr/td[./a/img]'
cells = SelectorList(response.xpath(xpq))
zip(cells.xpath('a/#href'), cells.xpath('a/img/#alt'))
=> [('link1', 'alt1'), ('link2', 'alt2')]

Related

Finding certain element using bs4 beautifulSoup

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:

You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)

Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text

To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

Is there a way to extract all the class name from an HTML file using BeautifulSoup?

<tr id="section_1asd8aa" class="main">
<td class="header">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="font-family: arial,sans-serif; font-size: 11px;>DUMMY TEXTbrowser.
</td>
</tr>
</tbody>
</table>
</td></tr>
Above is a sample html and I want to extract all the class names from the html file.
Output:'{ "c1":"main","c2":"header"}'

You can use find_all to get a set of nodes, then loop through the set of nodes and check if the node has class attribute, if it has, return the class:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<tr id="section_1asd8aa" class="main">
<td class="header">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="font-family: arial,sans-serif; font-size: 11px;>DUMMY TEXTbrowser.
</td>
</tr>
</tbody>
</table>
</td></tr>""", "html.parser")
To get a list of class names:
lst = [node['class'] for node in soup.find_all() if node.has_attr('class')]
lst
# [['main'], ['header']]
Convert the list to a dictionary:
{"c"+str(i): v for i, v in enumerate(lst)}
# {'c0': ['main'], 'c1': ['header']}
Notice the classes are wrapped in a list because some class can have multiple values. You can join the list as a single string if that's desired.
{"c"+str(i): " ".join(v) for i, v in enumerate(lst)}
# {'c0': 'main', 'c1': 'header'}

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.

There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.

An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

Parsing html table with BeautifulSoup to python dictionary

This is an html code than I'm trying to parse with BeautifulSoup:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
... (amount of this tags isn't fixed too)
</ul>
</td>
</tr>
</table>
The output I would like to get is a dictionary like this:
DICT = {
'menu1': ['Some data1','Foo1 Bar1'],
'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}
As I already mentioned in the code, amount of <li> tags is not fixed. Additionally, there could be:
menu1 and menu2
just menu1
just menu2
no menu1 and menu2 (just <table></table>)
so e.g. it could looks just like this:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
</table>
I was trying to use this example but with no success. I think it's because of that <ul> tags, I can't read proper data from table. Problem for me is also variable amount of menus and <li> tags.
So my question is how to parse this particular table to python dictionary?
I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is.
request = c.get('http://example.com/somepage.html)
soup = bs(request.text)
and this is always the first table of the page, so I can get it with:
table = soup.find_all('table')[0]
Thank you in advance for any help.

html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
.
{
u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'],
u'menu1': [u'Some data1', u'Foo1']
}

Need help parsing through this HTML using BeautifulSoup and Python

I have the following HTML I would like to parse using BeautifulSoup:
<tr class="TrGameOdd">
<td align="center">
<a href="Schedule.aspx?WT=0&lg=778&id=,1583114">
<img border="0" src="/core/engine/App_Themes/Global/images/plus.gif">
</a>
</td>
<td align="left">Oct 20</td>
<td>777</td>
<td align="left" colspan="2">Cupcakes</td>
<td align="right">7+3
<input type="checkbox" value="0_1583114_-3440" name="text_">
</td>
<td align="right">a199
<input type="checkbox" value="2_1583114_-199.5_-110" name="text_">
</td>
</tr>
There are a whole bunch of lines like this, but I only need specifics out of it. For example, I want to parse 777, Cupcakes, 7+3, -3440, a199 out of all of this. How would I go about doing that? I'd like it to print side by side and I would have a few of these lines I want to parse, so when it prints it should be like this:
777 Cupcakes 7+3 -3440
X X X X
X X X X
etc

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
trs = soup.find("tr",{"class":"TrGameOdd"})
for tr in trs:
tds = tr.findAll("td")
print tds[1].string # Oct 20
print tds[2].string # 777
print tds[3].string # Cupcakes
...
You need to continue yourself
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract pairs of (href, alt) wih python scrapy - python

Related

Finding certain element using bs4 beautifulSoup

Is there a way to extract all the class name from an HTML file using BeautifulSoup?

Using Python + BeautifulSoup to pick up text in a table on webpage

Parsing html table with BeautifulSoup to python dictionary

Need help parsing through this HTML using BeautifulSoup and Python

Categories

Resources