I'm trying to scrape all the paragraphs from Wikipedia that come between the main heading of the page and the table of contents. I noticed that they always come between two div elements as shown below:
<div id="some-div">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
<div id="some-other-div">...</div>
I want to grab all of the HTML between the two div elements (not just the text)
Looking for a solution in Python.
I doubt you can depend on utterly consistent formatting. However, this seems to work for the 'Python (programming language)' page, where the introductory text is delimited by the 'Contents' box.
I offer a few notes:
fetchPreviousSiblings returns the paragraphs in reverse order.
I would check the length of contents against the unlikely possibility of more than one occurrence.
It's almost certainly necessary with this approach to check for rubbish.
from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
HTML = str ( urlopen ( URL ) . read() )
soup = BeautifulSoup ( HTML )
contents = soup.findAll('div', attrs={'id': 'toc'})
paras = contents[0].fetchPreviousSiblings('p')
With BeautifulSoup you will find the first div and the second div by their ids:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")
first_div = bs.find(id="some-div")
second_div = bs.find(id="some-other-div")
After this, we create a list with all the elements in between the two divs (converted to strings) and afterwards join them together. For this we loop through all the siblings after the first_div and break when we reach the second div:
in_between = []
for sibling in first_div.next_siblings:
if sibling == second_div:
break
else:
in_between.append(str(sibling))
in_between = "".join(in_between)
The previous codeblock can be replaced by this list comprehension in one line:
in_between = "".join([str(sibling) for sibling in takewhile(lambda x: x != second_div, first_div.next_siblings)])
Related
I have an xml file containing data in this form:
<head xml:id="_2ebf9c0003">\n\nTECHNICAL FIELD</head>\n
<p n="0001" xml:id="_2ebf9c0004">whatever</p>
<p n="0002" xml:id="_2ebf9c0004">whatever</p>
<... other tags and data...>
<head xml:id="_2ebf9c0003">\n\nTITLE</head>\n
I know how to get particular elements like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(PDM_description, 'lxml')
title_element = soup.title$
importing all p elements
paras = soup.findAll('p')
The question is how can I add an OR inside the query to get a list of "p" or "head" elements? More general how to get all the elements with tags belonging to a list.
PSEUDO CODE:
paras = soup.findAll('p' OR 'head')
You are close to your goal, just add a list with tags to your find_all():
soup.find_all(['p','head'])
Note: In new code use find_all() instead of older findAll() syntax
You can use the , CSS selector, define your tags separated by a comma (,). To use a CSS selector, use the .select() method:
print(
soup.select("p, head")
)
I looked around on Stackoverflow, and most guides seem to be very specific on extracting all data from a table. However, I only need to extract one, and just can't seem to extract that specific value from the table.
Scrape link:
https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919
I am looking to extract the "Style" value from the table within the link.
Code:
import bs4
styleData=[]
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
table=cleanbyAddPD.find('div',{'id':'MainContent_ctl01_panView'})
style=table.findall('tr')[3]
style=style.findall('td')[1].text
print(style)
styleData.append(style)
Probably you misused find_all function, try this solution:
style=table.find_all('tr')[3]
style=style.find_all('td')[1].text
print(style)
It will give you the expected output
You can use a CSS Selector:
#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)
Which will select the "MainContent_ctl01_grdCns" id, the fourth <tr>, the second <td>.
To use a CSS Selector, use the .select() method instead of find_all(). Or select_one() instead of find().
import requests
from bs4 import BeautifulSoup
URL = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print(
soup.select_one(
"#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)"
).text
)
Output:
Townhouse End
Could also do something like:
import bs4
import requests
style_data = []
url = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = bs4.BeautifulSoup(requests.get(url).content, 'html.parser')
# select the first `td` tag whose text contains the substring `Style:`.
row = soup.select_one('td:-soup-contains("Style:")')
if row:
# if that row was found get its sibling which should be that vlue you want
home_style_tag = row.next_sibling
style_data.append(home_style_tag.text)
A couple of notes
This uses CSS selectors rather than the find methods. See SoupSieve docs for more details.
The select_one relies on the fact that the table is always ordered in a certain way, if this is not the case use select and iterate through the results to find the bs4.Tag whose text is exactly 'Style:' then grab its next sibling
Using select:
rows = soup.select('td:-soup-contains("Style:")')
row = [r for r in rows if r.text == 'Style:']
home_style_text = row.text
You can use :contains on a td to get the node with innerText "Style" then an adjacent sibling combinator with td type selector to get the adjacent td value.
import bs4, requests
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
print(cleanpagedata.select_one('td:contains("Style") + td').text)
I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])
Writing a script that would, initially, scrape the data for all of the census blocks in a given census block group. In order to do that, though, I first need to be able to get a link all of the block groups in a given tract. The tracts are defined by a list with the URLs to them, which returns a page which lists the block groups within the css selector "div#rList3 a". When I run this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
tracts = ['http://www.usa.com/NY023970800.html','http://www.usa.com/NY023970900.html',
'http://www.usa.com/NY023970600.html','http://www.usa.com/NY023970700.html',
'http://www.usa.com/NY023970500.html']
class Scrape:
def scrapeTracts(self):
for i in tracts:
html = urlopen(i)
soup = BeautifulSoup(html.read(), 'lxml')
bgs = soup.select("div#rList3 a")
print(bgs)
s = Scrape()
s.scrapeTracts()
This gives me an output that looks like: [NY0239708001] (with the real amount of links cut out for the sake of the length of this post.) My question is, how can I get just the string after 'href', in this case being /NY0239708001.html?
You can do this mostly in one line, by doing this:
bgs = [i.attrs.get('href') for i in soup.select("div#rList3 a")]
Output:
['/NY0239708001.html']
['/NY0239709001.html', '/NY0239709002.html', '/NY0239709003.html', '/NY0239709004.html']
['/NY0239706001.html', '/NY0239706002.html', '/NY0239706003.html', '/NY0239706004.html']
['/NY0239707001.html', '/NY0239707002.html', '/NY0239707003.html', '/NY0239707004.html', '/NY0239707005.html']
['/NY0239705001.html', '/NY0239705002.html', '/NY0239705003.html', '/NY0239705004.html']
Each node has an attrs dictionary which contains the attributes of that node...including CSS classes, or in this case, the href.
hrefs = []
for bg in bgs:
hrefs.append(bg.attrs['href'])
I am trying to access content in certain td tags with Python and BeautifulSoup. I can either get the first td tag meeting the criteria (with find), or all of them (with findAll).
Now, I could just use findAll, get them all, and get the content I want out of them, but that seems like it is inefficient (even if I put limits on the search). Is there anyway to go to a certain td tag meeting the criteria I want? Say the third, or the 10th?
Here's my code so far:
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
br = Browser()
url = "http://finance.yahoo.com/q/ks?s=goog+Key+Statistics"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
td = soup.findAll("td", {'class': 'yfnc_tablehead1'})
for x in range(len(td)):
var1 = td[x]
var2 = var1.contents[0]
print(var2)
Is there anyway to go to a certain td
tag meeting the criteria I want? Say
the third, or the 10th?
Well...
all_tds = [td for td in soup.findAll("td", {'class': 'yfnc_tablehead1'})]
print all_tds[3]
...there is no other way..
find and findAll are very flexible, the BeautifulSoup.findAll docs say
5. You can pass in a callable object
which takes a Tag object as its only
argument, and returns a boolean. Every
Tag object that findAll encounters
will be passed into this object, and
if the call returns True then the tag
is considered to match.