Beautiful Soup: Get the Contents of Sub-Nodes

Beautiful Soup: Get the Contents of Sub-Nodes - python

I have following python code:
def scrapeSite(urlToCheck):
html = urllib2.urlopen(urlToCheck).read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
tdtags = soup.findAll('td', { "class" : "c" })
for t in tdtags:
print t.encode('latin1')
This will return me following html code:
<td class="c">
FOO
</td>
<td class="c">
BAR
</td>
I'd like to get the text between the a-Node (e.g. FOO or BAR), which would be t.contents.contents. Unfortunately it doesn't work that easy :)
Does anyone have an idea how to solve that?
Thanks a lot, any help is appreciated!
Cheers,
Joseph

In this case, you can use t.contents[1].contents[0] to get FOO and BAR.
The thing is that contents returns a list with all elements (Tags and NavigableStrings), if you print contents, you can see it's something like
[u'\n', FOO, u'\n']
So, to get to the actual tag you need to access contents[1] (if you have the exact same contents, this can vary depending on the source HTML), after you've find the proper index you can use contents[0] afterwards to get the string inside the a tag.
Now, as this depends on the exact contents of the HTML source, it's very fragile. A more generic and robust solution would be to use find() again to find the 'a' tag, via t.find('a') and then use the contents list to get the values in it t.find('a').contents[0] or just t.find('a').contents to get the whole list.

For your specific example, pyparsing's makeHTMLTags can be useful, since they are tolerant of many HTML variabilities in HTML tags, but provide a handy structure to the results:
html = """
<td class="c">
FOO
</td>
<td class="c">
BAR
</td>
<td class="d">
BAZZ
</td>
"""
from pyparsing import *
td,tdEnd = makeHTMLTags("td")
a,aEnd = makeHTMLTags("a")
td.setParseAction(withAttribute(**{"class":"c"}))
pattern = td + a("anchor") + SkipTo(aEnd)("aBody") + aEnd + tdEnd
for t,_,_ in pattern.scanString(html):
print t.aBody, '->', t.anchor.href
prints:
FOO -> more.asp
BAR -> alotmore.asp

Related

Python, BeautifulSoup: Only one CSV row returned or keep getting "AttributeError: 'NoneType' object has no attribute 'text'" when parsing HTML table

UPDATE: HedgeHog's answer worked. To overcome the numpy issue, I uninstalled numpy-1.19.4 and installed the previous version numpy-1.19.3.
[Python 3.9.0 and BeautifulSoup 4.9.0.]
I am trying to use the BeautifulSoup library in Python to parse the HTML table found on the Department of Justice's Office of Legal Counsel website, and write the data to a CSV file. The table can be found at https://www.justice.gov/olc/opinions?keys=&items_per_page=40.
The table is deeply nested within 11 <div> elements. The abridged prettified version of the HTML up to the table's location is:
<html>
<body>
<section>
<11 continually nested div elements>
...
<table>
</table>
...
</divs>
</section>
</body>
</html>
The table is a simple three-column table, topped with a header row (which is inside a <thead> element), as shown below:
Date
Title
Headnotes
01/19/2021
Preemption of State and Local Requirements Under a PREP Act Declaration
The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
The <tr> elements have one of four different classes:
<tr class="odd views-row-first"> - This only exists on the very first row after the header row.
<tr class="even"> - appears on every even table row
<tr class="odd"> - appears on every odd row after the first row
<tr class="even views-row-last"> - appears on the very last row (the user can choose to see 10, 20, or 40 items per page, which means the last row will always be even)
Within the <tr> elements, naturally, each <td> element corresponds to one of the data types (date, title, headnotes). Notwithstanding the specific <tr> class, each table row follows the same general format:
<tr class="odd-or-even/first-or-last">
<td class="views-field views-field-field-opinion-post-date active">
<span class="date-display-single" . . . >
01/01/1970
</span>
</td>
<td class="views-field views-field-field-opinion-attachment-file">
<a href="/olc/files/file-number/download">
Title
</a>
</td>
<td class="views-field views-field-field-opinion-overview">
<p>
Headnotes
</p>
<p>
Some headnotes have multiple paragraph elements.
</p>
</td>
</tr>
All of the Python scripts I have used have started with this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")
f = open("olc-op.csv", "w", encoding="utf-8")
headers = "Date, Title, Headnotes \n"
f.write(headers)
My tinkering has primarily been focused on the find_all() argument and the for loop.
The problem I am having is that I am either getting only a single row in my CSV file or the error in the title to this post.
Since all of the <td> elements I want to scrape are within the <tbody> element, I ran tbody through find_all():
requests = soup.find_all("tbody")
In the for loop I specified <td> as the element, followed by the class name applied to each data:
for result in results:
date = result.find("td", class_="views-field views-field-field-opinion-post-date active").text
title = result.find("td", class_="views-field views-field-field-opinion-attachment-file").text
headnotes = result.find("td", class_="views-field views-field-field-opinion-overview").text
data = date + "," + title + "," + headnotes
f.write(data)
The output of the above code in the CSV file is:
Date,Title,Headnotes
01/19/2021 ,
Preemption of State and Local Requirements Under a PREP Act Declaration ,
The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
Yes, the data is technically separated by a comma, but not in the way I intended. There is also some unneeded whitespace
after the header row.
I replaced the .text at the end of the .find() statements with .striped_strings, which returned the
following TypeError:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
To try and overcome this error, I changed f.write(data) to f.write(str(data)) in the for loop, and received
the same TypeError.
I did some further researach, and changed the end of each variable in the for loop from .striped_strings to
.get_text(strip=True). I also changed my f.write() statement to
f.write(date + "," + title + "," + headnotes)
These changes yielded one perfectly scraped table row, in addition to the header row:
Date, Title, Headnotes
01/19/2021,Preemption of State and Local Requirements Under a PREP Act Declaration,The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
But I obviously wanted to loop over the entire table and get all of the table rows.
The second to last thing I tried was to possibly get more specific in the find_all() statement. I changed it from tbody to
tr with no class specified, so it would (I thought) return all of the <tr> elements, which I could then parse
for the specific <td> element. Instead, I got this error:
AttributeError: 'NoneType' object has no attribute 'get_text'
The final change I made was to change .get_text(strip=True) back to .text, which resulted in the error in the
title of this post:
AttributeError: 'NoneType' object has no attribute 'text'
Where have I gone wrong?

Alternativ is use of pandas
Always ask yourself - Is there an easier way to get my goals?
It is, you can simply use pandas to do it in two lines. In your case it do all the things for you.
Requesting the url
Searching for the table and scraping the contents
Push the results to an csv
I also try to go through your question and may answer to it.
Example
import pandas as pd
pd.read_html('https://www.justice.gov/olc/opinions?keys=&items_per_page=40')[0].to_csv('olc-op.csv', index=False)
But answering to your question
Excited by the effort of asking your question, I will go some bonus miles and tell you what happens.
There are two major points that prevented you from reaching your goal .
Selecting the right things
Reason why there is only one line in your csv - You made this:
soup.find_all("tbody")
So your loop only loops one time, cause there is only one tbody - You figured out the structure and talked about the <tr> but do not selected them for looping.
Writing your lines
Even if you fixed the above you would have found only one line in the csv, cause the \n was missing in your string.
Hope that helps to understand, what went wrong and you can use it in case pandas wont work, cause of dynamic served content, ...
Example
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")
with open("olc-op.csv", "a+", encoding="utf-8") as f:
headers = "Date, Title, Headnotes \n"
f.write(headers)
for result in soup.select("tbody tr"):
tds = result.find_all("td")
date = tds[0].get_text(strip=True)
title = tds[1].get_text(strip=True)
headnotes = tds[2].get_text(strip=True)
data = date + "," + title + "," + headnotes +'\n'
f.writelines(data)

Find html-tag with one and only one attribute with BeautifulSoup

I have a html-site that I want to scrape some data from. The html looks like this:
<p class="provice hidden-xs">
<span class="provice-mobile">NEW YORK</span>
witespace
<span class="provice-mobile" style="color: #8888 !important">UNION</span>
</p>
I just want to choose "NEW YORK", and I tried this code:
city = soup.find('span', attrs={'class':'provice-mobile'})
city.text also includes "UNION", but I just want to see the span-tag that only has the attribute:
'class': 'provice-mobile

If I understand your question correctly, you are looking for the span-tags whose only attribute is class = "provice-mobile. I suggest you start by finding all the tags that has that attribute and afterwards sort out the ones that has more than that one attribute, i.e. keeping tags with only one attribute.
The code to accomplish this could look like this:
results = soup.findAll('span', attrs = {'class':'provice-mobile'})
results = [tag for tag in results if len(tag.attrs) == 1]

find_all does not find text in mixed content

I have a little bit of screen scraping code in Python, using BeautifulSoup, that is giving me headache. A small change to the html made my code break, but I can't see why it fails to work. This is basically a demo of how the html looked when parsed:
soup=BeautifulSoup("""
<td>
<a href="https://alink.com">
Foo Some text Bar
</a>
</td>
""")
links = soup.find_all('a',text=re.compile('Some text'))
links[0]['href'] # => "https://alink.com"
After an upgrade, the a tag body now includes an img tag, which makes the code break.
<td>
<a href="https://alink.com">
<img src="dummy.gif" >
Foo Some text Bar
</a>
</td>
'links' is now an empty list, so the regex is not finding anything.
I hacked around it by matching on the text alone, then finding
its parent, but that seems even more fragile:
links = soup.find_all(text=re.compile('Some text'))
links[0].parent['href'] # => "https://alink.com"
What is the addition of an img tag as a sibling to the text
content breaking the search done by BeautifulSoup, and is there
a way of modifying the first code to work?

The difference is that the 2nd example has an incomplete img tag:
it should be either
<img src="dummy.gif" />
Foo Some text Bar
or
<img src="dummy.gif" > </img>
Foo Some text Bar
Instead, it is parsed as
<img src="dummy.gif" >
Foo Some text Bar
</img>
So the element found isn't a any longer, but img, whose parent is a.

The first example works only if a.string is not None i.e., iff the text is the only child.
As a workaround, you could use a function predicate:
a = soup.find(lambda tag: tag.name == 'a' and tag.has_attr('href') and 'Some text' in tag.text)
print(a['href'])
# -> 'https://alink.com'

Duplicates when extracting data from html table using lxmk.html.xpath()

I am trying to extract data from this table at Espn cricinfo.
Each row is comprised of the folowing format (Data replaced by headers):
<tr class="data1">
<td class="left" nowrap="nowrap"><a>Player Name</a> (Country)</td>
<td>Score</td>
<td>Minutes Played</td>
<td nowrap="nowrap">Balls Faced</td>
<td etc...
</tr>
I have used the following code in a python script to capture the values in the table:
bats = content.xpath('//tr[#class="data1"]/td[1]/a')
cntry = content.xpath('//tr[#class="data1"]/td[1]/*')
run = content.xpath('//tr[#class="data1"]/td[2]')
mins = content.xpath('//tr[#class="data1"]/td[3]')
bf = content.xpath('//tr[#class="data1"]/td[4]')
The data is then put into a csv file for storage.
All of the data is successfully being captured apart from the country of the player. The player name and country are stored inside the same <td> tag; however, the player name is also inside an <a> tag, allowing it to be captured easily. My problem is that the value captured for the players country (the cntry variable above) is the players name. I am sure that the code is incorrect but I am not sure why.

Where you have:
cntry = content.xpath('//tr[#class="data1"]/td[1]/*')
The '*' is looking for the child tags and passes by any text.
You can replace your line of code with this to grab the text instead of the tags:
cntry = content.xpath('//tr[#class="data1"]/td[1]/text()')
See if that works for you.
EDIT
To remove the white spacing at beginning of each item, just do the following:
cntry = content.xpath('//tr[#class="data1"]/td[1]/text()')
cntry = [str(x).strip() for x in cntry]

Python regex help

I am trying to make a regex that finds all names, url and phone numbers in an html page.
But I'm having trouble with the phone number part. I think the problem with the numbers part is that is searches until it finds the </strong> but in that process it skips people, instead of making a empty string if the person has no phone number ( simply put instead of a list like this: url1+name1+num1 | url2+name2+"" | url3+name3+num3 it returns a list like this: url1+name1+num1 | url2+name2+num3 , with url3+name3 deleted in the process)
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):
I am searchin for people in s single very long line. A person could have an url or phone number.
An example of a person with an url and a phone number
<tr> <td class="lablinksName"><div> dr. Ivan Bratko akad. prof.</div></td> <td class="lablinksMail"><img src="/Static/images/gui/mail.gif" height="8" width="11"></td> <td class="lablinksPhone"><div><strong>T:</strong> +386 1 4768 393 </div></td> </tr>
And an example of a person with no url or phone number
<tr> <td class="lablinksName"><div> dr. Branko Matjaž Jurič prof.</div></td> <td class="lablinksMail"><img src="/Static/images/gui/mail.gif" height="8" width="11"></td> <td class="lablinksPhone"><div> </div></td> </tr>
I hope i was clear enough and if any one can help me.

import lxml.html
root = lxml.html.parse("http://my.example.com/page.html").getroot()
rows = root.xpath("//table[#id='contactinfo']/tr")
for r in rows:
nameText = r.xpath("td[#class='lablinksName']/div/text() | td[#class='lablinksName']/div/a/text()")
name = u''.join(nameText).strip()
urls = r.xpath("td[#class='lablinksName']/div/a/#href")
url = len(urls)>0 and urls[0] or ''
phoneText = r.xpath("td[#class='lablinksPhone']/div/text()")
phone = u''.join(phoneText).strip()
print name, url, phone
For the purpose of this code, I assume <table id="contactinfo">{your table rows}</table>.

The quick and dirty way to fix it:
Replace
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):
with
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page.replace("<tr>","\n"):
The issue is that the the .*? in .*?</strong> can match strings containing td class="lablinksMail. It cannot match \n. Any time you use . in a Regex (rather than [^<]), this kind of annoyance tends to happen.

If you're having this kind of difficulty, it's usually a good sign you're using the wrong approach. In particular, if I were doing this via regexp, I wouldn't even try unless the line in question had the "<td class="lablinksPhone">" tag in it.

Looks like a job for Beautiful Soup.
I love the quote: "You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup: Get the Contents of Sub-Nodes - python

Related

Python, BeautifulSoup: Only one CSV row returned or keep getting "AttributeError: 'NoneType' object has no attribute 'text'" when parsing HTML table

Find html-tag with one and only one attribute with BeautifulSoup

find_all does not find text in mixed content

Duplicates when extracting data from html table using lxmk.html.xpath()

Python regex help

Categories

Resources