Python - Scraping text inside <br> which is not under a <p> - python

I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("p")
and then playing with the length of the element:
for bullet in column:
if len(bullet.find_all("br"))==4:
person = {}
person["NAME"]=bullet.contents[0].strip()
person["PROFESSION"]=bullet.contents[2].strip()
person["DEPARTMENT"]=bullet.contents[4].strip()
person["INSTITUTION"]=bullet.contents[6].strip()
person["LOCATION"]=bullet.contents[8].strip()
However, I have 2 issues.
I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
but it is not working
I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.
Any help would be extremely useful! Thanks in advance!

This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
This works with both BeautifulSoup(content, 'html.parser') and BeautifulSoup(content, 'html5lib').
The iter_lines function finds the first <b> tag like you did before, and then yields a single string for each line that is displayed in a browser.
The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.
The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.
Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.

Related

BeautifulSoup find_all('href') returns only part of the value

I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.
On examination of the links I was able to find that the links for the actors look like this.
William Shatner
Leonard Nimoy
Nicholas Guest
while the ones for other contributors look like this
Nicholas Meyer
Gene Roddenberry
This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".
Here's the code I'm running.
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)
/name/nm0583292/
/name/nm0000638/
That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later.
I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.
This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.
How can I make sure I get only the name IDs that I want (just the actor ones) from the page?
(Also, feel free to offer suggestions to tighten up the code)
It appears that the links you are trying to match have either been modified by JavaScript after loading, or perhaps get loaded differently based on other variables than the URL alone (like cookies or headers).
However, since you're only after links of people in the cast, an easier way would be to simply match the ids of people in the cast section. This is actually fairly straightforward, since they are all in a single element, <table class="cast_list">
So:
import urllib.request
from bs4 import BeautifulSoup
import re
# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'
# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
# return [[] for _ in range(n)]
# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
soup = BeautifulSoup(page, features="html.parser")
# removed needless parentheses - arguably, even `soup` is superfluous:
# return BeautifulSoup(page, features="html.parser")
return soup
# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
# removed unused variables, also 'list_of_people_ids' is needlessly verbose
# since they go together, why not return people as a list of tuples, or a dictionary?
# I'd prefer a dictionary as it automatically gets rid of duplicates as well
people = {}
# (put a space at the start of your comment blocks!)
# get all the anchors tags inside the `cast_list` table
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
# the whole point of compiling the regex is to only have to do it once,
# so outside the loop
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of people
for link_tag in link_tags:
# the href attributes is a strings, so casting with str() serves no purpose
href = link_tag.get('href')
# matching and extracting part of the match can all be done in one step:
match = id_regex.search(href)
if match:
# don't shadow Python keywords like `id` with variable names!
identifier = match.group(1)
name = link_tag.text.strip()
# just ignore the ones with no text, they're the thumbs
if name:
people[identifier] = name
# return the names and IDs
return people
def main():
# don't do stuff globally, it'll just cause problems when reusing names in functions
soup = get_soup(url)
people = get_people_for_movie(soup)
print(people)
# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
main()
Result:
{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.
And the code with a few more tweaks and without the commentary on your code:
import urllib.request
from bs4 import BeautifulSoup
import re
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
return BeautifulSoup(page, features="html.parser")
def get_people_for_movie(soup_object):
people = {}
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of the cast
for link_tag in link_tags:
match = id_regex.search(link_tag.get('href'))
if match:
name = link_tag.text.strip()
if name:
people[match.group(1)] = name
return people
def main():
movie_number = 'tt0084726'
url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'
people = get_people_for_movie(get_soup(url))
print(people)
if __name__ == '__main__':
main()

Getting the text value from cells of a table when scraping recursive structure

I'm looking for a function that given a pair os HTML tags, returns the text inside them. Ideally I would like it to be recursive:
Examples:
Given
Asset management
returns
Asset management
Given
<p>Recursive Asset management</p>
returns
Recursive Asset management
Given
<p>Again Asset management</p>
returns
Again Asset management
Here is the code I have:
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows[1:]:
th_list = tr.find("th")
td_list = tr.find("td")
if th_list is None or td_list is None:
continue
th_str = th.text
td_str = td.contents
# NOW THE PROBLEM IS td_str IS A LIST OF A BUNCH OF THINGS.
#PLAIN TEXT, BR TAG, LINKS, PARAGRAPHS, ETC.
#I WANT TO BE ABLE TO GET THAT PLAIN TEXT FOR LINKS AND PARAGRAPHS
for element in td_str:
if element == "<br/":
continue
# here...
The input should be a String, not a Tag or any other object. My trouble is the recursion.
UPDATE: This is an example of the data I am actually working with. The goal is to pull information from Wikipedia Infoboxes. The problem is some of the information in the Infobox are links or paragraphs. For example, in this page: https://en.wikipedia.org/wiki/Goldman_Sachs
<tr><th scope="row" style="padding-right:0.5em;">Founders</th><td
class="agent" style="line-height:1.35em;">Marcus Goldman .
<br /><a href="/wiki/Samuel_Sachs" title="Samuel Sachs">Samuel
Sachs</a></td></tr><tr>
Let's say we want to find who the Founders are. I only want the text in the elements. In this case a list containing Marcus Goldman and Samuel Sachs. I have also tried read_html from Pandas, but that concatenates the strings together and I don't want that to happen (its output is "Marcus GoldmanSamuel Sachs")
Here's an example of using .findChildren. It's not the full solution, but you can possibly use this to add on to #Bitto Bennichan solution
import bs4
html = '''<tr><th scope="row" style="padding-right:0.5em;">Founders</th><td
class="agent" style="line-height:1.35em;">Marcus Goldman .
<br /><a href="/wiki/Samuel_Sachs" title="Samuel Sachs">Samuel
Sachs</a></td></tr><tr>'''
soup = bs4.BeautifulSoup(html,'html.parser')
rows = soup.find_all('tr')
founders = []
for row in rows:
children = row.findChildren("a" , recursive=True, text=True)
for child in children:
child_text = child.text.split('\n')
child_text = [ x.strip() for x in child_text ]
child_text = ' '.join(child_text)
founders.append(child_text)

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!
I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>
This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

Using BeautifulSoup to find a tag and evaluate whether it fits some criteria

I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!

*Update: How to parse html with python/ beautifulsoup

First, I'm pretty new to Python. I'm trying to scrape contact information from offline websites and output the info to a csv. I'd like to grab the page url(not sure how to do this from the html), email, phone, location data if possible, any names, any phone numbers and the tag line for the html site if it exists.
Updated #2 code:
import os, csv, re
from bs4 import BeautifulSoup
topdir = 'C:\\projects\\training\\html'
output = csv.writer(open("scrape.csv", "wb+"))
output.writerow(["headline", "name", "email", "phone", "location", "url"])
all_contacts = []
for root, dirs, files in os.walk(topdir):
for f in files:
if f.lower().endswith((".html", ".htm")):
soup = BeautifulSoup(f)
def mailto_link(soup):
if soup.name != 'a':
return None
for key, value in soup.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
all_contacts.append(m)
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = []
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
output.writerow([all_contacts])
print "Finished"
This output currently doesn't return anything other than the row headers. What am I missing here? This should be at least returning some info from the html file, which is this page: http://bendoeslife.tumblr.com/about
There are (at least) two problems here.
First, f is a filename, not the file contents, or the Soup made from those contents. So, f.find('h2') is going to find 'h2' within the filename, which isn't very useful.
Second, most find methods (including str.find, which is what you're calling) return an index, not a substring. Calling str on that index is just going to give you the string version of a number. For example:
>>> s = 'A string with an h2 in it'
>>> i = s.find('h2')
>>> str(i)
'17'
So, your code is doing something like this:
>>> f = 'C:\\python\\training\\offline\\somehtml.html'
>>> headline = f.find('h2')
>>> str(headline)
'-1'
You probably want to call methods on the soup object, rather than f. BeautifulSoup.find returns a "sub-tree" of the soup, which is exactly what you want to stringify here.
However, it's impossible to test that without your sample input, so I can't promise that's the only problem in your code.
Meanwhile, when you get stuck with something like this, you should try printing out intermediate values. Print out f, and headline, and headline2, and it will be much more obvious why headline3 is wrong.
Just replacing the f with soup in the find calls, and fixing your indentation error, running against your sample file http://bendoeslife.tumblr.com/about now works.
It doesn't do anything all that useful, however. Since there's no h2 tag anywhere in the file, headline ends up as None. And the same goes for most of the other fields. The only thing that does find anything is url, because you're asking it to find an empty string, which will find something arbitrary. With three different parsers, I get <p>about</p> or <html><body><p>about</p></body></html>, and <html><body></body></html>…
You need to actually understand the structure of the file you're trying to parse before you can do anything useful with it. In this case, for example, there is an email address, but it's in an <a> element with a title of "Email", with an <li> element with an id of "email". So, you need to write a find to locate it based on one of those criteria, or something else it actually matches.

Categories

Resources