BeautifulSoup: Scraping answers from form - python

I need to scrape the answers to the questions from the following link, including the check boxes.
Here's what I have so far:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
The following gives me all the written answers, if there are any:
soup.find_all('span', {'class':'PrintHistRed'})
and I think I can piece together all the checkbox answers from this:
soup.find_all('img')
but these aren't going to be ordered correctly, because this doesn't pick up the "No Information Filed" answers that aren't written in red.
I also feel like there's a much better way to be doing this. Ideally I want (for the first 6 questions) to return:
['APEX INVESTMENT FUND, V, L.P',
'805-2054766781',
'Delaware',
'United States',
'APEX MANAGEMENT V, LLC',
'X',
'O',
'No Information Filed',
'NO',
'NO']
EDIT
Martin's answer below seems to do the trick, however when I put it in a loop, the results begin to change after the 3rd iteration. Any ideas how to fix this?
from bs4 import BeautifulSoup
import requests
import re
for x in range(5):
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:]) # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])
output = []
for entry in sorted(tags):
if entry.name == 'img':
alt = entry['alt']
if 'Radio' in alt:
output.append('NO' if 'not selected' in alt else 'YES')
else:
output.append('O' if 'not checked' in alt else 'X')
else:
output.append(entry.text)
print output[:9]

The website does not generate any of the required HTML via Javascript, so I have chosen to use just requests to get the HTML (which should be faster).
One approach to solving your problem is to store all the tags for your three different types into a single array. If this is then sorted, it will result in the tags being in tree order.
The first search simply uses your PrintHistRed to get the matching span tags. Secondly it finds all img tags that have alt text containing either the word Radio or Checkbox. Lastly it searches for all locations where No Information Filed is found and returns the parent tag.
The tags can now be sorted and a suitable output array built containing the information in the required format:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:]) # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])
output = []
for entry in sorted(tags):
if entry.name == 'img':
alt = entry['alt']
if 'Radio' in alt:
output.append('NO' if 'not selected' in alt else 'YES')
else:
output.append('O' if 'not checked' in alt else 'X')
else:
output.append(entry.text)
print output[:9] # Display the first 9 entries
Giving you:
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', 'X', 'O', u'No Information Filed', 'NO', 'YES']

I've looked fairly carefully at the HTML. I doubt there is an utterly simple way of scraping pages like this.
I would begin with an analysis, looking for similar questions. For instance, 11 through 16 inclusive can likely be handled in the same way. 19 and 21 appear to be similar. There may or may not be others.
I would work out how to handle each type of similar question as given by the rows containing them. For example, how would I handle 19 and 21? Then I would write code to identify the rows for the questions noting the question number for each. Finally I would use the appropriate code using the row number to winkle out information from it. In other words, when I encountered question 19 I'd use the code meant for either 19 or 21.

Related

Python - Scraping text inside <br> which is not under a <p>

I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("p")
and then playing with the length of the element:
for bullet in column:
if len(bullet.find_all("br"))==4:
person = {}
person["NAME"]=bullet.contents[0].strip()
person["PROFESSION"]=bullet.contents[2].strip()
person["DEPARTMENT"]=bullet.contents[4].strip()
person["INSTITUTION"]=bullet.contents[6].strip()
person["LOCATION"]=bullet.contents[8].strip()
However, I have 2 issues.
I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
but it is not working
I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.
Any help would be extremely useful! Thanks in advance!
This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
This works with both BeautifulSoup(content, 'html.parser') and BeautifulSoup(content, 'html5lib').
The iter_lines function finds the first <b> tag like you did before, and then yields a single string for each line that is displayed in a browser.
The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.
The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.
Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.

BeautifulSoup find_all('href') returns only part of the value

I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.
On examination of the links I was able to find that the links for the actors look like this.
William Shatner
Leonard Nimoy
Nicholas Guest
while the ones for other contributors look like this
Nicholas Meyer
Gene Roddenberry
This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".
Here's the code I'm running.
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)
/name/nm0583292/
/name/nm0000638/
That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later.
I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.
This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.
How can I make sure I get only the name IDs that I want (just the actor ones) from the page?
(Also, feel free to offer suggestions to tighten up the code)
It appears that the links you are trying to match have either been modified by JavaScript after loading, or perhaps get loaded differently based on other variables than the URL alone (like cookies or headers).
However, since you're only after links of people in the cast, an easier way would be to simply match the ids of people in the cast section. This is actually fairly straightforward, since they are all in a single element, <table class="cast_list">
So:
import urllib.request
from bs4 import BeautifulSoup
import re
# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'
# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
# return [[] for _ in range(n)]
# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
soup = BeautifulSoup(page, features="html.parser")
# removed needless parentheses - arguably, even `soup` is superfluous:
# return BeautifulSoup(page, features="html.parser")
return soup
# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
# removed unused variables, also 'list_of_people_ids' is needlessly verbose
# since they go together, why not return people as a list of tuples, or a dictionary?
# I'd prefer a dictionary as it automatically gets rid of duplicates as well
people = {}
# (put a space at the start of your comment blocks!)
# get all the anchors tags inside the `cast_list` table
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
# the whole point of compiling the regex is to only have to do it once,
# so outside the loop
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of people
for link_tag in link_tags:
# the href attributes is a strings, so casting with str() serves no purpose
href = link_tag.get('href')
# matching and extracting part of the match can all be done in one step:
match = id_regex.search(href)
if match:
# don't shadow Python keywords like `id` with variable names!
identifier = match.group(1)
name = link_tag.text.strip()
# just ignore the ones with no text, they're the thumbs
if name:
people[identifier] = name
# return the names and IDs
return people
def main():
# don't do stuff globally, it'll just cause problems when reusing names in functions
soup = get_soup(url)
people = get_people_for_movie(soup)
print(people)
# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
main()
Result:
{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.
And the code with a few more tweaks and without the commentary on your code:
import urllib.request
from bs4 import BeautifulSoup
import re
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
return BeautifulSoup(page, features="html.parser")
def get_people_for_movie(soup_object):
people = {}
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of the cast
for link_tag in link_tags:
match = id_regex.search(link_tag.get('href'))
if match:
name = link_tag.text.strip()
if name:
people[match.group(1)] = name
return people
def main():
movie_number = 'tt0084726'
url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'
people = get_people_for_movie(get_soup(url))
print(people)
if __name__ == '__main__':
main()

printing result of 2 for loops in same line

I'm fairly new to web scraping in Python; and after reading most of the tutorials on the topic online I decided to give it a shot. I finally got one site working but the output is not formatted properly.
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")
for div in soup.find_all('h2'): #prints the name of the food"
print(div.text)
for a in soup.find_all('span', {'class' : 'amount'}): #prints price of the food
print(a.text)
Output
I want both the name of the food to be printed side by side with the corresponding price of the food, concatenated by a "-" ... Would appreciate any help given, thanks!
Edit: After #Reblochon Masque comments below - I've run into another problem; As you can see there is a $0.00 which is a value from the inbuilt shopping cart on the website, how would i exclude this as an outlier and continue moving down the loop while ensuring that the other items in the price "move up" to correspond to the correct food?
Best practice is to use zip function in the for loop, but we can do that this way also. This is to just to show we can do by using indexing the two lists.
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for index in range(len(names)):
print('{} - {}'.format(names[index].text, rest[index].text))
You could maybe zip the two results:
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for div, a in zip(names, rest):
print('{} - {}'.format(div.text, a.text))
# print(f"{div.text} - {a.text}") # for python > 3.6

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!
I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>
This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

web scraping in python only retrieving one entry

I am trying to scrap the BBC football results website to get teams, shots, goals, cards and incidents.
I writing the script in Python and using the Beautiful soup package. The code provided only retrieves the first entry of the table in incidents. When the incidents table is printed to screen, the full table will all the data is there.
The table I am scraping from is stored in incidents:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.bbc.co.uk/sport/football/result/partial/EFBO815155?teamview=false'
inner_page = urllib2.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for incidents in soupb.find_all('table', class_="incidents-table"):
print incidents.prettify()
home_inc_tag = incidents.find('td', class_='incident-player-home')
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
type_inc_tag = incidents.find('td', 'span', class_='incident-type goal')
type_inc = type_inc_tag and ''.join(type_inc_tag.stripped_strings)
time_inc_tag = incidents.find('td', class_='incident-time')
time_inc = time_inc_tag and ''.join(time_inc_tag.stripped_strings)
away_inc_tag = incidents.find('td', class_='incident-player-away')
away_inc = away_inc_tag and ''.join(away_inc_tag.stripped_strings)
print home_inc, time_inc, type_inc, away_inc
I am just focusing one one match at the moment to get this correct (EFBO815155) before i add a regular expression into the URL to get all matches details.
So, the incidents for loop is not getting all the data, just the first entry in the table.
Thanks in advance, I am new to stack overflow, if anything is wrong with this post, formatting etc please let me know.
Thanks!
First, get the incidents table:
incidentsTable = soupb.find_all('table', class_='incidents-table')[0]
Then loop through all 'tr' tags within that table.
for incidents in incidentsTable.find_all('tr'):
# your code as it is
print incidents.prettify()
home_inc_tag = incidents.find('td', class_='incident-player-home')
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
.
.
.
Gives Output:
Bradford Park Avenue 1-2 Boston United
None None
2' Goal J.Rollins
36' None C.Piergianni
N.Turner 42' None
50' Goal D.Southwell
C.King 60' Goal
This is close to what you want. Hope this helps!

Categories

Resources