Python: Modifying contents of <a> elements - python

I have a web page I'm scraping and parsing, using Beautiful Soup. On this webpage there are several refernces to other sources. They look a lot like this:`
Shakespeare wrote good, such as in Romeo and Juliet, IV:ii.
What I'd like to have is:
Shakespeare wrote good, such as in (Romeo and Juliet, IV:ii).
Bare in mind, that this is a very long webpage with many lines and I need to combine all of them, so just modifying one "a" tag won't work for me, I need to modify all "a" tags on the page.
This is something I've tried already:
piska_ps = url_to_soup('https://he.wikisource.org'+a['href']).find_all('p')
p_box = []
for p in piska_ps:
if p.a:
for a_link in p.a:
a_link.string = "("+a_link.string+")"

You may use replace_with to replace a tag:
piska_ps = url_to_soup('https://he.wikisource.org'+a['href']).find_all('p')
for p in piska_ps:
for a in p.find_all('a'):
a.replace_with("(" + a.string + ")")

First, p.a is equal to p.find('a'), which return one tag, you can not iterate over it.
piska_ps = url_to_soup('https://he.wikisource.org'+a['href']).find_all('p')
p_box = []
for p in piska_ps:
if p.a:
p.a.string = "("+p.a.string+")"

Related

Python - Scraping text inside <br> which is not under a <p>

I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/roster.era?CID=102353 and I am able to do it for the names beginning with ANANDASABAPATHY which are contained inside a "p" tag:
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("p")
and then playing with the length of the element:
for bullet in column:
if len(bullet.find_all("br"))==4:
person = {}
person["NAME"]=bullet.contents[0].strip()
person["PROFESSION"]=bullet.contents[2].strip()
person["DEPARTMENT"]=bullet.contents[4].strip()
person["INSTITUTION"]=bullet.contents[6].strip()
person["LOCATION"]=bullet.contents[8].strip()
However, I have 2 issues.
I am unable to scrape the information for the chairperson (GUDJONSSON) which is not contained inside a "p" tag. I was trying something like:
soup.find("b").findNext('br').findNext('br').findNext('br').contents[0].strip()
but it is not working
I am unable to differentiate between the last 2 persons (WONDRAK and GERSCH) because they are both contained inside the same "p" tag.
Any help would be extremely useful! Thanks in advance!
This is a case where it may be easier to handle processing the data more as plain text than as HTML, after initially extracting the element you're looking for. The reason is that the HTML is not very well formatted for parsing / it doesn't follow a very uniform pattern. The html5lib package generally handles poorly formatted html better than html.parser, but it didn't help significantly in this case.
import re
from typing import Collection, Iterator
from bs4 import BeautifulSoup
def iter_lines(soup: BeautifulSoup, ignore: Collection[str] = ()) -> Iterator[str]:
for sibling in soup.find('b').next_siblings:
for block in sibling.stripped_strings:
block_str = ' '.join(filter(None, (line.strip() for line in block.split('\n'))))
if block_str and block_str not in ignore:
yield block_str
def group_people(soup: BeautifulSoup, ignore: Collection[str] = ()) -> list[list[str]]:
zip_code_pattern = re.compile(r', \d+$')
people = []
person = []
for line in iter_lines(soup, ignore):
person.append(line)
if zip_code_pattern.search(line):
people.append(person)
person = []
return people
def normalize_person(raw_person: list[str]) -> dict[str, str | None]:
return {
'NAME': raw_person[0],
'PROFESSION': raw_person[1] if len(raw_person) > 4 else None,
'DEPARTMENT': next((line for line in raw_person if 'DEPARTMENT' in line), None),
'INSTITUTION': raw_person[-2],
'LOCATION': raw_person[-1],
}
raw_people = group_people(soup, ignore={'SCIENTIFIC REVIEW OFFICER'})
normalized = [normalize_person(person) for person in raw_people]
This works with both BeautifulSoup(content, 'html.parser') and BeautifulSoup(content, 'html5lib').
The iter_lines function finds the first <b> tag like you did before, and then yields a single string for each line that is displayed in a browser.
The group_people function groups the lines into separate people, using the zip code at the end to indicate that that person's entry is complete. It may be possible to combine this function with iter_lines and skip the regex, but this was slightly easier. Better formatted html would be more conducive to that approach.
The ignore parameter was used to skip the SCIENTIFIC REVIEW OFFICER header above the last person on that page.
Lastly, the normalize_person function attempts to interpret what each line for a given person means. The name, institution, and location appear to be fairly consistent, but I took some liberties with profession and department to use None when it seemed like a value did not exist. Those decisions were only made based on the particular page you linked to - you may need to adjust those for other pages. It uses negative indexes for the institution and location because the number of lines that existed for each person's data was variable.

best way to pair list of titles with a separate list of their corresponding links? (bs4)

final edit: so here's the solution -
list_c = [[x, y] for x, y in zip(titleList, linkList)]
Original post: I used bs4 to scrape a recipe website where the title to each recipe is not saved within the link tag. so I've extracted the titles of the recipes from one part of the code, and extracted the links from the other part and I've got these two lists (recipes, links) but I'm not sure the best way to pair each title to its corresponding link.
(The end goal is to have the titles be hyperlinked in an HTML file that I will put on my eventual recipe aggregator website).
I was considering saving them to a dictionary as key value pairs, or something else(?), so that I can call them into the HTML doc later on.
suggestions?
EDIT:
here's the code, works fine
soup = BeautifulSoup(htmlText, 'lxml')
links = soup.find_all('article')
linkList = []
titleList = []
for link in links[0:12]:
hyperL = link.find('header', class_ = 'entry-header').a['href']
linkList.append(hyperL)
for title in links:
x = title.get('aria-label')
titleList.append(x)
linkList prints out something like
['www.recipe.com/ham', 'www.recipe.com/curry', 'www.recipe.com/etc']
and
titleList is ['Ham', 'Curry', 'etc']
I want to print a list from these 2 like this:
[['Ham', 'www.recipe.com/ham'],['Curry', 'www.recipe.com/curry']]
The final goal for my website, I would want to have the following for each pair:
<a href='www.recipe.com/ham'>Ham<a/>
If you only anticipate looking up titles, and then using the result links, dictionaries are great for that.

BeautifulSoup trying to remove HTML data from list

As mentioned above, I am trying to remove HTML from the printed output to just get text and my dividing | and -. I get span information as well as others that I would like to remove. As it is part of the program that is a loop, I cannot search for the individual text information of the page as they change. The page architecture stays the same, which is why printing the items in the list stays the same. Wondering what would be the easiest way to clean the output. Here is the code section:
infoLink = driver.find_element_by_xpath("//a[contains(#href, '?tmpl=component&detail=true&parcel=')]").click()
driver.switch_to.window(driver.window_handles[1])
aInfo = driver.current_url
data = requests.get(aInfo)
src = data.text
soup = BeautifulSoup(src, "html.parser")
parsed = soup.find_all("td")
for item in parsed:
Original = (parsed[21])
Owner = parsed[13]
Address = parsed[17]
print (*Original, "|",*Owner, "-",*Address)
Example output is:
<span class="detail-text">123 Main St</span> | <span class="detail-text">Banner,Bruce</span> - <span class="detail-text">1313 Mockingbird Lane<br>Santa Monica, CA 90405</br></span>
Thank you!
To get the text between the tags just use get_text() but you should be aware, that there is always text between the tags to avoid errors:
for item in parsed:
Original = (parsed[21].get_text(strip=True))
Owner = parsed[13].get_text(strip=True)
Address = parsed[17].get_text(strip=True)
I wrote an algorithm recently that does something like this. It won't work if your target text has a < or a > in it, though.
def remove_html_tags(string):
data = string.replace(string[string.find("<"):string.find(">") + 1], '').strip()
if ">" in data or "<" in data:
return remove_html_tags(data)
else:
return str(data)
It recursively removes the text between < and >, inclusive.
Let me know if this works!

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!
I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>
This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

*Update: How to parse html with python/ beautifulsoup

First, I'm pretty new to Python. I'm trying to scrape contact information from offline websites and output the info to a csv. I'd like to grab the page url(not sure how to do this from the html), email, phone, location data if possible, any names, any phone numbers and the tag line for the html site if it exists.
Updated #2 code:
import os, csv, re
from bs4 import BeautifulSoup
topdir = 'C:\\projects\\training\\html'
output = csv.writer(open("scrape.csv", "wb+"))
output.writerow(["headline", "name", "email", "phone", "location", "url"])
all_contacts = []
for root, dirs, files in os.walk(topdir):
for f in files:
if f.lower().endswith((".html", ".htm")):
soup = BeautifulSoup(f)
def mailto_link(soup):
if soup.name != 'a':
return None
for key, value in soup.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
all_contacts.append(m)
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = []
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
output.writerow([all_contacts])
print "Finished"
This output currently doesn't return anything other than the row headers. What am I missing here? This should be at least returning some info from the html file, which is this page: http://bendoeslife.tumblr.com/about
There are (at least) two problems here.
First, f is a filename, not the file contents, or the Soup made from those contents. So, f.find('h2') is going to find 'h2' within the filename, which isn't very useful.
Second, most find methods (including str.find, which is what you're calling) return an index, not a substring. Calling str on that index is just going to give you the string version of a number. For example:
>>> s = 'A string with an h2 in it'
>>> i = s.find('h2')
>>> str(i)
'17'
So, your code is doing something like this:
>>> f = 'C:\\python\\training\\offline\\somehtml.html'
>>> headline = f.find('h2')
>>> str(headline)
'-1'
You probably want to call methods on the soup object, rather than f. BeautifulSoup.find returns a "sub-tree" of the soup, which is exactly what you want to stringify here.
However, it's impossible to test that without your sample input, so I can't promise that's the only problem in your code.
Meanwhile, when you get stuck with something like this, you should try printing out intermediate values. Print out f, and headline, and headline2, and it will be much more obvious why headline3 is wrong.
Just replacing the f with soup in the find calls, and fixing your indentation error, running against your sample file http://bendoeslife.tumblr.com/about now works.
It doesn't do anything all that useful, however. Since there's no h2 tag anywhere in the file, headline ends up as None. And the same goes for most of the other fields. The only thing that does find anything is url, because you're asking it to find an empty string, which will find something arbitrary. With three different parsers, I get <p>about</p> or <html><body><p>about</p></body></html>, and <html><body></body></html>…
You need to actually understand the structure of the file you're trying to parse before you can do anything useful with it. In this case, for example, there is an email address, but it's in an <a> element with a title of "Email", with an <li> element with an id of "email". So, you need to write a find to locate it based on one of those criteria, or something else it actually matches.

Categories

Resources