BeautifulSoup trying to remove HTML data from list - python

As mentioned above, I am trying to remove HTML from the printed output to just get text and my dividing | and -. I get span information as well as others that I would like to remove. As it is part of the program that is a loop, I cannot search for the individual text information of the page as they change. The page architecture stays the same, which is why printing the items in the list stays the same. Wondering what would be the easiest way to clean the output. Here is the code section:
infoLink = driver.find_element_by_xpath("//a[contains(#href, '?tmpl=component&detail=true&parcel=')]").click()
driver.switch_to.window(driver.window_handles[1])
aInfo = driver.current_url
data = requests.get(aInfo)
src = data.text
soup = BeautifulSoup(src, "html.parser")
parsed = soup.find_all("td")
for item in parsed:
Original = (parsed[21])
Owner = parsed[13]
Address = parsed[17]
print (*Original, "|",*Owner, "-",*Address)
Example output is:
<span class="detail-text">123 Main St</span> | <span class="detail-text">Banner,Bruce</span> - <span class="detail-text">1313 Mockingbird Lane<br>Santa Monica, CA 90405</br></span>
Thank you!

To get the text between the tags just use get_text() but you should be aware, that there is always text between the tags to avoid errors:
for item in parsed:
Original = (parsed[21].get_text(strip=True))
Owner = parsed[13].get_text(strip=True)
Address = parsed[17].get_text(strip=True)

I wrote an algorithm recently that does something like this. It won't work if your target text has a < or a > in it, though.
def remove_html_tags(string):
data = string.replace(string[string.find("<"):string.find(">") + 1], '').strip()
if ">" in data or "<" in data:
return remove_html_tags(data)
else:
return str(data)
It recursively removes the text between < and >, inclusive.
Let me know if this works!

Related

How to clean HTML removing repeated paragraphs?

I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html
https://jsfiddle.net/97ptc0Lh/4/
Output.html
https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove.
from bs4 import BeautifulSoup
fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")
Uniques = set()
CleanHtml = []
for element in soup.html:
if element not in Uniques:
Uniques.add(element)
CleanHtml.append(element)
print (CleanHtml)
May someone help me to reach this goal please.
I think this should do it:
elms = []
for elem in soup.find_all('font'):
if elem not in elms:
elms.append(elem)
else:
target =elem.findParent().findParent()
target.decompose()
print(soup.html)
This should get you your the desired output.
Edit:
To remove only for those paragraphs that have don't size 4 or 5, change the else block to
else:
if elem.attrs['size'] != "4" and elem.attrs['size'] !="5":
target =elem.findParent().findParent()
target.decompose()

Getting the text value from cells of a table when scraping recursive structure

I'm looking for a function that given a pair os HTML tags, returns the text inside them. Ideally I would like it to be recursive:
Examples:
Given
Asset management
returns
Asset management
Given
<p>Recursive Asset management</p>
returns
Recursive Asset management
Given
<p>Again Asset management</p>
returns
Again Asset management
Here is the code I have:
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows[1:]:
th_list = tr.find("th")
td_list = tr.find("td")
if th_list is None or td_list is None:
continue
th_str = th.text
td_str = td.contents
# NOW THE PROBLEM IS td_str IS A LIST OF A BUNCH OF THINGS.
#PLAIN TEXT, BR TAG, LINKS, PARAGRAPHS, ETC.
#I WANT TO BE ABLE TO GET THAT PLAIN TEXT FOR LINKS AND PARAGRAPHS
for element in td_str:
if element == "<br/":
continue
# here...
The input should be a String, not a Tag or any other object. My trouble is the recursion.
UPDATE: This is an example of the data I am actually working with. The goal is to pull information from Wikipedia Infoboxes. The problem is some of the information in the Infobox are links or paragraphs. For example, in this page: https://en.wikipedia.org/wiki/Goldman_Sachs
<tr><th scope="row" style="padding-right:0.5em;">Founders</th><td
class="agent" style="line-height:1.35em;">Marcus Goldman .
<br /><a href="/wiki/Samuel_Sachs" title="Samuel Sachs">Samuel
Sachs</a></td></tr><tr>
Let's say we want to find who the Founders are. I only want the text in the elements. In this case a list containing Marcus Goldman and Samuel Sachs. I have also tried read_html from Pandas, but that concatenates the strings together and I don't want that to happen (its output is "Marcus GoldmanSamuel Sachs")
Here's an example of using .findChildren. It's not the full solution, but you can possibly use this to add on to #Bitto Bennichan solution
import bs4
html = '''<tr><th scope="row" style="padding-right:0.5em;">Founders</th><td
class="agent" style="line-height:1.35em;">Marcus Goldman .
<br /><a href="/wiki/Samuel_Sachs" title="Samuel Sachs">Samuel
Sachs</a></td></tr><tr>'''
soup = bs4.BeautifulSoup(html,'html.parser')
rows = soup.find_all('tr')
founders = []
for row in rows:
children = row.findChildren("a" , recursive=True, text=True)
for child in children:
child_text = child.text.split('\n')
child_text = [ x.strip() for x in child_text ]
child_text = ' '.join(child_text)
founders.append(child_text)

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!
I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>
This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

BeautifulSoup find a key value inside a code snippet inside a tag

My goal is to obtain the value for the 'sitekey' from a page source. The snippet of the code is here. The page in question is this
Right now, doing
soup = BeautifulSoup(url,'html.parser')
soup.find('div',{"class":"field field--required"})
does not work since there are multiple div tags with the same class name. How would I solve this issue?
Thank you in advance.
Edit:
def sitekey_search(atc_link):
response = session.get(atc_link)
soup = BeautifulSoup(response.content, 'html.parser')
sitekey = soup.select("div script")[0]
print(sitekey)
m = re.match("""\"(\w+)\"""", sitekey)
if m:
print(m.groups())
You can use:
soup.select("div.field.field-required")
it will give you a list with the divs found.
soup = BeautifulSoup(a,'lxml')
sitekey = soup.select("div script")[0]
b = sitekey.text
print(re.findall(r'"([^"]*)"', b))
This should do the job, the variable a [1st line] is the input (html),
b is only the script part and the regular expression prints everything in between quotes, in this case, the key, you can use additionally.strip("'") if you want to remove the quotes from the key or replace("'","")

In Python, how do I search an html webpage for a set of strings in text file?

I'm trying to figure out a way to search an html webpage for a number of strings that I have written in a text file, each on its own line. The code I have so far is this:
def mangasearch():
r = requests.get("https://www.mangaupdates.com/releases.html")
soup = BeautifulSoup(r.text)
if soup.find(text=re.compile(line)):
print("New chapter found!")
print("")
else:
print("No new chapters")
print("")
def textsearch():
global line
with open("manga.txt") as file:
for line in file:
print(line)
mangasearch()
It's supposed to read manga.txt and search the webpage for each string separately, but it always returns "no new chapters". If I replace if soup.find(text=re.compile(line)): with if soup.find(text=re.compile("An actual string")): it works correctly, but for some reason it doesn't want to use the line variable. Any help would be appreciated!
The problem is that you are trying to search with a string that contains some special character, like ' ', and '\n'.
Note that str.strip() removes ' ' and other whitespace characters as well (e.g. tabs and newlines), so, update the following line:
if soup.find(text=re.compile(line.strip())):
def textsearch():
#manga_file = open("managa.txt").readlines()
##assuming your text file has 3 titles
manga_file = ["Karakuri Circus","Sun-ken Rock","Shaman King Flowers"]
manga_html = "https://www.mangaupdates.com/releases.html"
manga_page = urllib2.urlopen(manga_html)
found = 0
soup = BeautifulSoup(manga_page)
##use BeautifulSoup to parse the section of interest out
##use 'inspect element' in Chrome or Firefox to find the interesting table
##in this case the data you are targeting is in a <div class="alt"> element
chapters = soup.findAll("div", attrs={"class":"alt"})
##chapters now contains all of the sections under the dates on the page
##so we iterate through each date to get all of the titles in each
##<tr> element of each table
for dated_section in chapters:
rows = dated_section.findAll("tr")
for r in rows:
title = r.td.text
#print title
#url = r.td.a.href
if title in manga_file:
found +=1
print "New Chapter Found!"
print r
if found >0:
print "Found a total of %d title"%found
else:
print "No chapters found"
above is not meant to be optimized but it does a good job of showing how BeautifulSoup can be used to parse the specific elements that you are looking for. In this case I used Chrome, right-clicked on the table that contains the titles to "inspect-element' and looked for the element that contains them to direct BeautifulSoup's attention directly to there. The rest is explained in code. I don't know what your managa.txt file looks like so I just created a list of 3 titles to search for as an example.

Categories

Resources