BeautifulSoup get links between strings - python

So I am using BS4 to get the following out of a Website:
<div>Some TEXT with some LINK
and some continuing TEXT with following some LINK inside.</div>
What I need to get is:
"Some TEXT with some LINK ("https// - actual Link") and some continuing TEXT with following some LINK ("https//- next Link") inside."
I am struggeling with this for some time now and don't know how to get there ... tried before, after, between, [:], all sort of in-Array-passing methods to get everything together.
I hope someone can help me with this because I am am new to Python. Thanks in advance.

You can use str.join with an iteration over soup.contents:
import bs4
html = '''<div>Some TEXT with some LINK and some continuing TEXT with following some LINK inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
Output:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
Edit: ignoring br tags:
html = '''<div>Some TEXT <br> with some LINK and some continuing TEXT with <br> following some LINK inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
Edit 2: recursive solution:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with some LINK and some continuing TEXT with <br> following some LINK inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
Also, whitespace may become an issue due to the presence of br tags, etc. To work around this, you can use re.sub:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'

Related

finding href path value Python regex

Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
output should be like that:
["/relative/path","//other.host/same-protocol","https://example.com"]
The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function
//I have tried this bellow one but not working its only give output: ["https://example.com"]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
print(urls)
You could try using positive lookbehind to find the quoted strings in front of href= in html
pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
urls = re.findall(pattern, html)
See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall
from typing import List
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
href_prefix = "href=\""
def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
if result == None:
result = []
is_splitted, _, rest = html.partition(href_prefix)
if not is_splitted:
return result
link = rest[:rest.find("\"")].partition("#")[0]
if link:
result.append(link)
return get_links_from_html(rest, result)
print(get_links_from_html(html))

How do I get the first 3 sentences of a webpage in python?

I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it. Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if (t.parent.name not in blacklist):
output += '{} '.format(t)
tempout = output.split('.')
for i in range(tempout):
if (i >= 3):
tempout.remove(i)
output = '.'.join(tempout)
print(output)
Finding sentences out of text is difficult. Normally you would look for characters that might complete a sentence, such as '.' and '!'. But a period ('.') could appear in the middle of a sentence as in an abbreviation of a person's name, for example. I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences, but not for any arbitrary sentence.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
Prints:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]
To scrape the first three sentences, just add these lines to ur code:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
Output:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Hope that this helps!
Actually using beautify soup you can filter by the class "article_text post" seeing source code:
myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)
And get the inner text of p element
Use this instead of soup = BeautifulSoup(html_page, 'html.parser')

BeautifulSoup, how to properly use decompose() [duplicate]

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.
So far I have this EDITED & UPDATED CURRENT CODE:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip<!-- --> tags and everything in between them. How would I go about that?
QUESTION EDIT #jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>
I am still trying to figure out why it
doesn't find and strip tags like this:
<!-- //-->. Those backslashes cause
certain tags to be overlooked.
This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:
import re, copy
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
if mutation isn't your bag, you can
[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]

BS HTML Parsing - & is ignored when printing URL strings

Consider the following example.
htmlist = ['<div class="portal" role="navigation" id="p-coll-print_export">',\
'<h3>Print/export</h3>',\
'<div class="body">',\
'<ul>',\
'<li id="coll-create_a_book">Create a book</li>',\
'<li id="coll-download-as-rl">Download as PDF</li>',\
'<li id="t-print">Printable version</li>',\
'</ul>',\
'</div>',\
'</div>',\
]
soup = __import__("bs4").BeautifulSoup("".join(htmlist), "html.parser")
for x in soup("a"):
print(x)
print(x.attrs)
print(soup.a.get_text())
I was expecting this short script to print the a tag equaling x, followed by a dictionary of the attributes of x (name (as key) and content (as key's value) of each of these), ending with the text for the link.
Instead the output is
Create a book
{'href': '/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page'}
Create a book
Download as PDF
{'href': '/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl'}
Create a book
<a accesskey="p" href="/w/index.php?title=Main_Page&printable=yes" title="Printable version of this page [p]">Printable version</a>
{'href': '/w/index.php?title=Main_Page&printable=yes', 'title': 'Printable version of this page [p]', 'accesskey': ['p']}
Create a book
The issues I find with this output are:
print(soup.a.get_text()) bit always prints the text of the first tag.
In the dictionaries output by print(x.attrs), the value of the key "href" is missing &amp.
What am I missing here and how do I get the desired output?
You can use the cgi.escape or html.escape methods to html encode the & character.
import html
for x in soup("a"):
print(x)
print({k:html.escape(v, False) if k == 'href' else v for k,v in x.attrs.items()})
print(x.get_text())

Python - Strip string from html tags, leave links but in changed form

Is there a way to remove all html tags from string, but leave some links and change their representation? Example:
description: <p>Animation params. For other animations, see myA.animation and the animation parameter under the API methods. The following properties are supported:</p>
<dl>
<dt>duration</dt>
<dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See demo.</dd>
</dl>
<p>
and I want to replace
myA.animation
with only 'myA.animation', but
demo
with 'demo: http://example.com'
EDIT:
Now it seems to be working:
def cleanComment(comment):
soup = BeautifulSoup(comment, 'html.parser')
for m in soup.find_all('a'):
if str(m) in comment:
if not m['href'].startswith("#"):
comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
soup = BeautifulSoup(comment, 'html.parser')
comment = soup.get_text()
return comment
This regex should work for you: (?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"
You can try it over here
In Python:
import re
text = ''
with open('textfile', 'r') as file:
text = file.read()
matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)
strings = []
for m in matches:
m = filter(bool, m)
strings.append(': '.join(m))
print(strings)
The result will look like: ['myA.animation', 'demo: http://example.com']

Categories

Resources