finding href path value Python regex - python

Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
output should be like that:
["/relative/path","//other.host/same-protocol","https://example.com"]
The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function
//I have tried this bellow one but not working its only give output: ["https://example.com"]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
print(urls)

You could try using positive lookbehind to find the quoted strings in front of href= in html
pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
urls = re.findall(pattern, html)
See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall

from typing import List
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
href_prefix = "href=\""
def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
if result == None:
result = []
is_splitted, _, rest = html.partition(href_prefix)
if not is_splitted:
return result
link = rest[:rest.find("\"")].partition("#")[0]
if link:
result.append(link)
return get_links_from_html(rest, result)
print(get_links_from_html(html))

Related

How do I get the first 3 sentences of a webpage in python?

I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it. Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if (t.parent.name not in blacklist):
output += '{} '.format(t)
tempout = output.split('.')
for i in range(tempout):
if (i >= 3):
tempout.remove(i)
output = '.'.join(tempout)
print(output)
Finding sentences out of text is difficult. Normally you would look for characters that might complete a sentence, such as '.' and '!'. But a period ('.') could appear in the middle of a sentence as in an abbreviation of a person's name, for example. I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences, but not for any arbitrary sentence.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
Prints:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]
To scrape the first three sentences, just add these lines to ur code:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
Output:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Hope that this helps!
Actually using beautify soup you can filter by the class "article_text post" seeing source code:
myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)
And get the inner text of p element
Use this instead of soup = BeautifulSoup(html_page, 'html.parser')

BeautifulSoup get links between strings

So I am using BS4 to get the following out of a Website:
<div>Some TEXT with some LINK
and some continuing TEXT with following some LINK inside.</div>
What I need to get is:
"Some TEXT with some LINK ("https// - actual Link") and some continuing TEXT with following some LINK ("https//- next Link") inside."
I am struggeling with this for some time now and don't know how to get there ... tried before, after, between, [:], all sort of in-Array-passing methods to get everything together.
I hope someone can help me with this because I am am new to Python. Thanks in advance.
You can use str.join with an iteration over soup.contents:
import bs4
html = '''<div>Some TEXT with some LINK and some continuing TEXT with following some LINK inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
Output:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
Edit: ignoring br tags:
html = '''<div>Some TEXT <br> with some LINK and some continuing TEXT with <br> following some LINK inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
Edit 2: recursive solution:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with some LINK and some continuing TEXT with <br> following some LINK inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
Also, whitespace may become an issue due to the presence of br tags, etc. To work around this, you can use re.sub:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'

Python - Strip string from html tags, leave links but in changed form

Is there a way to remove all html tags from string, but leave some links and change their representation? Example:
description: <p>Animation params. For other animations, see myA.animation and the animation parameter under the API methods. The following properties are supported:</p>
<dl>
<dt>duration</dt>
<dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See demo.</dd>
</dl>
<p>
and I want to replace
myA.animation
with only 'myA.animation', but
demo
with 'demo: http://example.com'
EDIT:
Now it seems to be working:
def cleanComment(comment):
soup = BeautifulSoup(comment, 'html.parser')
for m in soup.find_all('a'):
if str(m) in comment:
if not m['href'].startswith("#"):
comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
soup = BeautifulSoup(comment, 'html.parser')
comment = soup.get_text()
return comment
This regex should work for you: (?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"
You can try it over here
In Python:
import re
text = ''
with open('textfile', 'r') as file:
text = file.read()
matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)
strings = []
for m in matches:
m = filter(bool, m)
strings.append(': '.join(m))
print(strings)
The result will look like: ['myA.animation', 'demo: http://example.com']

Python text parsing between two words

I'm using beautifulsoup and want to extract all text from between two words on a webpage.
Ex, imagine the following website text:
This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.
I want to pull out everything on the page that starts with text and ends with bunch.
In this case I'd want only:
text of the webpage. It is just a string of a bunch
However, there's a chance there could be multiple instances of this on a page.
What is the best way to do this?
This is my current setup:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
urls = [
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html
]
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
# Filter only returns those items in the sequence, texts, that return True.
# We use those to build our final list.
for line in visible_texts:
print line
since you're just parsing the text you just need the regex:
import re
result = re.findall("text.*?bunch", text_from_web_page)

How to get span value using python,BeautifulSoup

I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.
Using regular expressions, I can identify the email address. My code to find the email is:
def get_email(link):
mail_list = []
for i in link:
a = str(i)
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._#]*)\">", re.IGNORECASE)
ik = re.findall(email_pattern, a)
if (len(ik) == 1):
mail_list.append(i)
else:
pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]
Now, I also need to collect the phone number, mailing address and web url. I think in BeautifulSoup there must be an easy way to find those specific data.
A sample html page is as below:
<ul>
<li>
<span>Email:</span>
Message Us
</li>
<li>
<span>Website:</span>
<a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
</li>
<li>
<span>Phone:</span>
(123)456-789
</li>
</ul>
And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone.
Thanks in advance.
The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. You might need to use a regular expression to deal with the contents of the href attribute, but that's it. Also, the else: pass is unnecessary - you can just leave it out entirely.
Here's some code that does something like what you want, and might be a useful starting point:
from BeautifulSoup import BeautifulSoup
import re
# Assuming that html is your input as a string:
soup = BeautifulSoup(html)
all_contacts = []
def mailto_link(e):
'''Return the email address if the element is is a mailto link,
otherwise return None'''
if e.name != 'a':
return None
for key, value in e.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = {}
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
print all_contacts
That will produce a list of one dictionary per contact found, in this case that will just be:
[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc#gmail.com'}]

Categories

Resources