fetching https links only - python

i can get the links but Idk how to filter https only

To parse HTML use html parser, e.g. BeautifulSoup. To extract desired <a> elements, you can use CSS selector 'a[href^="https"]' (Selects every <a> element whose href attribute value begins with "https"):
import requests
from bs4 import BeautifulSoup
url = 'https://sayamkanwar.com/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('a[href^="https"]'):
print(a['href'])
Prints:
https://sayamkanwar.com/work
https://sayamkanwar.com/about
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/#sayamkanwar/
Further reading:
CSS Selectors Reference
EDIT: Using only builtin modules:
import urllib.request
from html.parser import HTMLParser
url = 'https://sayamkanwar.com/'
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag=='a':
attrs = dict(attrs)
if 'href' in attrs and attrs['href'].startswith('https'):
print(attrs['href'])
with urllib.request.urlopen(url) as response:
src = response.read().decode('utf-8')
parser = MyHTMLParser()
parser.feed(src)
Prints:
https://sayamkanwar.com/work
https://sayamkanwar.com/about
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/#sayamkanwar/

Try this, I just use request library.
import re
import requests
URL = 'https://sayamkanwar.com/'
response = requests.get(URL)
pattern = r'(a href=")((https):((//)|(\\\\))+([\w\d:##%/;$()~_?\+-=\\\.&](#!)?)*)"'
all_url = re.findall(pattern, response.text)
for url in all_url:
print(url[1])
Output:
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/#sayamkanwar/
Visual output of the regex:

Related

Python Beautiful Soup Parse First Href in Each Div

Given the following code:
# import the module
import bs4 as bs
import urllib.request
import re
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.findAll('ul', {'class': 'song-list'}):
for span in div:
for link in span:
for a in link:
print(a)
I can parse multiple divs, and i get a result as follows :
My question is instead of getting the full contents of the div how can I only return the highlighted portion, the URL of the Href?
Try this. You need to specify the right class to fetch the urls connected to it.
from bs4 import BeautifulSoup
import urllib.request
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = BeautifulSoup(sauce,'lxml')
for div in soup.find_all(class_='subtitle'):
print(div.get("href"))
Output:
http://www.metrolyrics.com/charles-goose-lyrics.html
http://www.metrolyrics.com/param-singh-lyrics.html
http://www.metrolyrics.com/westlife-lyrics.html
http://www.metrolyrics.com/luis-fonsi-lyrics.html
http://www.metrolyrics.com/grease-lyrics.html
http://www.metrolyrics.com/shanti-dope-lyrics.html
and so on ---
if 'href' in a.attrs:
a.attrs['href']
this will give you what you need.

Google Scraping href values

I have problem with find href values in BeautifulSoup`
from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624")
bsObj = BeautifulSoup(html)
for link in bsObj.find("h3", {"class":"r"}).findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
all the time I have error:
"AttributeError: 'NoneType' object has no attribute 'findAll'
You'll have to change the User-Agent string to something other than urllib's default user agent.
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
url = "https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624"
html = urlopen(Request(url, headers={'User-Agent':'Mozilla/5'})).read()
bsObj = BeautifulSoup(html, 'html.parser')
for link in bsObj.find("h3", {"class":"r"}).findAll("a", href=True):
print(link['href'])
Also note that this expression will select only the first link. If you want to select all the links in the page use the following expression:
links = bsObj.select("h3.r a[href]")
for link in links:
print(link['href'])

Python web Automation to get Email from Webpage

I want a python script that opens a link and print the email address from that page.
E.g
Go to some site like example.com
Search for email in that.
Search in all the pages in that link.
I was tried below code
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data)
for rate in soup.find_all('#'):
print rate.text
I take this website for reference.
Anyone help me to get this?
Because find_all() will only search Tags. From document:
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
So you need add a keyword argument like this:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all(href=re.compile("mailto")):
print i.string
Demo:
contact#digitalseo.in
contact#digitalseo.in
From document:
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag's 'id' attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
If you pass in a value for href, Beautiful Soup will filter against each tag's 'href' attribute:
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
You can see the document for more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
And if you'd like find the email address from a document, regex is a good choice.
For example:
import re
re.findall( '[^#]+#[^#]+\.[^#]+ ', text) # remember change `text` variable
And if you'd like find a link in a page by keyword, just use .get like this:
import re
import requests
from bs4 import BeautifulSoup
def get_link_by_keyword(keyword):
links = set()
for i in soup.find_all(href=re.compile(r"[http|/].*"+str(keyword))):
links.add(i.get('href'))
for i in links:
if i[0] == 'h':
yield i
elif i[0] == '/':
yield link+i
else:
pass
global link
link = raw_input('Please enter a link: ')
if link[-1] == '/':
link = link[:-1]
r = requests.get(link, verify=True)
data = r.text
soup = BeautifulSoup(data, "html.parser")
for i in get_link_by_keyword(raw_input('Enter a keyword: ')):
print i

How to find and extract link from webpage?

I have website , e.g http://site.com
I would like fetch main page and extract only links that match the regular expression, e.g .*somepage.*
The format of links in html code can be:
url
url
url
I need the output format:
http://site.com/my-somepage
http://site.com/my-somepage.html
http://site.com/my-somepage.htm
Output url must contain domain name always.
What is the fast python solution for this?
You could use lxml.html:
from lxml import html
url = "http://site.com"
doc = html.parse(url).getroot() # download & parse webpage
doc.make_links_absolute(url)
for element, attribute, link, _ in doc.iterlinks():
if (attribute == 'href' and element.tag == 'a' and
'somepage' in link): # or e.g., re.search('somepage', link)
print(link)
Or the same using beautifulsoup4:
import re
try:
from urllib2 import urlopen
from urlparse import urljoin
except ImportError: # Python 3
from urllib.parse import urljoin
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
url = "http://site.com"
only_links = SoupStrainer('a', href=re.compile('somepage'))
soup = BeautifulSoup(urlopen(url), parse_only=only_links)
urls = [urljoin(url, a['href']) for a in soup(only_links)]
print("\n".join(urls))
Use an HTML Parsing module, like BeautifulSoup.
Some code(only some):
from bs4 import BeautifulSoup
import re
html = '''url
url
url'''
soup = BeautifulSoup(html)
links = soup.find_all('a',{'href':re.compile('.*somepage.*')})
for link in links:
print link['href']
Output:
http://site.com/my-somepage
/my-somepage.html
my-somepage.htm
You should be able to get the format you want from this much data...
Scrapy is the simplest way to do what you want. There is actually link extracting mechanism built-in.
Let me know if you need help with writing the spider to crawl links.
Please, also see:
How do I use the Python Scrapy module to list all the URLs from my website?
Scrapy tutorial

Python BeautifulSoup equivalent to lxml make_links_absolute

So lxml has a very hand feature: make_links_absolute:
doc = lxml.html.fromstring(some_html_page)
doc.make_links_absolute(url_for_some_html_page)
and all the links in doc are absolute now. Is there an easy equivalent in BeautifulSoup or do I simply need to pass it through urlparse and normalize it:
soup = BeautifulSoup(some_html_page)
for tag in soup.findAll('a', href=True):
url_data = urlparse(tag['href'])
if url_data[0] == "":
full_url = url_for_some_html_page + test_url
In my answer to What is a simple way to extract the list of URLs on a webpage using python? I covered that incidentally as part of the extraction step; you could easily write a method to do it on the soup and not just extract it.
from urllib.parse import urljoin
def make_links_absolute(soup, url):
for tag in soup.findAll('a', href=True):
tag['href'] = urljoin(url, tag['href'])
(Python 2: from urlparse import urljoin.)

Categories

Resources