How to extract all urls having .html at the end?

How to extract all urls having .html at the end? - python

from bs4 import BeautifulSoup
import requests
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in soup.find_all('html'):
print link
This not working for me someone can help?

for link in soup.find_all('a'):
if '.html' in link['href']:
print link

you might want to use regular expressions and search for "href" attributes. Something like this to help you get started. Assuming you are searching all href attributes
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
tags = soup.find_all(href = re.compile(r"\.html$"))
tags variable will be a list of all html tags whose href attribute ends in .html. Now, you can loop through tags and extract the href

Related

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?

I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

Scraping site returns different href for a link

In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python:
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']
However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?

There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.
The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.
For example:
"/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"
The above href contains two params, namely kh and uddg.
uddg is the actual link you need I suppose.
Below code will get all the URL of that particular class, unquoted.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
link = anchor.get('href')
url_obj = urlparse(link)
parsed_url = parse_qs(url_obj.query).get('uddg', '')
if parsed_url:
print(unquote(parsed_url[0]))

Web scraping: read all href

I write a small script to read all hrefs from web page with python.
But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648" for example.
code:
import urllib
import re
urls = ["http://something.com"]
regex='href=\"(.+?)\"'
pattern = re.compile(regex)
htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs
Can anybody help me? Thanks.

use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps
import requests
from bs4 import BeautifulSoup
url = 'whatever url you want to parse'
result = requests.get(url)
soup = BeautifulSoup(result.content,'html.parser')
for a in soup.find_all('a',href=True):
print "Found the URL:", a['href']

How find specific data attribute from html tag in BeautifulSoup4?

Is there a way to find an element using only the data attribute in html, and then grab that value?
For example, with this line inside an html doc:
<ul data-bin="Sdafdo39">
How do I retrieve Sdafdo39 by searching the entire html doc for the element that has the data-bin attribute?

A little bit more accurate
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
This way, the iterated list only has the ul elements that has the attr you want to find
from bs4 import BeautifulSoup
bs = BeautifulSoup(html_doc)
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]

You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']

You could solve this with gazpacho in just a couple of lines:
First, import and turn the html into a Soup object:
from gazpacho import Soup
html = """<ul data-bin="Sdafdo39">"""
soup = Soup(html)
Then you can just search for the "ul" tag and extract the href attribute:
soup.find("ul").attrs["data-bin"]
# Sdafdo39

As an alternative if one prefers to use CSS selectors via select() instead of find_all():
from bs4 import BeautifulSoup
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
soup = BeautifulSoup(html_doc)
# Select
soup.select('ul[data-bin]')

How to find and extract link from webpage?

I have website , e.g http://site.com
I would like fetch main page and extract only links that match the regular expression, e.g .*somepage.*
The format of links in html code can be:
url
url
url
I need the output format:
http://site.com/my-somepage
http://site.com/my-somepage.html
http://site.com/my-somepage.htm
Output url must contain domain name always.
What is the fast python solution for this?

You could use lxml.html:
from lxml import html
url = "http://site.com"
doc = html.parse(url).getroot() # download & parse webpage
doc.make_links_absolute(url)
for element, attribute, link, _ in doc.iterlinks():
if (attribute == 'href' and element.tag == 'a' and
'somepage' in link): # or e.g., re.search('somepage', link)
print(link)
Or the same using beautifulsoup4:
import re
try:
from urllib2 import urlopen
from urlparse import urljoin
except ImportError: # Python 3
from urllib.parse import urljoin
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
url = "http://site.com"
only_links = SoupStrainer('a', href=re.compile('somepage'))
soup = BeautifulSoup(urlopen(url), parse_only=only_links)
urls = [urljoin(url, a['href']) for a in soup(only_links)]
print("\n".join(urls))

Use an HTML Parsing module, like BeautifulSoup.
Some code(only some):
from bs4 import BeautifulSoup
import re
html = '''url
url
url'''
soup = BeautifulSoup(html)
links = soup.find_all('a',{'href':re.compile('.*somepage.*')})
for link in links:
print link['href']
Output:
http://site.com/my-somepage
/my-somepage.html
my-somepage.htm
You should be able to get the format you want from this much data...

Scrapy is the simplest way to do what you want. There is actually link extracting mechanism built-in.
Let me know if you need help with writing the spider to crawl links.
Please, also see:
How do I use the Python Scrapy module to list all the URLs from my website?
Scrapy tutorial

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract all urls having .html at the end? - python

from bs4 import BeautifulSoup import requests r = requests.get("xxx") soup = BeautifulSoup(r.content) for link in soup.find_all('html'): print link This not working for me someone can help?

for link in soup.find_all('a'): if '.html' in link['href']: print link

Related

Python HTML scraping cannot find attribute I know exists?

Scraping site returns different href for a link

Web scraping: read all href

How find specific data attribute from html tag in BeautifulSoup4?

How to find and extract link from webpage?

Categories

Resources