from bs4 import BeautifulSoup
import requests
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in soup.find_all('html'):
print link
This not working for me someone can help?
for link in soup.find_all('a'):
if '.html' in link['href']:
print link
you might want to use regular expressions and search for "href" attributes. Something like this to help you get started. Assuming you are searching all href attributes
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
tags = soup.find_all(href = re.compile(r"\.html$"))
tags variable will be a list of all html tags whose href attribute ends in .html. Now, you can loop through tags and extract the href
Related
I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.
In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python:
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']
However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?
There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.
The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.
For example:
"/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"
The above href contains two params, namely kh and uddg.
uddg is the actual link you need I suppose.
Below code will get all the URL of that particular class, unquoted.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
link = anchor.get('href')
url_obj = urlparse(link)
parsed_url = parse_qs(url_obj.query).get('uddg', '')
if parsed_url:
print(unquote(parsed_url[0]))
I write a small script to read all hrefs from web page with python.
But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648" for example.
code:
import urllib
import re
urls = ["http://something.com"]
regex='href=\"(.+?)\"'
pattern = re.compile(regex)
htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs
Can anybody help me? Thanks.
use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps
import requests
from bs4 import BeautifulSoup
url = 'whatever url you want to parse'
result = requests.get(url)
soup = BeautifulSoup(result.content,'html.parser')
for a in soup.find_all('a',href=True):
print "Found the URL:", a['href']
Is there a way to find an element using only the data attribute in html, and then grab that value?
For example, with this line inside an html doc:
<ul data-bin="Sdafdo39">
How do I retrieve Sdafdo39 by searching the entire html doc for the element that has the data-bin attribute?
A little bit more accurate
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
This way, the iterated list only has the ul elements that has the attr you want to find
from bs4 import BeautifulSoup
bs = BeautifulSoup(html_doc)
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']
You could solve this with gazpacho in just a couple of lines:
First, import and turn the html into a Soup object:
from gazpacho import Soup
html = """<ul data-bin="Sdafdo39">"""
soup = Soup(html)
Then you can just search for the "ul" tag and extract the href attribute:
soup.find("ul").attrs["data-bin"]
# Sdafdo39
As an alternative if one prefers to use CSS selectors via select() instead of find_all():
from bs4 import BeautifulSoup
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
soup = BeautifulSoup(html_doc)
# Select
soup.select('ul[data-bin]')
I have website , e.g http://site.com
I would like fetch main page and extract only links that match the regular expression, e.g .*somepage.*
The format of links in html code can be:
url
url
url
I need the output format:
http://site.com/my-somepage
http://site.com/my-somepage.html
http://site.com/my-somepage.htm
Output url must contain domain name always.
What is the fast python solution for this?
You could use lxml.html:
from lxml import html
url = "http://site.com"
doc = html.parse(url).getroot() # download & parse webpage
doc.make_links_absolute(url)
for element, attribute, link, _ in doc.iterlinks():
if (attribute == 'href' and element.tag == 'a' and
'somepage' in link): # or e.g., re.search('somepage', link)
print(link)
Or the same using beautifulsoup4:
import re
try:
from urllib2 import urlopen
from urlparse import urljoin
except ImportError: # Python 3
from urllib.parse import urljoin
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
url = "http://site.com"
only_links = SoupStrainer('a', href=re.compile('somepage'))
soup = BeautifulSoup(urlopen(url), parse_only=only_links)
urls = [urljoin(url, a['href']) for a in soup(only_links)]
print("\n".join(urls))
Use an HTML Parsing module, like BeautifulSoup.
Some code(only some):
from bs4 import BeautifulSoup
import re
html = '''url
url
url'''
soup = BeautifulSoup(html)
links = soup.find_all('a',{'href':re.compile('.*somepage.*')})
for link in links:
print link['href']
Output:
http://site.com/my-somepage
/my-somepage.html
my-somepage.htm
You should be able to get the format you want from this much data...
Scrapy is the simplest way to do what you want. There is actually link extracting mechanism built-in.
Let me know if you need help with writing the spider to crawl links.
Please, also see:
How do I use the Python Scrapy module to list all the URLs from my website?
Scrapy tutorial