Extract href from html

Extract href from html - python

I am given the following html :
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acaryochloris_marina_MBIC11017_> Jun 12 2013
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acetobacter_pasteurianus_386B_u> Aug 8 2013
and many more...
I want to extract the href from here.
Here's my python script : (page_source contains the html)
soup = BeautifulSoup(page_source)
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag.get('href',None)
if link != None:
print link
But this keeps returning the following error :
links = soup.find_all('A',attrs={'HREF': re.compile("^http://")})
TypeError: 'NoneType' object is not callable

You are using BeautifulSoup version 3, not version 4. soup.find_all is then not interpreted as a method, but as a search for the first <find_all> element. Because there is no such element, soup.find_all resolves to None.
Install BeautifulSoup 4 instead, the import is:
from bs4 import BeautifulSoup
BeautifulSoup 3 is instead imported as from BeautifulSoup import BeautifulSoup.
If you are sure you wanted to use BeautifulSoup 3 (not recommended), then use:
links = soup.findAll('a', attrs={'href': re.compile("^http://")})
As a side note, because you limit your search to <a> tags with a certain value, *there will always be a href attribute on the elements that are found. Using .get() and testing for None is entirely redundant. The following is equivalent:
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag['href']
print link
BeautifulSoup 4 also supports CSS selectors, which could make your query a little more readable still, removing the need for you to specify a regular expression:
for tag in soup.select('a[href^=http://]'):
link = tag['href']
print link

Why not use the split command?
Iterate over all lines of the file and d something like that:
href = line.split("HREF=\"")[1].split("\"")[0]

Related

How to get attribute value from li tag in python BS4

How can I get the src attribute of this link tag with BS4 library?
Right now I'm using the code below to achieve the resulte but i can't
<li class="active" id="server_0" data-embed="<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>"><a><span><i class="fa fa-eye"></i></span> <strong>vk</strong></a></li>
i want this value src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'
this my code i access ['data-embed'] i don't how to exract the link this my code
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper()
access = "https://w.mycima.cc/play.php?vid=d4d8322b9"
response = scraper.get(access)
doc2 = bs(response.content, "lxml")
container2 = doc2.find("div", id='player').find("ul", class_="list_servers list_embedded col-sec").find("li")
link = container2['data-embed']
print(link)
Result
<Response [200]>
https://w.mycima.cc/play.php?vid=d4d8322b9
<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>
Process finished with exit code 0

From the beautiful soup documentation
You can access a tag’s attributes by treating the tag like a
dictionary
They give the example:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser')
tag['id']
# 'boldest'
Reference and further details,
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So, for your case specifically, you could write
print(link.find("iframe")['src'])
if link turns out to be plain text, not a soup object - which may be the case for your particular example based on the comments - well then you can resort to string searching, regex, or more beautiful soup'ing, for example:
link = """<Response [200]>https://w.mycima.cc/play.php?vid=d4d8322b9<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'></iframe>"""
iframe = re.search(r"<iframe.*>", link)
if iframe:
soup = BeautifulSoup(iframe.group(0),"html.parser")
print("src=" + soup.find("iframe")['src'])

In Python, how do I find elements that contain a specific attribute?

I'm using Python 3.7. I want to locate all the elements in my HTML page that have an attribute, "data-permalink", regardless of what its value is, even if the value is empty. However, I'm confused about how to do this. I'm using the bs4 package and tried the following
soup = BeautifulSoup(html)
soup.findAll("data-permalink")
[]
soup.findAll("a")
[<a href=" ... </a>]
soup.findAll("a.data-permalink")
[]
The attribute is normally only found in anchor tags on my page, hence my unsuccessful, "a.data-permalink" attempt. I would like to return the elements that contain the attribute.

Your selector is invalid
soup.findAll("a.data-permalink")
it should be used for the method .select() but still it invalid because it mean select <a> with the class not the attribute.
to match everything use the * for select()
.select('*[data-permalink]')
or True if using findAll()
.findAll(True, attrs={'data-permalink' : True})
example
from bs4 import BeautifulSoup
html = '''<a data-permalink="a">link</a>
<b>bold</b>
<i data-permalink="i">italic</i>'''
soup= BeautifulSoup(html, 'html.parser')
permalink = soup.select('*[data-permalink]')
# or
# permalink = soup.findAll(True, attrs={'data-permalink' : True})
print(permalink)
Results, the <b> element is skipped
[<a data-permalink="a">link</a>, <i data-permalink="i">italic</i>]

How to find links within a specified class with Beautiful Soup

I'm using Beautiful Soup 4 to parse a news site for links contained in the body text. I was able to find all the paragraphs that contained the links but the paragraph.get('href') returned type none for each link. I'm using Python 3.5.1. Any help is really appreciated.
from bs4 import BeautifulSoup
import urllib.request
import re
soup = BeautifulSoup("http://www.cnn.com/2016/11/18/opinions/how-do-you-deal-with-donald-trump-dantonio/index.html", "html.parser")
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
print(paragraph.get('href'))

Do you really want this?
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
for a in paragraph("a"):
print(a.get('href'))
Note that paragraph.get('href') tries to find attribute href in <div> tag you found. As there's no such attribute, it returns None. Most probably you actually have to find all tags <a> which a descendants of your <div> (this can be done with paragraph("a") which is a shortcut for paragraph.find_all("a") and then for every element <a> look at their href attribute.

How find specific data attribute from html tag in BeautifulSoup4?

Is there a way to find an element using only the data attribute in html, and then grab that value?
For example, with this line inside an html doc:
<ul data-bin="Sdafdo39">
How do I retrieve Sdafdo39 by searching the entire html doc for the element that has the data-bin attribute?

A little bit more accurate
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
This way, the iterated list only has the ul elements that has the attr you want to find
from bs4 import BeautifulSoup
bs = BeautifulSoup(html_doc)
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]

You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']

You could solve this with gazpacho in just a couple of lines:
First, import and turn the html into a Soup object:
from gazpacho import Soup
html = """<ul data-bin="Sdafdo39">"""
soup = Soup(html)
Then you can just search for the "ul" tag and extract the href attribute:
soup.find("ul").attrs["data-bin"]
# Sdafdo39

As an alternative if one prefers to use CSS selectors via select() instead of find_all():
from bs4 import BeautifulSoup
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
soup = BeautifulSoup(html_doc)
# Select
soup.select('ul[data-bin]')

BeautifulSoup getting href [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 9 years ago.
I have the following soup:
next
<span class="class">...</span>
From this I want to extract the href, "some_url"
I can do it if I only have one tag, but here there are two tags. I can also get the text 'next' but that's not what I want.
Also, is there a good description of the API somewhere with examples. I'm using the standard documentation, but I'm looking for something a little more organized.

You can use find_all in the following way to find every a element that has an href attribute, and print each one:
# Python2
from BeautifulSoup import BeautifulSoup
html = '''next
<span class="class">later</span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print "Found the URL:", a['href']
# The output would be:
# Found the URL: some_url
# Found the URL: another_url
# Python3
from bs4 import BeautifulSoup
html = '''next
<span class="class">
another_url</span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print("Found the URL:", a['href'])
# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com
Note that if you're using an older version of BeautifulSoup (before version 4) the name of this method is findAll. In version 4, BeautifulSoup's method names were changed to be PEP 8 compliant, so you should use find_all instead.
If you want all tags with an href, you can omit the name parameter:
href_tags = soup.find_all(href=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract href from html - python

Why not use the split command? Iterate over all lines of the file and d something like that: href = line.split("HREF=\"")[1].split("\"")[0]

Related

How to get attribute value from li tag in python BS4

In Python, how do I find elements that contain a specific attribute?

How to find links within a specified class with Beautiful Soup

How find specific data attribute from html tag in BeautifulSoup4?

BeautifulSoup getting href [duplicate]

Categories

Resources