I'm using Beautiful Soup 4 to parse a news site for links contained in the body text. I was able to find all the paragraphs that contained the links but the paragraph.get('href') returned type none for each link. I'm using Python 3.5.1. Any help is really appreciated.
from bs4 import BeautifulSoup
import urllib.request
import re
soup = BeautifulSoup("http://www.cnn.com/2016/11/18/opinions/how-do-you-deal-with-donald-trump-dantonio/index.html", "html.parser")
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
print(paragraph.get('href'))
Do you really want this?
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
for a in paragraph("a"):
print(a.get('href'))
Note that paragraph.get('href') tries to find attribute href in <div> tag you found. As there's no such attribute, it returns None. Most probably you actually have to find all tags <a> which a descendants of your <div> (this can be done with paragraph("a") which is a shortcut for paragraph.find_all("a") and then for every element <a> look at their href attribute.
Related
I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])
I'm trying to use/learn beautifulsoup4 to scrape some basic data from a website, specifically the information contained within the html record below:
<li class="is-first-in-list css-9999999" data-testid="record-999999999">
I have around 285,000 records all with a unique
data-testid
However, while I can obtain the information from classes and tags I am familar with, custom tags are still evading me.
I've tried variations of:
for link in soup.find_all("data-testid"):
print() #changed to include data-testid.text/innertext/leftblank etc etc
The remainder of my code appears to work, as I can extract tags and href data without issue (including printing these in the terminal), just nothing from custom tags, I'm user the solution is mindbogglingly simple, I've just failed to come up with a success yet!
Is this what you are trying to do?
from bs4 import BeautifulSoup
html = """<li class="is-first-in-list css-9999999" data-testid="record-999999999">"""
soup = BeautifulSoup(html, features='html.parser')
for link in soup.select("li"):
print(link.get('data-testid'))
Prints
record-999999999
With class select
from bs4 import BeautifulSoup
html = """<li class="is-first-in-list css-9999999" data-testid="record-999999999">
<li class="hello css-9999999" data-testid="record-8888888">
<li class="0mr3 css-9999999" data-testid="record-777777">"""
soup = BeautifulSoup(html, features='html.parser')
for link in soup.select("li.is-first-in-list"):
print(link.get('data-testid'))
Prints
record-999999999
Similar to #0m3r but with a few tweaks
from bs4 import BeautifulSoup
from lxml import etree
html = """<li class="is-first-in-list css-9999999" data-testid="record-999999999">"""
soup = BeautifulSoup(html, features='lxml')
for link in soup.find_all("li", class_="is-first-in-list css-9999999"):
print(link.get('data-testid'))
Generally, I find lxml is a lot faster than html.parser.
That said, there are things you can do to speed up Beautiful Soup. If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib.
from https://www.crummy.com/software/BeautifulSoup/bs4/doc/, especially knowing that you have 285,000 elements to loop through. Also, the class_ argument gives you more restrictions on the class so your not sifting through every li element.
I am trying to parse a webpage (forums.macrumors.com) and get a list of all the threads posted.
So I have got this so far:
import urllib2
import re
from BeautifulSoup import BeautifulSoup
address = "http://forums.macrumors.com/forums/os/"
website = urllib2.urlopen(address)
website_html = website.read()
text = urllib2.urlopen(address).read()
soup = BeautifulSoup(text)
Now the webpage source has this code at the start of each thread:
<li id="thread-1880" class="discussionListItem visible sticky WikiPost "
data-author="ABCD">
How do I parse this so I can then get to the thread link within this li tag? Thanks for the help.
So from your code here, you have the soup object which contains the BeautifulSoup object of your html. The question is what part of that tag you're looking for is static? Is the id always the same? the class?
Finding by the id:
my_li = soup.find('li', {'id': 'thread-1880'})
Finding by the class:
my_li = soup.find('li', {'class': 'discussionListItem visible sticky WikiPost "})
Ideally you would figure out the unique class you can check for and use that instead of a list of classes.
if you are expecting an a tag inside of this object, you can do this to check:
if my_li and my_li.a:
print my_li.a.attrs.get('href')
I always recommend checking though, because if the my_li ends up being None or there is no a inside of it, your code will fail.
For more details, check out the BeautifulSoup documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
The idea here would be to use CSS selectors and to get the a elements inside the h3 with class="title" inside the div with class="titleText" inside the li element having the id attribute starting with "thread":
for link in soup.select("div.discussionList li[id^=thread] div.titleText h3.title a[href]"):
print link["href"]
You can tweak the selector further, but this should give you a good starting point.
I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')
The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8
import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!
I am given the following html :
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acaryochloris_marina_MBIC11017_> Jun 12 2013
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acetobacter_pasteurianus_386B_u> Aug 8 2013
and many more...
I want to extract the href from here.
Here's my python script : (page_source contains the html)
soup = BeautifulSoup(page_source)
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag.get('href',None)
if link != None:
print link
But this keeps returning the following error :
links = soup.find_all('A',attrs={'HREF': re.compile("^http://")})
TypeError: 'NoneType' object is not callable
You are using BeautifulSoup version 3, not version 4. soup.find_all is then not interpreted as a method, but as a search for the first <find_all> element. Because there is no such element, soup.find_all resolves to None.
Install BeautifulSoup 4 instead, the import is:
from bs4 import BeautifulSoup
BeautifulSoup 3 is instead imported as from BeautifulSoup import BeautifulSoup.
If you are sure you wanted to use BeautifulSoup 3 (not recommended), then use:
links = soup.findAll('a', attrs={'href': re.compile("^http://")})
As a side note, because you limit your search to <a> tags with a certain value, *there will always be a href attribute on the elements that are found. Using .get() and testing for None is entirely redundant. The following is equivalent:
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag['href']
print link
BeautifulSoup 4 also supports CSS selectors, which could make your query a little more readable still, removing the need for you to specify a regular expression:
for tag in soup.select('a[href^=http://]'):
link = tag['href']
print link
Why not use the split command?
Iterate over all lines of the file and d something like that:
href = line.split("HREF=\"")[1].split("\"")[0]