I am writing a web crawler, but I have a problem with function which recursively calls links.
Let's suppose I have a page: http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind.
I am looking for all links, and then open each link recursively, downloading again all links etc.
The problem is, that some links, although have different urls, drive to the same page, for example:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation
gives the same page as the previous link.
And I have an infinite loop.
Is any possibility to check if two links drive to the same page without comparing the all content of this pages?
You can store the hash of the content of pages previously seen and check if the page has already been seen before continuing.
No need to make extra requests to the same page.
You can use urlparse() and check if the .path part of the base url and the link you crawl is the same:
from urllib2 import urlopen
from urlparse import urljoin, urlparse
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind"
base_url = urlparse(url)
soup = BeautifulSoup(urlopen(url))
for link in soup.find_all('a'):
if 'href' in link.attrs:
url = urljoin(url, link['href'])
print url, urlparse(url).path == base_url.path
Prints:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation True
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#p-search True
http://en.wikipedia.org/wiki/File:Set_partitions_4;_Hasse;_circles.svg False
...
http://en.wikipedia.org/wiki/Equivalence_relation False
...
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind True
...
https://www.mediawiki.org/ False
This particular example uses BeautifulSoup to parse the wikipedia page and get all links, but the actual html parser here is not really important. Important is that you parse the links and get the path to check.
Related
Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.
I am having trouble figuring out how to use BeautifulSoup to scrape all 100 link titles on the page since it is under "a href = ....." . I have tried the below code but it returns a blank.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import bs4
url = 'https://www150.statcan.gc.ca/n1/en/type/data?count=100'
page = urlopen(url)
soup = bs4.BeautifulSoup(page,'html.parser')
title = soup.find_all('a')
Additionally, is there a way to ensure I am scraping everything under the "Tables (8898)" tabs? Thanks in advance!
Link:
https://www150.statcan.gc.ca/n1/en/type/data?count=100
The link you provided is loading it's contents with async javascript requests. So when you exec page = urlopen(url) it is only fetching the empty HTML and javascript blocks.
You need to use a browser to execute js to load page contents. You can checkout this link to learn how to do it: https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab
I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes.
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
if "mobiles" not in link:
print(link.get('href'))
SpecificItem()
On the other hand if I do the same thing using lxml library with xpath, It works.
import requests
from lxml import html
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
tree = html.fromstring(Process.text)
links = tree.xpath('//div[#class=""]//a/#href')
for link in links:
if "mobiles" not in link:
print(link)
SpecificItem()
So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick:
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
href = link.get('href')
if "mobiles" not in href:
print(href)
SpecificItem()
That prints out a bunch of links and none of them include "mobiles".
There's something I still don't understand about using BeautifulSoup. I can use this to parse the raw HTML of a webpage, here "example_website.com":
from bs4 import BeautifulSoup # load BeautifulSoup class
import requests
r = requests.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
# soup.find_all('a') grabs all elements with <a> tag for hyperlinks
Then, to retrieve and print all elements with the 'href' attribute, we can use a for loop:
for link in soup.find_all('a'):
print(link.get('href'))
What I don't understand: I have a website with several webpages, and each webpage lists several hyperlinks to a single webpage with tabular data.
I can use BeautifulSoup to parse the homepage, but how do I use the same Python script to scrape page 2, page 3, and so on? How do you "access" the contents found via the 'href' links?
Is there a way to write a python script to do this? Should I be using a spider?
You can do that with requests+BeautifulSoup for sure. It would be of a blocking nature, since you would process the extracted links one by one and you would not proceed to the next link until you are done with the current. Sample implementation:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
r = session.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
base_url = "http://example_website.com"
for link in soup.find_all('a'):
url = urljoin(base_url, link.get('href'))
r = session.get(url)
# parse the subpage
Though, it may quickly get complex and slow.
You may need to switch to Scrapy web-scraping framework which makes web-scraping, crawling, following the links easy (check out CrawlSpider with link extractors), fast and in a non-blocking nature (it is based on Twisted).
I'm trying to parse a wiki page here, but I only want the certain parts. Those links in the main article, I'd like to parse them all. Is there an article or tutorial on how to do it? I'm assuming I'd be using BS4. Can anyone help?
Specifically speaking; the links that are under all the main headers in the page.
Well, it really depends on what you mean by "parse" but here is a full working example on how to extract all links from the main section with BeautfulSoup:
from bs4 import BeautifulSoup
import urllib.request
def main():
url = 'http://yugioh.wikia.com/wiki/Card_Tips%3aBlue-Eyes_White_Dragon'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
content = soup.find('div',id='mw-content-text')
links = content.findAll('a')
for link in links:
print(link.get_text())
if __name__ == "__main__":
main()
This code should be self explanatory, but just in case:
First we open the page with urllib.reauest.urlopen and pass its contents to BS
Then we extract the main content div by its id. (The id mw-content-text can be found in the page's source)
We proceed with extracting all the links inside the main content
In a for loop we print all the links.
Additional methods, you might need for parsing the links:
link.get('href') extracts the destination url
link.get('title') extracts the alternative title of the link
And since you asked for resources: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ is the first place you should start.