I'm attempting to access the URLs of the different fish family from this website: http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon
I'd like to be able to run a script that opens the links of the given website to then be able to parse the information stored within the pages. I'm fairly new to web scraping, so any help will be appreciated. Thanks in advance!
This is what I have so far:
import urllib2
import re
from bs4 import BeautifulSoup
import time
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
page = urllib2.urlopen('http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon').read()
soup = BeautifulSoup(page)
soup.prettify()
for fish in soup.findAll('a', href=True):
print fish['href']
Scrapy is the perfect tool for this. It is a python web scraping framework.
http://doc.scrapy.org/en/latest/intro/tutorial.html
You can pass in your url with you term, and create rules for crawling.
FOr example using a regex you would add a rule to scrape all links with the path /Summary and then extract the information using XPath or Beautiful soup.
Additionally you can set up a rule to automatically handle the pagination, ie in your example url it could automatically follow the Next link.
Basically a lot of what you are trying to do comes packaged for free in scrapy. I would def take a look into it.
If you're just writing a one-off script to grab all the data from this site, you can do something like:
fish_url_base = "http://www.fishbase.org/ComNames/%s"
fish_urls = [fish_url_base%a['href'] for a in soup.find_all('a')]
This gives you a list of links to traverse, which you can pass to urllib2.urlopen and BeautifulSoup:
for url in fish_urls:
fish_soup = BeautifulSoup(urllib2.urlopen(url).read())
# Do something with your fish_soup
(Note 1: I haven't tested this code; you might need to adjust the base URL to fit the href attributes so you get to the right site.)
(Note 2: I see you're using bs4 but calling findAll on the soup. findAll was right for BS3, but it is changed to find_all in bs4.)
(Note 3: If you're doing this for practical rather than learning purposes / fun, there are more efficient ways of scraping, such as scrapy also mentioned here.)
Related
I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
I am attempting to scrape some basic product information from the url linked here, but the bs4 find_all command isn't finding any data given the name of the class associated with the product div. Specifically, I am trying:
url = https://www.walmart.com/grocery/browse/Cereal-&-Breakfast-Food?aisle=1255027787111_1255027787501
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
product_list = soup.find_all('div', class_='productListTile')
print(product_list)
But this prints an empty list []. Having inspected the webpage on Chrome, I know that 'productListTile' is the correct class name. Any idea what I am doing wrong?
You will need to use Selenium most likely. Beautiful Soup requests get redirected to a "Verify Your Identity" page.
Here is a very similar question to this one, which has code with Selenium and Beautiful Soup working in concert to scrape Wal-Mart
python web scraping using beautiful soup is not working
Web scraping technics vary with websites. In this case, you can either use selenium that is a good option and here I am adding another method with the beautiful soup itself, this helped me a lot.
In this case, inspect the web page and then select network, please refresh the page.
Then sort with type:
In the below image I had marked with red color the API's they called to get the data from the backend. So you can directly call the backend API to fetch the player's data.
Check the "Headers" you will see the API endpoint and in preview, you can see the API response in JSON format.
Now if you want to get the images then please check the source you will see the images and u can download the images and map with the id.
I am learning Spanish & to help me learn the different verbs and their conjugations I am making some flash cards to use on my phone.
I am trying to scrape the data from a web page here is example page for one verb. On the page there are a few tables, I am interested in the first five (Present, Future, Imperfect, Preterite & Conditional) near the top.
I have heard the BeautifulSoup is good for these types of projects. However when I use the prettify method I can't find the tables in the text anywhere? I think I'm missing something, how can I get these tables in python?
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www.linguasorb.com/spanish/verbs/conjugation/tener.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
txt = soup.prettify()
You're loading the wrong url. Remove the ".html" from the URL variable and you will be able to find the tables (they're actually lists) in the output: soup.find_all('div', class_='vPos')
I was trying to get some headlines from the newyorktimes website. I have 2 questions,
question 1:
This is my code, but I gives me no output, does anyone know what I'd have to change?
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.find_all(class_="balancedHeadline")
for story_heading in a:
print(story_heading)
My second question:
As the HTML is not the same for all headlines (there's a different class for the big headlines and the smaller ones for example), how would I take all those different classes in my code and give me all of the headlines as output?
Thanks in advance!
BeautifulSoup is a robust parsing library.
But, unlike your browser, it does not evaluate javascript.
Elements with balancedHeadline class you were looking for are
not present in the download HTML document.
They get added in later when assets have downloaded
and javascript functions have run.
You won't be able to find such a class using your current technique.
The answer to your second question is in the docs.
A regex or a function would work, but you might find that
passing in a list is simpler for your application.
I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup