Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to webscrape https://in.udacity.com/courses/all. I need to get the courses shown while entering the search query. For eg: if I enter python, there are 17 courses coming as results.I need to fetch those courses only. Here the search query is not passed as part of the url.(not get method).so the html content is also not changing. Then how can I fetch those results without going through the entire course list.
in this code i am fetching all the course links getting the content of it and seraching the search term in that content.but it is not giving me the result that i expect.
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
from urllib.request import Request, urlopen
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')
search_term = input("enter the course:")
for link in courses:
#print("https://in.udacity.com" + link['href'])
html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()
if search_term in text_from_html(html).lower():
print('\n'+link.text)
print("https://in.udacity.com" + link['href'])
Using requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
for course in courses:
print(course.text)
OUTPUT:
VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.
EDIT:
As explainged by #Martin Evans, the Ajax call behind the search is not doing what you think it is, it is probably keeping the count of the search i.e. how many users searched for AI It basically is filtering out the search based on the keyword in the search_term:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"
for course in courses:
if re.search(search_term, course.text, re.IGNORECASE):
print(course.text)
OUTPUT:
AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems
The udacity page is actually returning all available courses when you request it. When you enter a search, the page is simply filtering the available data. This is why you do not see any changes to the URL when entering a search. A check using the browser's developer tools also confirms this. It also explains why the "search" is so fast.
As such, if you are searching for a given course, you would just need to filter the results yourself. For example:
import requests
from bs4 import BeautifulSoup
req = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(req.content, "html.parser")
a_tags = soup.find_all("a", class_="capitalize")
print("Number of courses:", len(a_tags))
print()
for a_tag in a_tags:
course = a_tag.text
if "python" in course.lower():
print(course)
This would display all courses with Python in the title:
Number of courses: 225
Python Foundation
AI Programming with Python
Programming Foundations with Python
Data Structures & Algorithms in Python
Read the tutorials for how to use requests (for making HTTP requests) and BeautifulSoup (for processing HTML). This will teach you what you need to know to download the pages, and extract the data from the HTML.
You will use the function BeautifulSoup.find_all() to locate all of the <div> elements in the page HTML, with class=course-summary-card. The content you want is within that <div>, and after reading the above links it should be trivial for you to figure out the rest ;)
Btw, one helpful tool for you as you learn how to do this will be to use the "Inspect element" feature (for Chrome/Firefox), which can be accessed by right clicking on elements in the browser, that enables you to look at the source code surrounding the element you're interested in extracting, so you can get information like it's class or id, parent divs, etc that will allow you to select it in BeautifulSoup/lxml/etc.
Related
I created bs4 web-scraping app with python. My program return empty list for review. For soup program runs normally.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
usernames = []
titles = []
comments = []
result = requests.get('https://www.kupujemprodajem.com/review.php?action=list')
soup = BeautifulSoup(result.text, 'html.parser')
review = soup.findAll('div', class_="single-review")
print(review)
for i in review:
header = i.find('div', class_="single-review__header")
footer = i.find('div', class_="comment-holder")
username = header.find('a', class_="single-review__username").text
title = header.find('div', class_="single-review__related-to").text
comment = footer.find('div', class_="single-review__comment").text
usernames.append(username)
titles.append(title)
comments.append(comment)
data.append(usernames)
data.append(titles)
data.append(comments)
print(data)
It isn't problem with class.
It looks like the reason this doesn't work is because the website needs a login in order to access that page. If in a private tab in a browser you where to visit https://www.kupujemprodajem.com/review.php?action=list, it would just take you to a login page.
There's 2 paths I can think of that you could take here:
Reverse engineer how the login process works and use the requests library to make a request to login and get (most likely) the session cookie from that in order to be able to request pages that require sign in.
(much simpler) use selenium instead. Selenium is a library that allows you to control a full browser instance, so you would be able to easily input credentials using this method. Beautiful soup on the other hand simply just parses html, so doing things like authenticating often take much more work in Beautiful Soup then they do in Selenium. I'd definitely suggest looking into it if you haven't already.
I'm working on webscraping project currently using BS4 where I am trying to aggregate college tuition data. I'm using the site tuitiontracker.org as a data source. Once I've navigated to a specific college, I want to scrape the tuition data off of the site. When I inspect element, I can see the tuition data stored as an innerHTML, but when I use beautiful soup to find it, it returns everything about the location except for the tuition data.
Here is the url I'm trying to scrape from: https://www.tuitiontracker.org/school.html?unitid=164580
Here is the code I am using:
import urllib.request
from bs4 import BeautifulSoup
DOWNLOAD_URL = "https://www.tuitiontracker.org/school.html?unitid=164580"
def download_page(url):
return urllib.request.urlopen(url)
# print(download_page(DOWNLOAD_URL).read())
def parse_html(html):
"""Gathers data from an HTML page"""
soup = BeautifulSoup(html, features="html.parser")
# print(soup.prettify())
tuition_data = soup.find("div", attrs={"id": "price"}).innerHTML
print(tuition_data)
def main():
url = DOWNLOAD_URL
parse_html(download_page(DOWNLOAD_URL).read())
if __name__ == "__main__":
main()
When I print tuition_data, I see the relevant tags where the tuition data is stored on the page, but no number value. I've tried using .innerHTML and .string but they end up printing either None, or simply a blank space.
Really quite confused, thanks for any clarification.
The data comes from an API endpoint and is dynamically rendered by JavaScript so you won't get it with BeautifulSoup.
However, you can query the endpoint.
Here's how:
import json
import requests
url = "https://www.tuitiontracker.org/school.html?unitid=164580"
api_endpoint = f"https://www.tuitiontracker.org/data/school-data-09042019/"
response = requests.get(f"{api_endpoint}{url.split('=')[-1]}.json").json()
tuition = response["yearly_data"][0]
print(
round(tuition["price_instate_oncampus"], 2),
round(tuition["avg_net_price_0_30000_titleiv_privateforprofit"], 2),
)
Output:
75099.8 30255.86
PS. There's a lot more in that JSON. Pro tip for future web-scraping endeavors: you favorite web browser's Developer Tools should be your best friend.
Here's what it looks like behind the scenes:
I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.
I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.
Getting the sub-links:
import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ]
sub_links = {}
for master_atag in master_atags:
master_href = master_atag.get('href')
master_href = base_url + master_href
print(master_href)
master_links.append(master_href)
sub_request = requests.get(master_href)
sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
sub_atags = sub_soup.find_all("a", href=True)
sub_links[master_href] = []
for sub_atag in sub_atags:
sub_href = sub_atag.get('href')
sub_links[master_href].append(sub_href)
print("\t"+sub_href)
Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;
sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)
this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.
I tried also tried the following:
for master_atag in master_atags:
master_href = master_atag.get('href')
for href in master_href:
master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
print(master_href)
I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.
Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?
for links in sublinks:
resp = requests.get(sublinks)
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
title = soup.find('title')
textdate = soup.find('h2')
paragraphs = article.find_all('p')
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
for match in matches:
print(match[0])
datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})
Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:
import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url)
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()
Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.
I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.
The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:
Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.
Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.
Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = 'https://www.ecb.europa.eu'
urls = [
f'{base_url}/press/pr/html/index.en.html',
f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for anchor in soup.select('span.doc-title > a[href]'):
driver.get(f'{base_url}{anchor["href"]}')
article_soup = BeautifulSoup(driver.page_source, 'html.parser')
title = article_soup.select_one('h1.ecb-pressContentTitle').text
date = article_soup.select_one('p.ecb-publicationDate').text
paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
content = '\n\n'.join(p.text for p in paragraphs)
print(f'title: {title}')
print(f'date: {date}')
print(f'content: {content[0:80]}...')
I get the following output for the Press Releases page:
title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board
date: 20 December 2019
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...
title: Monetary policy decisions
date: 12 December 2019
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...
I've already done some basic web scraping with BeautifulSoup. For my next project I've chosen to scrape facebook friend list of a specified user. The problem is, facebook lets you see friend lists of people only if you are logged in. So my question is, can I somehow bypass it and if not, can I make BeautifulSoup act like if it was logged in?
Here's my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = input("enter url: ")
try:
page = urlopen(url)
except:
print("Error opening the URL")
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {"class": "_3i9"})
friends = ''
for i in content.findAll('a'):
friends = friends + ' ' + i.text
print(friends)
BeautifulSoup doesn't require that you use an URL. Instead:
Inspect the friends list
Copy the parent tag containing the list to a new file (ParentTag.html)
Open the file as a string, and pass it to BeautifulSoup()
with open("path/to/ParentTag.html", encoding="utf8") as html:
soup = BeautifulSoup(html, "html.parser")
Then, "you make-a the soup-a."
The problem is, facebook lets you see friend lists of people only if
you are logged in
You can overcome this using Selenium. You'll need it to authenticate yourself, then you can find the user. Once you found it, you can proceed in two ways:
You can get the HTML source with driver.page_sourceand from there use Beatiful Soup
Use the methods that Selenium provide you to scrape friends
I'm completely new to scraping the web but I really want to learn it in python. I have a basic understanding of python.
I'm having trouble understanding a code to scrape a webpage because I can't find a good documentation about the modules which the code uses.
The code scraps some movie's data of this webpage
I get stuck after the comment "selection in pattern follows the rules of CSS".
I would like to understand the logic behind that code or a good documentation to understand that modules. Is there any previous topic which I need to learn?
The code is the following :
import requests
from pattern import web
from BeautifulSoup import BeautifulSoup
url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url
url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url # notice it constructs the full url for you
#selection in pattern follows the rules of CSS
dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):
title = movie.by_tag('a')[0].content
genres = movie.by_tag('span.genre')[0].by_tag('a')
genres = [g.content for g in genres]
runtime = movie.by_tag('span.runtime')[0].content
rating = movie.by_tag('span.value')[0].content
print title, genres, runtime, rating
Here's the documentation for BeautifulSoup, which is an HTML and XML parser.
The comment
selection in pattern follows the rules of CSS
means the strings such as 'td.title' and 'span.runtime' are CSS selectors that help find the data you are looking for, where td.title searches for the <TD> element with attribute class="title".
The code is iterating through the HTML elements in the webpage body and extracting title, genres, runtime, and rating by the CSS selectors .