Decode a web page using request and BeautifulSoup package - python

I am trying a practice question of python. The question is "Use the BeautifulSoup and requests Python packages to print out a list of all the article titles on the New York Times homepage."
Below is my solution but it doesn't give any output. I am using Jupyter Notebook and when I run the below code it does nothing. My kernel is also working properly which means I have a problem with my code.
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url= 'https://www.nytimes.com/'
r=requests.get(base_url)
soup=BeautifulSoup(urlopen(base_url))
get_titles=soup.find_all(class_="css-1vctqli esl82me2" )
print()
for title in get_titles:
print(title.text)

Where did you get that class tag ? This is not the right one.
You need to replace css-1vctqli esl82me2 by css-1j836f9 esl82me3
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = 'https://www.nytimes.com/'
r = requests.get(base_url)
soup = BeautifulSoup(urlopen(base_url))
get_titles = soup.find_all(class_ = "css-1j836f9 esl82me3")
print()
for title in get_titles:
print(title.text)
And the output :

Related

weird text indentation when web scraping with beautifullsoup4 in python

Im trying to web scrape github
This is the code:
import requests as req
from bs4 import BeautifulSoup
urls = [
"https://github.com/moom825/Discord-RAT",
"https://github.com/freyacodes/Lavalink",
"https://github.com/KagChi/lavalink-railways",
"https://github.com/KagChi/lavalink-repl",
"https://github.com/Devoxin/Lavalink.py",
"https://github.com/karyeet/heroku-lavalink"]
r = req.get(urls[0])
soup = BeautifulSoup(r.content,"lxml")
title = str(soup.find("p",attrs={"class":"f4 mt-3"}).text)
print(title)
When i run the program i don't get any kind of errors but the indentation is very weird
Please anyone help me with this problem
Im using replit
Github has a really good API
You can use .strip() after .text then it will remove whitespace.
import requests as req
from bs4 import BeautifulSoup
urls = [
"https://github.com/moom825/Discord-RAT",
"https://github.com/freyacodes/Lavalink",
"https://github.com/KagChi/lavalink-railways",
"https://github.com/KagChi/lavalink-repl",
"https://github.com/Devoxin/Lavalink.py",
"https://github.com/karyeet/heroku-lavalink"]
r = req.get(urls[0])
soup = BeautifulSoup(r.content,"lxml")
title = str(soup.find("p",attrs={"class":"f4 mt-3"}).text.strip())
print(title)

Scraping with Beautiful Soup-getting empty set

I'm using the following code but still keep getting an empty set. Any ideas?
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests, time, os,html5lib
base_site = "https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/14"
response = requests.get(base_site)
soup = bs(response.text,"html.parser")
soup
# Find all links on the page
game = soup.find_all("section", class_="sb-score.final")
game
Here is what I'm seeing on the site:
The most likely issue is that there are multiple classes, you could try:
find_all('section', class_=['sb-score', 'final'])

I have BeautifulSoup4, from Anaconda, but I can't seem to utilize it, to save URL to TXT

I have this view in Anaconda.
However, I can't see to utilize BS in my script.
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
#import BeautifulSoup as bs
alphabets = string.ascii_lowercase
for i in alphabets:
#print(i)
html = urlopen("http://www.airlineupdate.com/content_public/codes/airportcodes/airports-by-iata/iata-" + i + ".htm")
print(html)
for j in html:
#soup = bs4(html, "html.parser")
soup = bs(html, "html.parser")
f = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
When I try to run the code above, I get the following error:
ModuleNotFoundError: No module named 'BeautifulSoup4'
Can someone enlighten me as to what's going on here?
from documentation its
from bs4 import BeautifulSoup
and based on your code, it seems like you want to use it as bs()
from bs4 import BeautifulSoup as bs

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjwhLS0gR2VuZXJhdG9yOiBBZG9iZSBJbGx1c3RyYXRvciAxNi4wLjAsIFNWRyBFeHBvcnQgUGx1Zy1JbiAuIFNWRyBWZXJzaW9uOiA2LjAwIEJ1aWxkIDApICAtLT4NCjwhRE9DVFlQRSBzdmcgUFVCTElDICItLy9XM0MvL0RURCBTVkcgMS4xLy9FTiIgImh0dHA6Ly93d3cudzMub3JnL0dyYXBoaWNzL1NWRy8xLjEvRFREL3N2ZzExLmR0ZCI+DQo8c3ZnIHZlcnNpb249IjEuMSIgaWQ9IkxheWVyXzEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgeG1sbnM6eGxpbms9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGxpbmsiIHg9IjBweCIgeT0iMHB4Ig0KCSB3aWR0aD0iMjQwcHgiIGhlaWdodD0iNjBweCIgdmlld0JveD0iMCAwIDI0MCA2MCIgZW5hYmxlLWJhY2tncm91bmQ9Im5ldyAwIDAgMjQwIDYwIiB4bWw6c3BhY2U9InByZXNlcnZlIj4NCjxwYXRoIGZpbGw9IiNGRkZGRkYiIGQ9Ik02LjkwMiwyMy4yODZDMzQuNzc3LDIwLjI2Miw1Ny4yNC'
Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.

BeautifulSoup doesn't extract the table

import urllib.request as urllib2 #To query website
from bs4 import BeautifulSoup #To parse website
import pandas as pd
#specify the url and open
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = urllib2.urlopen(url3)
soup = BeautifulSoup(req,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
If you see the content of your requested data
content = req.readall()
as you examine the content:
print(content)
and surprisingly there is not table!!!
But if you check the page source you can see tables in it.
As I examined there should be some problem with urllib.request and there is some escape sequence on the page which causes that urllib get only part of that page.
So I could be able to fix the problem by using requests instead of urllib
first
pip install requests
Then change your code to this:
import requests
from bs4 import BeautifulSoup
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = requests.get(url3)
soup = BeautifulSoup(req.content,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)

Categories

Resources