Having trouble in getting page source with beautifulsoup - python

I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjwhLS0gR2VuZXJhdG9yOiBBZG9iZSBJbGx1c3RyYXRvciAxNi4wLjAsIFNWRyBFeHBvcnQgUGx1Zy1JbiAuIFNWRyBWZXJzaW9uOiA2LjAwIEJ1aWxkIDApICAtLT4NCjwhRE9DVFlQRSBzdmcgUFVCTElDICItLy9XM0MvL0RURCBTVkcgMS4xLy9FTiIgImh0dHA6Ly93d3cudzMub3JnL0dyYXBoaWNzL1NWRy8xLjEvRFREL3N2ZzExLmR0ZCI+DQo8c3ZnIHZlcnNpb249IjEuMSIgaWQ9IkxheWVyXzEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgeG1sbnM6eGxpbms9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGxpbmsiIHg9IjBweCIgeT0iMHB4Ig0KCSB3aWR0aD0iMjQwcHgiIGhlaWdodD0iNjBweCIgdmlld0JveD0iMCAwIDI0MCA2MCIgZW5hYmxlLWJhY2tncm91bmQ9Im5ldyAwIDAgMjQwIDYwIiB4bWw6c3BhY2U9InByZXNlcnZlIj4NCjxwYXRoIGZpbGw9IiNGRkZGRkYiIGQ9Ik02LjkwMiwyMy4yODZDMzQuNzc3LDIwLjI2Miw1Ny4yNC'

Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.

Related

Decode a web page using request and BeautifulSoup package

I am trying a practice question of python. The question is "Use the BeautifulSoup and requests Python packages to print out a list of all the article titles on the New York Times homepage."
Below is my solution but it doesn't give any output. I am using Jupyter Notebook and when I run the below code it does nothing. My kernel is also working properly which means I have a problem with my code.
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url= 'https://www.nytimes.com/'
r=requests.get(base_url)
soup=BeautifulSoup(urlopen(base_url))
get_titles=soup.find_all(class_="css-1vctqli esl82me2" )
print()
for title in get_titles:
print(title.text)
Where did you get that class tag ? This is not the right one.
You need to replace css-1vctqli esl82me2 by css-1j836f9 esl82me3
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = 'https://www.nytimes.com/'
r = requests.get(base_url)
soup = BeautifulSoup(urlopen(base_url))
get_titles = soup.find_all(class_ = "css-1j836f9 esl82me3")
print()
for title in get_titles:
print(title.text)
And the output :

DB and BeautifulSoup

I would like to get from a search request to the site of the DB the sourcecode of the website and read out later values. However, I'm using BeautifulSoup and get a different sourcecode than in the browser! Have anyone an idea why this happens?
import bs4
import urllib.request
page = urllib.request.urlopen("http://reiseauskunft.bahn.de/bin/query.exe/dox?S=8000096&Z=8079090&start=1")
soup = bs4.BeautifulSoup(page, "html.parser")
print(soup)

Crawl site with infinite scrolling

I'm trying to crawl flipkart, but flipkart does not load its page at once. So I'm not able to crawl it. Please help.
from bs4 import BeautifulSoup
import requests
import re
import MySQLdb
import urllib2
import urllib
url = "https://www.flipkart.com/offers-list/weekend-specials?screen=dynamic&pk=contentTheme%3DLS-Nov-Weekend_widgetType%3DdealCard&wid=4.dealCard.OMU&otracker=hp_omu_Weekend+Specials_1"
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")
name=soup.find_all("div",{"class":"iUmrbN"})
for i in name:
print i.text
This is not giving any output.

BeautifulSoup doesn't extract the table

import urllib.request as urllib2 #To query website
from bs4 import BeautifulSoup #To parse website
import pandas as pd
#specify the url and open
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = urllib2.urlopen(url3)
soup = BeautifulSoup(req,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
If you see the content of your requested data
content = req.readall()
as you examine the content:
print(content)
and surprisingly there is not table!!!
But if you check the page source you can see tables in it.
As I examined there should be some problem with urllib.request and there is some escape sequence on the page which causes that urllib get only part of that page.
So I could be able to fix the problem by using requests instead of urllib
first
pip install requests
Then change your code to this:
import requests
from bs4 import BeautifulSoup
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = requests.get(url3)
soup = BeautifulSoup(req.content,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)

Crawl a news website and getting the news content

I'm trying to download the text from a news website. The HTML is:
<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>
The output should be: My Text
I'm using the following python code:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)
But the output of the code is: "None". Do you know what is wrong with my code??
The problem is that you are not parsing the HTML, you are parsing the URL string:
html = "My URL"
parsed_html = BeautifulSoup(html)
Instead, you need to get/retrieve/download the source first, example in Python 2:
from urllib2 import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
In Python 3, it would be:
from urllib.request import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
Or, you can use the third-party "for humans"-style requests library:
import requests
html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)
Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
with just:
from bs4 import BeautifulSoup
BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.
Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:
import urllib
from bs4 import BeautifulSoup
# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()
# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)
From here, just parse as you attempted previously.
p = soup.find("div", attrs={'class':'pane-content'})
print(p)

Categories

Resources