Error using BeautifulSoup - python

I want to extract some data from a website. I saved it as 'Webpage, HTML Only', in a file called soccerway.html on my Desktop.
Afterwards I wrote the following command using an IPython notebook:
from bs4 import BeautifulSoup
soup=BeautifulSoup(open("soccerway.html"))
I get the following error:
IOError: [Errno 2] No such file or directory: 'soccerway.html'
How can I solve this?

You don't need to manually save a page. Use urllib2 to get the html source you need:
from bs4 import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen("http://my_site.com/mypage"))
Example:
>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> soup = BeautifulSoup(urlopen('http://google.com'))
>>> soup('a')
[<a class="gb1" href="http://www.google.com/imghp?hl=en&tab=wi">Images</a>,
...
]

You can use this code:
from bs4 import BeautifulSoup
file = open("yourfile.html", "r")
soup = BeautifulSoup(file, "html.parser")

Related

Encoding issues request results

im learning python, and im trying to retrieve data from wikipedia, but is giving me encoding issues on special charecters of the links, text, etc:
My code:
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://pt.wikipedia.org/wiki/Jair_Bolsonaro")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
result:
/wiki/Hamilton_Mour%C3%A3o
/wiki/Michel_Temer
/wiki/C%C3%A2mara_dos_Deputados_do_Brasil
...
Should be:
/wiki/Hamilton_Mourão
/wiki/Michel_Temer
/wiki/Câmara_dos_Deputados_do_Brasil
...
Solution:
import urllib.parse
And in print line changed to:
print(urllib.parse.unquote(link.attrs['href']))

I have BeautifulSoup4, from Anaconda, but I can't seem to utilize it, to save URL to TXT

I have this view in Anaconda.
However, I can't see to utilize BS in my script.
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
#import BeautifulSoup as bs
alphabets = string.ascii_lowercase
for i in alphabets:
#print(i)
html = urlopen("http://www.airlineupdate.com/content_public/codes/airportcodes/airports-by-iata/iata-" + i + ".htm")
print(html)
for j in html:
#soup = bs4(html, "html.parser")
soup = bs(html, "html.parser")
f = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
When I try to run the code above, I get the following error:
ModuleNotFoundError: No module named 'BeautifulSoup4'
Can someone enlighten me as to what's going on here?
from documentation its
from bs4 import BeautifulSoup
and based on your code, it seems like you want to use it as bs()
from bs4 import BeautifulSoup as bs

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"'
Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.

BeautifulSoup doesn't extract the table

import urllib.request as urllib2 #To query website
from bs4 import BeautifulSoup #To parse website
import pandas as pd
#specify the url and open
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = urllib2.urlopen(url3)
soup = BeautifulSoup(req,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
If you see the content of your requested data
content = req.readall()
as you examine the content:
print(content)
and surprisingly there is not table!!!
But if you check the page source you can see tables in it.
As I examined there should be some problem with urllib.request and there is some escape sequence on the page which causes that urllib get only part of that page.
So I could be able to fix the problem by using requests instead of urllib
first
pip install requests
Then change your code to this:
import requests
from bs4 import BeautifulSoup
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = requests.get(url3)
soup = BeautifulSoup(req.content,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)

BeautifulSoup(html) not working, saying can't call module?

import urllib2
import urllib
from BeautifulSoup import BeautifulSoup # html
from BeautifulSoup import BeautifulStoneSoup # xml
import BeautifulSoup # everything
import re
f = o.open( 'http://www.google.com', p)
html = f.read()
f.close()
soup = BeautifulSoup(html)
Getting an error saying the line with soup = BeautifulSoup(html) says 'module' object is not callable.
Your import BeautifulSoup makes BeautifulSoup refer to the module, not the class as it did after from BeautifulSoup import BeautifulSoup. If you're going to import the whole module, you might want to omit the from ... line or perhaps rename the class afterward:
from BeautifulSoup import BeautifulSoup
Soup = BeautifulSoup
...
import BeautifulSoup
....
soup = Soup(html)
#Blair's answer has the right slant but I'd perform some things slightly differently, i.e.:
import BeautifulSoup
Soup = BeautifulSoup.BeautifulSoup
(recommended), or
import BeautifulSoup
from BeautifulSoup import BeautifulSoup as Soup
(not bad either).
Install BeautifulSoup4
sudo easy_install BeautifulSoup4
Recommendation
from bs4 import BeautifulSoup

Categories

Resources