im learning python, and im trying to retrieve data from wikipedia, but is giving me encoding issues on special charecters of the links, text, etc:
My code:
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://pt.wikipedia.org/wiki/Jair_Bolsonaro")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
result:
/wiki/Hamilton_Mour%C3%A3o
/wiki/Michel_Temer
/wiki/C%C3%A2mara_dos_Deputados_do_Brasil
...
Should be:
/wiki/Hamilton_Mourão
/wiki/Michel_Temer
/wiki/Câmara_dos_Deputados_do_Brasil
...
Solution:
import urllib.parse
And in print line changed to:
print(urllib.parse.unquote(link.attrs['href']))
Related
I have this view in Anaconda.
However, I can't see to utilize BS in my script.
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
#import BeautifulSoup as bs
alphabets = string.ascii_lowercase
for i in alphabets:
#print(i)
html = urlopen("http://www.airlineupdate.com/content_public/codes/airportcodes/airports-by-iata/iata-" + i + ".htm")
print(html)
for j in html:
#soup = bs4(html, "html.parser")
soup = bs(html, "html.parser")
f = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
When I try to run the code above, I get the following error:
ModuleNotFoundError: No module named 'BeautifulSoup4'
Can someone enlighten me as to what's going on here?
from documentation its
from bs4 import BeautifulSoup
and based on your code, it seems like you want to use it as bs()
from bs4 import BeautifulSoup as bs
I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"'
Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.
import urllib.request as urllib2 #To query website
from bs4 import BeautifulSoup #To parse website
import pandas as pd
#specify the url and open
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = urllib2.urlopen(url3)
soup = BeautifulSoup(req,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
If you see the content of your requested data
content = req.readall()
as you examine the content:
print(content)
and surprisingly there is not table!!!
But if you check the page source you can see tables in it.
As I examined there should be some problem with urllib.request and there is some escape sequence on the page which causes that urllib get only part of that page.
So I could be able to fix the problem by using requests instead of urllib
first
pip install requests
Then change your code to this:
import requests
from bs4 import BeautifulSoup
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = requests.get(url3)
soup = BeautifulSoup(req.content,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
I am trying to get the text in td class 'column-1' and I am having some trouble because it has no attribute text - but it clearly does so I must be doing something wrong. Here is the code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://vermontamerican.com/products/standard-drill-bit-extensions/"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for part in soup.find_all('td'),{"class":"column-1"}:
part1 = part.text
print(part1)
If I take line 2 out and just print "part" above I get a result but it is giving all td not just column-1.
I have also tried this but I am new so I am sure this is wrong in more ways than one.
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://vermontamerican.com/products/standard-drill-bit-extensions/"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for part in soup.find('tbody'),{"class":"row-hover"}:
for part1 in part.find_all('a'):
print(part1)
You are not passing the attribute selection dictionary into the find_all() function. Replace:
for part in soup.find_all('td'),{"class":"column-1"}:
with:
for part in soup.find_all('td', {"class":"column-1"}):
Now your code would produce:
17103
17104
I want to extract some data from a website. I saved it as 'Webpage, HTML Only', in a file called soccerway.html on my Desktop.
Afterwards I wrote the following command using an IPython notebook:
from bs4 import BeautifulSoup
soup=BeautifulSoup(open("soccerway.html"))
I get the following error:
IOError: [Errno 2] No such file or directory: 'soccerway.html'
How can I solve this?
You don't need to manually save a page. Use urllib2 to get the html source you need:
from bs4 import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen("http://my_site.com/mypage"))
Example:
>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> soup = BeautifulSoup(urlopen('http://google.com'))
>>> soup('a')
[<a class="gb1" href="http://www.google.com/imghp?hl=en&tab=wi">Images</a>,
...
]
You can use this code:
from bs4 import BeautifulSoup
file = open("yourfile.html", "r")
soup = BeautifulSoup(file, "html.parser")