I just bought a book to show me how to scrape websites but the first example right off the bat is not working for me - so now I am a little upset that I bought the book in the first place but I would like to try and get it going.
In Python 3.5 my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
BsObj = BeautifulSoup(html.read())
print(bsObj.h1)*
Here is the error I am getting
Traceback (most recent call last):
File
"C:/Users/MyName/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/bs4/test.py",
line 5, in
BsObj = BeautifulSoup(html.read())
File "C:\Users\MyName\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4__init__.py",
line 153, in init
builder = builder_class()
File "C:\Users\MyName\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\builder_htmlparser.py",
line 39, in init
return super(HTMLParserTreeBuilder, self).init(*args, **kwargs)
TypeError: init() got an unexpected keyword argument 'strict'
Any ideas would be super helpful?
Thanks in advance
I guess you transcribed the code from the book. bsObj is not named consistently and there is an unnecessary * after print(). It will work after you change those two things.
Also note that read() is not needed and that it's better to define the parser, otherwise you will get a warning.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bsObj = BeautifulSoup(html, 'html.parser')
print(bsObj.h1)
Hey you just had some typos BsObj not bsObj in print line.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
BsObj = BeautifulSoup(html.read())
print(BsObj.h1)
Related
I wanna scrape the title attributes of all a tags in the New Texts - Section at this website:
Try to do it this way
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests
url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)
Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",class_="enws-mainpage-widget-content").find_all('a')
for ebook in List:
print(List.get('title'))
When I run this I get this error:
File "C:\Users\Özdal\AppData\Local\Programs\Python\Python38-32\lib\site-packages\bs4\element.py", line 2173, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
What happens?
In your for loop, you try to grab the title from the List object not
from each of the ebook. That is why the error occurred.
Change your printing line to print(ebook.get('title')) and you
will get the results.
To get only the title from New texts - Section you have to be more specific otherwise you grab all a including author, ...
You can fix this for example this way .select("b > i > a")
Example:
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests
url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)
Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",{"id":"enws-mainpage-newtexts-content"}).select("b > i > a")
for ebook in List:
print(ebook.get('title'))
Output
The Center of the Web
Bobby Bumps Starts a Lodge
May (Mácha)
Animal Life and the World of Nature/1903/06/Notes and Comments
The Czechoslovak Review/Volume 2/No Compromise
She's All the World to Me
Their One Love
I want to write the code as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings: print(sibling)
But whenever I type the line of bsObj = BeautifulSoup(html), it throws an error as the following pic:
Hope anyone can help me out of this.
Thanks
html = urlopen("http://www.your.url/here")
Instead of
html = urlopen(("http://www.your.url/here")
Notice the (( to the right of urlopen.
I'm trying to build a pretty simple scraper to harvest links as part of a crawler project. I've set up the following function to do the scraping:
import requests as rq
from bs4 import BeautifulSoup
def getHomepageLinks(page):
homepageLinks = []
response = rq.get(page)
text = response.text
soup = BeautifulSoup(text)
for a in soup.findAll('a'):
homepageLinks.append(a['href'])
return homepageLinks
I saved this file as "scraper2.py". When I try to run the code, I get the following error:
>>> import scraper2 as sc
>>> sc.getHomepageLinks('http://washingtonpost.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "scraper2.py", line 9, in getHomepageLinks
for a in soup.findAll('a'):
TypeError: 'NoneType' object is not callable
Now for the odd part: If I try to debug the code and just print the response, it works fine:
>>> response = rq.get('http://washingtonpost.com')
>>> text = response.text
>>> soup = BeautifulSoup(text)
>>> for a in soup.findAll('a'):
... print(a['href'])
...
https://www.washingtonpost.com
#
#
http://www.washingtonpost.com/politics/
https://www.washingtonpost.com/opinions/
http://www.washingtonpost.com/sports/
http://www.washingtonpost.com/local/
http://www.washingtonpost.com/national/
http://www.washingtonpost.com/world/
...
If I'm reading the error messages correctly, the problem is occurring with soup.findAll, but only when the findAll is part of a function. I'm sure I'm spelling it correctly (not findall or Findall, as many of the errors on here are), and I've tried a fix using lxml suggested on a previous post that didn't fix it. Does anyone have any ideas?
Try to replace your for-loop with the following:
for a in soup.findAll('a'):
url = a.get("href")
if url != None:
homepageLinks.append(url)
Code:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
I think the problem lies mainly with urlib package. Here I am using urllib3 package. They changed the urlopen syntax from 2 to 3 version, which maybe the cause of error. But that being said I have included the latest syntax only.
Python version 3.4
since you are importing requests you can use it instead of urllib like this:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
Your problem is that python cannot encode the characters from the page that you are scraping. For some more information see here: https://stackoverflow.com/a/16347188/2638310
Since the wikipedia page is in UTF-8, it seems that BeautifulSoup is guessing the encoding incorrectly. Try passing the from_encoding argument in your code like this:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
For more on encodings in BeautifulSoup have a look here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
I am using Python2.7, so I don't have request method inside the urllib module.
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
https://www.python.org/dev/peps/pep-0263/
Put those print lines inside a Try-Catch block so if there is an illegal character, then you won't get an error.
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))
Here is my code, using find_all, but It works great with .find():
import requests
from BeautifulSoup import BeautifulSoup
r = requests.get(URL_DEFINED)
print r.status_code
soup = BeautifulSoup(r.text)
print soup.find_all('ul')
This is what I got:
Traceback (most recent call last):
File "scraper.py", line 19, in <module>
print soup.find_all('ul')
TypeError: 'NoneType' object is not callable
It looks like you're using BeautifulSoup version 3, which used a slightly different naming convention, eg: .findAll, while BeautifulSoup 4 standardised naming to be more PEP8 like, eg: .find_all (but keeps the older naming for backwards compatibility). Note that soup('ul') is the equivalent to find all on both.
To download and install, use pip install beautifulsoup4.
Then change your import to be:
from bs4 import BeautifulSoup
Then you're good to go.
Download BS4 from here. http://www.crummy.com/software/BeautifulSoup/#Download
Install it and import it at the beginning of your code like this:
import requests
from bs4 import BeautifulSoup
r = requests.get(URL_DEFINED)
print r.status_code
soup = BeautifulSoup(r.text)
print soup.find_all('ul')