while trying to scrape the homepage of youtube for the titles of each video im running this code
import request
from bs4 import BeautifulSoup
url = 'https://www.youtube.com'
html = requests.get(url)
soup = BeautifulSoup(html.content, "html.parser")
print(soup('a'))
and its returning this error
Traceback (most recent call last):
File "C:\Users\kenda\OneDrive\Desktop\Projects\youtube.py", line 7, in <
<module>
print(soup('a'))
File "C:\Users\kenda\AppData\Local\Programs\Python\Python36-
32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f384' in
position 45442: character maps to <undefined>
[Finished in 4.83s]
how do i fix this? and why is it doing this specifically when im scraping youtube
Urllib is much better and it's comfortable for use.
from urllib.request import urlopen
from bs4 import BeautifulSoup
The urlopen function will transform the url into html
url = 'https://www.youtube.com'
html = urlopen(url)
beautifulsoup will perse the html
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('a'))
if you absolutely want to do it with requests, the solution is:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com'
resp = requests.get(url)
html = resp.text
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('a'))
Related
below is what I have tried
import urllib.request
from bs4 import BeautifulSoup
url = "https://search.naver.com/search.naver?query=%ED%8C%8C%EC%9D%B4%EC%8D%AC&nso=&where=blog&sm=tab_viw.all"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.find_all(class_="api_txt_lines")
for i in title:
print(i.attrs["title"])
error codes as below
Traceback (most recent call last):
File "C:/Users/Home/PycharmProjects/pycharm_test/crawler/pic.py", line 11, in
print(i.attrs["title"])
KeyError: 'title'
Process finished with exit code 1
There is no title in your <a>, if you wanna print the text use print(i.get_text()):
import urllib.request
from bs4 import BeautifulSoup
url = "https://search.naver.com/search.naver?query=%ED%8C%8C%EC%9D%B4%EC%8D%AC&nso=&where=blog&sm=tab_viw.all"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.find_all(class_="api_txt_lines")
for i in title:
print(i.get_text())
I would to scrape this site: http://waqfeya.com/book.php?bid=1
but when I do I get characters like these ÇáÞÑÂä ÇáßÑíã .
This how looks my script:
import requests
from bs4 import BeautifulSoup
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
soup = BeautifulSoup(source.text, 'lxml')
print(soup)
I tried these things but don't work for me:
source.encoding = 'utf-8'
and this:
source.encoding = 'ISO-8859-1'
also this:
soup = BeautifulSoup(source.text, from_endocing='ISO-8859-1')
But none worked for me.
Use urlopen instead of request
from bs4 import BeautifulSoup
from urllib import urlopen
BASE_URL = "http://waqfeya.com/book.php?bid=1"
open = urlopen(BASE_URL)
soup = BeautifulSoup(open, 'lxml')
print(soup.encode('utf-8'))
Sometimes Requests may get the encoding wrong. For this site we can get the correct encoding from the Source.
You can assign the encoding like source.encoding='windows-1256' before using source.text in BeautifulSoup.
import requests
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
print(source.encoding)
print(source.apparent_encoding)
source.encoding='windows-1256'
print(source.text)
I was able to get all the Arabic characters properly.
I'm trying to get other subset URLs from a main URL. However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.
import urllib
file = 'http://example.com'
with urllib.request.urlopen(file) as url:
collection = url.read().decode('UTF-8')
I think this is what you are looking for.
You can use beautiful soup library of python and this code should work with python3
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_all_urls(url):
open = urlopen(url)
url_html = BeautifulSoup(open, 'html.parser')
for link in url_html.find_all('a'):
links = str(link.get('href'))
if links.startswith('http'):
print(links)
else:
print(url + str(links))
get_all_urls('url.com')
I would like to connect and receive http response from a specific web site link.
I have many Python codes :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
when I pass the response as a parameter to:
BeautifulSoup(str(mystr), 'html.parser')
to get the cleaned html text, I got the following error:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bc' in position 1139: character maps to <undefined>.
The question how can I solve this problem?
complete code :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(mystr), 'html.parser')
mystr = soup;
print(mystr.get_text())
BeautifulSoup is perfectly happy to consume the file-like object returned by urlopen:
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("...") as website:
soup = BeautifulSoup(website)
print(soup.prettify())
If you use the requests library you can avoid these complications:)
import requests
fp = requests.get("http://www.python.org")
mystr = fp.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(mystr, 'html.parser')
mystr = soup;
print(mystr.get_text())
Code:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
I think the problem lies mainly with urlib package. Here I am using urllib3 package. They changed the urlopen syntax from 2 to 3 version, which maybe the cause of error. But that being said I have included the latest syntax only.
Python version 3.4
since you are importing requests you can use it instead of urllib like this:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
Your problem is that python cannot encode the characters from the page that you are scraping. For some more information see here: https://stackoverflow.com/a/16347188/2638310
Since the wikipedia page is in UTF-8, it seems that BeautifulSoup is guessing the encoding incorrectly. Try passing the from_encoding argument in your code like this:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
For more on encodings in BeautifulSoup have a look here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
I am using Python2.7, so I don't have request method inside the urllib module.
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
https://www.python.org/dev/peps/pep-0263/
Put those print lines inside a Try-Catch block so if there is an illegal character, then you won't get an error.
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))