BeautifulSoup gives garbage for html conversion

BeautifulSoup gives garbage for html conversion - python

I am trying to scape this
url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf
url. This is my code
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup #gives garbage
However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?
I tried to the following:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
response = urllib2.urlopen(request)
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
and this too :
Python and BeautifulSoup encoding issues
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup.prettify('utf-8')
Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work
Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

Related

Beautifulsoup: Removing German Umlauts

i'm new to all of this, so i need a little bit of help. For a uni project i am trying to extract ingedrients from a website and in general the code works how it should, but i just don't know how
to get "Bärlauch" instead of "B%C3%A4rlauch" in the end.
I used beautifulsoup with the following code:
URL = [...]
links = []
for url in range(0,10):
req = requests.get(URL[url])
soup = bs(req.content, 'html.parser')
for link in soup.findAll('a'):
links.append(str(link.get('href')))
I don't get why it doesn't work as it should, eventhough the encoding already is utf-8.
Maybe someone knows better.
Thanks!

URLs are URL-encoded. The response of a request ist a response not a req(uest).
URLS = [...]
links = []
for url in URLS:
response = requests.get(url)
soup = bs(response.content, 'html.parser')
for link in soup.find_all('a'):
links.append(urllib.parse.unquote(link.get('href')))

Encoding/decoding while web-scraping

I am trying to scrape a website into a string, but when i use decode("utf-8") on my bytes object it doesn't return a string, i instead get an UnicodeEncodeError.
I am trying to scrape this website: https://www.futbin.com/20/player/24248/leon-goretzka, which i know uses charset = "utf-8".
from bs4 import BeautifulSoup
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
text = r.text.encode("utf-8")
html = text.decode("utf-8")
print(html)

The get function for requests needs to take an actual link. In your example, you're providing a string "link".
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
data = r.text
print(data)
This gives you a Response object for r. Using r.text will give you the string, r.content will give you bytes (which would require decoding).
Here's a link for reference: Response example

Encoding Emojis with Beautiful Soup

Looking for some help. I am working on a project scraping specific Craigslist posts using Beautiful Soup in Python. I can successfully display emojis found within the post title but have been unsuccessful within the post body. I've tried different variations but nothing has worked so far. Any help would be appreciated.
Code:
f = open("clcondensed.txt", "w")
html2 = requests.get("https://raleigh.craigslist.org/wan/6078682335.html")
soup = BeautifulSoup(html2.content,"html.parser")
#Post Title
title = soup.find(id="titletextonly")
title1 = soup.title.string.encode("ascii","xmlcharrefreplace")
f.write(title1)
#Post Body
body = soup.find(id="postingbody")
body = str(body)
body = body.encode("ascii","xmlcharrefreplace")
f.write(body)
Error received from the body:
'ascii' codec can't decode byte 0xef in position 273: ordinal not in range(128)

You should use unicode
body = unicode(body)
Please refer Beautiful Soup Documentation NavigableString
Update:
Sorry for the quick answer. It's not that right.
Here you should use lxml parser instead of html parser, because html parser do not support well for NCR (Numeric Character Reference) emoji.
In my test, when NCR emoji decimal value greater than 65535, such as your html demo emoji 🚢, HTML parser just decode it with wrong unicode \ufffd than u"\U0001F6A2". I can not find the accurate Beautiful Soup reference for this, but the lxml parser is just OK.
Below is the tested code:
import requests
from bs4 import BeautifulSoup
f = open("clcondensed.txt", "w")
html = requests.get("https://raleigh.craigslist.org/wan/6078682335.html")
soup = BeautifulSoup(html.content, "lxml")
#Post Title
title = soup.find(id="titletextonly")
title = unicode(title)
f.write(title.encode('utf-8'))
#Post Body
body = soup.find(id="postingbody")
body = unicode(body)
f.write(body.encode('utf-8'))
f.close()
You can ref lxml entity handling to do more things.
If you do not install lxml, just ref lxml installing.
Hope this help.

parsing xml and html page with lxml and requests package in python

I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose:
in python:
import requests
import lxml.etree
url = ""
req = requests.get(url)
tree = html.fromstring(req.content)
root = tree.xpath('')
for item in root:
print(item.text)
This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code

requests automatically decodes content from the server.
Important to understand:
r.content - contains not yet decoded response content
r.encoding - contains information about response content encoding
r.text - according to the official doc it is already decoded version of r.content
Following the unicode standard, I get used to r.text but you still can decode your content manually using
r.content.decode(r.encoding)
Hope it helps.

How to open with urllib, link parsed by BeautifulSoup?

I use python 3, Beautiful Soup 4 and urllib for parsing some html.
I need to parse some pages, get some links from this pages, and than parse pages from that links. I try to do it like that:
import urllib.request
import urllib
from bs4 import BeautifulSoup
with urllib.request.urlopen("https://example.com/mypage?myparam=%D0%BC%D0%B2") as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
total = soup.find(attrs={"class":"item_total"})
link = u"https://example.com" + total.find('a').get('href')
with urllib.request.urlopen(link) as response:
exthtml = BeautifulSoup(html,response.read())
But urllib can't open second link, because it is not encoded, like fist link. It has different languages symbols, and white spaces.
I try to encode url, like:
link = urllib.parse.quote("https://example.com" + total.find('a').get('href'))
But it encode all symbols. How can I get properly url form bs, and get request?
UPD:
exapmle of second link, resulted by
link = u"https://example.com" + total.find('a').get('href')
is
"https://example.com/mypage?p1url=www.example.net%2Fthatpage%2F01234&text=абвгд еёжз ийклмно"

should just be urlencoding your link.
link = "https://example.com" + urllib.parse.quote(total.find('a').get('href'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup gives garbage for html conversion - python

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

Related

Beautifulsoup: Removing German Umlauts

Encoding/decoding while web-scraping

Encoding Emojis with Beautiful Soup

parsing xml and html page with lxml and requests package in python

How to open with urllib, link parsed by BeautifulSoup?

Categories

Resources