Encoding/decoding while web-scraping

Encoding/decoding while web-scraping - python

I am trying to scrape a website into a string, but when i use decode("utf-8") on my bytes object it doesn't return a string, i instead get an UnicodeEncodeError.
I am trying to scrape this website: https://www.futbin.com/20/player/24248/leon-goretzka, which i know uses charset = "utf-8".
from bs4 import BeautifulSoup
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
text = r.text.encode("utf-8")
html = text.decode("utf-8")
print(html)

The get function for requests needs to take an actual link. In your example, you're providing a string "link".
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
data = r.text
print(data)
This gives you a Response object for r. Using r.text will give you the string, r.content will give you bytes (which would require decoding).
Here's a link for reference: Response example

Related

Checking network responses from a URL for a JSON using python

I need to get the JSON containing the info from this URL hkex.com.hk, I can do so using firefox>developer tools>network and looking for the JSON I want, I need to do the same using python, so far I have this
url='https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
r = requests.get(url)
print(r.text)
But I only receive an HTML so even after using .json() I get an error "Expecting value" because it is empty, how can I achieve this?

The response request is an html text, so you are not able to use json() method on the entire response.
There should be another way to "convert" html into json, but you will have to find the part of html you want to convert to.

The json is indeed hidden in the url you mention in one of your comments. You have to get the html, extract the json and load it:
url = 'https://www1.hkex.com.hk/hkexwidget/data/getequityfilter?lang=eng&token=evLtsLsBNAUVTPxtGqVeG8QpVRBPNt2I8CbDELLpyZv%2bff8QFzdfZ6w1Za4TWSJ6&sort=5&order=0&qid=1627367921383&callback=jQuery35106295196366220494_1627367912871&_=1627367912873'
req = requests.get(url)
#now for the extraction:
target = req.text.split('jQuery35106295196366220494_1627367912871(')[1].split(')')[0]
#EDIT
target = req.text.split('(')[1].split(')')[0]
data = json.loads(target)
data
The output should be your json.

BeautifulSoup: Getting empty variables

I have been trying to get the value of some variables of a web page:
itemPage='https://dadosabertos.camara.leg.br/api/v2/legislaturas/1'
url = urlopen(itemPage)
soupItem=BeautifulSoup(url,'lxml')
dataInicio=soupItem.find('dataInicio')
dataFim=soupItem.find('dataFim')
However, dataInicio and dataFim are empty. What am I doing wrong?

There are a couple of issues here. First, soup expects a string as input; check your url and see that it's actually <http.client.HTTPResponse object at 0x036D7770>. You can read() it, which produces a JSON byte string which is usable. But if you'd prefer to stick with XML parsing, I'd recommend using Python's request library to obtain a raw XML string (pass in correct headers to specify XML).
Secondly, when you create your soup object, you need to pass in features="xml" instead of "lxml".
Putting it all together:
import requests
from bs4 import BeautifulSoup
item_page = "https://dadosabertos.camara.leg.br/api/v2/legislaturas/1"
response = requests.get(item_page, headers={"accept": "application/xml"})
soup = BeautifulSoup(response.text, "xml")
data_inicio = soup.find("dataInicio")
data_fim = soup.find("dataFim")
print(data_inicio)
print(data_fim)
Output:
<dataInicio>1826-04-29</dataInicio>
<dataFim>1830-04-24</dataFim>

parsing xml and html page with lxml and requests package in python

I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose:
in python:
import requests
import lxml.etree
url = ""
req = requests.get(url)
tree = html.fromstring(req.content)
root = tree.xpath('')
for item in root:
print(item.text)
This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code

requests automatically decodes content from the server.
Important to understand:
r.content - contains not yet decoded response content
r.encoding - contains information about response content encoding
r.text - according to the official doc it is already decoded version of r.content
Following the unicode standard, I get used to r.text but you still can decode your content manually using
r.content.decode(r.encoding)
Hope it helps.

Getting the top wallpaper from reddit

I am trying to get the hottest Wallpaper from Reddit's wallpaper subreddit.
I am using beautiful soup to get the HTML layout of the first wallpaper And then regex to get the URL from the anchor tag. But more than often I get a URL that's not matched by my regex. Here's the code I am using:
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
print r.status_code
text = r.text
soup = BeautifulSoup(text, "html.parser")
search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())
Is there any way around it?

Here's a better way to do it:
Adding .json to the end of url in Reddit returns a json object instead of HTML.
For example https://www.reddit.com/r/wallpapers will provide HTML content but https://www.reddit.com/r/wallpapers/.json will give you a json object which you can easily exploit using json module in python
Here's the same program of getting the hottest wallpaper:
>>> import urllib
>>> import json
>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())
>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'
>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'
>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'
Not only it's much more cleaner, it'll also prevent you from a headache if reddit changes it's HTML layout or someone posts a URL that your regex can't handle.As a thumb rule it's generally smarter to use json instead of scraping the HTML
PS: The list inside the [children] is the wallpaper number. The first one is the topmost, the second one is the second one and so on.
Therefore ['data']['children'][2]['data']['url'] will give you the link for the second hottest wallpaper. you get the gist? :)
PPS: What's more is that with this method you can use the default urllib module. Generally when you're scraping Reddit you'd have to create fake User-Agent header and pass it while requesting(or it gives you a 429 response code but that's not the case with this method.

Here is the correct way to do it your method, but Jarwins method is better. You should not be using regex when working with HTML. You simply had to reference the href attribute to get the URL
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
soup = BeautifulSoup(r.text, "html.parser")
url = str(soup.find_all('a', {'class':'title'})[1]["href"])
print url

BeautifulSoup gives garbage for html conversion

I am trying to scape this
url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf
url. This is my code
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup #gives garbage
However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?
I tried to the following:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
response = urllib2.urlopen(request)
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
and this too :
Python and BeautifulSoup encoding issues
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup.prettify('utf-8')
Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work
Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding/decoding while web-scraping - python

Related

Checking network responses from a URL for a JSON using python

BeautifulSoup: Getting empty variables

parsing xml and html page with lxml and requests package in python

Getting the top wallpaper from reddit

BeautifulSoup gives garbage for html conversion

Categories

Resources