BeautifulSoup: Getting empty variables - python

I have been trying to get the value of some variables of a web page:
itemPage='https://dadosabertos.camara.leg.br/api/v2/legislaturas/1'
url = urlopen(itemPage)
soupItem=BeautifulSoup(url,'lxml')
dataInicio=soupItem.find('dataInicio')
dataFim=soupItem.find('dataFim')
However, dataInicio and dataFim are empty. What am I doing wrong?

There are a couple of issues here. First, soup expects a string as input; check your url and see that it's actually <http.client.HTTPResponse object at 0x036D7770>. You can read() it, which produces a JSON byte string which is usable. But if you'd prefer to stick with XML parsing, I'd recommend using Python's request library to obtain a raw XML string (pass in correct headers to specify XML).
Secondly, when you create your soup object, you need to pass in features="xml" instead of "lxml".
Putting it all together:
import requests
from bs4 import BeautifulSoup
item_page = "https://dadosabertos.camara.leg.br/api/v2/legislaturas/1"
response = requests.get(item_page, headers={"accept": "application/xml"})
soup = BeautifulSoup(response.text, "xml")
data_inicio = soup.find("dataInicio")
data_fim = soup.find("dataFim")
print(data_inicio)
print(data_fim)
Output:
<dataInicio>1826-04-29</dataInicio>
<dataFim>1830-04-24</dataFim>

Related

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is منهجية علم البيانات.
What is the problem? And how do I fix it?
Thank you for taking time to answer.
The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية
What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية
There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية
I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

Encoding/decoding while web-scraping

I am trying to scrape a website into a string, but when i use decode("utf-8") on my bytes object it doesn't return a string, i instead get an UnicodeEncodeError.
I am trying to scrape this website: https://www.futbin.com/20/player/24248/leon-goretzka, which i know uses charset = "utf-8".
from bs4 import BeautifulSoup
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
text = r.text.encode("utf-8")
html = text.decode("utf-8")
print(html)
The get function for requests needs to take an actual link. In your example, you're providing a string "link".
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
data = r.text
print(data)
This gives you a Response object for r. Using r.text will give you the string, r.content will give you bytes (which would require decoding).
Here's a link for reference: Response example

How can I change the scrip name in URL while executing the code in python

I have the python code which gives me the VWAP value for the derivative script.
import requests
import json
from bs4 import BeautifulSoup as bs
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=INFY&instrument=FUTSTK&expiry=30MAY2019&type=-&strike=-')
soup = bs(r.content, 'lxml')
data = json.loads(soup.select_one('#responseDiv').text.strip())
vwap = data['data'][0]['vwap']
print(vwap)
There is a pattern for URL where just the name of the underlying changes.
For example in given 2 URLs:
https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=RELIANCE&instrument=FUTSTK&expiry=30MAY2019&type=-&strike=-
https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=INFY&instrument=FUTSTK&expiry=30MAY2019&type=-&strike=-
What can be the code where the program asks for input as the script name and the script name changes in the URL?
The parameter to requests.get(...) is a string and you can manipulate it just like a string. For this case, I suggest using str.format() (or you can also use f-strings).
base_url = 'https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying={}&instrument=FUTSTK&expiry=30MAY2019&type=-&strike=-'
underlying_list = ['RELIANCE', 'INFY']
for underlying in underlying_list:
url = base_url.format(underlying)
print(url)
resp = requests.get(url)
...
This approach would also allow you to use more parameters if there are other parts of the URL that need to be different for each call; you simply add more parameters to the .format() call (and modify base_url accordingly to accept those new parameters).

Parsing json Objects From Text File With Much Other Stuff - Python

I have an html page.
I read with requests and parsed a script tag with beautifulsoup, now this tag has loads of text, and some of it is json objects.
How can I read all the json objects from this text?
What I want to achieve is to get the products with prices from amazon daily deals and this is what I wrote for now:
from bs4 import BeautifulSoup
import json
import requests
def FindRightScriptTag(soup):
for tag in soup.find_all('script', type="text/javascript"):
if 'sortedDealIDs' and 'dealDetails' in tag.text:
return tag
url = "https://www.amazon.co.uk/gp/deals/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,"html.parser")
tag = FindRightScriptTag(soup)
print (tag)
Would be good if you shared some of your code. In general if you know how to navigate down your beautiful soup xml tree you can pass the string you know to be json into the json module.
json.loads() is what you are looking for as it takes a json string to turn it into a Python object dict for you to use.

Getting the top wallpaper from reddit

I am trying to get the hottest Wallpaper from Reddit's wallpaper subreddit.
I am using beautiful soup to get the HTML layout of the first wallpaper And then regex to get the URL from the anchor tag. But more than often I get a URL that's not matched by my regex. Here's the code I am using:
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
print r.status_code
text = r.text
soup = BeautifulSoup(text, "html.parser")
search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())
Is there any way around it?
Here's a better way to do it:
Adding .json to the end of url in Reddit returns a json object instead of HTML.
For example https://www.reddit.com/r/wallpapers will provide HTML content but https://www.reddit.com/r/wallpapers/.json will give you a json object which you can easily exploit using json module in python
Here's the same program of getting the hottest wallpaper:
>>> import urllib
>>> import json
>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())
>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'
>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'
>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'
Not only it's much more cleaner, it'll also prevent you from a headache if reddit changes it's HTML layout or someone posts a URL that your regex can't handle.As a thumb rule it's generally smarter to use json instead of scraping the HTML
PS: The list inside the [children] is the wallpaper number. The first one is the topmost, the second one is the second one and so on.
Therefore ['data']['children'][2]['data']['url'] will give you the link for the second hottest wallpaper. you get the gist? :)
PPS: What's more is that with this method you can use the default urllib module. Generally when you're scraping Reddit you'd have to create fake User-Agent header and pass it while requesting(or it gives you a 429 response code but that's not the case with this method.
Here is the correct way to do it your method, but Jarwins method is better. You should not be using regex when working with HTML. You simply had to reference the href attribute to get the URL
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
soup = BeautifulSoup(r.text, "html.parser")
url = str(soup.find_all('a', {'class':'title'})[1]["href"])
print url

Categories

Resources