Problems with navigablestrings and unicode in BeautifulSoup - python

I am having some problems with navigablestrings and unicode in BeautifulSoup (python).
Basically, I am parsing four results pages from youtube, and putting the top result's extension (end of the url after youtube.com/watch?=) into a list.
I then loop the list in two other functions, on one, it throws this error: TypeError: 'NavigableString' object is not callable. However, the other one says TypeError: 'unicode' object is not callable. Both are using the same exact string.
What am I doing wrong here? I know that my parsing code is probably not perfect, I'm using both BeautifulSoup and regex. In the past whenever I get NavigableString errors, I have just thrown in a ".encode('ascii', 'ignore') or simply str(), and that has seemed to work. Any help would be appreciated!
for url in urls:
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
link_data = soup.findAll("a", {"class":"yt-uix-tile-link result-item-translation-title"})[0]
ext = re.findall('href="/(.*)">', str(link_data))[0]
if isinstance(ext, str):
exts.append('http://www.youtube.com/'+ext.replace(' ',''))
and then:
for ext in exts:
description = description(ext)
embed = embed(ext)
i only added the isinstance() lines to try and see what the problem was. When 'str' is changed to 'unicode', the exts list is empty (meaning they are strings, not unicode (or even navigablestrings?)). I'm quite confused...

description = description(ext) replaces the function by a string after the first iteration in the loop. Same for embed.

for ext in exts:
description = description(ext)
embed = embed(ext)
description() and embed() are function. For example
def description(): #this is a function
return u'result'
Then
description = description(ext)
#now description is a unicode object, and it is not callable.
#It is can't call like this description(ext) again
I think those two function description() and embed() are return 'NavigableString' object and 'unicode' object. Those two object are not callable.
So you should repalce those two line , such as:
for ext in exts:
description_result = description(ext)
embed_result = embed(ext)

Related

TypeError on matching pattern with module "re"

I'm trying to extract the price of the item from my programme by parsing the HTML with help of "bs4" BeautifulSoup library
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"},text= not None)
pattern_1 = re.compile("/d+./d+").findall(element).text.strip()
print(pattern_1)
print(element)
and here is what I get as output :
Traceback (most recent call last):
File "/root/Desktop/Visual_Studio_Files/Python_sample.py", line 9, in <module>
pattern_1 = (re.compile("/d+./d+").findall(str_ele)).text.strip()
TypeError: expected string or bytes-like object
re.findall freaks out because your element variable has the type bs4.element.Tag.
You can find this out by adding print(type(element)) in your script.
Based on some quick poking around, I think you can extract the string you need from the tag using the contents attribute (which is a list) and taking the first member of this list (index 0).
Moreover, re.findall also returns a list, so instead of .text you need to use [0] to access its first member. Thus you will once again have a string which supports the .strip() method!
Last but not least, it seems you may have mis-typed your slashes and meant to use \ instead of /.
Here's a working version of your code:
pattern_1 = re.findall("\d+.\d+", element.contents[0])[0].strip()
This is definitely not pretty or very pythonic, but it will get the job done.
Note that I dropped the call to re.compile because that gets run in the background when you call re.findall anyway.
here is what it finally look like :)
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"}).text.strip()
# pattern_1 = re.compile("/d+./d+").findall(element)
# print (pattern_1)
print (element)
and this is the output :)
146.00
thank you every one :)

Previous solutions doesn't work: TypeError: 'str' object is not callable

I know similar error has been reported, but I checked all online solutions and none of them work. So I decided to open a new post.
I was running the following code from an online course. It is supposed to work, however, it always reports the following error when running on my machine:
----> 7 input_str = input('Enter location: ')
TypeError: 'str' object is not callable
Below is the whole block of code:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
serviceurl = 'http://maps.googleapis.com/maps/api/geocode/xml?'
while True:
input_str = input('Enter location: ')
if len(input_str) < 1: break
url = serviceurl + urllib.parse.urlencode({'address': input_str})
print('Retrieving', url)
uh = urllib.request.urlopen(url)
data = uh.read()
print('Retrieved', len(data), 'characters')
print(data.decode())
tree = ET.fromstring(data)
results = tree.findall('result')
lat = results[0].find('geometry').find('location').find('lat').text
lng = results[0].find('geometry').find('location').find('lng').text
location = results[0].find('formatted_address').text
print('lat', lat, 'lng', lng)
print(location)
Thanks in advance!
You redefined the built-in function input to a string somewhere in your code (not necessarily in the posted code fragment) by executing something like this:
input = ....
There is only one way to fix this error: close the Python interpreter and start it over again. Make sure that any code you execute does not contain assignments to input or any other identifiers that refer to the built-in functions.
I see you're trying to put a string in "input_str". Have you already tried?:
input_str = raw_input('Enter location: ')
This type of method recieves a string as a input. The input of the method "input" must be a number
Regards!
Have you tried renaming the variable? Instead of input_str, try something like user_input. I'm pretty sure that 'str' is reserved in Python and so using it as a variable won't work.

'list' object has no attribute 'timeout'

I am trying to download Pdfs using urllib.request.urlopen from a page but it returns an error: 'list' object has no attribute 'timeout':
def get_hansard_data(page_url):
#Read base_url into Beautiful soup Object
html = urllib.request.urlopen(page_url).read()
soup = BeautifulSoup(html, "html.parser")
#grab <div class="itemContainer"> that hold links and dates to all hansard pdfs
hansard_menu = soup.find_all("div","itemContainer")
#Get all hansards
#write to a tsv file
with open("hansards.tsv","a") as f:
fieldnames = ("date","hansard_url")
output = csv.writer(f, delimiter="\t")
for div in hansard_menu:
hansard_link = [HANSARD_URL + div.a["href"]]
hansard_date = div.find("h3", "catItemTitle").string
#download
with urllib.request.urlopen(hansard_link) as response:
data = response.read()
r = open("/Users/Parliament Hansards/"+hansard_date +".txt","wb")
r.write(data)
r.close()
print(hansard_date)
print(hansard_link)
output.writerow([hansard_date,hansard_link])
print ("Done Writing File")
A bit late, but might still be helpful to someone else (if not for topic starter). I found the solution by solving the same problem.
The problem was that page_url (in your case) was a list, rather than a string. The reason for that is mos likely that page_url comes from argparse.parse_args() (at least it was so in my case).
Doing page_url[0] should work but it is not nice to do that inside the def get_hansard_data(page_url) function. Better would be to check the type of the parameter and return an appropriate error to the function caller, if the type does not match.
The type of an argument could be checked by calling type(page_url) and comparing the result like for example: typen("") == type(page_url). I am sure there might be more elegant way to do that, but it is out of the scope of this particular question.

NoneType object has no attribute 'encode' (Web Scraping)

I am getting error
'NoneType' object has no attribute 'encode'
when i run this code
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('D:\Scraping\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
But when i use find instead of findAll i get one url. How i get all urls from object by findAll?
'NoneType' object has no attribute 'encode'
You are using .string. If a tag has multiple children .string would be None (docs):
If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
Use .get_text() instead.
Below I provide two examples and one possible solution:
Example 1 shows a working sample.
Example 2 shows a non working sample raising your reported error.
Solution shows a possible solution.
Example 1: The html have the expected div
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry-content"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
Example 2: The html does not have the expected div in the content
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
"""
The error will rise here because the first find does not return nothing,
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll'
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
Possible solution:
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})
"""
Deal with documents that do not have the expected html structure
"""
if url:
url = url.findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
else:
print("The html source does not comply with expected structure")
I found the issue belong to NULL DATA.
I fixed it by FILTER OUT NULL DATA

BeautifulSoup: TypeError: 'NoneType' object is not subscriptable

I need take "href" attribute from a link (a) tag.
I run
label_tag = row.find(class_='Label')
print(label_tag)
and I get (sorry, I can't show link and text for privacy reasons)
<a class="Label" href="_link_">_text_</a>
of type
<class 'bs4.element.Tag'>
but when I run (as shown at BeautifulSoup getting href )
tag_link = label_tag['href']
print(tag_link)
I guess the following error (on first command)
TypeError: 'NoneType' object is not subscriptable
Any clue?
Thanks in advance
[SOLVED] EDIT: I was making a mistake (looping over elements with heterogeneous structure)
My guess is that label_tag isn't actually returning the part of the soup you are looking for. This minimal example works:
import bs4
text = '''<a class="Label" href="_link_">_text_</a>'''
soup = bs4.BeautifulSoup(text)
link = soup.find("a",{"class":"Label"})
print (link["href"])
Output:
_link_
beacuse in <class 'bs4.element.Tag'>, there is no class of "Label", so label_tag['href'] return None, than an error is occured.
you can use to following code to handle this exception:
if tag_link = label_tag.get('href'):
print(tag_link)
else:
print("there is no class of 'Label' or no attribute of 'href'! ")
this method can use to handle exception to prevent program crash.
if your page element is fixed, the former answer is feasible.

Categories

Resources