I am trying to download Pdfs using urllib.request.urlopen from a page but it returns an error: 'list' object has no attribute 'timeout':
def get_hansard_data(page_url):
#Read base_url into Beautiful soup Object
html = urllib.request.urlopen(page_url).read()
soup = BeautifulSoup(html, "html.parser")
#grab <div class="itemContainer"> that hold links and dates to all hansard pdfs
hansard_menu = soup.find_all("div","itemContainer")
#Get all hansards
#write to a tsv file
with open("hansards.tsv","a") as f:
fieldnames = ("date","hansard_url")
output = csv.writer(f, delimiter="\t")
for div in hansard_menu:
hansard_link = [HANSARD_URL + div.a["href"]]
hansard_date = div.find("h3", "catItemTitle").string
#download
with urllib.request.urlopen(hansard_link) as response:
data = response.read()
r = open("/Users/Parliament Hansards/"+hansard_date +".txt","wb")
r.write(data)
r.close()
print(hansard_date)
print(hansard_link)
output.writerow([hansard_date,hansard_link])
print ("Done Writing File")
A bit late, but might still be helpful to someone else (if not for topic starter). I found the solution by solving the same problem.
The problem was that page_url (in your case) was a list, rather than a string. The reason for that is mos likely that page_url comes from argparse.parse_args() (at least it was so in my case).
Doing page_url[0] should work but it is not nice to do that inside the def get_hansard_data(page_url) function. Better would be to check the type of the parameter and return an appropriate error to the function caller, if the type does not match.
The type of an argument could be checked by calling type(page_url) and comparing the result like for example: typen("") == type(page_url). I am sure there might be more elegant way to do that, but it is out of the scope of this particular question.
Related
The below come's up with the error:
"if soup.find(text=bbb).parent.parent.get_text(strip=True
AttributeError: 'NoneType' object has no attribute 'parent'"
Any help would be appreciated as I can't quite get it to run fully, python only returns results up to the error, I need it to return empty if there is no item and move on. I tried putting a IF statement but that doesnt work.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline= "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_e in soup.findAll('table', {'class' : 'neither'}):
Sold = item_e.get_text(strip=True)
bbb = re.compile('First listed')
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except:
Pass
try:
writer.writerow([ Sold, next_s])
except:
pass
trade_spider(2)
Your exception comes from trying to access an attribute on None. You don't intend to do that, but because some earlier part of your expression turns out to be None where you expected something else, the later parts break.
Specifically, either soup.find(text=bbb) or soup.find(text=bbb).parent is None (probably the former, since I think None is the returned value if find doesn't find anything).
There are a few ways you can write your code to address this issue. You could either try to detect that it's going to happen ahead of time (and do something else instead), or you can just go ahead and try the attribute lookup and react if it fails. These two approaches are often called "Look Before You Leap" (LBYL) and "Easier to Ask Forgiveness than Permission" (EAFP).
Here's a bit of code using an LBYL approach that checks to make sure the values are not None before accessing their attributes:
val = soup.find(text=bbb)
if val and val.parent: # I'm assuming the non-None values are never falsey
next_s = val.parent.parent.get_text(strip=True)
else:
# do something else here?
The EAFP approach is perhaps simpler, but there's some risk that it could catch other unexpected exceptions instead of the ones we expect (so be careful using this design approach during development):
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except AttributeError: # try to catch the fewest exceptions possible (so you don't miss bugs)
# do something else here?
It's not obvious to me what your code should do in the "do something else here" sections in the code above. It might be that you can ignore the situation, but probably you'd need an alternative value for next_s to be used by later code. If there's no useful value to substitute, you might want to bail out of the function early instead (with a return statement).
I'm new to web2py and websites in general.
I would like to upload xml-files with different numbers of questions in it.
I'm using bs4 to parse the uploaded file and then I want to do different things: if there is only one question in the xml file I would like to go to a new site and if there are more questions in it I want to go to another site.
So this is my code:
def import_file():
form = SQLFORM.factory(Field('file','upload', requires = IS_NOT_EMPTY(), uploadfolder = os.path.join(request.folder, 'uploads'), label='file:'))
if form.process().accepted:
soup = BeautifulSoup('file', 'html.parser')
questions = soup.find_all(lambda tag:tag.name == "question" and tag["type"] != "category")
# now I want to check the length of the list to redirect to the different URL's, but it doesn't work, len(questions) is 0.
if len(questions) == 1:
redirect(URL('import_questions'))
elif len(questions) > 1:
redirect(URL('checkboxes'))
return dict(form=form, message=T("Please upload the file"))
Does anybody know what I can do, to check the length of the list, after uploading the xml-file?
BeautifulSoup expects a string or a file-like object, but you're passing 'file'. Instead, you should use:
with open(form.vars.file) as f:
soup = BeautifulSoup(f, 'html.parser')
However this is not a web2py-specific problem.
Hope this helps.
I am getting error
'NoneType' object has no attribute 'encode'
when i run this code
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('D:\Scraping\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
But when i use find instead of findAll i get one url. How i get all urls from object by findAll?
'NoneType' object has no attribute 'encode'
You are using .string. If a tag has multiple children .string would be None (docs):
If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
Use .get_text() instead.
Below I provide two examples and one possible solution:
Example 1 shows a working sample.
Example 2 shows a non working sample raising your reported error.
Solution shows a possible solution.
Example 1: The html have the expected div
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry-content"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
Example 2: The html does not have the expected div in the content
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
"""
The error will rise here because the first find does not return nothing,
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll'
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
Possible solution:
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})
"""
Deal with documents that do not have the expected html structure
"""
if url:
url = url.findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
else:
print("The html source does not comply with expected structure")
I found the issue belong to NULL DATA.
I fixed it by FILTER OUT NULL DATA
I am trying to write a program that will collect specific information from an ebay product page and write that information to a text file. To do this I'm using BeautifulSoup and Requests and I'm working with Python 2.7.9.
I've been mostly using this tutorial (Easy Web Scraping with Python) with a few modifications. So far everything works as intended until it writes to the text file. The information is written, just not in the format that I would like.
What I'm getting is this:
{'item_title': u'Old Navy Pink Coat M', 'item_no': u'301585876394', 'item_price': u'US $25.00', 'item_img': 'http://i.ebayimg.com/00/s/MTYwMFgxMjAw/z/Sv0AAOSwv0tVIoBd/$_35.JPG'}
What I was hoping for was something that would be a bit easier to work with.
For example :
New Shirt 5555555555 US $25.00 http://ImageURL.jpg
In other words I want just the scraped text and not the brackets, the 'item_whatever', or the u'.
After a bit of research I suspect my problem is to do with the encoding of the information as its being written to the text file, but I'm not sure how to fix it.
So far I have tried,
def collect_data():
with open('writetest001.txt','w') as x:
for product_url in get_links():
get_info(product_url)
data = "'{0}','{1}','{2}','{3}'".format(item_data['item_title'],'item_price','item_no','item_img')
x.write(str(data))
In the hopes that it would make the data easier to format in the way I want. It only resulted in "NameError: global name 'item_data' is not defined" displayed in IDLE.
I have also tried using .split() and .decode('utf-8') in various positions but have only received AttributeErrors or the written outcome does not change.
Here is the code for the program itself.
import requests
import bs4
#Main URL for Harvesting
main_url = 'http://www.ebay.com/sch/Coats-Jackets-/63862/i.html?LH_BIN=1&LH_ItemCondition=1000&_ipg=24&rt=nc'
#Harvests Links from "Main" Page
def get_links():
r = requests.get(main_url)
data = r.text
soup = bs4.BeautifulSoup(data)
return [a.attrs.get('href')for a in soup.select('div.gvtitle a[href^=http://www.ebay.com/itm]')]
print "Harvesting Now... Please Wait...\n"
print "Harvested:", len(get_links()), "URLs"
#print (get_links())
print "Finished Harvesting... Scraping will Begin Shortly...\n"
#Scrapes Select Information from each page
def get_info(product_url):
item_data = {}
r = requests.get(product_url)
data = r.text
soup = bs4.BeautifulSoup(data)
#Fixes the 'Details about ' problem in the Title
for tag in soup.find_all('span',{'class':'g-hdn'}):
tag.decompose()
item_data['item_title'] = soup.select('h1#itemTitle')[0].get_text()
#Grabs the Price, if the item is on sale, grabs the sale price
try:
item_data['item_price'] = soup.select('span#prcIsum')[0].get_text()
except IndexError:
item_data['item_price'] = soup.select('span#mm-saleDscPrc')[0].get_text()
item_data['item_no'] = soup.select('div#descItemNumber')[0].get_text()
item_data['item_img'] = soup.find('img', {'id':'icImg'})['src']
return item_data
#Collects information from each page and write to a text file
write_it = open("writetest003.txt","w","utf-8")
def collect_data():
for product_url in get_links():
write_it.write(str(get_info(product_url))+ '\n')
collect_data()
write_it.close()
You were on the right track.
You need a local variable to assign the results of get_info to. The variable item_data you tried to reference only exists within the scope of the get_info function. You can use the same variable name though, and assign the results of the function to it.
There was also a syntax issue in the section you tried with respect to how you're formatting the items.
Replace the section you tried with this:
for product_url in get_links():
item_data = get_info(product_url)
data = "{0},{1},{2},{3}".format(*(item_data[item] for item in ('item_title','item_price','item_no','item_img')))
x.write(data)
I am having some problems with navigablestrings and unicode in BeautifulSoup (python).
Basically, I am parsing four results pages from youtube, and putting the top result's extension (end of the url after youtube.com/watch?=) into a list.
I then loop the list in two other functions, on one, it throws this error: TypeError: 'NavigableString' object is not callable. However, the other one says TypeError: 'unicode' object is not callable. Both are using the same exact string.
What am I doing wrong here? I know that my parsing code is probably not perfect, I'm using both BeautifulSoup and regex. In the past whenever I get NavigableString errors, I have just thrown in a ".encode('ascii', 'ignore') or simply str(), and that has seemed to work. Any help would be appreciated!
for url in urls:
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
link_data = soup.findAll("a", {"class":"yt-uix-tile-link result-item-translation-title"})[0]
ext = re.findall('href="/(.*)">', str(link_data))[0]
if isinstance(ext, str):
exts.append('http://www.youtube.com/'+ext.replace(' ',''))
and then:
for ext in exts:
description = description(ext)
embed = embed(ext)
i only added the isinstance() lines to try and see what the problem was. When 'str' is changed to 'unicode', the exts list is empty (meaning they are strings, not unicode (or even navigablestrings?)). I'm quite confused...
description = description(ext) replaces the function by a string after the first iteration in the loop. Same for embed.
for ext in exts:
description = description(ext)
embed = embed(ext)
description() and embed() are function. For example
def description(): #this is a function
return u'result'
Then
description = description(ext)
#now description is a unicode object, and it is not callable.
#It is can't call like this description(ext) again
I think those two function description() and embed() are return 'NavigableString' object and 'unicode' object. Those two object are not callable.
So you should repalce those two line , such as:
for ext in exts:
description_result = description(ext)
embed_result = embed(ext)