Here's the code:
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
with open('/users/Rachael/Desktop/CheckTitle.csv', 'r') as readcsv:
for row in readcsv.readlines():
try:
openitem = urllib2.urlopen(row).read()
soup = BeautifulSoup(openitem, 'lxml')
print soup.head.find('title').get_text()
except urllib2.URLError:
print 'passed'
pass
I'm getting following results:
(a):
passed
贝贝网京外裁员10%:团队要保持狼性和危机感_新浪财经_新浪网
垂直电商贝贝网被曝裁员 回应称只是10%人员优化_新浪财经_新浪网
(b):
passed
Traceback (most recent call last):
File "C:/Users/Rachael/PycharmProjects/untitled1/GetTitle.py", line 10, in
<module>
print soup.head.find('title').get_text()
AttributeError: 'NoneType' object has no attribute 'find'
(c):
passed
贝贝网京外裁员10%:团队要保持狼性和危机感_新浪财经_新浪网
Traceback (most recent call last):
File "C:/Users/Rachael/PycharmProjects/untitled1/GetTitle.py", line 10, in <module>
print soup.head.find('title').get_text()
AttributeError: 'NoneType' object has no attribute 'find'
I'm getting these three types of results randomly.
If I do soup.title OR soup.title.text OR soup.title.string instead, it will return the same/similar error.
Please help!
I found this very hard to describe so if this is a dup in any ways please give me the link to similar posts.
Thanks!!
'NoneType' object has no attribute is an error that happens when there are no results for this object, try print only the print soup.head.find('title') title without printing the .text it should return something like '[]' or 'None'
Answer: There is no actual title tag or there's a bot protection of some kind on one of those sites you have in that file.
Related
I am currently writing a web-scraping script with Python to be able to take play-by-play soccer commentary from fixtures and inputting it into an excel sheet. I keep getting this when I try to run it:
Traceback (most recent call last):
File "/Users/noahhollander/Desktop/Web_Scraping/play_by_play.py", line 9, in <module>
tbody = soup('table',{"class":"content"})[0:].findAll('tr')
AttributeError: 'list' object has no attribute 'findAll'
[Finished in 6.207s]
I've read that this probably has something to do with this table being text format, but I have added .text at the end and still same result.
Here is a picture of my code so far.
You might have to write something like this.
soup.find_all('table',{"class":"content"})
tbody = []
tclass = soup('table', {"class":"content"})[0:]
for temp in tclass:
for t_temp in temp.find_all('tr'):
tbody.append(t_temp)
This is your desired result?
div = soup.find('div', {"class": "content"})
tbody = div.find('table').findAll('tr')
You will get your desired result
I am trying to count the word on the google homepage. But I got AttributeError on the initial stage.
My code is -->
import requests
from bs4 import BeautifulSoup
import operator
def main(url):
word_list=[]
source_code=requests.get(url).text
soup=BeautifulSoup(source_code,'lxml')
for post_text in soup.findAll('a'):
content=post_text.string
words=content.lower().split()
for each_word in words:
print(each_word)
word_list.append(each_word)
main('https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=2-nqWavnB4WN8Qf4n7eQAw')
My Output is -->
images
maps
play
youtube
news
gmail
drive
Traceback (most recent call last):
File "word_freq.py", line 18, in <module>
main('https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=2-nqWavnB4WN8Qf4n7eQAw')
File "word_freq.py", line 13, in main
words=content.lower().split()
AttributeError: 'NoneType' object has no attribute 'lower'
you are parsing a web page in html so you need
soup=BeautifulSoup(source_code,'html.parser')
string is the incorrect attribute to get the content of any tag, use text:
content=post_text.text
I am newb to scraperwiki.I am trying to get infobox from wiki page using scraperwiki. I get the idea of scraperwiki to crawl wiki pages from below link
https://blog.scraperwiki.com/2011/12/how-to-scrape-and-parse-wikipedia/
Code
import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")
title = "Aquamole Pot"
val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave # prints just the ukcave infobox
Error
Traceback (most recent call last):
File "scrap_wiki.py", line 3, in <module>
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")
AttributeError: 'module' object has no attribute 'swimport'
I'm new to Python and stack overflow.
I'm trying to follow a tutorial on youtube (outdated I'm guessing based on the error I get) regarding fetching stock prices.
Here is the following program:
import urllib.request
import re
html = urllib.request.urlopen('http://finance.yahoo.com/q?uhb=uh3_finance_vert_gs_ctrl2&fr=&type=2button&s=AAPL')
htmltext = html.read()
regex = '<span id="yfs_l84_aapl">.+?</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Since this is Python 3, I had to research on urllib.request and use those methods instead of a simple urllib.urlopen.
Anyways, when I run it, I get the following error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 13, in <module>
price = re.findall(pattern, htmltext)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I realize the error and attempted to fix it by adding the following:
codec = html.info().get_param('charset', 'utf8')
htmltext = html.decode(codec)
But it gives me another error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 9, in <module>
htmltext = html.decode(codec)
AttributeError: 'HTTPResponse' object has no attribute 'decode'
Hence, after spending reasonable amount of time, I don't know what to do. All I want to do is get the price for AAPL so I can further continue to build a general program to fetch prices for an array of stocks and use the prices in future programs.
Any help is appreciated. Thanks!
You are barking up the right tree. Try decoding the actual HTML byte string rather than the urlopen HTTPResponse:
htmltext = html.read()
codec = html.info().get_param('charset', 'utf8')
htmltext = htmltext.decode(codec)
price = re.findall(pattern, htmltext)
I have next code:
for table in soup.findAll("table","tableData"):
for row in table.findAll("tr"):
data = row.findAll("td")
url = data[0].a
print type(url)
I get next output:
<class 'bs4.element.Tag'>
That means, that url is object of class Tag and i could get attribytes from this objects.
But if i replace print type(url) to print url['href'] i get next traceback
Traceback (most recent call last):
File "baseCreator.py", line 57, in <module>
createStoresTable()
File "baseCreator.py", line 46, in createStoresTable
print url['href']
TypeError: 'NoneType' object has no attribute '__getitem__'
What is wrong? And how i can get value of href attribute.
I do like BeautifulSoup but I personally prefer lxml.html (for not too wacky HTML) because of the ability to utilise XPath.
import lxml.html
page = lxml.html.parse('http://somesite.tld')
print page.xpath('//tr/td/a/#href')
Might need to implement some form of "axes" though depending on the structure.
You can also use elementsoup as a parser - details at http://lxml.de/elementsoup.html