Crawling infobox section of wikipedia using scraperwiki is giving error - python

I am newb to scraperwiki.I am trying to get infobox from wiki page using scraperwiki. I get the idea of scraperwiki to crawl wiki pages from below link
https://blog.scraperwiki.com/2011/12/how-to-scrape-and-parse-wikipedia/
Code
import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")
title = "Aquamole Pot"
val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave # prints just the ukcave infobox
Error
Traceback (most recent call last):
File "scrap_wiki.py", line 3, in <module>
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")
AttributeError: 'module' object has no attribute 'swimport'

Related

AttributeError: 'NoneType' object has no attribute 'findAll' in a web scraper

I am making a program for web scraping but this is my first time. The tutorial that I am using is built for python 2.7, but I am using 3.8.2. I have mostly edited my code to fit it to python 3, but one error pops up and I can't fix it.
import requests
import csv
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(features="html.parser")
results_table = soup.find('table', attrs={'class': 'resultsTable'})
output = []
for row in results_table.findAll('tr'):
output_rows = []
for cell in tr.findAll('td'):
output_rows.append(cell.text.replace(' ', ''))
output.append(output_rows)
print(output)
handle = open('out-using-requests.csv', 'a')
outfile = csv.writer(handle)
outfile.writerows(output)
The error I get is:
Traceback (most recent call last):
File "C:\Code\scrape.py", line 17, in <module>
for row in results_table.findAll('tr'):
AttributeError: 'NoneType' object has no attribute 'findAll'
The tutorial I am using is https://first-web-scraper.readthedocs.io/en/latest/
I tried some other questions, but they didn't help.
Please help!!!
Edit: Never mind, I got a good answer.
find returns None if it doesn't find a match. You need to check for that before attempting to find any sub elements in it:
results_table = soup.find('table', attrs={'class': 'resultsTable'})
output = []
if results_table:
for row in results_table.findAll('tr'):
output_rows = []
for cell in tr.findAll('td'):
output_rows.append(cell.text.replace(' ', ''))
output.append(output_rows)
The error allows the following conclusion:
results_table = None
Therefore, you cannot access the findAll() method because None.findAll() does not exist.
You should take a look, it is best to use a debugger to run through your program and see how the variables change line by line and why the mentioned line only returns ```None''. Especially important is the line:
results_table = soup.find('table', attrs={'class': 'resultsTable'})
Because in this row results_table is initialized yes, so here the above none'' value is returned andresults_table'' is assigned.

Python 3.7 scraping a page using BeautifulSoup issues with Available code on stack exchange

I have been trying to do webscraping for the a forum (https://forums.bharat-rakshak.com/viewtopic.php?f=3&t=7630) with the stack overflow code given in
Scraping a page for URLs using Beautifulsoup
I updated the Urllib2 to Urllib and the urlopen. However, the code still gives error on the metadata loop
for html in metaData:
text = BeautifulSoup(str(html).strip(),'lxml').get_text().encode("utf-8").replace("\n","") # convert the html to text
authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
times.append(text.split(" on ")[1].strip()) # get date
The error I get is
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
TypeError: a bytes-like object is required, not 'str'

Python-AttributeError: 'NoneType' object has no attribute 'lower'

I am trying to count the word on the google homepage. But I got AttributeError on the initial stage.
My code is -->
import requests
from bs4 import BeautifulSoup
import operator
def main(url):
word_list=[]
source_code=requests.get(url).text
soup=BeautifulSoup(source_code,'lxml')
for post_text in soup.findAll('a'):
content=post_text.string
words=content.lower().split()
for each_word in words:
print(each_word)
word_list.append(each_word)
main('https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=2-nqWavnB4WN8Qf4n7eQAw')
My Output is -->
images
maps
play
youtube
news
gmail
drive
Traceback (most recent call last):
File "word_freq.py", line 18, in <module>
main('https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=2-nqWavnB4WN8Qf4n7eQAw')
File "word_freq.py", line 13, in main
words=content.lower().split()
AttributeError: 'NoneType' object has no attribute 'lower'
you are parsing a web page in html so you need
soup=BeautifulSoup(source_code,'html.parser')
string is the incorrect attribute to get the content of any tag, use text:
content=post_text.text

Parse activity unstable, getting a few random results

Here's the code:
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
with open('/users/Rachael/Desktop/CheckTitle.csv', 'r') as readcsv:
for row in readcsv.readlines():
try:
openitem = urllib2.urlopen(row).read()
soup = BeautifulSoup(openitem, 'lxml')
print soup.head.find('title').get_text()
except urllib2.URLError:
print 'passed'
pass
I'm getting following results:
(a):
passed
贝贝网京外裁员10%:团队要保持狼性和危机感_新浪财经_新浪网
垂直电商贝贝网被曝裁员 回应称只是10%人员优化_新浪财经_新浪网
(b):
passed
Traceback (most recent call last):
File "C:/Users/Rachael/PycharmProjects/untitled1/GetTitle.py", line 10, in
<module>
print soup.head.find('title').get_text()
AttributeError: 'NoneType' object has no attribute 'find'
(c):
passed
贝贝网京外裁员10%:团队要保持狼性和危机感_新浪财经_新浪网
Traceback (most recent call last):
File "C:/Users/Rachael/PycharmProjects/untitled1/GetTitle.py", line 10, in <module>
print soup.head.find('title').get_text()
AttributeError: 'NoneType' object has no attribute 'find'
I'm getting these three types of results randomly.
If I do soup.title OR soup.title.text OR soup.title.string instead, it will return the same/similar error.
Please help!
I found this very hard to describe so if this is a dup in any ways please give me the link to similar posts.
Thanks!!
'NoneType' object has no attribute is an error that happens when there are no results for this object, try print only the print soup.head.find('title') title without printing the .text it should return something like '[]' or 'None'
Answer: There is no actual title tag or there's a bot protection of some kind on one of those sites you have in that file.

beautiful soup bug?

I have next code:
for table in soup.findAll("table","tableData"):
for row in table.findAll("tr"):
data = row.findAll("td")
url = data[0].a
print type(url)
I get next output:
<class 'bs4.element.Tag'>
That means, that url is object of class Tag and i could get attribytes from this objects.
But if i replace print type(url) to print url['href'] i get next traceback
Traceback (most recent call last):
File "baseCreator.py", line 57, in <module>
createStoresTable()
File "baseCreator.py", line 46, in createStoresTable
print url['href']
TypeError: 'NoneType' object has no attribute '__getitem__'
What is wrong? And how i can get value of href attribute.
I do like BeautifulSoup but I personally prefer lxml.html (for not too wacky HTML) because of the ability to utilise XPath.
import lxml.html
page = lxml.html.parse('http://somesite.tld')
print page.xpath('//tr/td/a/#href')
Might need to implement some form of "axes" though depending on the structure.
You can also use elementsoup as a parser - details at http://lxml.de/elementsoup.html

Categories

Resources