My script stops scraping after 449th Yelp restaurant.
Entire Code: https://pastebin.com/5U3irKZp
for idx, item in enumerate(yelp_containers, 1):
print("--- Restaurant number #", idx)
restaurant_title = item.h3.get_text(strip=True)
restaurant_title = re.sub(r'^[\d.\s]+', '', restaurant_title)
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
The error I am getting is:
Traceback (most recent call last):
File "/Users/kenny/MEGA/Python/yelp scraper.py", line 41, in
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
IndexError: list index out of range
The problem is that some restaurants are missing the address, for example this one:
What you should do is check first, if the address has enough elements before indexing it. Change this line of code:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
to these:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')
restaurant_address = restaurant_address[1] if len(restaurant_address) > 1 else restaurant_address[0]
I ran your parser for all pages and it worked.
Related
I'm new to web scraping with python and am having a problem with the weather web scraping script I wrote. Here is the whole code 'weather.py':
#! python3
import bs4, requests
weatherSite = requests.get('https://weather.com/en-CA/weather/today/l/eef019cb4dca2160f08eb9714e30f28e05e624bbae351ccb6a855dbc7f14f017')
weatherSoup = bs4.BeautifulSoup(weatherSite.text, 'html.parser')
weatherLoc = weatherSoup.select('.CurrentConditions--location--kyTeL')
weatherTime = weatherSoup.select('.CurrentConditions--timestamp--23dfw')
weatherTemp = weatherSoup.select('.CurrentConditions--tempValue--3a50n')
weatherCondition = weatherSoup.select('.CurrentConditions--phraseValue--2Z18W')
weatherDet = weatherSoup.select('.CurrentConditions--precipValue--3nxCj > span:nth-child(1)')
location = weatherLoc[0].text
time = weatherTime[0].text
temp = weatherTemp[0].text
condition = weatherCondition[0].text
det = weatherDet[0].text
print(location)
print(time)
print(temp + 'C')
print(condition)
print(det)
It basically parses the weather information from 'The Weather Channel' and prints it out. This code was working fine yesterday when I wrote it. But, I tried today and it is giving me the following error:
Traceback (most recent call last):
File "C:\Users\username\filesAndStuff\weather.py", line 16, in <module>
location = weatherLoc[0].text
IndexError: list index out of range
Replace:
weatherLoc = weatherSoup.select('.CurrentConditions--location--kyTeL')
# print(weatherLoc)
# []
By:
weatherLoc = weatherSoup.select('h1[class*="CurrentConditions--location--"]')
# print(weatherLoc)
# [<h1 class="CurrentConditions--location--2_osB">Hamilton, Ontario Weather</h1>]
As you can see, your suffix kYTeL is not the same for me 2_osB. You need a partial match on class attribute (class*=) (note the *)
I'm trying to get text from "Redeemed Highlight My Message" twitch chat, here is my code.
from selenium import webdriver
driver = webdriver.Chrome('D:\Project\Project\Rebot Router\chromedriver11.exe')
driver.get("https://www.twitch.tv/nightblue3")
while True:
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span')
text44= driver.find_elements_by_class_name("chat-line--inline chat-line__message")
print(str(text11))
print(str(text44))
but when i run it that's what i get
[]
[]
[]
[]
[]
[]
[]
[]
[]
and when i use .text like that
while True:
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span').text
text44= driver.find_elements_by_class_name("chat-line--inline chat-line__message").text
print(str(text11))
print(str(text44))
that's what i get
Traceback (most recent call last):
File "D:/Project/Project/Rebot Router/test.py", line 7, in <module>
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span').text
AttributeError: 'list' object has no attribute 'text'
so any help please.
btw text11 and text44 is the same i just use in text11 xpath and text44 class_name.
while True:
Texts = driver.find_elements_by_xpath("//span[#class='text-fragment']")
for x in range (0, len(Texts)):
print(Texts[x].text)
The very famous IndexError. Unfortunately, I really did not find a solution.
The last time visit at the last URL, I always get an error. Whether the website is empty or not. This error occurs whether the range is 2 or 20.
text_file = open("Results-from-{}.txt".format(self.entry_get), "w")
### Iterator for end of the url
multiple_url = []
for iterator_page in range(15):
iterator_page = iterator_page + 1
multiple_url.append("".join([self.sub_url, str(iterator_page)]))
### loop for visit all 20 pages ###
parser = 0
while parser < len(multiple_url):
print(multiple_url[parser])
parser += 1
with urllib.request.urlopen(multiple_url[parser]) as url:
soup = BeautifulSoup(url, "html.parser")
### html tag parsing
names = [name.get_text().strip() for name in soup.findAll("div", {"class": "name m08_name"})]
street = [address.get_text().strip() for address in soup.findAll(itemprop="streetAddress")]
plz = [address.get_text().strip() for address in soup.findAll(itemprop="postalCode")]
city = [address.get_text().strip() for address in soup.findAll(itemprop="addressLocality")]
### zip and write
for line in zip(names, street, plz , city):
print("%s;%s;%s;%s;\n" % line)
text_file.write("%s;%s;%s;%s;\n" % line)
### output of the path main: cwd_out_final
cwd = os.getcwd()
cwd_out = "\{}".format(text_file.name)
cwd_out_final = cwd + cwd_out
text_file.close()
My Error:
Exception in Tkinter callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/tkinter/__init__.py", line 1699, in __call__
return self.func(*args)
File "/Users/x/PycharmProjects/hackday/parser.py", line 55, in search_complete_inner
with urllib.request.urlopen(multiple_url[parser]) as url:
IndexError: list index out of range
Thank You!
You increment parser before using it as an index in the with statement; doing that on the last element will generate the error in question. Further, it means you never use the first element in the list.
I am trying to get the below program working. It is supposed to find email addresses in a website but, it is breaking. I suspect the problem is with initializing result = [] inside the crawl function. Below is the code:
# -*- coding: utf-8 -*-
import requests
import re
import urlparse
# In this example we're trying to collect e-mail addresses from a website
# Basic e-mail regexp:
# letter/number/dot/comma # letter/number/dot/comma . letter/number
email_re = re.compile(r'([\w\.,]+#[\w\.,]+\.\w+)')
# HTML <a> regexp
# Matches href="" attribute
link_re = re.compile(r'href="(.*?)"')
def crawl(url, maxlevel):
result = []
# Limit the recursion, we're not downloading the whole Internet
if(maxlevel == 0):
return
# Get the webpage
req = requests.get(url)
# Check if successful
if(req.status_code != 200):
return []
# Find and follow all the links
links = link_re.findall(req.text)
for link in links:
# Get an absolute URL for a link
link = urlparse.urljoin(url, link)
result += crawl(link, maxlevel - 1)
# Find all emails on current page
result += email_re.findall(req.text)
return result
emails = crawl('http://ccs.neu.edu', 2)
print "Scrapped e-mail addresses:"
for e in emails:
print e
The error I get is below:
C:\Python27\python.exe "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py"
Traceback (most recent call last):
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 41, in <module>
emails = crawl('http://ccs.neu.edu', 2)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
TypeError: 'NoneType' object is not iterable
Process finished with exit code 1
Any suggestions will help. Thanks!
The problem is this:
if(maxlevel == 0):
return
Currently it return None when maxlevel == 0. You can't concatenate a list with a None object.
You need to return an empty list [] to be consistent.
I am trying to code rss news feeder bot for irc. So I search a bit on web a little and made out this code
#this code is for local testing
import feedparser
feed_list = {}
channel = '#hackingdefined'
class Feed:
def __init__(self, name, url):
self.name = name
self.url = url
self.feeds = {}
self.newest = ''
def update(self):
self.feeds = feedparser.parse(self.url)
if self.newest != self.feeds['items'][0].title:
self.newest = self.feeds['items'][0].title
say('{}: {} '.format(self.name,self.newest))
say('URL: {} '.format(self.feeds.entries[0].link))
def say(data=''):
print('PRIVMSG '+channel+' :'+ data+'\r\n')
def url_loader(txt):
f = open(txt, 'r')
for line in f:
name, url = line.split(':',1) # check how to spilt only once
print name+" "+url
feed_list[name] = Feed(name,url)
print feed_list
url_loader('feed_list.txt')
for feed in feed_list.values():
print feed
feed.update()
When I run the code I get this error
Traceback (most recent call last):
File "C:\Or\define\projects\rss feed\the progect\test.py", line 33, in <module>
feed.update()
File "C:\Or\define\projects\rss feed\the progect\test.py", line 14, in update
if self.newest != self.feeds['items'][0].title:
IndexError: list index out of range
Now the wierd thing is, if I create a new Feed class like test = Feed('example', 'http://rss.packetstormsecurity.com/')
and call test.update() Its all work fine, but the automation script raise an error.
So i checked my url_load and the test file,The test file is something like this:
packet storm:http://rss.packetstormsecurity.com/
sans:http://www.sans.org/rss.php/
...
And its all seems fine to me. Any one have a clue what this could be?
Thanks, Or
EDIT:
Its been solved, one of my url was wrong.
All seem clear after good night sleep :-)
Its been solved, one of my url that i post into the file was wrong.
The solution is use try on every url in the list.