My script doesn't scrape all of Yelps restaurants

My script doesn't scrape all of Yelps restaurants - python

My script stops scraping after 449th Yelp restaurant.
Entire Code: https://pastebin.com/5U3irKZp
for idx, item in enumerate(yelp_containers, 1):
print("--- Restaurant number #", idx)
restaurant_title = item.h3.get_text(strip=True)
restaurant_title = re.sub(r'^[\d.\s]+', '', restaurant_title)
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
The error I am getting is:
Traceback (most recent call last):
File "/Users/kenny/MEGA/Python/yelp scraper.py", line 41, in
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
IndexError: list index out of range

The problem is that some restaurants are missing the address, for example this one:
What you should do is check first, if the address has enough elements before indexing it. Change this line of code:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
to these:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')
restaurant_address = restaurant_address[1] if len(restaurant_address) > 1 else restaurant_address[0]
I ran your parser for all pages and it worked.

Related

BeautifulSoup4 and Requests Module 'IndexError: list index out of range'

I'm new to web scraping with python and am having a problem with the weather web scraping script I wrote. Here is the whole code 'weather.py':
#! python3
import bs4, requests
weatherSite = requests.get('https://weather.com/en-CA/weather/today/l/eef019cb4dca2160f08eb9714e30f28e05e624bbae351ccb6a855dbc7f14f017')
weatherSoup = bs4.BeautifulSoup(weatherSite.text, 'html.parser')
weatherLoc = weatherSoup.select('.CurrentConditions--location--kyTeL')
weatherTime = weatherSoup.select('.CurrentConditions--timestamp--23dfw')
weatherTemp = weatherSoup.select('.CurrentConditions--tempValue--3a50n')
weatherCondition = weatherSoup.select('.CurrentConditions--phraseValue--2Z18W')
weatherDet = weatherSoup.select('.CurrentConditions--precipValue--3nxCj > span:nth-child(1)')
location = weatherLoc[0].text
time = weatherTime[0].text
temp = weatherTemp[0].text
condition = weatherCondition[0].text
det = weatherDet[0].text
print(location)
print(time)
print(temp + 'C')
print(condition)
print(det)
It basically parses the weather information from 'The Weather Channel' and prints it out. This code was working fine yesterday when I wrote it. But, I tried today and it is giving me the following error:
Traceback (most recent call last):
File "C:\Users\username\filesAndStuff\weather.py", line 16, in <module>
location = weatherLoc[0].text
IndexError: list index out of range

Replace:
weatherLoc = weatherSoup.select('.CurrentConditions--location--kyTeL')
# print(weatherLoc)
# []
By:
weatherLoc = weatherSoup.select('h1[class*="CurrentConditions--location--"]')
# print(weatherLoc)
# [<h1 class="CurrentConditions--location--2_osB">Hamilton, Ontario Weather</h1>]
As you can see, your suffix kYTeL is not the same for me 2_osB. You need a partial match on class attribute (class*=) (note the *)

how to get text from twitch redeem points using selenium (python)?

I'm trying to get text from "Redeemed Highlight My Message" twitch chat, here is my code.
from selenium import webdriver
driver = webdriver.Chrome('D:\Project\Project\Rebot Router\chromedriver11.exe')
driver.get("https://www.twitch.tv/nightblue3")
while True:
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span')
text44= driver.find_elements_by_class_name("chat-line--inline chat-line__message")
print(str(text11))
print(str(text44))
but when i run it that's what i get
[]
[]
[]
[]
[]
[]
[]
[]
[]
and when i use .text like that
while True:
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span').text
text44= driver.find_elements_by_class_name("chat-line--inline chat-line__message").text
print(str(text11))
print(str(text44))
that's what i get
Traceback (most recent call last):
File "D:/Project/Project/Rebot Router/test.py", line 7, in <module>
text11= driver.find_elements_by_xpath('//*[#id="6583f0b7722e3be4537e78903686d3b4"]/div/div[1]/div/div/section/div/div[3]/div[2]/div[3]/div/div/div[116]/div[2]/span[4]/span').text
AttributeError: 'list' object has no attribute 'text'
so any help please.
btw text11 and text44 is the same i just use in text11 xpath and text44 class_name.

while True:
Texts = driver.find_elements_by_xpath("//span[#class='text-fragment']")
for x in range (0, len(Texts)):
print(Texts[x].text)

IndexError at a loop with Beautifulsoup

The very famous IndexError. Unfortunately, I really did not find a solution.
The last time visit at the last URL, I always get an error. Whether the website is empty or not. This error occurs whether the range is 2 or 20.
text_file = open("Results-from-{}.txt".format(self.entry_get), "w")
### Iterator for end of the url
multiple_url = []
for iterator_page in range(15):
iterator_page = iterator_page + 1
multiple_url.append("".join([self.sub_url, str(iterator_page)]))
### loop for visit all 20 pages ###
parser = 0
while parser < len(multiple_url):
print(multiple_url[parser])
parser += 1
with urllib.request.urlopen(multiple_url[parser]) as url:
soup = BeautifulSoup(url, "html.parser")
### html tag parsing
names = [name.get_text().strip() for name in soup.findAll("div", {"class": "name m08_name"})]
street = [address.get_text().strip() for address in soup.findAll(itemprop="streetAddress")]
plz = [address.get_text().strip() for address in soup.findAll(itemprop="postalCode")]
city = [address.get_text().strip() for address in soup.findAll(itemprop="addressLocality")]
### zip and write
for line in zip(names, street, plz , city):
print("%s;%s;%s;%s;\n" % line)
text_file.write("%s;%s;%s;%s;\n" % line)
### output of the path main: cwd_out_final
cwd = os.getcwd()
cwd_out = "\{}".format(text_file.name)
cwd_out_final = cwd + cwd_out
text_file.close()
My Error:
Exception in Tkinter callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/tkinter/__init__.py", line 1699, in __call__
return self.func(*args)
File "/Users/x/PycharmProjects/hackday/parser.py", line 55, in search_complete_inner
with urllib.request.urlopen(multiple_url[parser]) as url:
IndexError: list index out of range
Thank You!

You increment parser before using it as an index in the with statement; doing that on the last element will generate the error in question. Further, it means you never use the first element in the list.

TypeError: 'NoneType' object is not iterable: Webcrawler to scrape email addresses

I am trying to get the below program working. It is supposed to find email addresses in a website but, it is breaking. I suspect the problem is with initializing result = [] inside the crawl function. Below is the code:
# -*- coding: utf-8 -*-
import requests
import re
import urlparse
# In this example we're trying to collect e-mail addresses from a website
# Basic e-mail regexp:
# letter/number/dot/comma # letter/number/dot/comma . letter/number
email_re = re.compile(r'([\w\.,]+#[\w\.,]+\.\w+)')
# HTML <a> regexp
# Matches href="" attribute
link_re = re.compile(r'href="(.*?)"')
def crawl(url, maxlevel):
result = []
# Limit the recursion, we're not downloading the whole Internet
if(maxlevel == 0):
return
# Get the webpage
req = requests.get(url)
# Check if successful
if(req.status_code != 200):
return []
# Find and follow all the links
links = link_re.findall(req.text)
for link in links:
# Get an absolute URL for a link
link = urlparse.urljoin(url, link)
result += crawl(link, maxlevel - 1)
# Find all emails on current page
result += email_re.findall(req.text)
return result
emails = crawl('http://ccs.neu.edu', 2)
print "Scrapped e-mail addresses:"
for e in emails:
print e
The error I get is below:
C:\Python27\python.exe "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py"
Traceback (most recent call last):
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 41, in <module>
emails = crawl('http://ccs.neu.edu', 2)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
TypeError: 'NoneType' object is not iterable
Process finished with exit code 1
Any suggestions will help. Thanks!

The problem is this:
if(maxlevel == 0):
return
Currently it return None when maxlevel == 0. You can't concatenate a list with a None object.
You need to return an empty list [] to be consistent.

IRC feedparser index out of range mysterious error

I am trying to code rss news feeder bot for irc. So I search a bit on web a little and made out this code
#this code is for local testing
import feedparser
feed_list = {}
channel = '#hackingdefined'
class Feed:
def __init__(self, name, url):
self.name = name
self.url = url
self.feeds = {}
self.newest = ''
def update(self):
self.feeds = feedparser.parse(self.url)
if self.newest != self.feeds['items'][0].title:
self.newest = self.feeds['items'][0].title
say('{}: {} '.format(self.name,self.newest))
say('URL: {} '.format(self.feeds.entries[0].link))
def say(data=''):
print('PRIVMSG '+channel+' :'+ data+'\r\n')
def url_loader(txt):
f = open(txt, 'r')
for line in f:
name, url = line.split(':',1) # check how to spilt only once
print name+" "+url
feed_list[name] = Feed(name,url)
print feed_list
url_loader('feed_list.txt')
for feed in feed_list.values():
print feed
feed.update()
When I run the code I get this error
Traceback (most recent call last):
File "C:\Or\define\projects\rss feed\the progect\test.py", line 33, in <module>
feed.update()
File "C:\Or\define\projects\rss feed\the progect\test.py", line 14, in update
if self.newest != self.feeds['items'][0].title:
IndexError: list index out of range
Now the wierd thing is, if I create a new Feed class like test = Feed('example', 'http://rss.packetstormsecurity.com/')
and call test.update() Its all work fine, but the automation script raise an error.
So i checked my url_load and the test file,The test file is something like this:
packet storm:http://rss.packetstormsecurity.com/
sans:http://www.sans.org/rss.php/
...
And its all seems fine to me. Any one have a clue what this could be?
Thanks, Or
EDIT:
Its been solved, one of my url was wrong.
All seem clear after good night sleep :-)

Its been solved, one of my url that i post into the file was wrong.
The solution is use try on every url in the list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

My script doesn't scrape all of Yelps restaurants - python

Related

BeautifulSoup4 and Requests Module 'IndexError: list index out of range'

how to get text from twitch redeem points using selenium (python)?

IndexError at a loop with Beautifulsoup

TypeError: 'NoneType' object is not iterable: Webcrawler to scrape email addresses

IRC feedparser index out of range mysterious error

Categories

Resources