The very famous IndexError. Unfortunately, I really did not find a solution.
The last time visit at the last URL, I always get an error. Whether the website is empty or not. This error occurs whether the range is 2 or 20.
text_file = open("Results-from-{}.txt".format(self.entry_get), "w")
### Iterator for end of the url
multiple_url = []
for iterator_page in range(15):
iterator_page = iterator_page + 1
multiple_url.append("".join([self.sub_url, str(iterator_page)]))
### loop for visit all 20 pages ###
parser = 0
while parser < len(multiple_url):
print(multiple_url[parser])
parser += 1
with urllib.request.urlopen(multiple_url[parser]) as url:
soup = BeautifulSoup(url, "html.parser")
### html tag parsing
names = [name.get_text().strip() for name in soup.findAll("div", {"class": "name m08_name"})]
street = [address.get_text().strip() for address in soup.findAll(itemprop="streetAddress")]
plz = [address.get_text().strip() for address in soup.findAll(itemprop="postalCode")]
city = [address.get_text().strip() for address in soup.findAll(itemprop="addressLocality")]
### zip and write
for line in zip(names, street, plz , city):
print("%s;%s;%s;%s;\n" % line)
text_file.write("%s;%s;%s;%s;\n" % line)
### output of the path main: cwd_out_final
cwd = os.getcwd()
cwd_out = "\{}".format(text_file.name)
cwd_out_final = cwd + cwd_out
text_file.close()
My Error:
Exception in Tkinter callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/tkinter/__init__.py", line 1699, in __call__
return self.func(*args)
File "/Users/x/PycharmProjects/hackday/parser.py", line 55, in search_complete_inner
with urllib.request.urlopen(multiple_url[parser]) as url:
IndexError: list index out of range
Thank You!
You increment parser before using it as an index in the with statement; doing that on the last element will generate the error in question. Further, it means you never use the first element in the list.
Related
I'm new to web scraping with python and am having a problem with the weather web scraping script I wrote. Here is the whole code 'weather.py':
#! python3
import bs4, requests
weatherSite = requests.get('https://weather.com/en-CA/weather/today/l/eef019cb4dca2160f08eb9714e30f28e05e624bbae351ccb6a855dbc7f14f017')
weatherSoup = bs4.BeautifulSoup(weatherSite.text, 'html.parser')
weatherLoc = weatherSoup.select('.CurrentConditions--location--kyTeL')
weatherTime = weatherSoup.select('.CurrentConditions--timestamp--23dfw')
weatherTemp = weatherSoup.select('.CurrentConditions--tempValue--3a50n')
weatherCondition = weatherSoup.select('.CurrentConditions--phraseValue--2Z18W')
weatherDet = weatherSoup.select('.CurrentConditions--precipValue--3nxCj > span:nth-child(1)')
location = weatherLoc[0].text
time = weatherTime[0].text
temp = weatherTemp[0].text
condition = weatherCondition[0].text
det = weatherDet[0].text
print(location)
print(time)
print(temp + 'C')
print(condition)
print(det)
It basically parses the weather information from 'The Weather Channel' and prints it out. This code was working fine yesterday when I wrote it. But, I tried today and it is giving me the following error:
Traceback (most recent call last):
File "C:\Users\username\filesAndStuff\weather.py", line 16, in <module>
location = weatherLoc[0].text
IndexError: list index out of range
Replace:
weatherLoc = weatherSoup.select('.CurrentConditions--location--kyTeL')
# print(weatherLoc)
# []
By:
weatherLoc = weatherSoup.select('h1[class*="CurrentConditions--location--"]')
# print(weatherLoc)
# [<h1 class="CurrentConditions--location--2_osB">Hamilton, Ontario Weather</h1>]
As you can see, your suffix kYTeL is not the same for me 2_osB. You need a partial match on class attribute (class*=) (note the *)
My script stops scraping after 449th Yelp restaurant.
Entire Code: https://pastebin.com/5U3irKZp
for idx, item in enumerate(yelp_containers, 1):
print("--- Restaurant number #", idx)
restaurant_title = item.h3.get_text(strip=True)
restaurant_title = re.sub(r'^[\d.\s]+', '', restaurant_title)
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
The error I am getting is:
Traceback (most recent call last):
File "/Users/kenny/MEGA/Python/yelp scraper.py", line 41, in
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
IndexError: list index out of range
The problem is that some restaurants are missing the address, for example this one:
What you should do is check first, if the address has enough elements before indexing it. Change this line of code:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')[1]
to these:
restaurant_address = item.select_one('[class*="secondaryAttributes"]').get_text(separator='|', strip=True).split('|')
restaurant_address = restaurant_address[1] if len(restaurant_address) > 1 else restaurant_address[0]
I ran your parser for all pages and it worked.
i trying to use crawler to get ieee paper keywords but now i get a error
how can to fix my crawler?
my code is here
import requests
import json
from bs4 import BeautifulSoup
ieee_content = requests.get("http://ieeexplore.ieee.org/document/8465981", timeout=180)
soup = BeautifulSoup(ieee_content.text, 'xml')
tag = soup.find_all('script')
for i in tag[9]:
s = json.loads(re.findall('global.document.metadata=(.*;)', i)[0].replace("'", '"').replace(";", ''))
and error is here
Traceback (most recent call last):
File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 90, in <module>
a.get_es_data(offset=0, size=1)
File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 53, in get_es_data
self.get_data(link=ieee_link, esid=es_id)
File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 65, in get_data
s = json.loads(re.findall('global.document.metadata=(.*;)', i)[0].replace(";", '').replace("'", '"'))
IndexError: list index out of range
Here's another answer. I don't know what you are doing with 's' in your code after the load (replace) in my code.
The code below doesn't thrown an error, but again how are you using 's'
import requests
import json
from bs4 import BeautifulSoup
ieee_content = requests.get("http://ieeexplore.ieee.org/document/8465981", timeout=180)
soup = BeautifulSoup(ieee_content.text, 'xml')
tag = soup.find_all('script')
# i is a list
for i in tag[9]:
metadata_format = re.compile(r'global.document.metadata=.*', re.MULTILINE)
metadata = re.findall(metadata_format, i)
if len(metadata) != 0:
# convert the list
convert_to_json = json.dumps(metadata)
x = json.loads(convert_to_json)
s = x[0].replace("'", '"').replace(";", '')
###########################################
# I don't know what you plan to do with 's'
###########################################
print (s)
Apparently in line 65 some of the data provided in i did not suite the regex pattern you're trying to use. Therefor your [0] will not work as the data returned is not an array of suitable length.
Solution:
x = json.loads(re.findall('global.document.metadata=(.*;)', i)
if x:
s = x[0].replace("'", '"').replace(";", ''))
I am trying to get the below program working. It is supposed to find email addresses in a website but, it is breaking. I suspect the problem is with initializing result = [] inside the crawl function. Below is the code:
# -*- coding: utf-8 -*-
import requests
import re
import urlparse
# In this example we're trying to collect e-mail addresses from a website
# Basic e-mail regexp:
# letter/number/dot/comma # letter/number/dot/comma . letter/number
email_re = re.compile(r'([\w\.,]+#[\w\.,]+\.\w+)')
# HTML <a> regexp
# Matches href="" attribute
link_re = re.compile(r'href="(.*?)"')
def crawl(url, maxlevel):
result = []
# Limit the recursion, we're not downloading the whole Internet
if(maxlevel == 0):
return
# Get the webpage
req = requests.get(url)
# Check if successful
if(req.status_code != 200):
return []
# Find and follow all the links
links = link_re.findall(req.text)
for link in links:
# Get an absolute URL for a link
link = urlparse.urljoin(url, link)
result += crawl(link, maxlevel - 1)
# Find all emails on current page
result += email_re.findall(req.text)
return result
emails = crawl('http://ccs.neu.edu', 2)
print "Scrapped e-mail addresses:"
for e in emails:
print e
The error I get is below:
C:\Python27\python.exe "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py"
Traceback (most recent call last):
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 41, in <module>
emails = crawl('http://ccs.neu.edu', 2)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
TypeError: 'NoneType' object is not iterable
Process finished with exit code 1
Any suggestions will help. Thanks!
The problem is this:
if(maxlevel == 0):
return
Currently it return None when maxlevel == 0. You can't concatenate a list with a None object.
You need to return an empty list [] to be consistent.
I have the following code that uses 3 strings 'us dollars','euro', '02-11-2014',
and a number to calculate the exchange rate for that given date. I modified the
code to pass those arguments but I get an error when I try to call it with
python currencyManager.py "us dollars" "euro" 100 "02-11-2014"
Traceback (most recent call last):
File "currencyManager.py", line 37. in <module>
currencyManager(currTo,currFrom,currAmount,currDate)
NameError: name 'currTo' is not defined
I'm fairly new to Python so my knowledge is limited. Any help would be greatly appreciated. Thanks.
Also the version of Python I'm using is 3.4.2.
import urllib.request
import re
def currencyManager(currTo,currFrom,currAmount,currDate):
try:
currency_to = currTo #'us dollars'
currency_from = currFrom #'euro'
currency_from_amount = currAmount
on_date = currDate # Day-Month-Year
currency_from = currency_from.replace(' ', '+')
currency_to = currency_to.replace(' ', '+')
url = 'http://www.wolframalpha.com/input/?i=' + str(currency_from_amount) + '+' + str(currency_from) + '+to+' + str(currency_to) + '+on+' + str(on_date)
req = urllib.request.Request(url)
output = ''
urllib.request.urlopen(req)
page_fetch = urllib.request.urlopen(req)
output = page_fetch.read().decode('utf-8')
search = '<area shape="rect.*href="\/input\/\?i=(.*?)\+.*?&lk=1'
result = re.findall(r'' + search, output, re.S)
if len(result) > 0:
amount = float(result[0])
print(str(amount))
else:
print('No match found')
except URLError as e:
print(e)
currencyManager(currTo,currFrom,currAmount,currDate)
The command line
python currencyManager.py "us dollars" "euro" 100 "02-11-2014"
does not automatically assign "us dollars" "euro" 100 "02-11-2014" to currTo,currFrom,currAmount,currDate.
Instead the command line arguments are stored in a list, sys.argv.
You need to parse sys.argv and/or pass its values on to the call to currencyManager:
For example, change
currencyManager(currTo,currFrom,currAmount,currDate)
to
import sys
currencyManager(*sys.argv[1:5])
The first element in sys.argv is the script name. Thus sys.argv[1:5] consists of the next 4 arguments after the script name (assuming 4 arguments were entered on the command line.) You may want to check that the right number of arguments are passed on the command line and that they are of the right type. The argparse module can help you here.
The * in *sys.argv[1:5] unpacks the list sys.argv[1:5] and passes the items in the list as arguments to the function currencyManager.