I am a beginner of Python. Could someone point out why it keeps saying
Traceback (most recent call last):
File "C:/Python27/practice example/datascraper templates.py", line 21, in <module>
print findPatTitle[i]
IndexError: list index out of range
Thanks a lot.
Here are the codes:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage=urlopen('http://www.voxeu.org/').read()
patFinderTitle=re.compile('<title>(.*)</title>') ##title tag
patFinderLink=re.compile('<link rel.*href="(.*)"/>') ##link tag
findPatTitle=re.findall(patFinderTitle,webpage)
findPatLink=re.findall(patFinderLink,webpage)
listIterator=[]
listIterator=range(2,16)
for i in listIterator:
print findPatTitle[i]
print findPatLink[i]
print '/n'
The error message is perfectly descriptive.
You're trying to access a hard-coded range of indices (2,16) into findPatTitle, but you have no idea how many items there are.
When you want to iterate over multiple similar collections simultaneously, use zip().
for title, link in zip(findPatTitle, findPatLink):
print 'Title={0} Link={1}'.format(title, link)
The problem is you have a different number of results than you expected. Don't hard-code that. But let's also rewrite this to be a bit more pythonic:
Replace this:
listIterator=[]
listIterator=range(2,16)
for i in listIterator:
print findPatTitle[i]
print findPatLink[i]
print '/n'
with the two lists zipped together:
for title, link in zip(findPatTitle, findPatLink):
print title
print link
print '/n'
This will loop over both at once, however long the list is. 1 element or 100 elements, it makes no difference.
Related
Please help me fix this, this is my code which I've already tried.
I really appreciate your help.
import urllib.request
import re
search_keyword="ill%20wiat"
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])
First of all check page you try to parse. you wrote:
r"watch?v=(\S{11})"
just remember that ? char here will be parsed as REGEX operator and not string you want,
so first of all you need to write it like:
/watch[?]v=(\S{11})
so your regex will be parsed properly
Second: good practice to print your list to see what you get and iterate via list using FOR loop instead of directly accessing index [0].
in you case you get this error just because your list of id is empty.
next code is working for me
import urllib.request
import re
search_keyword="ill%20wiat"
url="https://www.youtube.com/results?search_query="+search_keyword
with urllib.request.urlopen(url) as response:
video_ids = re.findall("/watch[?]v=(\S{11})", response.read().decode())
for video in video_ids:
print("https://www.youtube.com/watch?v=" + video)
P.S don't wrap your code with try/except to catch such thrown errors
urlib won't give you data
use
import requests
html=requests.get('https://www.youtube.com/results?search_query='+search_keyword)
text=html.text
text have all html data so search from text
I was hoping people on here would be able to answer what I believe to be a simple question. I'm a complete newbie and have been attempting to create an image webscraper from the site Archdaily. Below is my code so far after numerous attempts to debug it:
#### - Webscraping 0.1 alpha -
#### - Archdaily -
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
These elements here:
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
Attempts to isolate the data-images attribute, shown here:
My github upload of this portion, very long
As you can see, or maybe I'm completely wrong here, my attempts to call the 'url_large' values from this final dictionary list comes to a TypeError, shown below:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 23, in <module>
for k, v in img_list():
TypeError: 'str' object is not callable
I believe my error lies in the resulting isolation of 'data-images', which to me looks like a dict within a list, as they're wrapped by brackets and curly braces. I'm completely out of my element here because I basically jumped into this project blind (haven't even read past chapter 4 of Guttag's book yet).
I also looked everywhere for ideas and tried to mimic what I found. I've found solutions others have offered previously to change the data to JSON data, so I found the code below:
jsonData = json.loads(img.attrs['data-images'])
print(jsonData['url_large'])
But that was a bust, shown here:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 29, in <module>
print(jsonData['url_large'])
TypeError: list indices must be integers or slices, not str
There is a step I'm missing here in changing these string values, but I'm not sure where I could change them. I'm hoping someone can help me resolve this issue, thanks!
It's all about the types.
img_list is actually not a list, but a string. You try to call it by img_list() which results in an error.
You had the right idea of turning it into a dictionary using json.loads. The error here is pretty straight forward - jsonData is a list, not a dictionary. You have more than one image.
You can loop through the list. Each item in the list is a dictionary, and you'll be able to find the url_large attribute in each dictionary in the list:
images_json = img.attrs['data-images']
for image_properties in json.loads(images_json):
print(image_properties['url_large'])
#infinity & #simic0de are both right, but I wanted to more explicitly address what I see in your code as well.
In this particular block:
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
There is a couple syntax errors.
If 'img_list' truly WAS a dictionary, you cannot iterate through it this way. You would need to use img_list.items() (for python3) or img_list.iteritems() (python2) in the second line.
When you use the parenthesis like that, it implies that you're calling a function. But here, you're trying to iterate through a dictionary. That is why you get the 'is not callable' error.
The other main issue is the Type issue. simic0de & Infinity address that, but ultimately you need to check the type of img_list and convert it as needed so you can iterate through it.
Source of error:
img_list is a string. You have to convert it to list using json.loads and it not becomes a list of dicts that you have to loop over.
Working Solution:
import json
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for img in json.loads(img_list):
for k, v in img.items():
if k == 'url_large':
print(v)
I am facing a difficulty in parsing the population count and appending it to a list
from bs4 import *
import requests
def getPopulation(name):
url="http://www.worldometers.info/world-population/"+name+"-population/"
data=requests.get(url)
soup=BeautifulSoup(data.text,"html.parser")
#print(soup.prettify())
x=soup.find_all('div',{"class":"col-md-8 country-pop-description"})
y=x[0].find_all('strong')
result=y[1].text
return result
def main():
no=input("Enter the number of countries : ")
Map=[]
for i in range(0,int(no)):
country=input("Enter country : ")
res=getPopulation(country)
Map.append(res)
print(Map)
if __name__ == "__main__":
main()
The function works fine if i run it separately by passing a country name such as "india" as a parameter but shows an error when i compile it in this program.I am a beginner in python so sorry for the silly mistakes if any present.
Traceback (most recent call last):
File "C:/Users/Latheesh/AppData/Local/Programs/Python/Python36/Population Graph.py", line 24, in <module>
main()
File "C:/Users/Latheesh/AppData/Local/Programs/Python/Python36/Population Graph.py", line 19, in main
res=getPopulation(country)
File "C:/Users/Latheesh/AppData/Local/Programs/Python/Python36/Population Graph.py", line 10, in getPopulation
y=x[0].find_all('strong')
IndexError: list index out of range
I just ran your code for the sample cases (india and china) and ran into no issue. The reason you'd get the indexerror is if there are no results for find_all, for which the result would be [] (so there is no 0th element).
To fix your code you need a "catch" to confirm there are results. Here's a basic way to do that:
def getPopulation(name):
...
x=soup.find_all('div',{"class":"col-md-8 country-pop-description"})
if x:
y=x[0].find_all('strong')
result=y[1].text
else:
result = "No results founds."
return result
A cleaner way to write that, eliminating the unnecessary holder variables (e.g. y) and using a ternary operator:
def getPopulation(name):
...
x=soup.find_all('div',{"class":"col-md-8 country-pop-description"})
return x[0].find_all('strong')[1].text if x else "No results founds."
A few other notes about your code:
It's best to use returns for all of your functions. For main(), instead of using print(Map), you should use return Map
Style convention in Python calls for variable names to be lowercase (e.g. Map should be map) and there should be a space before your return line (as in the shortened getPopulation() above. I suggest reviewing PEP 8 to learn more about style norms / making code easier to read.
For url it's better practice to use string formatting to insert your variables. For example, "http://www.worldometers.info/world-population/{}-population/".format(name)
I wrote a script that parses a webpage and get the amount of links('a' tag) on it:
import urllib
import lxml.html
connection = urllib.urlopen('http://test.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
The output of a script:
./01.html
./52.html
./801.html
http://www.blablabla.com/1.html
#top
How can i convert it to list to count the amount of links? I use link.split() but it got to me:
['./01.html']
['./52.html']
['./801.html']
['http://www.blablabla.com/1.html']
['#top']
But i want to get:
[./01.html, ./52.html, ./801.html, http://www.blablabla.com/1.html, #top]
Thanks!
link.split() tries to split link itself. But you must work with entity that represents all links. In your case: dom.xpath('//a/#href').
So this must help you:
links = list(dom.xpath('//a/#href'))
And getting length with a built-in len function:
print len(links)
list(dom.xpath('//a/#href'))
This will take the iterator that dom.xpath returns and puts every item into a list.
Found this great answer on how to check if a list of strings are within a line
How to check if a line has one of the strings in a list?
But trying to do a similar thing with keys in a dict does not seem to do the job for me:
import urllib2
url_info = urllib2.urlopen('http://rss.timegenie.com/forex.xml')
currencies = {"DKK": [], "SEK": []}
print currencies.keys()
testCounter = 0
for line in url_info:
if any(countryCode in line for countryCode in currencies.keys()):
testCounter += 1
if "DKK" in line or "SEK" in line:
print line
print "testCounter is %i and should be 2 - if not debug the code" % (testCounter)
The output:
['SEK', 'DKK']
<code>DKK</code>
<code>SEK</code>
testCounter is 377 and should be 2 - if not debug the code
Think that perhaps my problem is because that .keys() gives me an array rather than a list.. But haven't figured out how to convert it..
change:
any(countryCode in line for countryCode in currencies.keys())
to:
any([countryCode in line for countryCode in currencies.keys()])
Your original code uses a generator expression whereas (I think) your intention is a list comprehension.
see: Generator Expressions vs. List Comprehension
UPDATE:
I found that using an ipython interpreter with pylab imported I got the same results as you did (377 counts versus the anticipated 2). I realized the issue was that 'any' was from the numpy package which is meant to work on an array.
Next, I loaded an ipython interpreter without pylab such that 'any' was from builtin. In this case your original code works.
So if your using an ipython interpreter type:
help(any)
and make sure it is from the builtin module. If so your original code should work fine.
This is not a very good way to examine an xml file.
It's slow. You are making potentially N*M substring searches where N is the number of lines and M is the number of keys.
XML is not a line-oriented text format. Your substring searches could find attribute names or element names too, which is probably not what you want. And if the XML file happens to put all its elements on one line with no whitespace (common for machine-generated and -processed XML) you will get fewer matches than you expect.
If you have line-oriented text input, I suggest you construct a regex from your list of keys:
import re
linetester = re.compile('|'.join(re.escape(key) for key in currencies))
for match in linetester.finditer(entire_text):
print match.group(0)
#or if entire_text is too long and you want to consume iteratively:
for line in entire_text:
for match in linetester.find(line):
print match.group(0)
However, since you have XML, you should use an actual XML processor:
import xml.etree.cElementTree as ET
for elem in forex.findall('data/code'):
if elem.text in currencies:
print elem.text
If you are only interested in what codes are present and don't care about the particular entry you can use set intersection:
codes = frozenset(e.text for e in forex.findall('data/code'))
print codes & frozenset(currencies)