List appending and web crawling difficulty in python

List appending and web crawling difficulty in python - python

I am facing a difficulty in parsing the population count and appending it to a list
from bs4 import *
import requests
def getPopulation(name):
url="http://www.worldometers.info/world-population/"+name+"-population/"
data=requests.get(url)
soup=BeautifulSoup(data.text,"html.parser")
#print(soup.prettify())
x=soup.find_all('div',{"class":"col-md-8 country-pop-description"})
y=x[0].find_all('strong')
result=y[1].text
return result
def main():
no=input("Enter the number of countries : ")
Map=[]
for i in range(0,int(no)):
country=input("Enter country : ")
res=getPopulation(country)
Map.append(res)
print(Map)
if __name__ == "__main__":
main()
The function works fine if i run it separately by passing a country name such as "india" as a parameter but shows an error when i compile it in this program.I am a beginner in python so sorry for the silly mistakes if any present.
Traceback (most recent call last):
File "C:/Users/Latheesh/AppData/Local/Programs/Python/Python36/Population Graph.py", line 24, in <module>
main()
File "C:/Users/Latheesh/AppData/Local/Programs/Python/Python36/Population Graph.py", line 19, in main
res=getPopulation(country)
File "C:/Users/Latheesh/AppData/Local/Programs/Python/Python36/Population Graph.py", line 10, in getPopulation
y=x[0].find_all('strong')
IndexError: list index out of range

I just ran your code for the sample cases (india and china) and ran into no issue. The reason you'd get the indexerror is if there are no results for find_all, for which the result would be [] (so there is no 0th element).
To fix your code you need a "catch" to confirm there are results. Here's a basic way to do that:
def getPopulation(name):
...
x=soup.find_all('div',{"class":"col-md-8 country-pop-description"})
if x:
y=x[0].find_all('strong')
result=y[1].text
else:
result = "No results founds."
return result
A cleaner way to write that, eliminating the unnecessary holder variables (e.g. y) and using a ternary operator:
def getPopulation(name):
...
x=soup.find_all('div',{"class":"col-md-8 country-pop-description"})
return x[0].find_all('strong')[1].text if x else "No results founds."
A few other notes about your code:
It's best to use returns for all of your functions. For main(), instead of using print(Map), you should use return Map
Style convention in Python calls for variable names to be lowercase (e.g. Map should be map) and there should be a space before your return line (as in the shortened getPopulation() above. I suggest reviewing PEP 8 to learn more about style norms / making code easier to read.
For url it's better practice to use string formatting to insert your variables. For example, "http://www.worldometers.info/world-population/{}-population/".format(name)

Related

Unable to retrieve value from dictionary after webscraping

I was hoping people on here would be able to answer what I believe to be a simple question. I'm a complete newbie and have been attempting to create an image webscraper from the site Archdaily. Below is my code so far after numerous attempts to debug it:
#### - Webscraping 0.1 alpha -
#### - Archdaily -
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
These elements here:
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
Attempts to isolate the data-images attribute, shown here:
My github upload of this portion, very long
As you can see, or maybe I'm completely wrong here, my attempts to call the 'url_large' values from this final dictionary list comes to a TypeError, shown below:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 23, in <module>
for k, v in img_list():
TypeError: 'str' object is not callable
I believe my error lies in the resulting isolation of 'data-images', which to me looks like a dict within a list, as they're wrapped by brackets and curly braces. I'm completely out of my element here because I basically jumped into this project blind (haven't even read past chapter 4 of Guttag's book yet).
I also looked everywhere for ideas and tried to mimic what I found. I've found solutions others have offered previously to change the data to JSON data, so I found the code below:
jsonData = json.loads(img.attrs['data-images'])
print(jsonData['url_large'])
But that was a bust, shown here:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 29, in <module>
print(jsonData['url_large'])
TypeError: list indices must be integers or slices, not str
There is a step I'm missing here in changing these string values, but I'm not sure where I could change them. I'm hoping someone can help me resolve this issue, thanks!

It's all about the types.
img_list is actually not a list, but a string. You try to call it by img_list() which results in an error.
You had the right idea of turning it into a dictionary using json.loads. The error here is pretty straight forward - jsonData is a list, not a dictionary. You have more than one image.
You can loop through the list. Each item in the list is a dictionary, and you'll be able to find the url_large attribute in each dictionary in the list:
images_json = img.attrs['data-images']
for image_properties in json.loads(images_json):
print(image_properties['url_large'])

#infinity & #simic0de are both right, but I wanted to more explicitly address what I see in your code as well.
In this particular block:
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
There is a couple syntax errors.
If 'img_list' truly WAS a dictionary, you cannot iterate through it this way. You would need to use img_list.items() (for python3) or img_list.iteritems() (python2) in the second line.
When you use the parenthesis like that, it implies that you're calling a function. But here, you're trying to iterate through a dictionary. That is why you get the 'is not callable' error.
The other main issue is the Type issue. simic0de & Infinity address that, but ultimately you need to check the type of img_list and convert it as needed so you can iterate through it.

Source of error:
img_list is a string. You have to convert it to list using json.loads and it not becomes a list of dicts that you have to loop over.
Working Solution:
import json
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for img in json.loads(img_list):
for k, v in img.items():
if k == 'url_large':
print(v)

Python-How to execute code and store into variable?

So I have been struggling with this issue for what seems like forever now (I'm pretty new to Python). I am using Python 3.7 (need it to be 3.7 due to variations in the versions of packages I am using for the project) to develop an AI chatbot system that can converse with you based on your text input. The program reads the contents of a series of .yml files when it starts. In one of the .yml files I am developing a syntax for when the first 5 characters match a ^###^ pattern, it will instead execute the code and return the result of that execution rather than just output text back to the user. For example:
Normal Conversation:
- - What is AI?
- Artificial Intelligence is the branch of engineering and science devoted to constructing machines that think.
Service/Code-based conversation:
- - Say hello to me
- ^###^print("HELLO")
The idea is that when you ask it to say hello to you, the ^##^print("HELLO") string will be retrieved from the .yml file, the first 5 characters of the response will be removed, the response will be sent to a separate function in the python code where it will run the code and store the result into a variable which will be returned from the function into a variable that will give the nice, clean result of HELLO to the user. I realize that this may be a bit hard to follow, but I will straighten up my code and condense everything once I have this whole error resolved. As a side note: Oracle is just what I am calling the project. I'm not trying to weave Java into this whole mess.
THE PROBLEM is that it does not store the result of the code being run/executed/evaluated into the variable like it should.
My code:
def executecode(input):
print("The code to be executed is: ",input)
#note: the input may occasionally have single quotes and/or double quotes in the input string
result = eval("{}".format(input))
print ("The result of the code eval: ", result)
test = eval("2+2")
test
print(test)
return result
#app.route("/get")
def get_bot_response():
userText = request.args.get('msg')
print("Oracle INTERPRETED input: ", userText)
ChatbotResponse = str(english_bot.get_response(userText))
print("CHATBOT RESPONSE VARIABLE: ", ChatbotResponse)
#The interpreted string was a request due to the ^###^ pattern in front of the response in the custom .yml file
if ChatbotResponse[:5] == '^###^':
print("---SERVICE REQUEST---")
print(executecode(ChatbotResponse[5:]))
interpreter_response = executecode(ChatbotResponse[5:])
print("Oracle RESPONDED with: ", interpreter_response)
else:
print("Oracle RESPONDED with: ", ChatbotResponse)
return ChatbotResponse
When I run this code, this is the output:
Oracle INTERPRETED input: How much RAM do you have?
CHATBOT RESPONSE VARIABLE: ^###^print("HELLO")
---SERVICE REQUEST---
The code to be executed is: print("HELLO")
HELLO
The result of the code eval: None
4
None
The code to be executed is: print("HELLO")
HELLO
The result of the code eval: None
4
Oracle RESPONDED with: None
Output on the website interface
Essentially, need it to say HELLO for the "The result of the code eval:" output. This should get it to where the chatbot responds with HELLO in the web interface, which is the end goal here. It seems as if it IS executing the code due to the HELLO's after the "The code to be executed is:" output text. It's just not storing it into a variable like I need it to.
I have tried eval, exec, ast.literal_eval(), converting the input to string with str(), changing up the single and double quotes, putting \ before pairs of quotes, and a few other things. Whenever I get it to where the program interprets "print("HELLO")" when it executes the code, it complains about the syntax. Also, from several days of looking online I have figured out that exec and eval aren't generally favored due to a bunch of issues, however I genuinely do not care about that at the moment because I am trying to make something that works before I make something that is good and works. I have a feeling the problem is something small and stupid like it always is, but I have no idea what it could be. :(
I used these 2 resources as the foundation for the whole chatbot project:
Text Guide
Youtube Guide
Also, I am sorry for the rather lengthy and descriptive question. It's rare that I have to ask a question of my own on stackoverflow because if I have a question, it usually already has a good answer. It feels like I've tried everything at this point. If you have a better suggestion of how to do this whole system or you think I should try approaching this another way, I'm open to ideas.
Thank you for any/all help. It is very much appreciated! :)

The issue is that python's print() doesn't have a return value, meaning it will always return None. eval simply evaluates some expression, and returns back the return value from that expression. Since print() returns None, an eval of some print statement will also return None.
>>> from_print = print('Hello')
Hello
>>> from_eval = eval("print('Hello')")
Hello
>>> from_print is from_eval is None
True
What you need is a io stream manager! Here is a possible solution that captures any io output and returns that if the expression evaluates to None.
from contextlib import redirect_stout, redirect_stderr
from io import StringIO
# NOTE: I use the arg name `code` since `input` is a python builtin
def executecodehelper(code):
# Capture all potential output from the code
stdout_io = StringIO()
stderr_io = StringIO()
with redirect_stdout(stdout_io), redirect_stderr(stderr_io):
# If `code` is already a string, this should work just fine without the need for formatting.
result = eval(code)
return result, stdout_io.getvalue(), stderr_io.getvalue()
def executecode(code):
result, std_out, std_err = executecodehelper(code)
if result is None:
# This code didn't return anything. Maybe it printed something?
if std_out:
return std_out.rstrip() # Deal with trailing whitespace
elif std_err:
return std_err.rstrip()
else:
# Nothing was printed AND the return value is None!
return None
else:
return result
As a final note, this approach is heavily linked to eval since eval can only evaluate a single statement. If you want to extend your bot to multiple line statements, you will need to use exec, which changes the logic. Here's a great resource detailing the differences between eval and exec: What's the difference between eval, exec, and compile?

It is easy just convert try to create a new list and add the the updated values of that variable to it, for example:
if you've a variable name myVar store the values or even the questions no matter.
1- First declare a new list in your code as below:
myList = []
2- If you've need to answer or display the value through myVar then you can do like below:
myList.append(myVar)
and this if you have like a generator for the values instead if you need the opposite which means the values are already stored then you will just update the second step to be like the following:
myList[0]='The first answer of the first question'
myList[1]='The second answer of the second question'
ans here all the values will be stored in your list and you can also do this in other way, for example using loops is will be much better if you have multiple values or answers.

python-for-list index out of range

I am a beginner of Python. Could someone point out why it keeps saying
Traceback (most recent call last):
File "C:/Python27/practice example/datascraper templates.py", line 21, in <module>
print findPatTitle[i]
IndexError: list index out of range
Thanks a lot.
Here are the codes:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage=urlopen('http://www.voxeu.org/').read()
patFinderTitle=re.compile('<title>(.*)</title>') ##title tag
patFinderLink=re.compile('<link rel.*href="(.*)"/>') ##link tag
findPatTitle=re.findall(patFinderTitle,webpage)
findPatLink=re.findall(patFinderLink,webpage)
listIterator=[]
listIterator=range(2,16)
for i in listIterator:
print findPatTitle[i]
print findPatLink[i]
print '/n'

The error message is perfectly descriptive.
You're trying to access a hard-coded range of indices (2,16) into findPatTitle, but you have no idea how many items there are.
When you want to iterate over multiple similar collections simultaneously, use zip().
for title, link in zip(findPatTitle, findPatLink):
print 'Title={0} Link={1}'.format(title, link)

The problem is you have a different number of results than you expected. Don't hard-code that. But let's also rewrite this to be a bit more pythonic:
Replace this:
listIterator=[]
listIterator=range(2,16)
for i in listIterator:
print findPatTitle[i]
print findPatLink[i]
print '/n'
with the two lists zipped together:
for title, link in zip(findPatTitle, findPatLink):
print title
print link
print '/n'
This will loop over both at once, however long the list is. 1 element or 100 elements, it makes no difference.

Google search with python is sporadically non-accurate and has Type Errors

I am using some code I found here on SO to google search a set of strings and return the "expected" amount of results. Here is that code:
for a in months:
for b in range(1, daysInMonth[a] + 1):
#Code
if not myString:
googleStats.append(None)
else:
try:
query = urllib.urlencode({'q': myString})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
googleStats.append(data['cursor']['estimatedResultCount'])
except TypeError:
googleStats.append(None)
for x in range(0, len(googleStats)):
if googleStats[x] != None:
finalGoogleStats.append(googleStats[x])
There are two problems, which may be related. When I return the len(finalGoogleStats), it's different every time. One time it's 37, then it's 12. However, it should be more like 240.
This is TypeError I receive when I take out the try/except:
TypeError: 'NoneType' object has no attribute '__getitem__'
which occurs on line
googleStats.append(data['cursor']['estimatedResultCount'])
So, I just can't figure out why the number of Nones in googleStats changes every time and it's never as low as it should be. If anyone has any ideas, I'd love to hear them, thanks!
UPDATE
When I try to print out data for every think I'm searching, I get a ton of Nones and very, very few actual JSON dictionaries. The dictionaries I do get are spread out across all the searches, I don't see a pattern in what is a None and what isn't. So, the problem looks like it has more to do with GoogleAPI than anything else.

First, I'd say remove your try..except clause and see where exactly the problem is. Then as a general good practice, when you try to access layers of dictionary elements, use .get() method instead for better control.
As a demonstration of your possible TypeError, here is my educated guess:
>>> a = {}
>>> a['lol'] = None
>>> a['lol']['teemo']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object has no attribute '__getitem__'
>>>
There are ways to use .get(), for a simple demonstration:
>>> a = {}
>>> b = a.get('lol') # will return None
>>> if type(b) is dict: # determine type
... print b.get('teemo') # same technique if b is indeed of type dict
...
>>>

The answer is what I was fearing for a while, but thanks to everyone who tried to help, I upvoted you if anythign was useful.
So, Google seems to randomly freak out that I'm searching so must stuff. Here's the error they give to me :
Suspected Terms of Service Abuse ...... responseStatus:403
So, I guess they put limits on how much I can search with them. What is still strange, though, is that it doesn't happen all the time, I still get sporadic successful searches within the sea of errors. That is still a mystery...

By default the googleapi pass the least result. If you want to increase your display results, in your url add another parameter 'rsz=8' (by default rsz=1 hence the small result).
so your new url becomes:
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=8&%s' % query
see detailed documentation here: https://developers.google.com/web-search/docs/reference#_class_GSearch

Python: If dict keys in line

Found this great answer on how to check if a list of strings are within a line
How to check if a line has one of the strings in a list?
But trying to do a similar thing with keys in a dict does not seem to do the job for me:
import urllib2
url_info = urllib2.urlopen('http://rss.timegenie.com/forex.xml')
currencies = {"DKK": [], "SEK": []}
print currencies.keys()
testCounter = 0
for line in url_info:
if any(countryCode in line for countryCode in currencies.keys()):
testCounter += 1
if "DKK" in line or "SEK" in line:
print line
print "testCounter is %i and should be 2 - if not debug the code" % (testCounter)
The output:
['SEK', 'DKK']
<code>DKK</code>
<code>SEK</code>
testCounter is 377 and should be 2 - if not debug the code
Think that perhaps my problem is because that .keys() gives me an array rather than a list.. But haven't figured out how to convert it..

change:
any(countryCode in line for countryCode in currencies.keys())
to:
any([countryCode in line for countryCode in currencies.keys()])
Your original code uses a generator expression whereas (I think) your intention is a list comprehension.
see: Generator Expressions vs. List Comprehension
UPDATE:
I found that using an ipython interpreter with pylab imported I got the same results as you did (377 counts versus the anticipated 2). I realized the issue was that 'any' was from the numpy package which is meant to work on an array.
Next, I loaded an ipython interpreter without pylab such that 'any' was from builtin. In this case your original code works.
So if your using an ipython interpreter type:
help(any)
and make sure it is from the builtin module. If so your original code should work fine.

This is not a very good way to examine an xml file.
It's slow. You are making potentially N*M substring searches where N is the number of lines and M is the number of keys.
XML is not a line-oriented text format. Your substring searches could find attribute names or element names too, which is probably not what you want. And if the XML file happens to put all its elements on one line with no whitespace (common for machine-generated and -processed XML) you will get fewer matches than you expect.
If you have line-oriented text input, I suggest you construct a regex from your list of keys:
import re
linetester = re.compile('|'.join(re.escape(key) for key in currencies))
for match in linetester.finditer(entire_text):
print match.group(0)
#or if entire_text is too long and you want to consume iteratively:
for line in entire_text:
for match in linetester.find(line):
print match.group(0)
However, since you have XML, you should use an actual XML processor:
import xml.etree.cElementTree as ET
for elem in forex.findall('data/code'):
if elem.text in currencies:
print elem.text
If you are only interested in what codes are present and don't care about the particular entry you can use set intersection:
codes = frozenset(e.text for e in forex.findall('data/code'))
print codes & frozenset(currencies)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

List appending and web crawling difficulty in python - python

Related

Unable to retrieve value from dictionary after webscraping

Python-How to execute code and store into variable?

python-for-list index out of range

Google search with python is sporadically non-accurate and has Type Errors

Python: If dict keys in line

Categories

Resources