I need to output the exchange rate given by the ECB API. But the output shows an error
"TypeError: string indices must be integers"
How to fix this error?
import requests, config
from bs4 import BeautifulSoup
r = requests.get(config.ecb).text
soup = BeautifulSoup(r, "lxml")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
print(y['currency'], y['rate'])
You have too many for-loops
for i in course:
print(i['currency'], i['rate'])
But this need also to search <cube> with attribute currency
course = soup.findAll("cube", currency=True)
course = soup.findAll("cube", {"currenc": True})
or you would have to check if item has attribute currency
for i in course:
if 'currency' in i.attrs:
print(i['currency'], i['rate'])
Full code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
course = soup.find_all("cube", currency=True)
for i in course:
#print(i)
print(i['currency'], i['rate'])
try this
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r, "lxml")
result = [{currency.get('currency'): currency.get('rate')} for currency in soup.find_all("cube", {'currency': True})]
print(result)
OUTPUT:
[{'USD': '0.9954'}, {'JPY': '142.53'}, {'BGN': '1.9558'}, {'CZK': '24.497'}, {'DKK': '7.4366'}, {'GBP': '0.87400'}, {'HUF': '403.98'}, {'PLN': '4.7143'}, {'RON': '4.9238'}, {'SEK': '10.7541'}, {'CHF': '0.9579'}, {'ISK': '138.30'}, {'NOK': '10.1985'}, {'HRK': '7.5235'}, {'TRY': '18.1923'}, {'AUD': '1.4894'}, {'BRL': '5.2279'}, {'CAD': '1.3226'}, {'CNY': '6.9787'}, {'HKD': '7.8133'}, {'IDR': '14904.67'}, {'ILS': '3.4267'}, {'INR': '79.3605'}, {'KRW': '1383.58'}, {'MXN': '20.0028'}, {'MYR': '4.5141'}, {'NZD': '1.6717'}, {'PHP': '57.111'}, {'SGD': '1.4025'}, {'THB': '36.800'}, {'ZAR': '17.6004'}]
Just in addition to answer from #Sergey K, that is on point how it should be done, to show what is the main issue.
Main issue in your code is that, your selection is not that precise as it should be:
soup.findAll("cube")
This will also find_all() parent <cube> that do not have an attribute called currency or rate but much more decisive is that there are spaces in the markup in between nodes BeautifulSoup will turn those into NavigableString's.
Using the index to get the attribute values, wont work while you do it with a NavigableStringinstead of the next`.
You can see this if you print(y.name) only:
None
Cube
None
Cube
...
How to fix this error?
There are two approaches in my opinion
Best is already shwon https://stackoverflow.com/a/73756178/14460824 by Sergey K who used very precise arguments to find_all() specific elements.
While working with your code, is to implement an if-statement that checks, if the tag.name is equal to 'cube'. It is working fine, but I would recommend to use a more precise selection instead.
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r)
soup.findAll("cube")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
if y.name == 'cube':
print(y['currency'], y['rate'])
Output
USD 0.9954
JPY 142.53
BGN 1.9558
CZK 24.497
DKK 7.4366
GBP 0.87400
HUF 403.98
PLN 4.7143
...
Related
I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:
from bs4 import BeautifulSoup as bsoup
import requests as reqs
home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"
page_to_parse = home_test
page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')
find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
add_stats = stat.find_next('div').get_text()
print(add_stats)
If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.
I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.
Results are always:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Where it should take on the following instead:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23
parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements
find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
If you have multiple tables use two loops
find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
all_stats = stats.find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.
You can change the way to select the div blocks with :
find_stats = parse_page.select('div#team_stats_extra > div')
print(len(find_stats)) # >>> returns 2
for stat in find_stats:
add_stats = stat.get_text()
print(add_stats)
To explain the selector select('div#team_stats_extra > div'), it is the same as :
find the div block with the id team_stats_extra
and select all direct children that are div
With bs4 4.7.1+ you can use :has to ensure you get the appropriate divs with class th as a child so you have the appropriate elements to loop over
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two')
soup = bs(r.content, 'lxml')
for div in soup.select('#team_stats_extra div:has(.th)'):
print(div.get_text())
I was making a program that would collect the value of the cryptocurrency verge. This script did the trick:
import urllib2
from bs4 import BeautifulSoup
url=("https://coinmarketcap.com/currencies/verge/")
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
price=find_value.text
Though the issue was that the result was in USD and i lived in Australia. So what i then did was put that value in a USD to AUD converter to find my value. I tried with the following code:
url2="http://www.xe.com/currencyconverter/convert/?
Amount="+price+"&From=USD&To=AUD"
print url2
page2=urllib2.urlopen(url2)
soup2=BeautifulSoup(page2,"html.parser")
find_value2=soup.find('span',attrs={'class':'uccResultAmount'})
print find_value2
The result was that i would get the right url though i would get the wrong result. Could anybody tell me where i am going wrong?Thank You
You can use regular expressions to scrape the currency converter:
import urllib
from bs4 import BeautifulSoup
import re
def convert(**kwargs):
url = "http://www.xe.com/currencyconverter/convert/?Amount={amount}&From={from_curr}&To={to_curr}".format(**kwargs)
data = str(urllib.urlopen(url).read())
val = map(float, re.findall("(?<=uccResultAmount'>)[\d\.]+", data))
return val[0]
url="https://coinmarketcap.com/currencies/verge/"
page=urllib.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
print convert(amount = float(find_value.text), from_curr = 'USD', to_curr = 'AUD')
Output:
0.170358
How do I return data from https://finance.yahoo.com/quote/FB?p=FB. I am trying to pull the open and close data. The thing is that both of these numbers share the same class in the code.
They both share this class 'Trsdu(0.3s) '
How can I differentiate these if the classes are the same?
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
This function:
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
will return just the text of the first element with class Trsdu(0.3s).
Using:
googclose = googsoupsoup.find_all(class_='Trsdu(0.3s)')
will return an array containing the page's elements with class Trsdu(0.3s).
Then you can iterate them:
for element in googsoupsoup.find_all(class_='Trsdu(0.3s)'):
print element.get_text()
Check this out, if this is what you wanted:
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.select("span[data-reactid=42]")[1].text
googopen = googsoup.select("span[data-reactid=48]")[0].text
print("Close: {}\nOpen: {}".format(googclose,googopen))
Result:
Close: 172.17
Open: 171.69
If you want just the values for Open and Previous Close, you can either use findAll and grab the first 2 items in the results
googclose, googopen = googsoup.findAll('span', class_='Trsdu(0.3s) ')[:2]
googclose = googclose.get_text()
googopen = googopen.get_text()
print(googclose, googopen)
>>> 172.17 171.69
Or you can go one level higher, and find the values based on the parent td using the data-test attribute
googclose = googsoup.find('td', attrs={'data-test': 'PREV_CLOSE-value'}).get_text()
googopen = googsoup.find('td', attrs={'data-test': 'OPEN-value'}).get_text()
print(googclose, googopen)
>>> 172.17 171.69
If you use the Chrome browser you can right-click on the item that you want to know more about then select Inspect from the resulting menu. The browser will show you something like this for the number associated with OPEN.
Notice that, not only is there a class attribute, there's the data-reactid attribute that might do the trick. In fact, if you also inspect the close number you will find, as I did, that its attribute is different.
This suggests the following code.
>>> import requests
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> soup.findAll('span', attrs={'data-reactid': '42'})[0].text
'172.17'
>>> soup.findAll('span', attrs={'data-reactid': '48'})[0].text
'171.69'
I pass multiple class values to BeautifulSoup.find_all(). The value is something like l4 center OR l5 center. (i.e., "l4 center" | "l5 center").
soup.find_all("ul", {"class" : value)
I fail (output nothing) to do that with the following two solution:
soup.find_all("ul", {"class" : re.compile("l[4-5]\scenter")})
#OR
soup.find_all("ul", {"class" : ["l4 center", "l5 center"]})
The source code is as follows:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import bs4
import requests
import requests.exceptions
import re
### function, , .... ###
def crawler_chinese_idiom():
url = 'http://chengyu.911cha.com/zishu_8.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)
#for result_set in soup.find_all("ul", class=re.compile("l[45] +center")): #l4 center or l5 center
for result_set in soup.find_all("ul", {"class", re.compile(r"l[45]\s+center")}): #nothing output
#for result_set in soup.find_all("ul", {"class" : "l4 center"}): #normal one
print(result_set)
crawler_chinese_idiom()
#[] output nothing
Update: resolved https://bugs.launchpad.net/bugs/1476868
At first I thought the problem was that class='l4 center' in HTML is actually two classes -- thinking that soup won't match because it's looking for a single class that contains a space (impossible).
Tried:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("<html><div class='l5 center'>l5test</div><div class='l4 center'>l4test</div><div class='l6 center'>l6test</div>")
results1 = soup.findAll('div', re.compile(r'l4 center'));
print results1
results2 = soup.findAll('div', 'l4 center');
print results2
Output:
[]
[<div class="l4 center">l4test</div>]
But wait? The non-regex option worked fine - it found both classes.
At this point, it looks to me like a BeautifulSoup bug.
To work around it, you could do:
soup.findAll('div', ['l4 center', 'l5 center']);
# update: ^ that doesn't work either.
# or
soup.findAll('div', ['l4', 'l5', 'center']);
I'd recommend the second one just in case you want to match l4 otherclass center, but you might need to iterate the results to make sure you don't have any unwanted captures in there. Something like:
for result in soup.findAll(...):
if (result.find({'class': 'l4'}) and result.find({'class': 'center'}):
# yay!
I've submitted a bug here for investigation.
I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.
First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.
I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.