Web Scraping with BeautifulSoup4 failing - python

I was making a program that would collect the value of the cryptocurrency verge. This script did the trick:
import urllib2
from bs4 import BeautifulSoup
url=("https://coinmarketcap.com/currencies/verge/")
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
price=find_value.text
Though the issue was that the result was in USD and i lived in Australia. So what i then did was put that value in a USD to AUD converter to find my value. I tried with the following code:
url2="http://www.xe.com/currencyconverter/convert/?
Amount="+price+"&From=USD&To=AUD"
print url2
page2=urllib2.urlopen(url2)
soup2=BeautifulSoup(page2,"html.parser")
find_value2=soup.find('span',attrs={'class':'uccResultAmount'})
print find_value2
The result was that i would get the right url though i would get the wrong result. Could anybody tell me where i am going wrong?Thank You

You can use regular expressions to scrape the currency converter:
import urllib
from bs4 import BeautifulSoup
import re
def convert(**kwargs):
url = "http://www.xe.com/currencyconverter/convert/?Amount={amount}&From={from_curr}&To={to_curr}".format(**kwargs)
data = str(urllib.urlopen(url).read())
val = map(float, re.findall("(?<=uccResultAmount'>)[\d\.]+", data))
return val[0]
url="https://coinmarketcap.com/currencies/verge/"
page=urllib.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
print convert(amount = float(find_value.text), from_curr = 'USD', to_curr = 'AUD')
Output:
0.170358

Related

How to display the entire list of currencies?

I need to output the exchange rate given by the ECB API. But the output shows an error
"TypeError: string indices must be integers"
How to fix this error?
import requests, config
from bs4 import BeautifulSoup
r = requests.get(config.ecb).text
soup = BeautifulSoup(r, "lxml")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
print(y['currency'], y['rate'])
You have too many for-loops
for i in course:
print(i['currency'], i['rate'])
But this need also to search <cube> with attribute currency
course = soup.findAll("cube", currency=True)
course = soup.findAll("cube", {"currenc": True})
or you would have to check if item has attribute currency
for i in course:
if 'currency' in i.attrs:
print(i['currency'], i['rate'])
Full code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
course = soup.find_all("cube", currency=True)
for i in course:
#print(i)
print(i['currency'], i['rate'])
try this
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r, "lxml")
result = [{currency.get('currency'): currency.get('rate')} for currency in soup.find_all("cube", {'currency': True})]
print(result)
OUTPUT:
[{'USD': '0.9954'}, {'JPY': '142.53'}, {'BGN': '1.9558'}, {'CZK': '24.497'}, {'DKK': '7.4366'}, {'GBP': '0.87400'}, {'HUF': '403.98'}, {'PLN': '4.7143'}, {'RON': '4.9238'}, {'SEK': '10.7541'}, {'CHF': '0.9579'}, {'ISK': '138.30'}, {'NOK': '10.1985'}, {'HRK': '7.5235'}, {'TRY': '18.1923'}, {'AUD': '1.4894'}, {'BRL': '5.2279'}, {'CAD': '1.3226'}, {'CNY': '6.9787'}, {'HKD': '7.8133'}, {'IDR': '14904.67'}, {'ILS': '3.4267'}, {'INR': '79.3605'}, {'KRW': '1383.58'}, {'MXN': '20.0028'}, {'MYR': '4.5141'}, {'NZD': '1.6717'}, {'PHP': '57.111'}, {'SGD': '1.4025'}, {'THB': '36.800'}, {'ZAR': '17.6004'}]
Just in addition to answer from #Sergey K, that is on point how it should be done, to show what is the main issue.
Main issue in your code is that, your selection is not that precise as it should be:
soup.findAll("cube")
This will also find_all() parent <cube> that do not have an attribute called currency or rate but much more decisive is that there are spaces in the markup in between nodes BeautifulSoup will turn those into NavigableString's.
Using the index to get the attribute values, wont work while you do it with a NavigableStringinstead of the next`.
You can see this if you print(y.name) only:
None
Cube
None
Cube
...
How to fix this error?
There are two approaches in my opinion
Best is already shwon https://stackoverflow.com/a/73756178/14460824 by Sergey K who used very precise arguments to find_all() specific elements.
While working with your code, is to implement an if-statement that checks, if the tag.name is equal to 'cube'. It is working fine, but I would recommend to use a more precise selection instead.
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r)
soup.findAll("cube")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
if y.name == 'cube':
print(y['currency'], y['rate'])
Output
USD 0.9954
JPY 142.53
BGN 1.9558
CZK 24.497
DKK 7.4366
GBP 0.87400
HUF 403.98
PLN 4.7143
...

How i can get random object from list in Python

I have build a list which contains href from website and i wanna randomly select one of this link, how can i do that?
from bs4 import BeautifulSoup
import urllib
import requests
import re
import random
url = "https://www.formula1.com/en/latest.html"
articles = []
respone = urllib.request.urlopen(url)
soup = BeautifulSoup(respone,'lxml')
def getItems():
for a in soup.findAll('a',attrs={'href': re.compile("/en/latest/article.")}):
articles = a['href']
x = random.choice(articles)
print(x)
That code work, but selecting only random index from all of the objects
You're very close to the answer. You just need to do this:
from bs4 import BeautifulSoup
import urllib
import requests
import re
import random
url = "https://www.formula1.com/en/latest.html"
articles = []
respone = urllib.request.urlopen(url)
soup = BeautifulSoup(respone,'lxml')
def getItems():
for a in soup.findAll('a',attrs={'href': re.compile("/en/latest/article.")}):
articles.append(a['href'])
x = random.choice(articles)
print(x)
getItems()
The changes are:
We add each article to the articles array.
The random choice is now done after the loop, rather than inside the loop.

scrape blocks of data from webpage API

I try to collect block data which forms a small table from a webpage. Pls see my codes below.
`
import requests
import re
import json
import sys
import os
import time
from lxml import html,etree
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.investing.com/instruments/OptionsDataAjax'
params = {'pair_id': 525, ## SPX
'date': 1536555600, ## 2018-9-4
'strike': 'all', ## all prices
'callspots': 'calls',#'call_andputs',
'type':'analysis', # webpage viewer
'bringData':'true',
}
headers = {'User-Agent': Chrome/39.0.2171.95 Safari/537.36'}
def R(text, end='\n'): print('\033[0;31m{}\033[0m'.format(text), end=end)
def G(text, end='\n'): print('\033[0;32m{}\033[0m'.format(text), end=end)
page = requests.get(url, params=params,headers = headers)
if page.status_code != 200:
R('ERROR CODE:{}'.format(page.status_code))
sys.exit
G('Problem in connection!')
else:
G('OK')
soup = BeautifulSoup(page.content,'lxml')
spdata = json.loads(soup.text)
print(spdata['data'])`
This result--spdata['data'] gives me a str, I just want to get following blocks in this str. There are many such data blocks in this str with the same format.
SymbolSPY180910C00250000
Delta0.9656
Imp Vol0.2431
Bid33.26
Gamma0.0039
Theoretical33.06
Ask33.41
Theta-0.0381
Intrinsic Value33.13
Volume0
Vega0.0617
Time Value-33.13
Open Interest0
Rho0.1969
Delta / Theta-25.3172
I use json and BeautifulSoup here, maybe regular expression will help but I don't know much about re. To get the result, any approach is appreciated. Thanks.
Add this after your code:
regex = r"((SymbolSPY[1-9]*):?\s*)(.*?)\n[^\S\n]*\n[^\S\n]*"
for match in re.finditer(regex, spdata['data'], re.MULTILINE | re.DOTALL):
for line in match.group().splitlines():
print (line.strip())
Outputs
OK
SymbolSPY180910C00245000
Delta0.9682
Imp Vol0.2779
Bid38.26
Gamma0.0032
Theoretical38.05
Ask38.42
Theta-0.0397
Intrinsic Value38.13
Volume0
Vega0.0579
Time Value-38.13
Open Interest0
Rho0.1934
Delta / Theta-24.3966
SymbolSPY180910P00245000
Delta-0.0262
Imp Vol0.2652
...

BeautifulSoup Python script no longer working for mining a simple field

The script used to work, but no longer and I can't figure out why. I am trying to go to the link and extract/print the religion field. Using firebug, the religion field entry is within the 'tbody' then 'td' tag-structure. But now the script find "none" when searching for these tags. And I also look at the lxml by 'print Soup_FamSearch' and I couldn't see any 'tbody' and 'td' tags appeared on firebug.
Please let me know what I am missing?
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
from unicodedata import normalize
FamSearchURL = 'https://familysearch.org/pal:/MM9.1.1/KH21-211'
OpenFamSearchURL = urllib2.urlopen(FamSearchURL)
Soup_FamSearch = BeautifulSoup(OpenFamSearchURL, 'lxml')
OpenFamSearchURL.close()
tbodyTags = Soup_FamSearch.find('tbody')
trTags = tbodyTags.find_all('tr', class_='result-item ')
for trTags in trTags:
tdTags_label = trTag.find('td', class_='result-label ')
if tdTags_label:
tdTags_label_string = tdTags_label.get_text(strip=True)
if tdTags_label_string == 'Religion: ':
print trTags.find('td', class_='result-value ')
Find the Religion: label by text and get the next td sibling:
soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> response = requests.get('https://familysearch.org/pal:/MM9.1.1/KH21-211')
>>> soup = BeautifulSoup(response.content, 'lxml')
>>>
>>> soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Methodist
Then, you can make a nice reusable function and reuse:
def get_field_value(soup, field):
return soup.find(text='%s:' % field).parent.find_next_sibling('td').get_text(strip=True)
print get_field_value(soup, 'Religion')
print get_field_value(soup, 'Nationality')
print get_field_value(soup, 'Birthplace')
Prints:
Methodist
Canadian
Ontario

Trouble with a for loop

I'm having a problem with a for loop. In the script, I use a text list to build a URL and then ran a for loop for each element of the list. After having all the URLs I want to extract information from the website. That's where I have a problem.
I checked the program and it's building the correct URL but I don't know how to extract the information for all elements of the look using just the 1st URL.
Please, anyone have an idea where I'm going wrong?
import urllib2
import re
from bs4 import BeautifulSoup
import time
date = date = (time.strftime('%Y%m%d'))
symbolslist = open('pistas.txt').read().split()
for symbol in symbolslist:
url = "http://trackinfo.com/entries-race.jsp?raceid=" + symbol + "$" + date +"A01"
htmltext = BeautifulSoup(urllib2.urlopen(url).read())
names=soup.findAll('a',{'href':re.compile("dog")})
for name in names:
results = ' '.join(name.string.split())
print results
and that is the text list:
GBM
GBR
GCA
GDB
GSP
GDQ
GEB
Hei man, try this:
import urllib2
import re
from bs4 import BeautifulSoup
import time
date = (time.strftime('%Y%m%d'))
symbolslist = open('pistas.txt').read().split()
for symbol in symbolslist:
url = "http://trackinfo.com/entries-race.jsp?raceid=" + symbol + "$" + date +"A01"
htmltext = BeautifulSoup(urllib2.urlopen(url).read())
names=htmltext.findAll('a',{'href':re.compile("dog")})
for name in names:
results = ' '.join(name.string.split())
print results

Categories

Resources