Python scraping information has the same classes

Python scraping information has the same classes - python

How do I return data from https://finance.yahoo.com/quote/FB?p=FB. I am trying to pull the open and close data. The thing is that both of these numbers share the same class in the code.
They both share this class 'Trsdu(0.3s) '
How can I differentiate these if the classes are the same?
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()

This function:
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
will return just the text of the first element with class Trsdu(0.3s).
Using:
googclose = googsoupsoup.find_all(class_='Trsdu(0.3s)')
will return an array containing the page's elements with class Trsdu(0.3s).
Then you can iterate them:
for element in googsoupsoup.find_all(class_='Trsdu(0.3s)'):
print element.get_text()

Check this out, if this is what you wanted:
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.select("span[data-reactid=42]")[1].text
googopen = googsoup.select("span[data-reactid=48]")[0].text
print("Close: {}\nOpen: {}".format(googclose,googopen))
Result:
Close: 172.17
Open: 171.69

If you want just the values for Open and Previous Close, you can either use findAll and grab the first 2 items in the results
googclose, googopen = googsoup.findAll('span', class_='Trsdu(0.3s) ')[:2]
googclose = googclose.get_text()
googopen = googopen.get_text()
print(googclose, googopen)
>>> 172.17 171.69
Or you can go one level higher, and find the values based on the parent td using the data-test attribute
googclose = googsoup.find('td', attrs={'data-test': 'PREV_CLOSE-value'}).get_text()
googopen = googsoup.find('td', attrs={'data-test': 'OPEN-value'}).get_text()
print(googclose, googopen)
>>> 172.17 171.69

If you use the Chrome browser you can right-click on the item that you want to know more about then select Inspect from the resulting menu. The browser will show you something like this for the number associated with OPEN.
Notice that, not only is there a class attribute, there's the data-reactid attribute that might do the trick. In fact, if you also inspect the close number you will find, as I did, that its attribute is different.
This suggests the following code.
>>> import requests
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> soup.findAll('span', attrs={'data-reactid': '42'})[0].text
'172.17'
>>> soup.findAll('span', attrs={'data-reactid': '48'})[0].text
'171.69'

Related

How to display the entire list of currencies?

I need to output the exchange rate given by the ECB API. But the output shows an error
"TypeError: string indices must be integers"
How to fix this error?
import requests, config
from bs4 import BeautifulSoup
r = requests.get(config.ecb).text
soup = BeautifulSoup(r, "lxml")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
print(y['currency'], y['rate'])

You have too many for-loops
for i in course:
print(i['currency'], i['rate'])
But this need also to search <cube> with attribute currency
course = soup.findAll("cube", currency=True)
course = soup.findAll("cube", {"currenc": True})
or you would have to check if item has attribute currency
for i in course:
if 'currency' in i.attrs:
print(i['currency'], i['rate'])
Full code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
course = soup.find_all("cube", currency=True)
for i in course:
#print(i)
print(i['currency'], i['rate'])

try this
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r, "lxml")
result = [{currency.get('currency'): currency.get('rate')} for currency in soup.find_all("cube", {'currency': True})]
print(result)
OUTPUT:
[{'USD': '0.9954'}, {'JPY': '142.53'}, {'BGN': '1.9558'}, {'CZK': '24.497'}, {'DKK': '7.4366'}, {'GBP': '0.87400'}, {'HUF': '403.98'}, {'PLN': '4.7143'}, {'RON': '4.9238'}, {'SEK': '10.7541'}, {'CHF': '0.9579'}, {'ISK': '138.30'}, {'NOK': '10.1985'}, {'HRK': '7.5235'}, {'TRY': '18.1923'}, {'AUD': '1.4894'}, {'BRL': '5.2279'}, {'CAD': '1.3226'}, {'CNY': '6.9787'}, {'HKD': '7.8133'}, {'IDR': '14904.67'}, {'ILS': '3.4267'}, {'INR': '79.3605'}, {'KRW': '1383.58'}, {'MXN': '20.0028'}, {'MYR': '4.5141'}, {'NZD': '1.6717'}, {'PHP': '57.111'}, {'SGD': '1.4025'}, {'THB': '36.800'}, {'ZAR': '17.6004'}]

Just in addition to answer from #Sergey K, that is on point how it should be done, to show what is the main issue.
Main issue in your code is that, your selection is not that precise as it should be:
soup.findAll("cube")
This will also find_all() parent <cube> that do not have an attribute called currency or rate but much more decisive is that there are spaces in the markup in between nodes BeautifulSoup will turn those into NavigableString's.
Using the index to get the attribute values, wont work while you do it with a NavigableStringinstead of the next`.
You can see this if you print(y.name) only:
None
Cube
None
Cube
...
How to fix this error?
There are two approaches in my opinion
Best is already shwon https://stackoverflow.com/a/73756178/14460824 by Sergey K who used very precise arguments to find_all() specific elements.
While working with your code, is to implement an if-statement that checks, if the tag.name is equal to 'cube'. It is working fine, but I would recommend to use a more precise selection instead.
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r)
soup.findAll("cube")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
if y.name == 'cube':
print(y['currency'], y['rate'])
Output
USD 0.9954
JPY 142.53
BGN 1.9558
CZK 24.497
DKK 7.4366
GBP 0.87400
HUF 403.98
PLN 4.7143
...

find_next not capturing all <div> instances

I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:
from bs4 import BeautifulSoup as bsoup
import requests as reqs
home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"
page_to_parse = home_test
page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')
find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
add_stats = stat.find_next('div').get_text()
print(add_stats)
If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.
I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.
Results are always:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Where it should take on the following instead:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23

parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements
find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
If you have multiple tables use two loops
find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
all_stats = stats.find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())

find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.
You can change the way to select the div blocks with :
find_stats = parse_page.select('div#team_stats_extra > div')
print(len(find_stats)) # >>> returns 2
for stat in find_stats:
add_stats = stat.get_text()
print(add_stats)
To explain the selector select('div#team_stats_extra > div'), it is the same as :
find the div block with the id team_stats_extra
and select all direct children that are div

With bs4 4.7.1+ you can use :has to ensure you get the appropriate divs with class th as a child so you have the appropriate elements to loop over
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two')
soup = bs(r.content, 'lxml')
for div in soup.select('#team_stats_extra div:has(.th)'):
print(div.get_text())

BeautifulSoup Python script no longer working for mining a simple field

The script used to work, but no longer and I can't figure out why. I am trying to go to the link and extract/print the religion field. Using firebug, the religion field entry is within the 'tbody' then 'td' tag-structure. But now the script find "none" when searching for these tags. And I also look at the lxml by 'print Soup_FamSearch' and I couldn't see any 'tbody' and 'td' tags appeared on firebug.
Please let me know what I am missing?
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
from unicodedata import normalize
FamSearchURL = 'https://familysearch.org/pal:/MM9.1.1/KH21-211'
OpenFamSearchURL = urllib2.urlopen(FamSearchURL)
Soup_FamSearch = BeautifulSoup(OpenFamSearchURL, 'lxml')
OpenFamSearchURL.close()
tbodyTags = Soup_FamSearch.find('tbody')
trTags = tbodyTags.find_all('tr', class_='result-item ')
for trTags in trTags:
tdTags_label = trTag.find('td', class_='result-label ')
if tdTags_label:
tdTags_label_string = tdTags_label.get_text(strip=True)
if tdTags_label_string == 'Religion: ':
print trTags.find('td', class_='result-value ')

Find the Religion: label by text and get the next td sibling:
soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> response = requests.get('https://familysearch.org/pal:/MM9.1.1/KH21-211')
>>> soup = BeautifulSoup(response.content, 'lxml')
>>>
>>> soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Methodist
Then, you can make a nice reusable function and reuse:
def get_field_value(soup, field):
return soup.find(text='%s:' % field).parent.find_next_sibling('td').get_text(strip=True)
print get_field_value(soup, 'Religion')
print get_field_value(soup, 'Nationality')
print get_field_value(soup, 'Birthplace')
Prints:
Methodist
Canadian
Ontario

HTML Parsing gives no response

I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.

First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.

I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.

I need help web-scraping

So I wanted to scrape visualizations from visual.ly, however right now I do not understand how the "show more" button works. As of now, my code will get the image link, the text next to the image, and the link of the page. I was wondering how the "show more" button functions, because I was going to try to loop through using the number of pages. As of now I do not know how i would loop through each one individually. Any ideas on how I could loop through and go on to get more images than they originally show you????
from BeautifulSoup import BeautifulSoup
import urllib2
import HTMLParser
import urllib, re
counter = 1
columnno = 1
parser = HTMLParser.HTMLParser()
soup = BeautifulSoup(urllib2.urlopen('http://visual.ly/?view=explore& type=static#v2_filter').read())
image = soup.findAll("div", attrs = {'class': 'view-mode-wrapper'})
if columnno < 4:
column = image[0].findAll("div", attrs = {'class': 'v2_grid_column'})
columnno += 1
else:
column = image[0].findAll("div", attrs = {'class': 'v2_grid_column last'})
visualizations = column[0].findAll("div", attrs = {'class': '0 v2_grid_item viewmode-item'})
getImage = visualizations[0].find("a")
print counter
print getImage['href']
soup1 = BeautifulSoup(urllib2.urlopen(getImage['href']).read())
theImage = soup1.findAll("div", attrs = {'class': 'ig-graphic-wrapper'})
text = soup1.findAll("div", attrs = {'class': 'ig-content-right'})
getText = text[0].findAll("div", attrs = {'class': 'ig-description right-section first'})
imageLink = theImage[0].find("a")
print imageLink['href']
print getText
for row in image:
theImage = image[0].find("a")
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
counter += 1

You cannot use a urllib-parser combo here because it uses javascript to load more content. In order to do this you will need a full force browser emulator (with javascript support). I have never used Selenium before, but I have heard that it does this, as well as has a python binding
However, I have found that it uses a very predictable form
http://visual.ly/?page=<page_number>
for its GET requests. Perhaps an easier way would be to go under
<div class="view-mode-wrapper">...</div>
to parse the data (using the above url format). After all, ajax requests must go to a location.
Then you could do
for i in xrange(<whatever>):
url = r'http://visual.ly/?page={pagenum}'.format(pagenum=i)
#do whatever you want from here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping information has the same classes - python

Related

How to display the entire list of currencies?

find_next not capturing all <div> instances

BeautifulSoup Python script no longer working for mining a simple field

HTML Parsing gives no response

I need help web-scraping

Categories

Resources