Scraping page with BS in python only captures first column of splitColumn - python

I'm trying to scrape the last part of this page through BeautifulSoup in python.
I want to retrieve all the companies listed in the bottom. Furthermore, the companies are ordered alphabetically, where the companies with titles starting with "A-F" appear under the first tab, then "G-N" under the second tab and so on. You have to click the tabs for the names to appear, so I'll loop through the different "name pages" and apply the same code.
I'm having trouble retrieving all the names of a single page, however.
When looking at the companies named "A-F" I can only retrieve the names of the first column of the table.
My code is:
from bs4 import BeautifulSoup as Soup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-
responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url)
soup = Soup(page.content, "html.parser")
for header in soup.find("h2").next_siblings:
try:
for a in header.childGenerator():
if str(type(a)) == "<class 'bs4.element.NavigableString'>":
print(str(a))
except:
pass
As can be seen by running this, I only get the names from the first column.
Any help is very much appreciated.

Give this a shot and tell me this is not what you wanted:
from bs4 import BeautifulSoup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url).text
soup = BeautifulSoup(page, "lxml")
for items in soup.select(".splitColumn p"):
title = '\n'.join([item for item in items.strings])
print(title)
Result:
3iGroup
8point3 Energy Partners 
A
ABN AMRO
Accell Group
Accsys Technologies
Achmea
Acuity Brands
Adecco
Adidas
Adobe Systems

Related

Navigating html tree with beautifulsoup

I'm trying to scrape some data with beautifulsoup on python (url:http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html)
When the data is first occurrence, no problem like
titlebook = soup.find("h1")
titlebook = titlebook.text
but i want to scrape different values, further in page, like upc, price incl.tax, etc
Upc value is first and i have it running universal_product_code= soup.find("tr").find("td").text
I tried so many solutions to access the other ones (i've read beautifulsoup documentation and tried lot of things but it didn't really help me)
So my question is, how to access specific values in a tree where tags are same? I joined a screenshot of the tree to help you understand what i'm talking about
Thank you for your help
For example, if you want to find price (excluding tax), you can use string= parameter in .find and then search for text in next <td>:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# get "Price (excl. tax)" from the table:
key = "Price (excl. tax)"
print("{}: {}".format(key, soup.find("th", string=key).find_next("td").text))
Prints:
Price (excl. tax): £53.74
Or: Use CSS selector:
print(soup.select_one('th:-soup-contains("Price (excl. tax)") + td').text)

Finding name and codes of all airports

I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)
Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.
Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Finding the correct elements for scraping a website

I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.
I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.
Getting the sub-links:
import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ]
sub_links = {}
for master_atag in master_atags:
master_href = master_atag.get('href')
master_href = base_url + master_href
print(master_href)
master_links.append(master_href)
sub_request = requests.get(master_href)
sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
sub_atags = sub_soup.find_all("a", href=True)
sub_links[master_href] = []
for sub_atag in sub_atags:
sub_href = sub_atag.get('href')
sub_links[master_href].append(sub_href)
print("\t"+sub_href)
Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;
sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)
this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.
I tried also tried the following:
for master_atag in master_atags:
master_href = master_atag.get('href')
for href in master_href:
master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
print(master_href)
I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.
Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?
for links in sublinks:
resp = requests.get(sublinks)
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
title = soup.find('title')
textdate = soup.find('h2')
paragraphs = article.find_all('p')
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
for match in matches:
print(match[0])
datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})
Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:
import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url)
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()
Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.
I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.
The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:
Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.
Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.
Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = 'https://www.ecb.europa.eu'
urls = [
f'{base_url}/press/pr/html/index.en.html',
f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for anchor in soup.select('span.doc-title > a[href]'):
driver.get(f'{base_url}{anchor["href"]}')
article_soup = BeautifulSoup(driver.page_source, 'html.parser')
title = article_soup.select_one('h1.ecb-pressContentTitle').text
date = article_soup.select_one('p.ecb-publicationDate').text
paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
content = '\n\n'.join(p.text for p in paragraphs)
print(f'title: {title}')
print(f'date: {date}')
print(f'content: {content[0:80]}...')
I get the following output for the Press Releases page:
title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board
date: 20 December 2019
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...
title: Monetary policy decisions
date: 12 December 2019
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

How to scrape 'Click to Display' fields with BeautifulSoup

I am trying to scrape the number of schools and names of schools that basketball players get offers from verbalcommits.com
Using this page as an example: http://www.verbalcommits.com/players/jarrey-foster
It's easy to access the first offer (SMU) but all of the other offers are hidden behind the "Show other offers" button. When I inspect the page, I can see the offers but my scraper doesn't get to them. I've been using the following:
page=urllib.request.urlopen("http://www.verbalcommits.com/players/jarrey-foster") #opens page
soup = BeautifulSoup(page, 'html.parser') #makes page into a BS python object
schools = soup.body.findAll('span',{"class":"team_name"})
print(schools)
This returns the first span that has the team name in it, but not the rest of the spans that are hidden. What do I need to add to access the rest of the page that is hidden?
To elaborate more on #furas's great answer. Here is how you can extract the player id and make a second request to get the "closed offers". For this, we are going to maintain a web-scraping session with requests:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
response = session.get("http://www.verbalcommits.com/players/jarrey-foster")
# get the player id
soup = BeautifulSoup(response.content, "html.parser")
player_id = soup.select_one("h1.player-name").get("data-player-id")
# get closed offers
response = session.get("http://www.verbalcommits.com/player_divs/closed_offers", params={"player_id": player_id})
soup = BeautifulSoup(response.content, "html.parser")
# print team names
for team in soup.select(".team_name"):
print(team.get_text())
Prints team names for demonstration purposes:
UTEP
Sam Houston State
New Hampshire
Rice
Temple
Liberty
UL Lafayette
You can't get other data because when you click button then JavaScript reads it from server from
http://www.verbalcommits.com/player_divs/closed_offers?player_id=17766&_=1475626846752
Now you can use this url with BS to get data.
I used Firebug in Firefox or Developer Tools in Chrome to find this url.
EDIT: inside HTML I found data-player-id="17766" - it is first argument in above url. Maybe you can find second argument so you could generate url using Python.
EDIT: I checked url
http://www.verbalcommits.com/player_divs/closed_offers?player_id=17766
and it gives the same data so you don't need second argument.

Categories

Resources