Trying to Get A Link Embedded in an HTML Page with Python

Trying to Get A Link Embedded in an HTML Page with Python - python

On the webpage: https://podcasts.apple.com/us/podcast/id979020229, there is a title that reads "Python at the US Federal Election Commission". When you click on that title, a link opens. What I'm trying to do is firstly, find the first title on the webpage that has an embedded link. Then, to get that link and print it. I'm not sure how to do this in Python, but I've tried using different ways I thought would work. One of the ways involved the BeautifulSoup module. My code is below.
Code:
page = requests.get(link)
soup = bs4.BeautifulSoup(page.text, "html.parser")
eps = soup.find_all('a')
i = 0
while (len(open) < 1):
s = str(eps[i].get('href'))
if s[8] == 'q':
open.append(s)
i += 1
for i in open:
print(i)

You can use findall with the class of the link.
I used inspect element to determine which class to select, in this case, tracks__track__link--block
Then you can iterate through the links.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://podcasts.apple.com/us/podcast/id979020229')
soup = BeautifulSoup(page.text, "html.parser")
eps = soup.find_all(class_ = 'tracks__track__link--block')
for a in eps:
# gets the text of the link
print(a.text)
# gets the link
print(a['href'])
prints
Python at the US Federal Election Commission
https://podcasts.apple.com/us/podcast/python-at-the-us-federal-election-commission/id979020229?i=1000522628049
Flask 2.0
https://podcasts.apple.com/us/podcast/flask-2-0/id979020229?i=1000521751060
Awesome FastAPI extensions and add ons
https://podcasts.apple.com/us/podcast/awesome-fastapi-extensions-and-add-ons/id979020229?i=1000520763681
Ask us about modern Python projects and tools
https://podcasts.apple.com/us/podcast/ask-us-about-modern-python-projects-and-tools/id979020229?i=1000519506466
Automate your data exchange with PyDantic
https://podcasts.apple.com/us/podcast/automate-your-data-exchange-with-pydantic/id979020229?i=1000518286844
Python Apps that Scale to Billions of Users
https://podcasts.apple.com/us/podcast/python-apps-that-scale-to-billions-of-users/id979020229?i=1000517664109

Related

Fetch all pages using a Python request, using Beautiful Soup

I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())

Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)

The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.

That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

Finding the correct elements for scraping a website

I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.
I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.
Getting the sub-links:
import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ]
sub_links = {}
for master_atag in master_atags:
master_href = master_atag.get('href')
master_href = base_url + master_href
print(master_href)
master_links.append(master_href)
sub_request = requests.get(master_href)
sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
sub_atags = sub_soup.find_all("a", href=True)
sub_links[master_href] = []
for sub_atag in sub_atags:
sub_href = sub_atag.get('href')
sub_links[master_href].append(sub_href)
print("\t"+sub_href)
Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;
sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)
this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.
I tried also tried the following:
for master_atag in master_atags:
master_href = master_atag.get('href')
for href in master_href:
master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
print(master_href)
I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.
Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?
for links in sublinks:
resp = requests.get(sublinks)
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
title = soup.find('title')
textdate = soup.find('h2')
paragraphs = article.find_all('p')
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
for match in matches:
print(match[0])
datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})
Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:
import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url)
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()
Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.
I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.

The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:
Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.
Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.
Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = 'https://www.ecb.europa.eu'
urls = [
f'{base_url}/press/pr/html/index.en.html',
f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for anchor in soup.select('span.doc-title > a[href]'):
driver.get(f'{base_url}{anchor["href"]}')
article_soup = BeautifulSoup(driver.page_source, 'html.parser')
title = article_soup.select_one('h1.ecb-pressContentTitle').text
date = article_soup.select_one('p.ecb-publicationDate').text
paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
content = '\n\n'.join(p.text for p in paragraphs)
print(f'title: {title}')
print(f'date: {date}')
print(f'content: {content[0:80]}...')
I get the following output for the Press Releases page:
title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board
date: 20 December 2019
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...
title: Monetary policy decisions
date: 12 December 2019
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Trouble Scraping site with BS4

usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within "&quot"if you look what Table outputs.

The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...

So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to Get A Link Embedded in an HTML Page with Python - python

Related

Fetch all pages using a Python request, using Beautiful Soup

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

Finding the correct elements for scraping a website

Web Scraping through links with Beautiful Soup

Trouble Scraping site with BS4

Categories

Resources