How do I web scrape JavaScript generated contents?

How do I web scrape JavaScript generated contents? - python

I want to scrape everything under "Notes to Financial Statements". How can I do that? This is the link of the webpage: Please Click
This is the screenshot:
Each item under "Notes to Financial Statements" is generated after I click the link. I want to get the source of each item and parse it, for example, "Summary of Significant Accounting Policies.
Thank you!
Updated: 2019-10-22
Now my question boils down to how can I extract the IDs (i.e., r9, r10, etc.) from the following. It is in one of the s whose class = "accordion". There is a href="#" id="menu_cat2" in the .
Not the best code. This is how I managed to do it:
lis = soup.select("li.accordion")
notes = [str(li) for li in lis if "menu_cat3" in str(li)]
ids = re.findall(r'id="(r\d+)"', notes[0])

You can use selenium or scrapy
https://selenium-python.readthedocs.io/index.html
Approach:
https://selenium-python.readthedocs.io/navigating.html
Navigate to that link
Click on that html element - can be done by using xpath or css selectors
Get the text contents

The page makes requests based on the ids of the lis in that section. Gather the ids, convert to upper case and make the same requests. Requires bs4 4.7.1+
from bs4 import BeautifulSoup as bs
import requests
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-18-000145&xbrl_type=v#')
soup = bs(r.content, 'lxml')
urls = [f'https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/{i["id"].upper()}.htm' for i in soup.select('li:has(#menu_cat3) .accordion')]
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
print([i.text for i in soup.select('font')])

Related

Extract single URL from webpage by scraping

I have been trying to scrape a website such as the one below. In the footer there are a bunch of links of their social media out of which the LinkedIn URL is the point of focus for me. Is there a way to fish out only that link maybe using regex or any other libraries available in Python.
This is what I have tried so far -
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(reqs.text,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
But I'm fetching all the URLs instead of the one I'm looking for.
Note: I'd appreciate a dynamic code which I can use for other sites as well.
Thanks in advance for you suggestion/help.

One approach could be to use css selectors and look for string linkedin.com/company/ in values of href attributes:
soup.select_one('a[href*="linkedin.com/company/"]')['href']
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")
# single (first) link
link = e['href'] if(e := soup.select_one('a[href*="linkedin.com/company/"]')) else None
# multiple links
links = [link['href'] for link in soup.select('a[href*="linkedin.com/company/"]')]

How to webscrape a text inside a link in Python?

I would like to webscrape the following page: https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
In particular, I would like to get the text inside every link you see displayed clicking on the link above. I am able to do it only by clickling on the link. For example, clicking on the first one:
import pandas as pd
from bs4 import BeautifulSoup
import requests
x = "https://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in211222~5f9a709924.en.html"
x1=[requests.get(x)]
x2 = [BeautifulSoup(x1[0].text)]
x3 = [x2[0].select("p+ p") for i in range(len(x2)-1)]
The problem is that I am not able to automate the process that leads me from the url with the list of links containing text (https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html) to the actual link where the text I need is stored (e.g. https://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in211222~5f9a709924.en.html)
Can anyone help me?
Thanks!

To get a list of all links on https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html')
soup = BeautifulSoup(r.text, 'html.parser')
links = [link.get('href') for link in soup.find_all('a')]

Wouter's answer is correct for getting all links, but if you need just the the title links, you could try a more specific selector query like select('div.title > a'). Here's an example:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
html = BeautifulSoup(requests.get(url).text, 'html.parser')
links = html.select('div.title > a')
for link in links:
print(link.attrs['href'])

In particular, I would like to get the text inside every link you see displayed clicking on the link above.
To get the text of every linked article you have to iterate over your list of links and request each of them:
for link in soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
data.append({
'title':link.text,
'url': url,
'subtitle':soup.main.h2.text,
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
Example
Contents are stored in a list of dicts, so you can easily access and process the data later.
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
for link in soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
data.append({
'title':link.text,
'url': url,
'subtitle':soup.main.h2.text,
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
print(data)

Web scraping href link in Python with Beautifulsoup

I'm trying to code a web scraping to obtain information of Linkedin Jobs post, including Job Description, Date, role, and link of the Linkedin job post. While I have made great progress obtaining job information about the job posts I'm currently stuck on how I could get the 'href' link of each job post. I have made many attempts including using class driver.find_element_by_class_name, and select_one method, neither seems to obtain the 'canonical' link by resulting none value. Could you please provide me some light?
This is the part of my code that tries to get the href link:
import requests
from bs4 import BeautifulSoup
url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('link'):
print(link.get('href'))
link: https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
Picture of the code where the href link is stored

I think you were trying to access the href attribute incorrectly, to access them, use object["attribute_name"].
this works for me, searching for just links where rel = "canonical":
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
print(link['href'])

The <link> has an attribute of rel="canonical". You can use an [attribute=value] CSS selector: [rel="canonical"] to get the value.
To use a CSS selector, use the .select_one() method instead of find().
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup.select_one('[rel="canonical"]')['href'])
Output:
https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Scraping pricing off a search bar from a website with Python

I have a list of part numbers to be looked up from the search bar on: https://www.partsfinder.com/catalog/preview?q=0119000230this site.
I want to collect the prices off the results.
Here's what I put up together but I'm not sure where to go from here:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.partsfinder.com/catalog/preview?q=0119000230')
soup = BeautifulSoup(r.text, 'html.parser')
resultsRow = soup.find_all('a', {'class': 'search_result_row'})
results = []
Any help appreciated, thanks!

Content is dynamically loaded via xhr POST request. You can see this in the dev tools of browser when refreshing the page. You can minimize the request to the following:
import requests
data = {"partOptionFilter":{"PartNumber":"0119000230","AlternativeOemId":"17155"}}
r = requests.post('https://www.partsfinder.com/Catalog/Service/GetPartOptions',json=data).json()
print(r['Data']['PartOptions'][0]['YourPrice'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I web scrape JavaScript generated contents? - python

You can use selenium or scrapy https://selenium-python.readthedocs.io/index.html Approach: https://selenium-python.readthedocs.io/navigating.html Navigate to that link Click on that html element - can be done by using xpath or css selectors Get the text contents

Related

Extract single URL from webpage by scraping

How to webscrape a text inside a link in Python?

Web scraping href link in Python with Beautifulsoup

Web Scraping through links with Beautiful Soup

Scraping pricing off a search bar from a website with Python

Categories

Resources