How to efficiently scrap data from dynamic websites using Selenium? - python

I want to scrape data from https://ksanahealth.com/mental-health-blog/ website .
I am trying to access each blog and then click on the link and scrape the details on the details page of a given blog.
I tried to use BeautifulSoup but it returned no data, and I realized the data was loaded dynamically with JavaScript.
Then I tried to use Selenium to scrape it and this the code I came up with:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
service = Service('/usr/bin/chromedrivers')
service.start()
driver = webdriver.Remote(service.service_url)
driver.get('https://ksanahealth.com/mental-health-blog/');
driver.quit()
Unfortunately, my code returns no results.
How best can I improve it so that I get the desired results from the blog?

You don't need selenium for this. When a page is loaded dynamically, you can look up in Network tab which urls are being accessed. The following code will get you started - returning a dataframe with blog title & url. You can further access those urls. Do tell if you need guidance.
The code is below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'accept': 'application/json'
}
df_list = []
for x in range(1, 5):
r = requests.get(f'https://ksanahealth.com/wp-admin/admin-ajax.php?id=&post_id=107&slug=mental-health-blog&canonical_url=https%3A%2F%2Fksanahealth.com%2Fmental-health-blog%2F&posts_per_page=10&page={x}&offset=0&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&order=DESC&orderby=date&action=alm_get_posts&query_type=standard', headers=headers)
soup = BeautifulSoup(r.json()['html'], 'html.parser')
for y in soup.select('div.post-item'):
df_list.append((y.select_one('h4').text.strip(), y.select_one('a.more-link').get('href')))
df = pd.DataFrame(df_list, columns = ['Title', 'URL'])
print(df)
This returns:
Title URL
0 Addressing the Youth Mental Health Crisis Requ... https://www.hmpgloballearningnetwork.com/site/...
1 Remote work: What does it mean for local offic... https://www.klcc.org/2022-02-23/remote-work-wh...
2 Second Nature? https://www.oregonbusiness.com/article/tech/it...
3 6 Benefits of Continuous Behavioral Health Mea... https://ksanahealth.com/post/6-benefits-of-con...
4 A New Level of Measurement-Based Care https://ksanahealth.com/post/a-new-level-of-me...
5 4 Ways Continuous Behavioral Health Measuremen... https://ksanahealth.com/post/4-ways-continuous.
[....]

Related

Is there a way to make the html elements of a website more visible?

While scraping the following website (https://www.middletownk12.org/Page/4113), this code could not locate the table rows (To get the staff name, email & department) even though they are visible when I use the Chrome developer tools. The soup object is not readbale enough to locate the tr tags that have the info needed.
import requests
from bs4 import BeautifulSoup
url = "https://www.middletownk12.org/Page/4113"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(response.text)
I used different libraries such as bs4, request & selenium with no chance. I also tried Css selectors & XPATH with selenium with no chance. The Tr elements could not be located.
That table of contact information is filled in by Javascript after the page has loaded. The content doesn't exist in the page's HTML and you won't see it using requests.
By using the developer tools available in the browser, we can examine the requests made after the page has loaded. There are a lot of them, but at least in my browser it's obvious the contact information is loaded near the end.
Looking at the request log, I see a request for a spreadsheet from docs.google.com:
If we examine that entry, we find that it's a request for:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0
And if we fetch the above link, we get a spreadsheet with the source data for that table.
Actually I used Selenium & then bs4 without any results. The code does not find the 'tr' elements...
Why are you using Selenium? The whole point to this answer is that you don't need to use Selenium if you can figure out the link to retrieve the data -- which we have.
All we need is requests to fetch the data and BeautifulSoup to parse it:
import requests
import bs4
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for link in soup.findAll('a'):
print(f"{link.text}: {link.get('href')}")

Why I cannot scrape all the data from Zillow?

I'm trying to scrape the data from Zillow (prices) as a practice with Python and I'm not getting the data complete.
This is my code
from jobEntryBot import JobEntryBot
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from pprint import pprint
import time
import requests
URL_ZILLOW = r"https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-123.4663871665039%2C%22east%22%3A-121.7744926352539%2C%22south%22%3A37.03952097286371%2C%22north%22%3A38.19687379258651%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22price%22%3A%7B%22max%22%3A872627%7D%2C%22beds%22%3A%7B%22min%22%3A1%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22mp%22%3A%7B%22max%22%3A3000%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A9%7D"
header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/511.22 (KHTML, like Gecko) Chrome/139.3.3.3 Safari/312.311',
'Accept-Language': 'en-US,en;q=0.9'
}
data = requests.get(headers=header, url=URL_ZILLOW)
soup = BeautifulSoup(data.text, "html.parser")
selector_for_prices = ".gMDnGj span"
prices = soup.select(selector_for_prices)
for price in prices:
print(price.text)
I try this but **only get 9 prices ** not all the 40 something prices on the webpage.
enter image description here
I've tried using other functions like soup.find_all() but it doesn't work. I've tried even using selenium.
If I inspect the Zillow page and use the selector I use in the code it works but not in my code.
Pd: I changed the user_agent for the code I show fyi
Since the website has web-detection capabilities, you will first need find a way to avoid detection. This post contains a comprehensive list of methods to avoid detection.
It may also be worth looking into the APIs Zillow offers, as it does not seem like there will be a simple way to scrape their website. But if your just doing fun or as a personal learning experience, then it definitely worth take some time to figure out the best approach to scrape Zillow.

Why is BeautifulSoup leaving out parts of a website?

I'm completely new to python and wanted to dip my toes into web scraping. So I tried to scrape the rankings of players in https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3#
But when I try to access the rankings and ratings of each player, it gives none as a return. This is all inside the so I assume beautifulsoup isn't able to access it because it's javascript, but I'm not sure. please help ._.
Input:
from bs4 import BeautifulSoup
import requests
URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")
tbody = October_Nac_2022.tbody
print(tbody)
Output:
None
In this case the problem is not with BS4 but with your analysis before starting the scraping. The data which you are looking for is not available directly from the request you have made.
To get the data you have to make request to a different back end URL https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=name, which will give you a JSON response.
The code will look something like this
from requests import get
url = 'https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=
name'
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0 X-Requested-With XMLHttpRequest'})
print(response.json())
If you want to test performance of BS4 consider the below example for fetching the blog post links from the link
from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.zyte.com/blog/"
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux soup = bs(response.content)
posts = soup.find_all('div', {"class":"oxy-posts"})
print(len(posts))
Note:
Before writing code for scraping analyse the website thoroughly. It will give the idea about the data sources of the website

scraping table data based on date

iam trying to scrape table of kurs transaction https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx
from 2015-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2015-2020(20-Nov-15 -- 20-nov-20)? I'm very new to python and using python 3.thank you in advance
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
url = "https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx"
import requests
from lxml import html
response = requests.get(url)
content= response.content
print(content)
A few different approaches:
Use array slicing if you are working with 1 dimensional data
Use filter / groupby methods from the Pandas library after putting your data into a dataframe
The website requires you to enter in start and end dates for the query. However, as far as I know, bs4 only scrapes html that is already displayed on the browser, and is not so useful for making a query on the web site itself.
From the source code and the POST request it looks like a complicated request so you might be better off simulating mouse clicks.
This can be done using the automated browser testing selenium package to automate opening Google Chrome browser, entering the date into the From and To fields, then clicking the Lihat button, waiting for page to load, then scraping the displayed table using bs4 or selenium.

python web scraping Weatherforecast

I'm new to Python(actually second time I try to learn the language so i know a little something) and I'm trying to build a script that scrapes the weather forecast.
Now i have a little problem with finding the right html classes to import into python. I have this code now:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(city_name)
Problem is that this just returns 'None'
I found the class that the code searches for via chrome and inspect page. If i export the html page through python with the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(soup.prettify())
Then I see the html page in cmd(as expected) but I'm also unable to find 'class_="weather-widget__city-name"' so I'm not amazed that python is also unable to. My question is, why is the html code that python gives me different than the html code Chrome shows on the site? And am I doing something wrong with trying to find the weather widget through BeautifulSoup this way?
Here is a picture from the page, the part that I'm trying to scrape is circled in red.
Screenshot from website
Thanks in advance!
That site is loaded with JS.
Python requests doesn't activate those scripts. One of those scripts is responsible for loading the data you are looking for (you can see it's JS, maybe with a bit jQuery I didn't actually check, by the spinning circle while it's loading).
My suggestion here is use the sites API.
I am not subscribed to the site so I can't show here an example but the trick is simple. You subscribe to the sites API with the basic (and free) plan, get an API key and start sending get requests to the API URLs.
This will also simplify things for you further since you wouldn't need BeautifulSoup for the parsing. The responses are all in JSON.
There is another nastier way around it and that is using selenium. This module will simulate the web browser and all of it's JS activating, HTML rendering, CSS loading mechanisms.
I have experience with both and I strongly recommend sticking to the API (if that option exists).
For sites that use JS to send further requests, after we request the initial URL, one method that can work is to study the network tab of Chrome's developer tools (or the equivalent in any other browser).
You'll usually find a big list of URLs being requested by the browser. Most of them are unnecessary for our purposes. And few of them relate to other sites such as Google, Facebook.
In this particular case, after the initial URL is requested, you'll find a few '.js' files being retrieved and after that, three scripts (forecast, weather, daily) that correspond to the data which finally gets presented by the browser.
From those three, the data you ask for comes from the 'weather' script. If you click on it in the network tab, another side pane will open which will contain header information, preview, etc.
In the Headers tab, you'll find the URL that you need to use, which is:
https://openweathermap.org/data/2.5/weather?id=2743477&units=metric&appid=b1b15e88fa797225412429c1c50c122a1
The b1b15e88fa797225412429c1c50c122a1 might be a general API key that is assigned to a browser request. I don't know for sure. But all we need to know is that it doesn't change. I've tried on two different systems and this value doesn't change.
The 2743477 is, of course, the City ID. You can download a reference of various cities and their IDs in their site itself:
http://bulk.openweathermap.org/sample/
As nutmeg64 said, the site actually responds with a JSON file. That's the case with both the API and the request of this URL found in the network tab of a browser.
As for the codes appearing in the JSON, the site gives you a reference to the codes and their meanings:
https://openweathermap.org/weather-conditions
With this information, you can use requests and json to retrieve and manipulate the data. Here's a sample script:
from pprint import pprint
import json
import requests
city_id = 2743477
url = 'https://openweathermap.org/data/2.5/weather?id={}&units=metric&appid=b1b15e88fa797225412429c1c50c122a1'.format(city_id)
req_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'openweathermap.org',
'Referer': 'https://openweathermap.org/city/2743477',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
s = requests.Session()
r = s.get(url, headers=req_headers)
d = json.loads(r.text)
pprint(d)
However, as nutmeg64 said, it's better to use the API, and resist the temptation to bombard the site with more requests than you truly need.
You can find all about their API here:
https://openweathermap.org/current
Use selenium in combination with BeautifulSoup to get any of the table from that page with no hardship. Here is how you can do:
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
table_tag = soup.select(".weather-widget__items")[0]
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
for data in tab_data:
print(data)
Partial result:
['Wind', 'Gentle Breeze,\n 3.6 m/s, Southwest ( 220 )']
['Cloudiness', 'Broken clouds']
['Pressure', '1014 hpa']
['Humidity', '100 %']
['Sunrise', '11:53']

Categories

Resources