How to parse links in HTML page?

How to parse links in HTML page? - python

I want to parse list of links from this website
I am trying to do this with request library in Python. However, when I try to read the HTML with bs4 there aren't any links. Just empty ul:
< ul class="ais-Hits-list">< /ul >
How can I get these links?
Edit:
The code I tried so far:
link = "https://www.over-view.com/digital-index/"
r = requests.get(link)
soup = BeautifulSoup(r.content, 'lxml')

since information load dynamically on that website, you can use selenium to collect required information:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--window-size=1920x1080")
path_to_chromedriver ='chromedriver'
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=path_to_chromedriver)
driver.get('https://www.over-view.com/digital-index/')
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "lxml")
rows = soup.select("ul.ais-Hits-list > li > a")
for row in rows:
print(row.get('href'))
Example of output:
/overviews/adelaide-canola-flowers
/overviews/adelaide-rift-complex
/overviews/adriatic-tankers
/overviews/adventuredome
/overviews/agricultural-development
/overviews/agricultural-development
/overviews/agricultural-development
/overviews/agriculture-development

There is also a bit more extravagant way: don't judge to harsh since I've tried this approach for the first time, but you can make the same request to the API as their frontend does. Plus, this code executes asynchronously thanks to asyncio + aiohttp.
Keep in mind, that I took arbitrary number of pages to iterate over and didn't handle possible errors (you need to fine tune it).
Code without Selenium WebDriver
import json
import asyncio
import aiohttp
URL = "https://ai7o5ij8d5-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser (lite); react (16.13.1); react-instantsearch (5.7.0); JS Helper (2.28.0)&x-algolia-application-id=AI7O5IJ8D5&x-algolia-api-key=7f1a509e834f885835edcfd3482b990c"
async def scan_single_digital_index_page(page_num, session):
body = {
"requests": [
{
"indexName": "overview",
"params": f"query=&hitsPerPage=30&maxValuesPerFacet=10&page={page_num}&highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&facets=%5B%22_tags.name%22%5D&tagFilters=",
}
]
}
async with session.post(URL, json=body) as resp:
received_data = await resp.json()
results = received_data.get("results")
hits = results[0].get("hits")
links = list()
for hit in hits:
for key, value in hit.items():
if key == "slug":
links.append("https://www.over-view.com/overviews/" + value)
return links
async def scan_all_digital_index_pages(session):
tasks = list()
max_pages = 20
for page_num in range(1, max_pages):
task = asyncio.create_task(scan_single_digital_index_page(page_num, session))
tasks.append(task)
all_lists = await asyncio.gather(*tasks)
# Unpack all lists with links into a single set of all links.
all_links = set()
for l in all_lists:
all_links.update(l)
return all_links
async def main():
async with aiohttp.ClientSession() as session:
all_links = await scan_all_digital_index_pages(session)
for link in all_links:
print(link)
if __name__ == "__main__":
asyncio.run(main())
Example result for the first page
https://www.over-view.com/overviews/adelaide-canola-flowers
https://www.over-view.com/overviews/adelaide-rift-complex
https://www.over-view.com/overviews/adriatic-tankers
https://www.over-view.com/overviews/adventuredome
https://www.over-view.com/overviews/agricultural-development
https://www.over-view.com/overviews/agricultural-development
https://www.over-view.com/overviews/agricultural-development
https://www.over-view.com/overviews/agriculture-development
https://www.over-view.com/overviews/akimiski-island
https://www.over-view.com/overviews/al-falah-housing-project
https://www.over-view.com/overviews/alabama-tornadoes
https://www.over-view.com/overviews/alakol-lake
https://www.over-view.com/overviews/albenga
https://www.over-view.com/overviews/albuquerque-baseball-complex
https://www.over-view.com/overviews/alta-wind-energy-center
https://www.over-view.com/overviews/altocumulus-clouds
https://www.over-view.com/overviews/amsterdam
https://www.over-view.com/overviews/anak-krakatoa-eruption-juxtapose
https://www.over-view.com/overviews/ancient-ruins-of-palmyra
https://www.over-view.com/overviews/andean-mountain-vineyards
https://www.over-view.com/overviews/angas-inlet-trees
https://www.over-view.com/overviews/angkor-wat
https://www.over-view.com/overviews/ankara-residential-development
https://www.over-view.com/overviews/antofagasta-chile
https://www.over-view.com/overviews/apple-park
https://www.over-view.com/overviews/aquatica-water-park
https://www.over-view.com/overviews/aral-sea
https://www.over-view.com/overviews/arc-de-triomphe
https://www.over-view.com/overviews/arecibo-observatory
https://www.over-view.com/overviews/arizona-rock-formations
For future changes (as there are many moving parts) you can get the info about their API from Web Console in browser 👇

Related

How to get src in an image using class?

Hi I am trying to get the src data from the image on the website, I locate the image using the class since it is unique. With the code below it is able to locate the image but is unable to save the image to mongodb and shows up as null, so want to find the src and save the link instead.
ps. the code works for other classes but not sure how to locate the src and save it into "findImage".
https://myaeon2go.com/products/category/6236298/vegetable
postal code is : 56000
cate_list = [
"https://myaeon2go.com/products/category/1208101/fresh-foods",
"https://myaeon2go.com/products/category/8630656/ready-to-eat",
"https://myaeon2go.com/products/category/6528959/grocery",
"https://myaeon2go.com/products/category/6758871/snacks",
"https://myaeon2go.com/products/category/8124135/chill-&-frozen",
"https://myaeon2go.com/products/category/4995043/beverage",
"https://myaeon2go.com/products/category/3405538/household",
"https://myaeon2go.com/products/category/493239/baby-&-kids",
]
cookies = {
"hideLocationOverlay": "true",
"selectedShippingState": "Kuala Lumpur",
"selectedPostalCode": "56000",
}
for x in range(len(cate_list)):
url = cate_list[x]
# convert soup to readable html
result = requests.get(url, cookies=cookies)
doc = BeautifulSoup(result.text, "html.parser")
# a for loop located here to loop through all the products
# <span class="n_MyDQk4X3P0XRRoTnOe a8H5VCTgYjZnRCen1YkC">myAEON2go Signature Taman Maluri</span>
findImage = j.find("img", {"class": "pgJEkulRiYnxQNzO8njV shown"})

To extract the value of src attribute simply call .get('src') on your element.
Try to change your strategy selecting elements and avoid using classes that are often dynamically - I recommend to use more static identifier as well as HTML structure.
for url in cate_list:
result = requests.get(url, cookies=cookies,headers = {'User-Agent': 'Mozilla/5.0'})
doc = BeautifulSoup(result.text, "html.parser")
for e in doc.select('.g-product-list li'):
print(e.img.get('src'))
Note: Iterating your list do not need range(len()) construct
Example
import requests
from bs4 import BeautifulSoup
cate_list = [
"https://myaeon2go.com/products/category/1208101/fresh-foods",
"https://myaeon2go.com/products/category/8630656/ready-to-eat",
"https://myaeon2go.com/products/category/6528959/grocery",
"https://myaeon2go.com/products/category/6758871/snacks",
"https://myaeon2go.com/products/category/8124135/chill-&-frozen",
"https://myaeon2go.com/products/category/4995043/beverage",
"https://myaeon2go.com/products/category/3405538/household",
"https://myaeon2go.com/products/category/493239/baby-&-kids",
]
cookies = {
"hideLocationOverlay": "true",
"selectedShippingState": "Kuala Lumpur",
"selectedPostalCode": "56000",
}
for url in cate_list:
result = requests.get(url, cookies=cookies,headers = {'User-Agent': 'Mozilla/5.0'})
doc = BeautifulSoup(result.text, "html.parser")
for e in doc.select('.g-product-list li'):
print(e.img.get('src').split(')/')[-1])
Output
https://assets.myboxed.com.my/1659400060229.jpg
https://assets.myboxed.com.my/1662502067580.jpg
https://assets.myboxed.com.my/1658448744726.jpg
https://assets.myboxed.com.my/1627880003755.jpg
https://assets.myboxed.com.my/1662507451284.jpg
https://assets.myboxed.com.my/1662501936757.jpg
https://assets.myboxed.com.my/1659400602324.jpg
https://assets.myboxed.com.my/1627880346297.jpg
https://assets.myboxed.com.my/1662501743853.jpg
...

import requests
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor, as_completed
cookies = {
"hideLocationOverlay": "true",
"selectedShippingState": "Kuala Lumpur",
"selectedPostalCode": "56000",
}
links = [
"8630656/ready-to-eat",
"1208101/fresh-foods",
"6528959/grocery",
"6758871/snacks",
"8124135/chill-&-frozen",
"4995043/beverage",
"3405538/household",
"493239/baby-&-kids",
]
allin = []
def get_soup(content):
return BeautifulSoup(content, 'lxml', parse_only=SoupStrainer('img', class_="pgJEkulRiYnxQNzO8njV"))
def worker(req, url, link):
r = req.get(url + link)
soup = get_soup(r.content)
return [urljoin(url, x['src']) for x in soup.select('img')]
def main(url):
with requests.Session() as req, ThreadPoolExecutor(max_workers=10) as executor:
req.cookies.update(cookies)
fs = (executor.submit(worker, req, url, link) for link in links)
for f in as_completed(fs):
allin.extend(f.result())
print(allin)
if __name__ == "__main__":
main('https://myaeon2go.com/products/category/')

How can I send Dynamic website content to scrapy with the html content generated by selenium browser?

I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy.
This is the website I am working on right now.
I have written the following code inside scrappy spider.
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
def start_requests(self):
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
for date in floorsheet_dates:
driver.get(
"https://merolagani.com/Floorsheet.aspx")
driver.find_element(By.XPATH, "//input[#name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
driver.find_element(By.XPATH, "(//a[#title='Search'])[3]").click()
total_length = driver.find_element(By.XPATH,
"//span[#id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(z, z + 1):
driver.find_element(By.XPATH, "(//a[#title='Page {}'])[2]".format(data)).click()
self.url = driver.page_source
yield Request(url=self.url, callback=self.parse)
def parse(self, response, **kwargs):
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
Update: Error after answer is
2022-01-14 14:11:36 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
I want to send current web html content to scrapy feeder but I am getting unusal error for past 2 days any help or suggestions will be very much appreciated.

The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.
Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from scrapy.http import HtmlResponse
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
def start_requests(self):
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver = webdriver.Chrome()
floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
for date in floorsheet_dates:
driver.get(
"https://merolagani.com/Floorsheet.aspx")
driver.find_element(By.XPATH, "//input[#name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
driver.find_element(By.XPATH, "(//a[#title='Search'])[3]").click()
total_length = driver.find_element(By.XPATH,
"//span[#id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
driver.find_element(By.XPATH, "(//a[#title='Page {}'])[2]".format(data)).click()
self.body = driver.page_source
response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
# return an empty requests list
return []
Solution 2 - with super simple downloader middleware:
(You might have a delay here in parse method so be patient).
import scrapy
from scrapy import Request
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.common.by import By
class SeleniumMiddleware(object):
def process_request(self, request, spider):
url = spider.driver.current_url
body = spider.driver.page_source
return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
# 'projects_name.path.to.your.pipeline': 543
}
}
driver = webdriver.Chrome()
def start_requests(self):
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
for date in floorsheet_dates:
self.driver.get(
"https://merolagani.com/Floorsheet.aspx")
self.driver.find_element(By.XPATH, "//input[#name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
self.driver.find_element(By.XPATH, "(//a[#title='Search'])[3]").click()
total_length = self.driver.find_element(By.XPATH,
"//span[#id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
self.driver.find_element(By.XPATH, "(//a[#title='Page {}'])[2]".format(data)).click()
self.body = self.driver.page_source
self.url = self.driver.current_url
yield Request(url=self.url, callback=self.parse, dont_filter=True)
def parse(self, response, **kwargs):
print('test ok')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
Notice that I've used chrome so change it back to firefox like in your original code.

Script fails to generate results

I've written a script in python to scrape the result populated upon filling in two inputboxes zipcode and distance with 66109,10000. When I try the inputs manually, the site does display results but when I try the same using the script I get nothing. The script throws no error either. What might be the issues here?
Website link
I've tried with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sart.org/clinic-pages/find-a-clinic/'
payload = {
'zip': '66109',
'strdistance': '10000',
'SelectedState': 'Select State or Region'
}
def get_clinics(link):
session = requests.Session()
response = session.post(link,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.text,"lxml")
item = soup.select_one(".clinics__search-meta").text
print(item)
if __name__ == '__main__':
get_clinics(url)
I'm only after this line Within 10000 miles of 66109 there are 383 clinics. generated when the search is made.

I changed the url and the requests method to GET and worked for me
def get_clinics(link):
session = requests.Session()
response = session.get(link, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.text,"lxml")
item = soup.select_one(".clinics__search-meta").text
print(item)
url = 'https://www.sart.org/clinic-pages/find-a-clinic?zip=66109&strdistance=10000&SelectedState=Select+State+or+Region'
get_clinics(url)

Include cookies is one of the main concern here. If you do it in the right way, you can get a valid response following the way you started. Here is the working code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sart.org/clinic-pages/find-a-clinic/'
payload = {
'zip': '66109',
'strdistance': '10000',
'SelectedState': 'Select State or Region'
}
def get_clinics(link):
with requests.Session() as s:
res = s.get(link)
req = s.post(link,data=payload,cookies=res.cookies.get_dict())
soup = BeautifulSoup(req.text,"lxml")
item = soup.select_one(".clinics__search-meta").get_text(strip=True)
print(item)
if __name__ == '__main__':
get_clinics(url)

pagination : why it's still run while the page is not match?

I want to scrape data from a website, but first I want to get the page with pagination. Here I'm using python as a program language, and I already got this code. But when I run it, it doesn't work properly. the result must be stopped when response.url didn't match with expected_url. Is there someone know how to solve it? Please help, thank you.
Here is the code :
from bs4 import BeautifulSoup
import urllib.request
count = 0
url = "http://www.belanjamimo.net/foundation-bb-cream/?o=a&s=%d"
def get_url(url):
req = urllib.request.Request(url)
return urllib.request.urlopen(req)
expected_url = url % count
response = get_url(expected_url)
while (response.url == expected_url):
print("GET {0}".format(expected_url))
count += 9
expected_url = url % count
response = get_url(expected_url)

Try the below approach to exhaust all the items in different pages and break out of loop when there is no more items available.
from bs4 import BeautifulSoup
import requests
url = "http://www.belanjamimo.net/foundation-bb-cream/?o=a&s={}"
page = 0
while True:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
items = soup.select(".product-block h2 a")
if len(items)<=1:break #check out if there is any product still available
for item in items:
print(item.text)
page+=9

Convert a result in the console into a list Python

I want to web scrape a list of urls from a web site and then open them one by one.
I can get the list of all urls but then try to turn into a list things get wrong.
When I print the list, intead of geting [url1, urls2...] I get something like this in the console:
[url1,url2,url3] dif line
[url1,url2,url3,url4] difline
[url1,url2,url3,url4,url5]
Find the my script bellow:
driver = webdriver.Chrome()
my_url="https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
driver.get(my_url)
time.sleep(3)
content = driver.page_source.encode('utf-8').strip()
page_soup = soup(content,"html.parser")
links =[]
for link in page_soup.find_all('a', href=True):
url=link['href']
ai=str(url)
links.append(ai)
print(links)
links.append(ai)
print(links)

I have rewritten your code a little. First you need to load and scrap main page to get all links from "href". After that just use scraped urls in a loop to get next pages.
Also there is some junk in "href" which isn't url so you have to clean it first.
I prefer requests to do GET.
http://docs.python-requests.org/en/master/
I hope it helps.
from bs4 import BeautifulSoup
import requests
def main():
links = []
url = "https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
web_page = requests.get(url)
soup = BeautifulSoup(web_page.content, "html.parser")
a_tags = soup.find_all('a', href=True)
for a in a_tags:
links.append(a.get("href"))
print(links) # just to demonstrate that links are there
cleaned_list = []
for link in links:
if "http" in link:
cleaned_list.append(link)
print(cleaned_list)
return cleaned_list
def load_pages_from_links(urls):
user_agent = {'User-agent': 'Mozilla/5.0'}
links = urls
downloaded_pages = {}
if len(links) == 0:
return "There are no links."
else:
for nr, link in enumerate(links):
web_page = requests.get(link, headers=user_agent)
downloaded_pages[nr] = web_page.content
print(downloaded_pages)
if __name__ == "__main__":
links = main()
load_pages_from_links(links)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse links in HTML page? - python

Related

How to get src in an image using class?

How can I send Dynamic website content to scrapy with the html content generated by selenium browser?

Script fails to generate results

pagination : why it's still run while the page is not match?

Convert a result in the console into a list Python

Categories

Resources