Extracting product URLs from a search query on a website - python

If I were for example looking to track the price changes of MIDI keyboards on https://www.gear4music.com/Studio-MIDI-Controllers. I would need to extract all the URLs of the products pictured from the search and then loop through the URLs of the products and extract price info for each product. I can obtain the price data of an individual product by hard coding the URL but I cannot find a way to automate getting the URLs of multiple products.
So far I have tried this,
from bs4 import BeautifulSoup
import requests
url = "https://www.gear4music.com/Studio-MIDI- Controllers"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
This does produce a list of URLs but I cannot make out which ones relate specifically to the MIDI keyboards in that search query that I want to obtain the price product info of. Is there a better more specific way to obtain the URLs of the products only and not everything within the HTML file?

There are many ways how to obtain product links. One way could be select all <a> tags which have data-g4m-inv= attribute:
import requests
from bs4 import BeautifulSoup
url = "https://www.gear4music.com/Studio-MIDI-Controllers"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a[data-g4m-inv]"):
print("https://www.gear4music.com" + a["href"])
Prints:
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniPad-MIDI-Controller/P6E
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniControl-MIDI-Controller/P6D
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-MiniKey-25-Key-MIDI-Controller/JMR
https://www.gear4music.com/Keyboards-and-Pianos/Nektar-SE25/2XWA
https://www.gear4music.com/Keyboards-and-Pianos/Korg-nanoKONTROL2-USB-MIDI-Controller-Black/G8L
https://www.gear4music.com/Recording-and-Computers/SubZero-ControlKey25-MIDI-Keyboard/221Y
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-CommandKey25-Universal-MIDI-Controller/221X
...

Open the chrome developer console and look at the div that corresponds to a product, from there, set a variable(lets say "product") equal to soup.find_all(that aforementioned div) and loop through these results to find tags that are children of that element (or alternatively identify the title class and search that way).

Related

Can't able to scrape the airtel website plans with python BeautifulSoup

from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.airtel.in/myplan-infinity/")
soup = BeautifulSoup(req.content, 'html.parser')
#finding the div with the id
div_bs4 = soup.find('div')
print(div_bs4)
What should I do to scrape the recharge plans of the page?
you should use content when you need to get media data (like pictures, videos, etc)
if you need to get non-media data you've to use text instead of content
make the soup like this: `soup = BeautifulSoup(req.text, 'lxml')
make sure that this like div_bs4 = soup.find('div') finds the exact div you need (cuz it will just get the first div in html)
this line print(div_bs4) won't give you needed data
you'd better use this print(div_bs4.text)

Scraping href which includes ads information

I would like to count how many ads there are in this website: https://www.lastampa.it/?refresh_ce
I am using BeautifulSoup to do this. I would need to extra info within the following:
<a id="aw0" target="_blank" href="https://googleads.g.doubleclick.net/pcs/click?xai=AKAOjssYz5VxTdwhxCBCrbtSi0dfGqGd25s7Ub6CCjsHLqd__OqfDKLyOWi6bKE3CL4XIJ0xDHy3ey-PGjm3_yVqTe0_IZ1g9AsvZmO1u8gciKpEKYMj1TIvl6KPivBuwgpfUDf8g2EvMyCD5r6tQ8Mx6Oa4G4yZoPYxFRN7ieFo7UbMr8FF2k6FL6R2qegawVLKVB5WHVAbwNQu4rVx4GE8KuxowGjcfecOnagp9uAHY2qiDE55lhdGqmXmuIEAK8UdaIKeRr6aBBVCR40LzY4&sig=Cg0ArKJSzEIRw7NDzCe7&adurl=https://track.adform.net/C/%3Fbn%3D38337867&nm=3&nx=357&ny=-4&mb=2" onfocus="ss('aw0')" onmousedown="st('aw0')" onmouseover="ss('aw0')" onclick="ha('aw0')"><img src="https://tpc.googlesyndication.com/simgad/5262715044200667305" border="0" width="990" height="30" alt="" class="img_ad"></a>
i.e. parts containing ads information.
The code that I am using is the following:
from bs4 import BeautifulSoup
import requests
from lxml import html
r = requests.get("https://www.lastampa.it/?refresh_ce")
soup = BeautifulSoup(r.content, "html.parser")
ads_div = soup.find('div')
if ads_div:
for link in ads_div.find_all('a'):
print (link['href'])
It does not scrape any information because I am considering the wrong tag/href. How could I get ads information in order to count how many ads there are in that webpage?
How about use a regular expression to match "googleads" and count how many you get.
Recursively searching from the body gives you all the links in the whole page. If you want to search in a specific div you can supply parameters such as the class or id that you want to match as a dictionary.
You can filter the links once you obtain them.
body = soup.find('body')
if body:
for link in body.find_all('a'):
if "ad" in link['href']:
print (link['href'])
When looking at the response that I get, I notice there are no ads at all. This could be because the ads are loaded via some script, which means the ads won't be rendered and requests won't download it. To get around this you can use a webdriver with selenium. That should do the trick.

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!
The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page
To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

How do I scrape Dynamic content from Woolworths catalogue? Can't seem to find the price and product class

I am trying to attain Woolworth specials data from this website: "https://www.woolworths.com.au/shop/catalogue"
However, I can't seem to extract the specific class information I am after. Examples include: 'sf-pricedisplay', 'shelfProductStamp-imageTagsContainer' and etc. They are contained within a container.
I believe this information may be generated dynamically.
Would anyone be able to help me with this?
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.woolworths.com.au/shop/catalogue#view=catalogue2&saleId=28314&areaName=NSW&page=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.find_all('div', class_='shop-content')) # This works.

I want to ask about web crawling through python

import requests
from bs4 import BeautifulSoup
def laptopspec():
url = "https://search.shopping.naver.com/search/all.nhn?origQuery=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81&pagingIndex=1&pagingSize=40&productSet=model&viewType=list&sort=rel&frm=NVSHPRC&query=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all("li", {"class": "ad _model_list _itemSection"})
for idx, tag in enumerate(tags):
print(idx, tags)
laptopspec()
Through this code, I could get some information that I need.
Now I want to get more specific information through some keywords like GTX 1050, and I want to print URL that contain that keyword. How can I do?
import requests
from bs4 import BeautifulSoup
def laptopspec():
html = requests.get(url).text
url = "https://search.shopping.naver.com/search/all.nhn?origQuery=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81&pagingIndex=1&pagingSize=40&productSet=model&viewType=list&sort=rel&frm=NVSHPRC&query=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81"
soup = BeautifulSoup(html, "html.parser")
GTX = soup.find_all("div", {"class": "img_area"})
for idx, tag in enumerate(tags):
print(idx, GTX)
links = []
for link in GTX:
if link.has_attr('gtx'):
links.append(link.get('href'))
print(links)
laptopspec()
That code looks for all of the divs with class "img_area" which is the one that contains the actual links, creates an empty list called links and then stores all the links that contain "GTX" inside that list.
Main problem with that webpage is that the links to the product and the description of the graphics card and other specs are stored in different classes. And the links in the class with the graphic information points to "#" which just refreshes the page.
Another way to do it is, if you know exactly which model you're looking for that has the card, instead of "gtx" you can do like
if link.has_attr('ASUS')
Or what ever you're actually looking for. Because that variable mostly just contains model number and link.
And it's only by chance that GTX is in the hashed link that this code actually works when looking for "GTX", so not guaranteed to find all of the ones you're looking for but every link I've checked out so far from this output contains a laptop with a GTX card. shrug
But hopefully this will get you in the right direction. I'm still new to Python but I was just doing a project in BeautifulSoup so I figured I'd try and help.

Categories

Resources