I want to ask about web crawling through python

I want to ask about web crawling through python - python

import requests
from bs4 import BeautifulSoup
def laptopspec():
url = "https://search.shopping.naver.com/search/all.nhn?origQuery=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81&pagingIndex=1&pagingSize=40&productSet=model&viewType=list&sort=rel&frm=NVSHPRC&query=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all("li", {"class": "ad _model_list _itemSection"})
for idx, tag in enumerate(tags):
print(idx, tags)
laptopspec()
Through this code, I could get some information that I need.
Now I want to get more specific information through some keywords like GTX 1050, and I want to print URL that contain that keyword. How can I do?

import requests
from bs4 import BeautifulSoup
def laptopspec():
html = requests.get(url).text
url = "https://search.shopping.naver.com/search/all.nhn?origQuery=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81&pagingIndex=1&pagingSize=40&productSet=model&viewType=list&sort=rel&frm=NVSHPRC&query=%EA%B2%8C%EC%9D%B4%EB%B0%8D%EB%85%B8%ED%8A%B8%EB%B6%81"
soup = BeautifulSoup(html, "html.parser")
GTX = soup.find_all("div", {"class": "img_area"})
for idx, tag in enumerate(tags):
print(idx, GTX)
links = []
for link in GTX:
if link.has_attr('gtx'):
links.append(link.get('href'))
print(links)
laptopspec()
That code looks for all of the divs with class "img_area" which is the one that contains the actual links, creates an empty list called links and then stores all the links that contain "GTX" inside that list.
Main problem with that webpage is that the links to the product and the description of the graphics card and other specs are stored in different classes. And the links in the class with the graphic information points to "#" which just refreshes the page.
Another way to do it is, if you know exactly which model you're looking for that has the card, instead of "gtx" you can do like
if link.has_attr('ASUS')
Or what ever you're actually looking for. Because that variable mostly just contains model number and link.
And it's only by chance that GTX is in the hashed link that this code actually works when looking for "GTX", so not guaranteed to find all of the ones you're looking for but every link I've checked out so far from this output contains a laptop with a GTX card. shrug
But hopefully this will get you in the right direction. I'm still new to Python but I was just doing a project in BeautifulSoup so I figured I'd try and help.

Related

Extracting product URLs from a search query on a website

If I were for example looking to track the price changes of MIDI keyboards on https://www.gear4music.com/Studio-MIDI-Controllers. I would need to extract all the URLs of the products pictured from the search and then loop through the URLs of the products and extract price info for each product. I can obtain the price data of an individual product by hard coding the URL but I cannot find a way to automate getting the URLs of multiple products.
So far I have tried this,
from bs4 import BeautifulSoup
import requests
url = "https://www.gear4music.com/Studio-MIDI- Controllers"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
This does produce a list of URLs but I cannot make out which ones relate specifically to the MIDI keyboards in that search query that I want to obtain the price product info of. Is there a better more specific way to obtain the URLs of the products only and not everything within the HTML file?

There are many ways how to obtain product links. One way could be select all <a> tags which have data-g4m-inv= attribute:
import requests
from bs4 import BeautifulSoup
url = "https://www.gear4music.com/Studio-MIDI-Controllers"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a[data-g4m-inv]"):
print("https://www.gear4music.com" + a["href"])
Prints:
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniPad-MIDI-Controller/P6E
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniControl-MIDI-Controller/P6D
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-MiniKey-25-Key-MIDI-Controller/JMR
https://www.gear4music.com/Keyboards-and-Pianos/Nektar-SE25/2XWA
https://www.gear4music.com/Keyboards-and-Pianos/Korg-nanoKONTROL2-USB-MIDI-Controller-Black/G8L
https://www.gear4music.com/Recording-and-Computers/SubZero-ControlKey25-MIDI-Keyboard/221Y
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-CommandKey25-Universal-MIDI-Controller/221X
...

Open the chrome developer console and look at the div that corresponds to a product, from there, set a variable(lets say "product") equal to soup.find_all(that aforementioned div) and loop through these results to find tags that are children of that element (or alternatively identify the title class and search that way).

Matching a specific piece of text in a title using Beuatiful Soup

Basically, I want to find all links that contain certain key terms. In my case, the titles of these links that I want come in this form: abc... (common text), dce... (common text), ... I want to take all of the links containing "(common text)" and put them in the list. I got the code working and I understand how to find all links. However, I converted the links to strings to find the "(common text)". I know that this isn't good practice and I am not sure how to use Beautiful Soup to find this common element without converting to a string. The issue here is that the titles I am searching for are not all the same. Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
links_length = len(links)
string_links = []
targetlist = []
for a in range(links_length):
string_links.append(str(links[a]))
if '(common text)' in string_links[a]:
targetlist.append(string_links[a])
NOTE: I am looking for the simplest method using Beautiful Soup to accomplish this. Any help will be appreciated.

Without the actual website and actual output you want, it's very difficult to say what you want but this is a "cleaner" solution using list comprehension.
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
targetlist = [str(link) for link in links if "(common text)" in str(link)]

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

I wrote this test code which uses BeautifulSoup.
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
print(n.get_text())
It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.
I would wish for it to only retrieve text from the news article itself, how would one go about this?

You might have much better luck with newspaper library which is focused on scraping articles.
If we talk about BeautifulSoup only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div element with itemprop="articleBody" attribute:
article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
print(p.get_text())

You'll need to target more specifically than just the p tag. Try looking for a div class="article" or something similar, then only grab paragraphs from there

Be more specific, you need to catch the div with class articleBody, so :
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
print(n.get_text())
Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs is a dict, it is then possible to pass more attributes/values if needed.

How to use BeautifulSoup to scrape links in a html

I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks

I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)

You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want to ask about web crawling through python - python

Related

Extracting product URLs from a search query on a website

Matching a specific piece of text in a title using Beuatiful Soup

Web Scraping through links with Beautiful Soup

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

How to use BeautifulSoup to scrape links in a html

Categories

Resources