I'm trying to create a script which will fetch the title and the description of products from this webpage. In it's landing page there is a single product. However, when you take a look at the left sided area, you will notice a tab titling 17 products. I'm trying to grab their title and description as well. The tab named 17 products in reality does nothing as the 17 products are already within the page source.
I can fetch all the 18 products in the following manner. I had to use print twice to print all 18 products. If I append the results and print all them together, the script will look messier.
import requests
from bs4 import BeautifulSoup
link = 'https://www.3m.com/3M/en_US/company-us/all-3m-products/~/3M-Cubitron-II-Cut-Off-Wheel/?N=5002385+3290927385&preselect=8710644+3294059243&rt=rud'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
product_title = soup.select_one("h1[itemprop='name']").text
specification = soup.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
print(product_title,specification)
for additional_link in list(set([item.get("href") for item in soup.select(".js-row-results .allModelItemDetails a.SNAPS--actLink")])):
res = s.get(additional_link)
sauce = BeautifulSoup(res.text,"lxml")
product_title = sauce.select_one("h1[itemprop='name']").text
specification = sauce.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
print(product_title,specification)
How can I print all the title and description of products all together?
Not sure if I understand your question. You want to print all of the title and descriptions together, but you don't want to append them to a list, because the script would be messy?
One option is to use a dict instead of a list. Define a dict up at the top of your code after the imports: products = {}, and swapping out your print statements with products[product_title] = specification
Afterwards, you can use the pprint package, which I believe comes with python, to neatly print the dictionary object, like so:
import pprint
some_random_dict = {'a': 123, 'b': 456}
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(stuff)
Replace some_random_dict with products
If you're concerned with neatness, I would also refactor this bit into a seperate function:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
product_title = soup.select_one("h1[itemprop='name']").text
specification = soup.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
Maybe something like this:
def get_product(sess, link):
info = {}
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
product_title = soup.select_one("h1[itemprop='name']").text
specification = soup.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
info[product_title] = specification
return soup, info
Your code would then look like this:
products = {}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
soup, product_info = get_link(s, link)
products.update(product_info)
for additional_link in list(set([item.get("href") for item in soup.select(".js-row-results .allModelItemDetails a.SNAPS--actLink")])):
sauce, product_info = get_link(s, additional_link)
products.update(product_info)
Having the same piece of code pasted around in multiple places is something that should always be avoided. Refactoring that bit into a separate function will help readability and maintainability in the long run.
Related
I am trying to run this code in idle 3.10.6 and I am not seeing any kind of data that should be extracted from Indeed. All this data should be in the output when I run it but it isn't. Below is the input statement
#Indeed data
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko"}
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}"
r = requests.get(url,headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
def transform(soup):
divs = soup.find_all("div", class_ = "jobsearch-SerpJobCard")
for item in divs:
title = item.find ("a").text.strip()
company = item.find("span", class_="company").text.strip()
try:
salary = item.find("span", class_ = "salarytext").text.strip()
finally:
salary = ""
summary = item.find("div",{"class":"summary"}).text.strip().replace("\n","")
job = {
"title":title,
"company":company,
'salary':salary,
"summary":summary
}
joblist.append(job)
joblist = []
for i in range(0,40,10):
print(f'Getting page, {i}')
c = extract(10)
transform(c)
df = pd.DataFrame(joblist)
print(df.head())
df.to_csv('jobs.csv')
Here is the output I get
Getting page, 0
Getting page, 10
Getting page, 20
Getting page, 30
Empty DataFrame
Columns: []
Index: []
Why is this going on and what should I do to get that extracted data from indeed? What I am trying to get is the jobtitle,company,salary, and summary information. Any help would be greatly apprieciated.
The URL string includes {page}, bit it's not an f-string, so it's not being interpolated, and the URL you are fetching is:
https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}
That returns an error page.
So you should add an f before opening quote when you set url.
Also, you are calling extract(10) each time, instead of extract(i).
This is the correct way of using url
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}".format(page=page)
r = requests.get(url,headers)
here r.status_code gives an error 403 which means the request is forbidden.The site will block your request from fullfilling.use indeed job search Api
Currently using the below Python scraper to pull Job title, Company, Salary, and Description. Looking for a way to take it one step further by filtering only results where application link is URL to company website, as opposed to the 'Easily Apply' postings that send application through Indeed. Is there a way to do this?
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
url = f'https://www.indeed.com/jobs?q=Software%20Engineer&l=Austin%2C%20TX&ts=1630951951455&rq=1&rsIdx=1&fromage=last&newcount=6&vjk=c8f4815c6ecfa793'
r = requests.get(url, headers) # 200 is OK, 404 is page not found
soup = BeautifulSoup(r.content, 'html.parser')
return soup
# <span title="API Developer"> API Developer </span>
def transform(soup):
divs = soup.find_all('div', class_ = 'slider_container')
for item in divs:
if item.find(class_ = 'label'):
continue # need to fix, if finds a job that has a 'new' span before the title span, skips job completely
title = item.find('span').text.strip()
company = item.find('span', class_ = "companyName").text.strip()
description = item.find('div', class_ = "job-snippet").text.strip().replace('\n', '')
try:
salary = item.find('span', class_ = "salary-snippet").text.strip()
except:
salary = ""
job = {
'title': title,
'company': company,
'salary': salary,
'description': description
}
jobList.append(job)
# print("Seeking a: "+title+" to join: "+company+" paying: "+salary+". Job description: "+description)
return
jobList = []
# go through multiple pages
for i in range(0,100, 10): #0-40 stepping in 10's
print(f'Getting page, {i}')
c = extract(0)
transform(c)
print(len(jobList))
df = pd.DataFrame(jobList)
print(df.head())
df.to_csv('jobs.csv')
My approach is as follows-
Find the href from the <a> tag for each job card on the initial page, and then send a request to each of those links, and grab the external job link (If "Apply on Company Site" button is available) from there.
Code snippet-
#function which gets external job links
def get_external_link(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
#if Apply On Company Site button is available, fetch the link
if(soup.find('a',attrs={"referrerpolicy" : "origin"})) is not None:
external_job_link=soup.find('a',attrs={"referrerpolicy" : "origin"})
print(external_job_link['href'])
#add this piece of code to transform function
def transform(soup):
cards=soup.find('div',class_='mosaic-provider-jobcards')
links=cards.find_all("a", class_=lambda value: value and value.startswith("tapItem"))
#for each job link in the page call get_external_links
for link in links:
get_external_link('https://www.indeed.com'+(link['href']))
Note- You can also use the page source of the new requests which are being called to fetch the data like title, company, salary, description which you previously used to scrape from the main page.
I am making a code that retrieves data from an E-journal site.
What I want to get are the titles, pages, authors, and abstracts of the articles.
I succeed retrieving the data and now making a list that combines them along with articles.
Some articles don't have authors or abstracts so I used if in def article(): to classify them. But it doesn't work, showing the results that in else code. please help me...
(I'm not a native English speaker so I hope you understand what I want to say...)
import requests
from bs4 import BeautifulSoup
h = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
URL = "https://agsjournals.onlinelibrary.wiley.com/toc/15325415/2021/69/7"
JAGS_result = requests.get(URL, headers=h)
JAGS_soup = BeautifulSoup(JAGS_result.text, "html.parser")
T = []
for title in JAGS_soup.select("a > h2"):
T.append(title.text)
P = []
for page in JAGS_soup.select(".page-range"):
P.append(page.text)
A =[]
for author in JAGS_soup.select(".comma__list"):
A.append(author.text)
L = []
for link in JAGS_soup.find_all('a',{"title":"Abstract"}):
L.append(link.get('href'))
Ab_Links = []
a = 0
for ab_link in L:
full_link = "https://agsjournals.onlinelibrary.wiley.com"+L[a]
Ab_Links.append(full_link)
a = a+1
b = 0
Ab = []
Ab_URL = Ab_Links[b]
for ab_url in Ab_Links:
Ab_result = requests.get(Ab_Links[b], headers = h)
Ab_soup = BeautifulSoup(Ab_result.text, "html.parser")
abstract = Ab_soup.find(class_='article-section article-section__abstract').text
Ab.append(abstract)
b = b+1
result = JAGS_soup.find_all("div", {"class": "issue-item"})
def article():
x = 0
results = []
for y in list(range(len(T))) :
an_article = [T[x], P[x]]
if "author" in result[x]:
an_article.append(A[x])
else :
an_article.append(" ")
if "Abstract" in result[x]:
an_article.append(Ab[x])
else:
an_article.append("No Abstract available")
results.append(an_article)
x = x+1
return results
print(article())
I believe I have written some code that does what you hope to accomplish. I don't generally like making a list of all components and then trying to assemble each article afterwards. Instead, I got the topic, title, authors, pages, and abstract link of each article and just made the article object there.
Here is the code:
import bs4
import requests
print(bs4.__version__)
print(requests.__version__)
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"}
res = requests.get("https://agsjournals.onlinelibrary.wiley.com/toc/15325415/2021/69/7",headers=headers)
soup = bs4.BeautifulSoup(res.content,"html.parser")
# gets the info on the main issue
parent_volume_tag = soup.find('li', class_="grid-item cover-image")
volume_and_issue = parent_volume_tag.find('div', class_="cover-image__parent-item").text.replace('\n', '')
page_nums = parent_volume_tag.find('div', class_="cover-image__pageRange").text.replace('\n', '')
issue_date = parent_volume_tag.find('div', class_="cover-image__date").text.replace('\n', '')
volume_info = [volume_and_issue, page_nums, issue_date] # creates a list with the relevant volume/issue data
def get_catergory(container):
return container.find('h3').text
def get_issue_list(container):
issues = []
for issue in container.find_all('div', class_="issue-item"):
if not issue.find('div', class_='issue-item'):
issues.append(issue)
return issues
def get_item_title(issue):
return issue.find('a')
def get_item_authors(issue):
author_list = issue.find('div', class_="comma__list")
if author_list:
authors = [i.text for i in author_list.find_all('span', class_="comma__item")]
authors = [i.replace('\n ', '').replace(', ', '') for i in authors]
return authors
else:
return None
def get_abstract_link(issue):
abstract_tag = issue.find('a', title="Abstract")
if abstract_tag:
link = "https://agsjournals.onlinelibrary.wiley.com" + abstract_tag.get('href')
return link
else:
return None
containers = soup.find_all('div', class_="issue-items-container bulkDownloadWrapper")
all = []
for container in containers:
topic = get_catergory(container)
issues = get_issue_list(container)
catergory = []
for issue in issues:
title = issue.find('h2').text
authors = get_item_authors(issue)
page_range = issue.find('li', class_='page-range').text.replace('Pages: ', '')
abstract_link = get_abstract_link(issue)
article = [topic, title, authors, page_range, abstract_link]
catergory.append(article)
all.append(catergory)
Within the all list, you have lists of articles, sorted by each topic, such as "Editorials" or "Covid Related Content".
I'm scraping the activities to do in Paris from TripAdvisor (https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html).
The code that I've written works well, but I haven't still found a way to obtain the rating of each activity. The rating in Tripadvisor is represented from 5 rounds, I need to know how many of these rounds are colored.
I obtain nothing in the "rating" field.
Following the code:
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html")
import pprint
detail_tours = []
for tour in list_tours:
url = tour.find_elements_by_css_selector("a")[0].get_attribute("href")
title = ""
reviews = ""
rating = ""
if(len(tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")) > 0):
title = tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")[0].text
if(len(tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")) > 0):
reviews = tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")[0].text
if(len(tour.find_elements_by_css_selector(".zWXXYhVR")) > 0):
rating = tour.find_elements_by_css_selector(".zWXXYhVR")[0].text
detail_tours.append({'url': url,
'title': title,
'reviews': reviews,
'rating': rating})
I would use BeautifulSoup in a way similar to the suggested code. (I would also recommend you study the structure of the html, but seeing the original code I don't think that's necessary.)
import requests
from bs4 import BeautifulSoup
import re
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
resp = requests.get('https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html', headers=header)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, 'lxml')
cards = soup.find_all('div', {'data-automation': 'cardWrapper'})
for card in cards:
rating = card.find('svg', {'class': 'zWXXYhVR'})
match = re.match('Punteggio ([0-9,]+)', rating.attrs['aria-label'])[1]
print(float(match.replace(',', '.')))
And a small bonus-info, the part in the link preceeded by oa (In the example below: oa60), indicates the starting offset, which runs in 30 result increments - So in case you want to change pages, you can change your link to include oa30, oa60, oa90, etc.: https://www.tripadvisor.it/Attractions-g187147-Activities-c42-oa60-Paris_Ile_de_France.html
Recently i am working on exercise and in which i had extracted whole webpage source data . I am very much interested in area tag . In area tag i am very much interested in onclick attribute . Now how can we extract onclick attribute from particular element .
Now our extracted data is looking like these ,
<area class="borderimage" coords="21.32,14.4,933.96,180.56" href="javascript:void(0);" onclick="return show_pop('78545','51022929357','1')" onmouseover="borderit(this,'black','<b>इंदौर, गुरुवार, 10 मई , 2018 <b><br><bआप पढ़ रहे हैं देश का सबसे व...')" onmouseout="borderit(this,'white')" alt="<b>इंदौर, गुरुवार, 10 मई , 2018 <b><br><bआप पढ़ रहे हैं देश का सबसे व..." shape="rect">
I am very much interested in onclick attribute and my code is like these which i had already done but nothing has worked ,
paper_url = 'http://epaper.bhaskar.com/indore/129/10052018/mpcg/1/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
# Total number of pages available in these product
page = requests.get(paper_url,headers = headers)
page_response = page.text
parser = html.fromstring(page_response)
XPATH_Total_Pages = '//div[contains(#class,"fs12 fta w100 co_wh pdt5")]//text()'
raw_total_pages = parser.xpath(XPATH_Total_Pages)
lastpage=raw_total_pages[-1]
print(int(lastpage))
finallastpage=int(lastpage)
reviews_list = []
XPATH_PRODUCT_NAME = '//map[contains(#name,"Mapl")]'
#XPATH_PRODUCT_PRICE = '//span[#id="priceblock_ourprice"]/text()'
#raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
#product_price = raw_product_price
raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)
XPATH_REVIEW_SECTION_2 = '//area[#class="borderimage"]'
reviews = parser.xpath(XPATH_REVIEW_SECTION_2)
product_name =raw_product_name
#result = product_name.find(',')
#finalproductname = slice[0:product_name]
print(product_name)
print(reviews)
for review in reviews:
#soup = BeautifulSoup(str(review), "html.parser")
#parser2.feed(str(review))
#allattr = [tag.attrs for tag in review.findAll('onclick')]
#print(allattr)
XPATH_RATING = './/area[#data-hook="onclick"]'
raw_review_rating = review.xpath(XPATH_RATING)
#cleaning data
print(raw_review_rating)
If I got it right - you need to get all onclick attributes of <area> tags on a page.
Try something like this:
import requests
from bs4 import BeautifulSoup
TAG_NAME = 'area'
ATTR_NAME = 'onclick'
url = 'http://epaper.bhaskar.com/indore/129/10052018/mpcg/1/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
# there are 3 <area> tags on page; putting them into a list
area_onclick_attrs = [x[ATTR_NAME] for x in soup.findAll(TAG_NAME)]
print(area_onclick_attrs)
Output:
[
"return show_pophead('78545','51022929357','1')",
"return show_pop('78545','51022928950','4')",
"return show_pop('78545','51022929357','1')",
]