I am trying to web scrape https://steamcommunity.com/market/listings/730/AWP%20%7C%20Atheris%20%28Field-Tested%29 and extract the data regarding the price of items sorted with there Inspect in Game link.
You can find the Inspect in Game button by clicking on the arrow popup which appears when you move the cursor on the image of any listed item below.
I managed to scrape the data of prices but, from the same method the Inspect in Game link isn't scraping.
Here's my code
from bs4 import BeautifulSoup
import requests
import re
print("Fetching Data")
url = 'https://steamcommunity.com/market/listings/730/AWP%20%7C%20Atheris%20%28Field-Tested%29'
response = requests.get(url)
# getting the source code of page response.text
data = response.text
soup = BeautifulSoup(data,'html.parser')
jobs = soup.find_all('div',{'class':re.compile('market_listing_row market_recent_listing_row listing_''\d')})
for job in jobs:
price_tag = job.find('span',{'class':'market_listing_price market_listing_price_with_fee'})
price = price_tag.text[8:] if price_tag else "Sold!"
link = job.find('a',{'class':'popup_menu_item'}).get('href')
print('PRICE : ',price,link)
I got error AttributeError: 'NoneType' object has no attribute 'get' which is probably because this class appears in html code when someone click on the arrow button on the picture, and in the code i am unable to do so. Can anyone suggest me a way, so that i can get the link of Inspect in Game button
Go through the API. I believe it already doesn't include sold items:
from bs4 import BeautifulSoup
import requests
import re
import math
import time
print("Fetching Data")
with requests.Session() as s:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
s.get('https://steamcommunity.com/')
cookiesDict = s.cookies.get_dict()
headers.update({'Cookie':'sessionid=' + cookiesDict['sessionid']})
url = 'https://steamcommunity.com/market/listings/730/AWP%20%7C%20Atheris%20%28Field-Tested%29/render/'
payload = {
'query':'',
'start': '0',
'count': '100',
'country': 'US',
'language': 'english',
'currency': '1'}
results = {}
jsonData = s.get(url, params=payload, headers=headers).json()
total_count = jsonData['total_count']
total_pages = math.ceil(total_count/100)
for page in range(0,total_pages):
#page=1
time.sleep(5)
payload = {
'query':'',
'start': '%s' %(page*100),
'count': '100',
'country': 'US',
'language': 'english',
'currency': '1'}
jsonData = s.get(url, params=payload, headers=headers).json()
results.update(jsonData['listinginfo'])
print ('Aquired page %s of %s...' %(page, total_pages))
for k, v in results.items():
price = (v['converted_price'] + v['converted_fee']) / 100
link = v['asset']['market_actions'][0]['link']
print('PRICE : $%.02f %s' %(price, link))
Output:
....
PRICE : $4.64 steam://rungame/730/76561202255233023/+csgo_econ_action_preview%20M%listingid%A%assetid%D2353320232219212327
PRICE : $4.64 steam://rungame/730/76561202255233023/+csgo_econ_action_preview%20M%listingid%A%assetid%D5237312699887394313
PRICE : $4.64 steam://rungame/730/76561202255233023/+csgo_econ_action_preview%20M%listingid%A%assetid%D17060803922484197923
PRICE : $4.64 steam://rungame/730/76561202255233023/+csgo_econ_action_preview%20M%listingid%A%assetid%D3054184006200365805
PRICE : $4.64 steam://rungame/730/76561202255233023/+csgo_econ_action_preview%20M%listingid%A%assetid%D5237312699887394313
....
Related
need help with retrieving review section from agoda website.
from bs4 import BeautifulSoup
import requests
import json
from tqdm import tqdm
filename = "hotel.csv"
f = open(filename, "w", encoding="utf-8")
headers = "title, rating, review\n"
f.write(headers)
api_url = "https://www.agoda.com/api/cronos/property/review/ReviewComments"
headers = {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
# for loop for multiple page scrap
for x in tqdm(range(1,10)):
post_data = {"hotelId":"2252947",
"providerId":"332",
"demographicId":"0",
"page":str(x),
"pageSize":"20",
"sorting":"7",
"providerIds":[332],
"isReviewPage":"false",
"isCrawlablePage":"true",
"filters":{"language":[],"room":[]},
"searchKeyword":"",
"searchFilters":[]}
html = requests.post(api_url, data=post_data)
values = html.text
soup = BeautifulSoup(values, "html.parser")
hotels = soup.find_all("div", {"class": "review-comment"})
for hotel in hotels:
try:
rating = hotel.find("div", {"class":"Review-comment-leftScore"}).text
title = hotel.find("p", {"class":"Review-comment-bodyTitle"}).text
review = hotel.find("p", {"class":"Review-comment-bodyText"}).text
f.write(title + ", "+ rating + ", " + review + "\n")
except TypeError:
continue
f.close()
post data i get from firefox network monitor when i change the page on the review section.
the hotel: Hotel Page
tried the json method but i dont understand
I think your api endpoint or data is wrong. 'cause if you try just to print, you get <Response [415]>.
Should be 200.
html.json()
{'type': 'https://tools.ietf.org/html/rfc7231#section-6.5.13', 'title': 'Unsupported Media Type', 'status': 415, 'traceId': '00-68f23e7f0431e7bffae420112667ed1b-6306a38dd716894d-00'}
I can't scrape the "Product Details" section (scrolling down the webpage you'll find it) html by using requests or requests_html.
Find_all returns a 0 size object... Any Help?
from requests import session
from requests_html import HTMLSession
s = HTMLSession()
#s = session()
r = s.get("https://www.amazon.com/dp/B094HWN66Y")
soup = BeautifulSoup(r.text, 'html.parser')
len(soup.find_all("div", {"id":"detailBulletsWrapper_feature_div"}))
Product details with different information:
Code:
from bs4 import BeautifulSoup
import requests
cookies = {'session': '131-1062572-6801905'}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
r = requests.get("https://www.amazon.com/dp/B094HWN66Y",headers=headers,cookies=cookies)
print(r)
soup = BeautifulSoup(r.text, 'lxml')
key = [x.get_text(strip=True).replace('\u200f\n','').replace('\u200e','').replace(':\n','').replace('\n', '').strip() for x in soup.select('ul.a-unordered-list.a-nostyle.a-vertical.a-spacing-none.detail-bullet-list > li > span > span.a-text-bold')][:13]
#print(key)
value = [x.get_text(strip=True) for x in soup.select('ul.a-unordered-list.a-nostyle.a-vertical.a-spacing-none.detail-bullet-list > li > span > span:nth-child(2)')]
#print(value)
product_details = {k:v for k, v, in zip(key, value)}
print(product_details)
Output:
{'ASIN': 'B094HWN66Y', 'Publisher': 'Boldwood Books (September 7, 2021)', 'Publication date':
'September 7, 2021', 'Language': 'English', 'File size': '1883 KB', 'Text-to-Speech': 'Enabled', 'Screen Reader': 'Supported', 'Enhanced typesetting': 'Enabled', 'X-Ray': 'Enabled', 'Word
Wise': 'Enabled', 'Print length': '332 pages', 'Page numbers source ISBN': '1800487622', 'Lending': 'Not Enabled'}
This is an example of how to scrape the title of the product using bs4 and requests, easily expandable to getting other info from the product.
The reason yours doesn't work is your request has no headers so Amazon realises your a bot and doesn't want you scraping their site. This is shown by your request being returned as <Response [503]> and explained in r.text.
I believe Amazon have an API for this (that they'd probably like you to use) but it'll be fine to scrape like this for small-scale stuff.
import requests
import bs4
# Amazon don't like you scrapeing them however these headers should stop them from noticing a small number of requests
HEADERS = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
def main():
url = "https://www.amazon.com/dp/B094HWN66Y"
title = get_title(url)
print("The title of %s is: %s" % (url, title))
def get_title(url: str) -> str:
"""Returns the title of the amazon product."""
# The request
r = requests.get(url, headers=HEADERS)
# Parse the content
soup = bs4.BeautifulSoup(r.content, 'html.parser')
title = soup.find("span", attrs={"id": 'productTitle'}).string
return title
if __name__ == "__main__":
main()
Output:
The title of https://www.amazon.com/dp/B094HWN66Y is: Will They, Won't They?
I have recently started learning python and one of my first project is to get live stock prices from Google finance using beautifulsoup. Basically I am looking up for a stock and setting a price alert.
here is what my code looks like.
import requests
import time
import tkinter
from bs4 import BeautifulSoup
def st_Price(symbol):
baseurl = 'http://google.com/finance/quote/'
URL = baseurl + symbol + ":NSE?hl=en&gl=in"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_="YMlKec fxKbKc")
result = results.__str__()
#print(result)
res = result.split("₹")[1].split("<")[0]
res_flt = float(res.replace(",",""))
return res_flt
def main():
sym = input("Enter Stock Symbol : ")
price = input("Enter desired price : ")
x = st_Price(sym)
while x < float(price):
print(x)
t1 = time.perf_counter()
x = st_Price(sym)
t2 = time.perf_counter()
print("Internal refresh time is {}".format(t2-t1))
else:
print("The Stock {} achieved price greater than {}".format(sym,x))
root = tkinter.Tk()
root.geometry("150x150")
tkinter.messagebox.showinfo(title="Price Alert",message="Stock Price {} greater Than {}".format(x,price))
root.destroy()
if __name__ == "__main__":
main()
I am looking up following class in the Page HTML:
HTML element for the Stock
The code works perfectly fine but it takes too much time to fetch the information:
Enter Stock Symbol : INFY
Enter desired price : 1578
1574.0
Internal refresh time is 9.915285099999892
1574.0
Internal refresh time is 7.2284357999997155
I am not too much familiar with HTML. By referring online documentation I was able to figure out how to scrape necessary part.
Is there any way to reduce the time to fetch the data ?
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
Also, when using the requests library, the default requests user-agent is python-requests so websites understand that it's a bot or a script that sends a request, not a real user. Check what's your user-agent and pass it request headers.
To get just the current price you would need to use such CSS selector AHmHk .fxKbKc via the select_one() bs4 method, which could also change in the future.
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/INFY:NSE", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
current_price = soup.select_one(".zzDege").text
print(current_price)
# ₹1,860.50
Code and full example in the online IDE to scrape current price and right panel data:
from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest
def scrape_google_finance(ticker: str):
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/{ticker}", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = {"right_panel_data": {},
"ticker_info": {}}
ticker_data["ticker_info"]["title"] = soup.select_one(".zzDege").text
ticker_data["ticker_info"]["current_price"] = soup.select_one(".AHmHk .fxKbKc").text
right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
right_panel_values = soup.select(".gyFHrc .P6K39c")
for key, value in zip_longest(right_panel_keys, right_panel_values):
key_value = key.text.lower().replace(" ", "_")
ticker_data["right_panel_data"][key_value] = value.text
return ticker_data
data = scrape_google_finance(ticker="INFY:NSE")
# ensure_ascii=False to display Indian Rupee ₹ symbol
print(json.dumps(data, indent=2, ensure_ascii=False))
print(data["right_panel_data"].get("ceo"))
Outputs:
{
"right_panel_data": {
"previous_close": "₹1,882.95",
"day_range": "₹1,857.15 - ₹1,889.60",
"year_range": "₹1,311.30 - ₹1,953.90",
"market_cap": "7.89T INR",
"p/e_ratio": "36.60",
"dividend_yield": "1.61%",
"primary_exchange": "NSE",
"ceo": "Salil Parekh",
"founded": "Jul 2, 1981",
"headquarters": "Bengaluru, KarnatakaIndia",
"website": "infosys.com",
"employees": "292,067"
},
"ticker_info": {
"title": "Infosys Ltd",
"current_price": "₹1,860.50"
}
}
Salil Parekh
If you want to scrape more data with a line-by-line explanation, there's a Scrape Google Finance Ticker Quote Data in Python blog post of mine.
I'm trying to scrape website from a job postings data, and the output looks like this:
[{'job_title': 'Junior Data Scientist','company': '\n\n BBC',
summary': "\n We're now seeking a Junior Data Scientist to
come and work with our Marketing & Audiences team in London. The Data
Science team are responsible for designing...", 'link':
'www.jobsite.com',
'summary_text': "Job Introduction\nImagine if Netflix, The Huffington Post, ESPN, and Spotify were all rolled into one....etc
I want to create a dataframe, or a CSV, that looks like this:
right now, this is the loop I'm using:
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
results = []
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '',
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
and after using the loop, I just print the results.
What would be a good way to get the output in a dataframe? Thanks!
Look at the pandas Dataframe API. There are several ways you can initialize a dataframe
list of dictionaries
list of lists
You just need to append either a list or a dictionary to a global variable, and you should be good to go.
results = []
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '', # assuming this has value like you shared in the example in your question
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
results.append(result)
# results is now a list of dictionaries
df= pandas.DataFrame(results)
One other suggestion, don't think about dumping this in a dataframe within the same program. Dump all your HTML files first into a folder, and then parse them again. This way if you need more information from the page which you hadn't considered before, or if a program terminates due to some parsing error or timeout, the work is not lost. Keep parsing separate from crawling logic.
I think you need to define the number of pages and add that into your url (ensure you have a placeholder for the value which I don't think your code, nor the other answer have). I have done this via extending your url to include a page parameter in the querystring which incorporates a placeholder.
Is your selector of class result correct? You could certainly also use for job in soup.select('.job'):. You then need to define appropriate selectors to populate values. I think it easier to grab all the job links for each page then visit the page and extract the values from a json like string in the page. Add Session to re-use connection.
Explicit waits required to prevent being blocked
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 3
with requests.Session() as s:
for page in range(1, pages + 1):
try:
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
links.append([link['href'] for link in soup.select('.job-title a')])
except Exception as e:
print(e, url )
finally:
time.sleep(2)
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of results:
I can't imagine you want all pages but you could use something similar to:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 0
def get_links(url, page):
try:
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
page_links = [link['href'] for link in soup.select('.job-title a')]
if page == 1:
global pages
pages = int(soup.select_one('.page-title span').text.replace(',',''))
except Exception as e:
print(e, url )
finally:
time.sleep(1)
return page_links
with requests.Session() as s:
links.append(get_links('https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page=1',1))
for page in range(2, pages + 1):
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
links.append(get_links(url, page))
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
I am new to web scraping, so my apologies in advance if I'm misunderstanding anything...
I am trying to get data from ESPN. Here is my python code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nba/teams'
r = requests.get(url)
soup = BeautifulSoup(r.text)
tables = soup.find_all('dl')
teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
lis = table.find_all('dt', text=False)
print lis
for li in lis:
info = dt
teams.append(info.text)
url = info['href']
teams_urls.append(url)
prefix_1.append(url.split('/')[-2])
prefix_2.append(url.split('/')[-1])
print (teams)
When I print at various points, i am getting empty brackets [] as a return. Please help. Thanks.
You are extracting the team names from the menu, but the actual page content contains teams also.
Let's use CSS selectors to get to the each team link on the page. As a result, let's construct a list of dictionaries with team names and urls inside:
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nba/teams'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'})
soup = BeautifulSoup(r.content, 'lxml')
teams = []
for link in soup.select('div.mod-table div.mod-content ul li h5 a[href]'):
teams.append({
'name': link.text,
'url': link['href']
})
print(teams)
Prints:
[
{'name': u'Boston Celtics', 'url': 'http://espn.go.com/nba/team/_/name/bos/boston-celtics'},
{'name': u'Brooklyn Nets', 'url': 'http://espn.go.com/nba/team/_/name/bkn/brooklyn-nets'},
...
{'name': u'Utah Jazz', 'url': 'http://espn.go.com/nba/team/_/name/utah/utah-jazz'}
]