I am a newbie at python and currently learning web scraping using BeautifulSoup. I am trying to get information on Steam to display the game name, price, and genre. I can get my code to find all of this but when I put in in a for loop, it doesn't work. Can you identify the problem?
Thank you so much for the help!
This will show everything I need(and more) on the page (name, price, genre)*
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
for item in content.findAll("div", attrs={"id": "tab_content_NewReleases"}):
print(item.text)
This will only show the first game, therefore I believe it is not looping correctly*
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
for item in content.findAll("div", attrs={"id": "tab_content_NewReleases"}):
itemObject = {
"name": item.find("div", attrs={"class": "tab_item_name"}).text,
"price": item.find("div", attrs={"class": "discount_final_price"}).text,
"genre": item.find("div", attrs={"class": "tab_item_top_tags"}).text
}
print(itemObject)
I'm expecting results like this but more than 1 results:
{
'name': 'Little Misfortune',
'price': '$19.99',
'genre': 'Adventure, Indie, Casual, Singleplayer'
}
The issue is that content.findAll("div", attrs=....... contains all of the results you want in the very first index (results[0]) so you only get the first result. When you iterate over it; you only search the html that contains the good stuff once, hence the one result issue. The solution is to then search the found html block which contains your desired results and split THAT into an iterable you can work with. Here is my solution:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
bulk = content.find("div", attrs={"id": "tab_content_NewReleases"}) # Isolate the block you want
results = bulk.findAll('a', attrs={'class': 'tab_item'}) # Split it into the seperate results
for item in results:
itemObject = {
"name": item.find("div", attrs={"class": "tab_item_name"}).text,
"price": item.find("div", attrs={"class": "discount_final_price"}).text,
"genre": item.find("div", attrs={"class": "tab_item_top_tags"}).text
}
print(itemObject)
You got 90% of the way there, just missing that little bit.
Make sure you are working with the children so add in a child a for the selector. You could also make the parent the rows element i.e. #NewReleasesRows a
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
for item in content.select('#NewReleasesRows a'):
itemObject = {
"name": item.find("div", attrs={"class": "tab_item_name"}).text,
"price": item.find("div", attrs={"class": "discount_final_price"}).text,
"genre": item.find("div", attrs={"class": "tab_item_top_tags"}).text
}
print(itemObject)
I think you are not selecting the right Tag. Use instead 'NewReleasesRows' to find the table containing rows of the new releases.
So code would be like this using CSS selector:
my_soup: BeautifulSoup = BeautifulSoup(my_page_text, 'lxml')
print("mysoup type:", type(my_soup))
my_table_list = my_soup.select('#NewReleasesRows')
print('my_table_list size:', len(my_table_list))
Then you can look for the rows (after having checked that you got only one table (could use select_one too):
print(BeautifulSoup.prettify(my_table_list[0]))
my_table_rows = my_table_list[0].select('.tab_item')
and from there you can iterate
for my_row in my_table_rows:
print(my_row.get_text(strip=True))
Result code:
R 130.00Little MisfortuneAdventure, Indie, Casual, Singleplayer
-33%R 150.00R 100.50TrailmakersBuilding, Sandbox, Multiplayer, LEGO
-10%R 105.00R 94.50Devil's Deck 恶魔秘境Early Access, RPG, Indie, Early Access
R 89.00Showdown BanditAction, Adventure, Indie, Horror
R 150.00HardlandAdventure, Indie, Open World, Singleplayer
R 120.00Aeon's EndCard Game, Strategy, Indie, Adventure
R 105.00Atomorf2Casual, Action, Indie, Adventure
-10%R 175.00R 157.50Daymare: 1998Indie, Action, Survival Horror, Horror
-25%R 79.00R 59.25Ling: A Road AloneAction, RPG, Indie, Gore
-10%R 105.00R 94.50NauticrawlIndie, Simulation, Atmospheric, Sci-fi
FreeOrpheus's DreamFree to Play, Adventure, Indie, Casual
-40%R 105.00R 63.00AVAEarly Access, Action, Early Access, Indie
-40%R 18.00R 10.80Angry GolfIndie, Casual, Sports, Adventure
-40%R 10.00R 6.00Death LiveIndie, Casual, Adventure, Anime
-30%R 130.00R 91.00Die YoungSurvival, Action, Open World, Gore
I hope that helps.
Best
Related
I am trying to use beautifulsoup to get the links off of this webpage: https://nfdc.faa.gov/nfdcApps/services/ajv5/fixes.jsp
I need the links to all of the fixes in Arizona (AZ), so I search for AZ, and when I start by hitting 'A' under 'View fixes in alphabetical order:', I am not able to scrape the links that are shown by hoving over each fix (i.e 'AALAN') when I use beautifulsoup in python. How can I do this? Here is my code:
page = requests.get("https://nfdc.faa.gov/nfdcApps/services/ajv5/fix_search.jsp?selectType=state&selectName=AZ&keyword=")
soup = bs(page.content)
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
And this is what it outputs:
['http://www.faa.gov', 'http://www.faa.gov', 'http://www.faa.gov/privacy/', 'http://www.faa.gov/web_policies/', 'http://www.faa.gov/contact/', 'http://faa.custhelp.com/', 'http://www.faa.gov/viewer_redirect.cfm?viewer=pdf&server_name=employees.faa.gov', 'http://www.faa.gov/viewer_redirect.cfm?viewer=doc&server_name=employees.faa.gov', 'http://www.faa.gov/viewer_redirect.cfm?viewer=ppt&server_name=employees.faa.gov', 'http://www.faa.gov/viewer_redirect.cfm?viewer=xls&server_name=employees.faa.gov', 'http://www.faa.gov/viewer_redirect.cfm?viewer=zip&server_name=employees.faa.gov']
The links to the fixes are not there (i.e https://nfdc.faa.gov/nfdcApps/services/ajv5/fix_detail.jsp?fix=1948394&list=yes is not in the list)
I am looking to compile a list of all the fix links for Arizona so I can aquire the data. Thanks!
Try:
import requests
from bs4 import BeautifulSoup
url = "https://nfdc.faa.gov/nfdcApps/services/ajv5/fix_search.jsp"
data = {
"alphabet": "A",
"selectType": "STATE",
"selectName": "AZ",
"keyword": "",
}
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for data["alphabet"] in alphabet:
soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
for a in soup.select('[href*="fix_detail.jsp"]'):
print("{:<10} {}".format(a.text.strip(), a["href"]))
Prints:
...
ITEMM fix_detail.jsp?fix=17822&list=yes
ITUCO fix_detail.jsp?fix=56147&list=yes
IVLEC fix_detail.jsp?fix=11787&list=yes
IVVRY fix_detail.jsp?fix=20962&list=yes
IWANS fix_detail.jsp?fix=1948424&list=yes
IWEDU fix_detail.jsp?fix=13301&list=yes
IXAKE fix_detail.jsp?fix=585636&list=yes
...
I am getting some trouble with my code. The thing is that I'm extracting prices for a website and some of this prices includes the original price and the discounted one, and what i'm trying to get is only the discounted one. How do i fix it?
import requests
from bs4 import BeautifulSoup
search_url = "https://store.steampowered.com/search/?sort_by=Price_ASC&category1=998%2C996&category2=29"
category1 = ('998', '996')
category2 = '29'
params = {
'sort_by': 'Price_ASC',
'category1': ','.join(category1),
'category2': category2,
}
response = requests.get(
search_url,
params=params
)
soup = BeautifulSoup(response.text, "html.parser")
elms = soup.find_all("span", {"class": "title"})
prcs = soup.find_all("div",{"class": "col search_price discounted responsive_secondrow"})
for elm in elms:
print(elm.text)
for prc in prcs:
print(prc.text)
You can also use .contents. .contents list all the children of the selected tag and from there select your desired tag, in your case the last children of the tag.
for prc in prcs:
print(prc.contents[-1])
Use next_sibling
prcs = soup.find_all("div",{"class": "col search_price discounted responsive_secondrow"})
for p in prcs:
print(p.find('span').next_sibling.next)
I'm learning beautiful soup. I want to extract the player names i.e. the playing eleven for both teams from cricinfo.com. The exact link is "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
The problem is that the website only displays the players under class "wrap batsmen" if they have batted. Otherwise they are placed under the class "wrap dnb". I want to extract all the players irrespective of whether they have batted or not. How I can maintain two arrays (one for each team) that will dynamically search for players in "wrap batsmen" and "wrap dnb" (if required)?
This is my attempt:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
years = []
# Years we will be analyzing
for i in range(2010, 2018):
years.append(i)
names = []
# URL page we will scraping (see image above)
url = "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")
for a in range(0, 1):
names.append([a.getText() for a in soup.find_all("div", class_="cell batsmen")[1:][a].findAll('a', limit=1)])
soup = soup.find_all("div", class_="wrap dnb")
print(soup[0])
While this is possible with BeautifulSoup, it's not the best tool for the job. All that data (and much more) is available through the API. Simply pull that and then you can parse the json to get what you want (and more). Here's a quick script though to get the 11 players for each team:
You can get the api url by using dev tools (Ctrl-Shft-I) and seeing what requests the browser makes (look at Network -> XHR in the side panel. you may need to click around to view it make the request/call)
import requests
url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/13266/summary'
payload = {
'contentorigin': 'espn',
'event': '439146',
'lang': 'en',
'region': 'gb',
'section': 'cricinfo'}
jsonData = requests.get(url, params=payload).json()
roster = jsonData['rosters']
players = {}
for team in roster:
players[team['team']['displayName']] = []
for player in team['roster']:
playerName = player['athlete']['displayName']
players[team['team']['displayName']].append(playerName)
Output:
print (players)
{'West Indies': ['Chris Gayle', 'Andre Fletcher', 'Dwayne Bravo', 'Ramnaresh Sarwan', 'Narsingh Deonarine', 'Kieron Pollard', 'Darren Sammy', 'Nikita Miller', 'Jerome Taylor', 'Sulieman Benn', 'Kemar Roach'], 'South Africa': ['Graeme Smith', 'Loots Bosman', 'Jacques Kallis', 'AB de Villiers', 'Jean-Paul Duminy', 'Johan Botha', 'Alviro Petersen', 'Ryan McLaren', 'Roelof van der Merwe', 'Dale Steyn', 'Charl Langeveldt']}
See below:
I am attempting to scrape the following web page
https://www.betexplorer.com/tennis/wta-singles/dubai/siniakova-katerina-kvitova-petra/6ZCipZ9h/#ha
I am fine with scraping player names, the date, the score, however, I am running into trouble when trying to scrape the match odds of the different bookmakers (listed in the table)
Here is what I attempted
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.betexplorer.com/tennis/wta-singles/dubai/siniakova-katerina-kvitova-petra/6ZCipZ9h/')
soup = BeautifulSoup(r.text,'html.parser')
Odds = soup.find_all('td', attrs= {'class':'table-main__detail-odds table-main__detail-odds--first'})
print(odds)
[]
As you can see, nothing is being found.
Any ideas on this?
Thanks
The class you seem to be looking for is table-main__odds as per the page source.
For example:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.betexplorer.com/tennis/wta-singles/dubai/siniakova-katerina-kvitova-petra/6ZCipZ9h/')
soup = BeautifulSoup(r.text, 'html.parser')
odds = [x.attrs for x in soup.find_all('td', attrs={'class': 'table-main__odds'})]
print(odds)
Output:
[{u'class': [u'table-main__odds'],
u'data-odd': u'3.46',
u'data-odd-max': u'3.90'},
{u'class': [u'table-main__odds', u'colored']},
{u'class': [u'table-main__odds'],
u'data-odd': u'3.58',
u'data-odd-max': u'3.92'},
{u'class': [u'table-main__odds', u'colored']}]
I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
You could use regex to match each part of string.
[-\d]+ the string only have number and -
(?<=\s).*?(?=by) the string start with blank and end with by(which is begin with author)
(?<=by\s).* the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-\d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=\s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=by\s).*",paper.text)[0]
print(data)
datas.append(data)