Beautifulsoup python3 Howlongtobeat.com extracting name (and other elements) - python

Trying to figure out how to extract the name of the game through beautifulsoup
I think i having a problem with the HTML aspect of it
here what I have so far:
from requests import get
url = 'https://howlongtobeat.com/game.php?id=38050'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
game_length = html_soup.find_all('div', class_='game_times')
length = (game_length[-1].find_all({'li': ' short time_100 shadow_box'})[-1].contents[3].get_text())
print(length)
game_name = html_soup.find_all('div', class_='profile_header_game')
game = (game_name[].find({"profile_header shadow_text"})[].contents[].get_text())
print(game)
I'm getting the length but not the game name why?
for print(length) prints:
31 Hours
but for print(game) prints:
game_name = html_soup.find_all('div', class_='profile_header_game')
game = (game_name[].find({"profile_header shadow_text"})[].contents[].get_text())
File "", line 1
game = (game_name[].find({"profile_header shadow_text"})[].contents[].get_text())
^
SyntaxError: invalid syntax
print(game)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'game' is not defined
what am I doing wrong?

It looks like there are a few syntax issues in your code. Here is a corrected version:
from bs4 import BeautifulSoup
import requests
url = 'https://howlongtobeat.com/game.php?id=38050'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
game_times_tag = html_soup.find('div', class_='game_times')
game_time_list = []
for li_tag in game_times_tag.find_all('li'):
title = li_tag.find('h5').text.strip()
play_time = li_tag.find('div').text.strip()
game_time_list.append((title, play_time))
for game_time in game_time_list:
print(game_time)
profile_header_tag = html_soup.find("div", {"class": "profile_header shadow_text"})
game_name = profile_header_tag.text.strip()
print(game_name)

shorter version
game_length = html_soup.select('div.game_times li div')[-1].text
game_name = html_soup.select('div.profile_header')[0].text
developer = html_soup.find_all('strong', string='\nDeveloper:\n')[0].next_sibling

Related

Getting a not subscriptable when running a web scraping script

I am practicing web scraping and am using this code. I am trying the for loop.
import requests
from bs4 import BeautifulSoup
name=[]
link=[]
address=[]
for i in range (1,11):
i=str(i)
url = "https://forum.iktva.sa/exhibitors-list?&page="+i+"&searchgroup=37D5A2A4-exhibitors"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select(".m-exhibitors-list__items__item__header__title__link"):
company_url = "https://forum.iktva.sa/" + a["href"].split("'")[1]
soup2 = BeautifulSoup(requests.get(company_url).content, "html.parser")
n=soup2.select_one(".m-exhibitor-entry__item__header__title").text
l=soup2.select_one("h4+a")["href"]
a=soup2.select_one(".m-exhibitor-entry__item__body__contacts__address").text
name.append(n)
link.append(l)
address.append(a)
When I am running the program I am getting this error:
l=soup2.select_one("h4+a")["href"]
TypeError: 'NoneType' object is not subscriptable
If i am not sure how to solve the problem.
You just need to raplace, follwing code to Handle None
l = soup2.select_one("h4+a")
if l:
l = l["href"]
else:
l = "Website not available"
As you can see, Because website is not available for:
https://forum.iktva.sa/exhibitors/sanad
OR you can handle all error like:
import requests
from bs4 import BeautifulSoup
def get_object(obj, attr=None):
try:
if attr:
return obj[attr]
else:
return obj.text
except:
return "Not available"
name = []
link = []
address = []
for i in range(1, 11):
i = str(i)
url = f"https://forum.iktva.sa/exhibitors-list?&page={i}&searchgroup=37D5A2A4-exhibitors"
soup = BeautifulSoup(requests.get(url).text, features="lxml")
for a in soup.select(".m-exhibitors-list__items__item__header__title__link"):
company_url = "https://forum.iktva.sa/" + a["href"].split("'")[1]
soup2 = BeautifulSoup(requests.get(company_url).content, "html.parser")
n = soup2.select_one(".m-exhibitor-entry__item__header__title").text
n = get_object(n)
l = soup2.select_one("h4+a")
l = get_object(l, 'href')
a = soup2.select_one(".m-exhibitor-entry__item__body__contacts__address")
a = get_object(a)
name.append(n)
link.append(l)
address.append(a)

Exception has occurred: AttributeError 'str' object has no attribute 'descendants'

I'm new at using python and I'm trying to make a web scraper for an internship
from typing import Container
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
p1 = ["https://www.libris.ro/search?iv.q={}", "https://carturesti.ro/product/search/{}", "https://www.elefant.ro/search?SearchTerm={}&StockAvailability=true", "https://www.litera.ro/catalogsearch/result/?q{}", "https://www.librariadelfin.ro/?submitted=1&O=search&keywords{}&do_submit=1", "https://bookzone.ro/cautare?term={}", "https://www.librex.ro/search/{}/?q={}"]
#price_min = 1000000
#url_min, price_min
title = "percy jackson"
for x in p1:
temp = x
title = title.replace(" ", "+")
url = temp.format(title)
if url == "https://www.libris.ro/search?iv.q=" + title :
**books = bs.find_all("div", class_="product-item-info imgdim-x")**
for each_book in books:
book_url = each_book.find("a")["href"]
price = each_book.find("span", class_="price-wrapper")
print(book_url)
print(price)
and I'm getting this error for the text between the 2 asterisk :
Exception has occurred: AttributeError
'str' object has no attribute 'descendants'
After from bs4 import BeautifulSoup as bs, bs is the class. You need to instantiate that class with data from the web site. In the code below, I've add a requests call to get the page and have built the beautifulsoup doc from there. You'll find some other errors in your code that need to be sorted out, but it will get you past this problem.
from typing import Container
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
p1 = ["https://www.libris.ro/search?iv.q={}", "https://carturesti.ro/product/search/{}", "https://www.elefant.ro/search?SearchTerm={}&StockAvailability=true", "https://www.litera.ro/catalogsearch/result/?q{}", "https://www.librariadelfin.ro/?submitted=1&O=search&keywords{}&do_submit=1", "https://bookzone.ro/cautare?term={}", "https://www.librex.ro/search/{}/?q={}"]
#price_min = 1000000
#url_min, price_min
title = "percy jackson"
for x in p1:
temp = x
title = title.replace(" ", "+")
url = temp.format(title)
if url == "https://www.libris.ro/search?iv.q=" + title :
# THE FIX
resp = requests.get(url)
if not 200 <= resp.status_code < 299:
print("failed", resp.status_code, url)
continue
doc = bs(resp.text, "html.parser")
books = doc.find_all("div", class_="product-item-info imgdim-x")
for each_book in books:
book_url = each_book.find("a")["href"]
price = each_book.find("span", class_="price-wrapper")
print(book_url)
print(price)

webscraping bus stops with beautifulsoup

I am trying to web scrape bus stop names for a given line, here is an example page for line 212 https://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=212. I want to have as an output two lists, one with bus stop names in one direction and the other list with another direction. (It's clearly seen on the web page). I managed to get all names in one list with
import requests
from bs4 import BeautifulSoup
def download_bus_schedule(bus_number):
URL = "http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=" + bus_number
r = requests.get(URL)
soup = BeautifulSoup(r.content,
'html5lib')
print(soup.prettify())
all_bus_stops = []
table = soup.find_all('a')
for element in table:
if element.get_text() in all_bus_stops:
continue
else:
all_bus_stops.append(element.get_text())
return all_bus_stops
print(download_bus_schedule('212'))
I guess the solution would be to somehow divide the soup into two parts.
You can use the bs4.element.Tag.findAll method:
import requests
from bs4 import BeautifulSoup
def download_bus_schedule(bus_number):
all_bus_stops = []
URL = "http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=" + bus_number
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
for s in soup.select(".holo-list"):
bus_stops = []
for f in s.findAll("li"):
if f.text not in bus_stops:
bus_stops.append(f.text)
all_bus_stops.append(bus_stops)
return all_bus_stops
print(download_bus_schedule('212'))
Output:
[['Pl.Hallera', 'Pl.Hallera', 'Darwina', 'Namysłowska', 'Rondo Żaba', 'Rogowska', 'Kołowa', 'Dks Targówek', 'Metro Targówek Mieszkaniowy', 'Myszkowska', 'Handlowa', 'Metro Trocka', 'Bieżuńska', 'Jórskiego', 'Łokietka', 'Samarytanka', 'Rolanda', 'Żuromińska', 'Targówek-Ratusz', 'Św.Wincentego', 'Malborska', 'Ch Targówek'],
['Ch Targówek', 'Ch Targówek', 'Malborska', 'Św.Wincentego', 'Targówek-Ratusz', 'Żuromińska', 'Gilarska', 'Rolanda', 'Samarytanka', 'Łokietka', 'Jórskiego', 'Bieżuńska', 'Metro Trocka', 'Metro Trocka', 'Metro Trocka', 'Handlowa', 'Myszkowska', 'Metro Targówek Mieszkaniowy', 'Dks Targówek', 'Kołowa', 'Rogowska', 'Rondo Żaba', '11 Listopada', 'Bródnowska', 'Szymanowskiego', 'Pl.Hallera', 'Pl.Hallera']]
import requests
from bs4 import BeautifulSoup
def download_bus_schedule(bus_number):
URL = "http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=" + bus_number
r = requests.get(URL)
soup = BeautifulSoup(r.content,
'html5lib')
bus_stops_1 = []
bus_stops_2 = []
directions = soup.find_all("ul", {"class":"holo-list"})
for stop in directions[0].find_all("a"):
if stop not in bus_stops_1:
bus_stops_1.append(stop.text.strip())
for stop in directions[1].find_all("a"):
if stop not in bus_stops_2:
bus_stops_2.append(stop.text.strip())
all_bus_stops = (bus_stops_1, bus_stops_2)
return all_bus_stops
print(download_bus_schedule('212')[0])
print(download_bus_schedule('212')[1])
I may have misunderstood as I do not know Polish but see if this helps.
from bs4 import BeautifulSoup
import requests
url = 'https://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l=212'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, "html.parser")
d = {}
for h2 in soup.select('h2.holo-divider'):
d[h2.text] = []
ul = h2.next_sibling
for li in ul.select('li'):
if li.a.text not in d[h2.text]:
d[h2.text].append(li.a.text)
from pprint import pprint
pprint(d)
As all stops are encapsulated in the next un-ordered list, you could use the find_next function of bs4.
e.g.
URL = f"http://www.m2.rozkladzik.pl/warszawa/rozklad_jazdy.html?l={bus_number}"
r = requests.get(URL)
soup = BeautifulSoup(r.content,
'html5lib')
directions = ["Ch Targówek","Pl.Hallera"]
result = {}
for direction in directions:
header = soup.find(text=direction)
list = header.find_next("ul")
stops_names = [stop.get_text() for stop in list]
result[direction] = stops_names
return result
Plus you might want to use f-string to format your strings as it improves reading and is less error prone.

Web scraping from multiple pages with for loop part 2

My original problem:
"I have created web scraping tool for picking data from listed houses.
I have problem when it comes to changing page. I did make for loop to go from 1 to some number.
Problem is this: In this web pages last "page" can be different all the time. Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last."
Then I got help from Ricco D who wrote code that it will know when to stop:
import requests
from bs4 import BeautifulSoup as bs
url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')
last_page = None
pages = []
buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
pages.append(button.text)
print(pages)
This works just fine.
Butt when I try to combine this with my original code, which also works by itself I run into error:
Traceback (most recent call last):
File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in <module>
containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"})
File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
This is the error I get.
Any ideas how to get this work? Thanks
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests
my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'
filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)
page = requests.get(my_url)
soup = soup(page.content, 'html.parser')
pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
pages.append(button.text)
last_page = int(pages[-1])
for sivu in range(1, last_page):
req = requests.get(my_url + str(sivu))
page_soup = soup(req.text, "html.parser")
containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})
for container in containers:
size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
size_number = re.findall("\d+\,*\d+", size_list)
size = ''.join(size_number) # Asunnon koko neliöinä
prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
prize_number_list = re.findall("\d+\d+", prize_line)
prize = ''.join(prize_number_list[:2]) # Asunnon hinta
address_city = container.h4.text
address_list = address_city.split(', ')[0:1]
address = ' '.join(address_list) # osoite
city_part = address_city.split(', ')[-2] # kaupunginosa
city = address_city.split(', ')[-1] # kaupunki
type_org = container.h5.text
type = type_org.replace("|", "").replace(",", "").replace(".", "") # asuntotyyppi
year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
year_number = re.findall("\d+", year_list)
year = ' '.join(year_number)
print("pinta-ala: " + size)
print("hinta: " + prize)
print("osoite: " + address)
print("kaupunginosa: " + city_part)
print("kaupunki: " + city)
print("huoneistoselittelmä: " + type)
print("rakennusvuosi: " + year)
f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")
f.close()
Your main problem has to do with the way you use soup. You first import BeautifulSoup as soup - and then you override this name, when you create your first BeautifulSoup-instance:
soup = soup(page.content, 'html.parser')
From this point on soup will no longer be the name library BeautifulSoup, but the object you just created. Hence, when you some lines further down try to create a new instance (page_soup = soup(req.text, "html.parser")) this fails as soup no longer refers to BeautifulSoup.
So the best thing would be importing the library correctly like so: from bs4 import BeautifulSoup (or import AND use it as bs - like Ricco D did), and then change the two instantiating lines like so:
soup = BeautifulSoup(page.content, 'html.parser') # this is Python2.7-syntax btw
and
page_soup = BeautifulSoup(req.text, "html.parser") # this is Python3-syntax btw
If you're on Python3, the proper requests-syntax would by page.text and not page.content as .content returns bytes in Python3, which is not what you want (as BeautifulSoup needs a str). If you're on Python2.7 you should probably change req.text to req.content.
Good luck.
Finding your element with class name doesn't seem to be the best idea..because of this. Same class name for all the next elements.
I don't know what you are looking for exactly because of the language. I suggest..you go to the website>press f12>press ctrl+f>type the xpath..See what elements you get.If you don't know about xpaths read this. https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples

Fix the syntax error of a list comprehension that contains beautiful soup methods

I tried hard but there is always some syntax error with the piece of code that follows.
import urllib.request
import re
import csv
from bs4 import BeautifulSoup
from bs4 import NavigableString
from unicodedata import normalize
url = input('Please paste the link here: ')
html = urllib.request.urlretrieve(url)
html_file = open(html[0])
soup = BeautifulSoup(html_file, 'html5lib')
def contains_href(tag):
return tag.find('a', href=True)
scrollables = [table in soup.find_all('table', class_='sc_courselist') if contains_href(table)]
def num_name_unit(tag):
td_num = tag.find('td', href=True)
num = normalize('NFKD', td_num.string.strip())
td_name = tag.find('td', class_=False)
name = normalize('NFKD', td_name.string.strip())
td_unit = tag.find('td', class_='hourscol')
unit = normalize('NFKD', td_unit.string.strip())
row = ['Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return row
dic_rows = {scrollable.find_previous_siblings(re.compile('h'), class_=False, limit=1).string.strip(): list(num_name_unit(tr) for tr in scrollable.find_all('tr', contains_href)) for scrollable in scrollables}
I expect that the terminal would print the following request: "Please paste the link here: ". In reality, it says "invalid syntax" at the end of scrollables = [table in soup.find_all('table', class_='sc_courselist') if contains_href(table)].
enter image description here
You are missing the for part in your list. It should be
[table for table in soup.find_all('table', class_='sc_courselist') if contains_href(table)]

Categories

Resources