I want to web scrape this webpage (carbuzz.com). I want to get the links (href) of all the car brands from "Acura" to "Volvo" (link to picture).
Currently, I only get the first entry (Acura). How do I get the remaining ones? As I just started scraping and coding would highly appreciate your input!
Code:
from bs4 import BeautifulSoup
import requests
import time
#Inputs/URLs to scrape:
URL2 = ('https://carbuzz.com/cars')
(response := requests.get(URL2)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
car_brand = overview.find(class_='bg-make-preview')['href']
car_brand_url ='https://carbuzz.com'+car_brand
print(car_brand_url)
Output:
[Finished in 1.2s]
You can use find_all to get the tag with class name bg-make-preview.
soup = BeautifulSoup(response.text, 'lxml')
for elem in soup.find_all(class_='bg-make-preview'):
car_brand_url ='https://carbuzz.com' + elem['href']
print(car_brand_url)
This gives us the expected output
https://carbuzz.com/cars/acura
https://carbuzz.com/cars/alfa-romeo
https://carbuzz.com/cars/aston-martin
https://carbuzz.com/cars/audi
https://carbuzz.com/cars/bentley
https://carbuzz.com/cars/bmw
https://carbuzz.com/cars/bollinger
https://carbuzz.com/cars/bugatti
https://carbuzz.com/cars/buick
https://carbuzz.com/cars/cadillac
https://carbuzz.com/cars/caterham
https://carbuzz.com/cars/chevrolet
https://carbuzz.com/cars/chrysler
https://carbuzz.com/cars/dodge
https://carbuzz.com/cars/ferrari
https://carbuzz.com/cars/fiat
https://carbuzz.com/cars/fisker
https://carbuzz.com/cars/ford
https://carbuzz.com/cars/genesis
https://carbuzz.com/cars/gmc
https://carbuzz.com/cars/hennessey
https://carbuzz.com/cars/honda
https://carbuzz.com/cars/hyundai
https://carbuzz.com/cars/infiniti
https://carbuzz.com/cars/jaguar
https://carbuzz.com/cars/jeep
https://carbuzz.com/cars/karma
https://carbuzz.com/cars/kia
https://carbuzz.com/cars/koenigsegg
https://carbuzz.com/cars/lamborghini
https://carbuzz.com/cars/land-rover
https://carbuzz.com/cars/lexus
https://carbuzz.com/cars/lincoln
https://carbuzz.com/cars/lordstown
https://carbuzz.com/cars/lotus
https://carbuzz.com/cars/lucid
https://carbuzz.com/cars/maserati
https://carbuzz.com/cars/mazda
https://carbuzz.com/cars/mclaren
https://carbuzz.com/cars/mercedes-benz
https://carbuzz.com/cars/mini
https://carbuzz.com/cars/mitsubishi
https://carbuzz.com/cars/nissan
https://carbuzz.com/cars/pagani
https://carbuzz.com/cars/polestar
https://carbuzz.com/cars/porsche
https://carbuzz.com/cars/ram
https://carbuzz.com/cars/rimac
https://carbuzz.com/cars/rivian
https://carbuzz.com/cars/rolls-royce
https://carbuzz.com/cars/spyker
https://carbuzz.com/cars/subaru
https://carbuzz.com/cars/tesla
https://carbuzz.com/cars/toyota
https://carbuzz.com/cars/volkswagen
https://carbuzz.com/cars/volvo
https://carbuzz.com/cars/hummer
https://carbuzz.com/cars/maybach
https://carbuzz.com/cars/mercury
https://carbuzz.com/cars/pontiac
https://carbuzz.com/cars/saab
https://carbuzz.com/cars/saturn
https://carbuzz.com/cars/scion
https://carbuzz.com/cars/smart
https://carbuzz.com/cars/suzuki
I have a script that scrapes a website. However, I am looking for it to incrementally scrape the websites for a range. So imagine the range is set to 0-999. The code is:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.greekrank.com/uni/1/sororities/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
uni = soup.find_all('h1', class_='overviewhead')
for title in uni:
print(title.text)
rows = soup.find_all('div', class_='desktop-view')
for row in rows:
print(row.text)
It would go to https://www.greekrank.com/uni/1/sororities/ scrape that, then go to https://www.greekrank.com/uni/2/sororities/ scrape that, etc.
Wrap it all in a loop. Also note the URL assignment.
import requests
from bs4 import BeautifulSoup
for x in range(0, 999):
URL = f'https://www.greekrank.com/uni/{x}/sororities/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
uni = soup.find_all('h1', class_='overviewhead')
for title in uni:
print(title.text)
rows = soup.find_all('div', class_='desktop-view')
for row in rows:
print(row.text)
I am trying to get top movies name by genre. I couldn't get complete href links for that, I stuck by getting half href links
By the following code I got,
https://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=adventure&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=animation&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=biography&sort=user_rating,desc&title_type=feature&num_votes=25000,
.........
Like that but i want to all top 100 movies name by its genre like action, Adventure, Animation, Biography.......
I tried the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.imdb.com'
main_url = url + '/chart/top'
res = requests.get(main_url)
soup = BeautifulSoup(res.text, 'html.parser')
for href in soup.find_all(class_='subnav_item_main'):
# print(href)
all_links = url + href.find('a').get('href')
print(all_links)
I want complete link as shown bellow from a link
/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=FM1ZEBQ7E9KGQSDD441H&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1"
You need another loop over those urls and a limit to only get 100. I store in a dictionary with keys being genre and values being a list of films. Note original titles may appear e.g. The Mountain II (2016) is Dag II (original title).
links is a list of tuples where I keep the genre as first item and url as second.
import requests, pprint
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
url = 'https://www.imdb.com/chart/top'
genres = {}
with requests.Session() as s:
r = s.get(url)
soup = bs(r.content, 'lxml')
links = [(i.text, urljoin(url,i['href'])) for i in soup.select('.subnav_item_main a')]
for link in links:
r = s.get(link[1])
soup = bs(r.content, 'lxml')
genres[link[0].strip()] = [i['alt'] for i in soup.select('.loadlate', limit = 100)]
pprint.pprint(genres)
Sample output:
Again I am having trouble scraping href's in BeautifulSoup. I have a list of pages that I am scraping and I have the data but I can't seem to get the hrefs even when I use various codes that work in other scripts.
So here is the code and my data will be below that:
import requests
from bs4 import BeautifulSoup
with open('states_names.csv', 'r') as reader:
states = [states.strip().replace(' ', '-') for states in reader]
url = 'https://www.hauntedplaces.org/state/alabama'
for state in states:
page = requests.get(url+state)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('div', class_='description')
# When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early?
h_page = soup.findAll('h3')
<h3>Gaines Ridge Dinner Club</h3>
<h3>Purifoy-Lipscomb House</h3>
<h3>Kate Shepard House Bed and Breakfast</h3>
<h3>Cedarhurst Mansion</h3>
<h3>Crybaby Bridge</h3>
<h3>Gaineswood Plantation</h3>
<h3>Mountain View Hospital</h3>
This works perfectly:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hauntedplaces.org/state/Alabama'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for link in soup.select('div.description a'):
print(link['href'])
Try that:
soup = BeautifulSoup(page.content, 'html.parser')
list0 = []
possible_links = soup.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print (link.attrs['href'])
list0.append(link.attrs['href'])
print(list0)
I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.
Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))
You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}