How to crawl href - Python & beautifulsoup - python

I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.

Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))

You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}

Related

Webscrape Beautifulsoup on website (get multiple hrefs)

I want to web scrape this webpage (carbuzz.com). I want to get the links (href) of all the car brands from "Acura" to "Volvo" (link to picture).
Currently, I only get the first entry (Acura). How do I get the remaining ones? As I just started scraping and coding would highly appreciate your input!
Code:
from bs4 import BeautifulSoup
import requests
import time
#Inputs/URLs to scrape:
URL2 = ('https://carbuzz.com/cars')
(response := requests.get(URL2)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
car_brand = overview.find(class_='bg-make-preview')['href']
car_brand_url ='https://carbuzz.com'+car_brand
print(car_brand_url)
Output:
[Finished in 1.2s]
You can use find_all to get the tag with class name bg-make-preview.
soup = BeautifulSoup(response.text, 'lxml')
for elem in soup.find_all(class_='bg-make-preview'):
car_brand_url ='https://carbuzz.com' + elem['href']
print(car_brand_url)
This gives us the expected output
https://carbuzz.com/cars/acura
https://carbuzz.com/cars/alfa-romeo
https://carbuzz.com/cars/aston-martin
https://carbuzz.com/cars/audi
https://carbuzz.com/cars/bentley
https://carbuzz.com/cars/bmw
https://carbuzz.com/cars/bollinger
https://carbuzz.com/cars/bugatti
https://carbuzz.com/cars/buick
https://carbuzz.com/cars/cadillac
https://carbuzz.com/cars/caterham
https://carbuzz.com/cars/chevrolet
https://carbuzz.com/cars/chrysler
https://carbuzz.com/cars/dodge
https://carbuzz.com/cars/ferrari
https://carbuzz.com/cars/fiat
https://carbuzz.com/cars/fisker
https://carbuzz.com/cars/ford
https://carbuzz.com/cars/genesis
https://carbuzz.com/cars/gmc
https://carbuzz.com/cars/hennessey
https://carbuzz.com/cars/honda
https://carbuzz.com/cars/hyundai
https://carbuzz.com/cars/infiniti
https://carbuzz.com/cars/jaguar
https://carbuzz.com/cars/jeep
https://carbuzz.com/cars/karma
https://carbuzz.com/cars/kia
https://carbuzz.com/cars/koenigsegg
https://carbuzz.com/cars/lamborghini
https://carbuzz.com/cars/land-rover
https://carbuzz.com/cars/lexus
https://carbuzz.com/cars/lincoln
https://carbuzz.com/cars/lordstown
https://carbuzz.com/cars/lotus
https://carbuzz.com/cars/lucid
https://carbuzz.com/cars/maserati
https://carbuzz.com/cars/mazda
https://carbuzz.com/cars/mclaren
https://carbuzz.com/cars/mercedes-benz
https://carbuzz.com/cars/mini
https://carbuzz.com/cars/mitsubishi
https://carbuzz.com/cars/nissan
https://carbuzz.com/cars/pagani
https://carbuzz.com/cars/polestar
https://carbuzz.com/cars/porsche
https://carbuzz.com/cars/ram
https://carbuzz.com/cars/rimac
https://carbuzz.com/cars/rivian
https://carbuzz.com/cars/rolls-royce
https://carbuzz.com/cars/spyker
https://carbuzz.com/cars/subaru
https://carbuzz.com/cars/tesla
https://carbuzz.com/cars/toyota
https://carbuzz.com/cars/volkswagen
https://carbuzz.com/cars/volvo
https://carbuzz.com/cars/hummer
https://carbuzz.com/cars/maybach
https://carbuzz.com/cars/mercury
https://carbuzz.com/cars/pontiac
https://carbuzz.com/cars/saab
https://carbuzz.com/cars/saturn
https://carbuzz.com/cars/scion
https://carbuzz.com/cars/smart
https://carbuzz.com/cars/suzuki

How can I scrape Songs Title from this request that I have collected using python

import requests
from bs4 import BeautifulSoup
r = requests.get("https://gaana.com/playlist/gaana-dj-hindi-top-50-1")
soup = BeautifulSoup(r.text, "html.parser")
result = soup.find("div", {"class": "s_c"})
print(result.class)
From the above code, I am able to scrape this data
https://www.pastiebin.com/5f08080b8db82
Now I would like to scrape only the title of the songs and then make a list out of them like the below:
Meri Aashiqui
Genda Phool
Any suggestions are much appreciated!
Try this :
import requests
from bs4 import BeautifulSoup
r = requests.get("https://gaana.com/playlist/gaana-dj-hindi-top-50-1")
soup = BeautifulSoup(r.text, "html.parser")
result = soup.find("div", {"class": "s_c"})
#print(result)
div = result.find_all('div', class_='track_npqitemdetail')
name_list = []
for x in div:
span = x.find('span').text
name_list.append(span)
print(name_list)
this code will return all song name in name_list list.

How to get just links of articles in list using BeautifulSoup

Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html

How to get complete href links using beautifulsoup in python

I am trying to get top movies name by genre. I couldn't get complete href links for that, I stuck by getting half href links
By the following code I got,
https://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=adventure&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=animation&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=biography&sort=user_rating,desc&title_type=feature&num_votes=25000,
.........
Like that but i want to all top 100 movies name by its genre like action, Adventure, Animation, Biography.......
I tried the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.imdb.com'
main_url = url + '/chart/top'
res = requests.get(main_url)
soup = BeautifulSoup(res.text, 'html.parser')
for href in soup.find_all(class_='subnav_item_main'):
# print(href)
all_links = url + href.find('a').get('href')
print(all_links)
I want complete link as shown bellow from a link
/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=FM1ZEBQ7E9KGQSDD441H&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1"
You need another loop over those urls and a limit to only get 100. I store in a dictionary with keys being genre and values being a list of films. Note original titles may appear e.g. The Mountain II (2016) is Dag II (original title).
links is a list of tuples where I keep the genre as first item and url as second.
import requests, pprint
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
url = 'https://www.imdb.com/chart/top'
genres = {}
with requests.Session() as s:
r = s.get(url)
soup = bs(r.content, 'lxml')
links = [(i.text, urljoin(url,i['href'])) for i in soup.select('.subnav_item_main a')]
for link in links:
r = s.get(link[1])
soup = bs(r.content, 'lxml')
genres[link[0].strip()] = [i['alt'] for i in soup.select('.loadlate', limit = 100)]
pprint.pprint(genres)
Sample output:

BeautifulSoup and scraping href's isn't working

Again I am having trouble scraping href's in BeautifulSoup. I have a list of pages that I am scraping and I have the data but I can't seem to get the hrefs even when I use various codes that work in other scripts.
So here is the code and my data will be below that:
import requests
from bs4 import BeautifulSoup
with open('states_names.csv', 'r') as reader:
states = [states.strip().replace(' ', '-') for states in reader]
url = 'https://www.hauntedplaces.org/state/alabama'
for state in states:
page = requests.get(url+state)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('div', class_='description')
# When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early?
h_page = soup.findAll('h3')
<h3>Gaines Ridge Dinner Club</h3>
<h3>Purifoy-Lipscomb House</h3>
<h3>Kate Shepard House Bed and Breakfast</h3>
<h3>Cedarhurst Mansion</h3>
<h3>Crybaby Bridge</h3>
<h3>Gaineswood Plantation</h3>
<h3>Mountain View Hospital</h3>
This works perfectly:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hauntedplaces.org/state/Alabama'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for link in soup.select('div.description a'):
print(link['href'])
Try that:
soup = BeautifulSoup(page.content, 'html.parser')
list0 = []
possible_links = soup.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print (link.attrs['href'])
list0.append(link.attrs['href'])
print(list0)

Categories

Resources