How to crawl href - Python & beautifulsoup - python
I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.
Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))
You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}
Related
Webscrape Beautifulsoup on website (get multiple hrefs)
I want to web scrape this webpage (carbuzz.com). I want to get the links (href) of all the car brands from "Acura" to "Volvo" (link to picture). Currently, I only get the first entry (Acura). How do I get the remaining ones? As I just started scraping and coding would highly appreciate your input! Code: from bs4 import BeautifulSoup import requests import time #Inputs/URLs to scrape: URL2 = ('https://carbuzz.com/cars') (response := requests.get(URL2)).raise_for_status() soup = BeautifulSoup(response.text, 'lxml') overview = soup.find() car_brand = overview.find(class_='bg-make-preview')['href'] car_brand_url ='https://carbuzz.com'+car_brand print(car_brand_url) Output: [Finished in 1.2s]
You can use find_all to get the tag with class name bg-make-preview. soup = BeautifulSoup(response.text, 'lxml') for elem in soup.find_all(class_='bg-make-preview'): car_brand_url ='https://carbuzz.com' + elem['href'] print(car_brand_url) This gives us the expected output https://carbuzz.com/cars/acura https://carbuzz.com/cars/alfa-romeo https://carbuzz.com/cars/aston-martin https://carbuzz.com/cars/audi https://carbuzz.com/cars/bentley https://carbuzz.com/cars/bmw https://carbuzz.com/cars/bollinger https://carbuzz.com/cars/bugatti https://carbuzz.com/cars/buick https://carbuzz.com/cars/cadillac https://carbuzz.com/cars/caterham https://carbuzz.com/cars/chevrolet https://carbuzz.com/cars/chrysler https://carbuzz.com/cars/dodge https://carbuzz.com/cars/ferrari https://carbuzz.com/cars/fiat https://carbuzz.com/cars/fisker https://carbuzz.com/cars/ford https://carbuzz.com/cars/genesis https://carbuzz.com/cars/gmc https://carbuzz.com/cars/hennessey https://carbuzz.com/cars/honda https://carbuzz.com/cars/hyundai https://carbuzz.com/cars/infiniti https://carbuzz.com/cars/jaguar https://carbuzz.com/cars/jeep https://carbuzz.com/cars/karma https://carbuzz.com/cars/kia https://carbuzz.com/cars/koenigsegg https://carbuzz.com/cars/lamborghini https://carbuzz.com/cars/land-rover https://carbuzz.com/cars/lexus https://carbuzz.com/cars/lincoln https://carbuzz.com/cars/lordstown https://carbuzz.com/cars/lotus https://carbuzz.com/cars/lucid https://carbuzz.com/cars/maserati https://carbuzz.com/cars/mazda https://carbuzz.com/cars/mclaren https://carbuzz.com/cars/mercedes-benz https://carbuzz.com/cars/mini https://carbuzz.com/cars/mitsubishi https://carbuzz.com/cars/nissan https://carbuzz.com/cars/pagani https://carbuzz.com/cars/polestar https://carbuzz.com/cars/porsche https://carbuzz.com/cars/ram https://carbuzz.com/cars/rimac https://carbuzz.com/cars/rivian https://carbuzz.com/cars/rolls-royce https://carbuzz.com/cars/spyker https://carbuzz.com/cars/subaru https://carbuzz.com/cars/tesla https://carbuzz.com/cars/toyota https://carbuzz.com/cars/volkswagen https://carbuzz.com/cars/volvo https://carbuzz.com/cars/hummer https://carbuzz.com/cars/maybach https://carbuzz.com/cars/mercury https://carbuzz.com/cars/pontiac https://carbuzz.com/cars/saab https://carbuzz.com/cars/saturn https://carbuzz.com/cars/scion https://carbuzz.com/cars/smart https://carbuzz.com/cars/suzuki
How can I scrape Songs Title from this request that I have collected using python
import requests from bs4 import BeautifulSoup r = requests.get("https://gaana.com/playlist/gaana-dj-hindi-top-50-1") soup = BeautifulSoup(r.text, "html.parser") result = soup.find("div", {"class": "s_c"}) print(result.class) From the above code, I am able to scrape this data https://www.pastiebin.com/5f08080b8db82 Now I would like to scrape only the title of the songs and then make a list out of them like the below: Meri Aashiqui Genda Phool Any suggestions are much appreciated!
Try this : import requests from bs4 import BeautifulSoup r = requests.get("https://gaana.com/playlist/gaana-dj-hindi-top-50-1") soup = BeautifulSoup(r.text, "html.parser") result = soup.find("div", {"class": "s_c"}) #print(result) div = result.find_all('div', class_='track_npqitemdetail') name_list = [] for x in div: span = x.find('span').text name_list.append(span) print(name_list) this code will return all song name in name_list list.
How to get just links of articles in list using BeautifulSoup
Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong? from bs4 import BeautifulSoup from requests import get import requests URL = "https://news.ycombinator.com" page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') results = soup.find(id = 'hnmain') articles = results.find_all(class_="title") links_with_text = [] for article in articles: link = article.find('a', href=True) links_with_text.append(link) print('\n'.join(map(str, links_with_text))) This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'. For example: from bs4 import BeautifulSoup from requests import get import requests URL = "https://news.ycombinator.com" page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') links_with_text = [] for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink" links_with_text.append(a['href']) # <-- note the ['href'] print(*links_with_text, sep='\n') Prints: https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/ https://mxb.dev/blog/the-return-of-the-90s-web/ https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/ https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/ https://lwn.net/SubscriberLink/822568/61d29096a4012e06/ https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/ https://jepsen.io/consistency https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1 https://apnews.com/1da061ce00eb531291b143ace0eed1c9 https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/ https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/ https://www.videolan.org/security/sb-vlc3011.html https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903 https://www.bbc.com/news/world-europe-53006790 https://github.com/efficient/HOPE https://everytwoyears.org/ https://www.historytoday.com/archive/natural-histories/intelligence-earthworms https://cr.yp.to/2005-590/powerpc-cwg.pdf https://quantum.country/ http://www.crystallography.net/cod/ https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/ https://spark.apache.org/releases/spark-release-3-0-0.html https://arxiv.org/abs/1712.09624 https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/ https://blog.chromium.org/2020/06/improving-chromiums-browser.html
How to get complete href links using beautifulsoup in python
I am trying to get top movies name by genre. I couldn't get complete href links for that, I stuck by getting half href links By the following code I got, https://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000, https://www.imdb.com/search/title?genres=adventure&sort=user_rating,desc&title_type=feature&num_votes=25000, https://www.imdb.com/search/title?genres=animation&sort=user_rating,desc&title_type=feature&num_votes=25000, https://www.imdb.com/search/title?genres=biography&sort=user_rating,desc&title_type=feature&num_votes=25000, ......... Like that but i want to all top 100 movies name by its genre like action, Adventure, Animation, Biography....... I tried the following code: from bs4 import BeautifulSoup import requests url = 'https://www.imdb.com' main_url = url + '/chart/top' res = requests.get(main_url) soup = BeautifulSoup(res.text, 'html.parser') for href in soup.find_all(class_='subnav_item_main'): # print(href) all_links = url + href.find('a').get('href') print(all_links) I want complete link as shown bellow from a link /search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=FM1ZEBQ7E9KGQSDD441H&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1"
You need another loop over those urls and a limit to only get 100. I store in a dictionary with keys being genre and values being a list of films. Note original titles may appear e.g. The Mountain II (2016) is Dag II (original title). links is a list of tuples where I keep the genre as first item and url as second. import requests, pprint from bs4 import BeautifulSoup as bs from urllib.parse import urljoin url = 'https://www.imdb.com/chart/top' genres = {} with requests.Session() as s: r = s.get(url) soup = bs(r.content, 'lxml') links = [(i.text, urljoin(url,i['href'])) for i in soup.select('.subnav_item_main a')] for link in links: r = s.get(link[1]) soup = bs(r.content, 'lxml') genres[link[0].strip()] = [i['alt'] for i in soup.select('.loadlate', limit = 100)] pprint.pprint(genres) Sample output:
BeautifulSoup and scraping href's isn't working
Again I am having trouble scraping href's in BeautifulSoup. I have a list of pages that I am scraping and I have the data but I can't seem to get the hrefs even when I use various codes that work in other scripts. So here is the code and my data will be below that: import requests from bs4 import BeautifulSoup with open('states_names.csv', 'r') as reader: states = [states.strip().replace(' ', '-') for states in reader] url = 'https://www.hauntedplaces.org/state/alabama' for state in states: page = requests.get(url+state) soup = BeautifulSoup(page.text, 'html.parser') links = soup.findAll('div', class_='description') # When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early? h_page = soup.findAll('h3') <h3>Gaines Ridge Dinner Club</h3> <h3>Purifoy-Lipscomb House</h3> <h3>Kate Shepard House Bed and Breakfast</h3> <h3>Cedarhurst Mansion</h3> <h3>Crybaby Bridge</h3> <h3>Gaineswood Plantation</h3> <h3>Mountain View Hospital</h3>
This works perfectly: from bs4 import BeautifulSoup import requests url = 'https://www.hauntedplaces.org/state/Alabama' r = requests.get(url) soup = BeautifulSoup(r.text, 'lxml') for link in soup.select('div.description a'): print(link['href'])
Try that: soup = BeautifulSoup(page.content, 'html.parser') list0 = [] possible_links = soup.find_all('a') for link in possible_links: if link.has_attr('href'): print (link.attrs['href']) list0.append(link.attrs['href']) print(list0)