I'm trying to scrape rotten tomatoes with bs4
My aim is to find all a hrefs from the table but i cannot do it can you help me?
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
my code is
from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd
url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
########################################### links ############################################################################
webpages = []
for link in reversed(links):
print(link)
html = request.urlopen(link)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
webpages.extend(links)
print(webpages)
I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help
I need to find an exact solution on how to scrape from table without scrape irrelevant information
thanks
Just grab the main table and then extract all the <a> tags.
For example:
import requests
from bs4 import BeautifulSoup
rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
f"https://www.rottentomatoes.com{link.get('href')}"
for link in
BeautifulSoup(
requests.get(rotten_tomatoes_url).text,
"lxml",
)
.find("table", class_="table")
.find_all("a")
]
print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))
Output (all 100 links to movies):
100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi
try this:
tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]
My soup
import requests
from bs4 import BeautifulSoup
page = requests.get('https://example.com')
soup = BeautifulSoup(page.text, 'html.parser')
property_list = soup.find(class_='listing-list ListingsListstyle__ListingsListContainer-cNRhPr hqtMPr')
property_link_list = property_list.find_all('a',{ "class" : "depth-listing-card-link" },string="View details")
print(property_link_list)
I just got an empty array. What I need is to retrieve all the hrefs that contain View details text.
This is an example of the input
<a class="depth-listing-card-link" href="https://example.com">View details<i class="rui-icon rui-icon-arrow-right-small Icon-cFRQJw cqwgEb"></i></a>
I am using Python 3.7.
Try changing the last 2 lines of your code to:
property_link_list = property_list.find_all('a',{ "class" : "depth-listing-card-link" })
for pty in property_link_list:
if pty.text=="View details":
print(pty['href'])
My output is:
/property/bandar-sungai-long/sale-7700845/
/property/bandar-sungai-long/sale-7700845/
/property/bandar-sungai-long/sale-4577620/
/property/bandar-sungai-long/sale-4577620/
/property/port-dickson/sale-8387235/
/property/port-dickson/sale-8387235/
etc.
I am attempting to scrape the following web page
https://www.betexplorer.com/tennis/wta-singles/dubai/siniakova-katerina-kvitova-petra/6ZCipZ9h/#ha
I am fine with scraping player names, the date, the score, however, I am running into trouble when trying to scrape the match odds of the different bookmakers (listed in the table)
Here is what I attempted
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.betexplorer.com/tennis/wta-singles/dubai/siniakova-katerina-kvitova-petra/6ZCipZ9h/')
soup = BeautifulSoup(r.text,'html.parser')
Odds = soup.find_all('td', attrs= {'class':'table-main__detail-odds table-main__detail-odds--first'})
print(odds)
[]
As you can see, nothing is being found.
Any ideas on this?
Thanks
The class you seem to be looking for is table-main__odds as per the page source.
For example:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.betexplorer.com/tennis/wta-singles/dubai/siniakova-katerina-kvitova-petra/6ZCipZ9h/')
soup = BeautifulSoup(r.text, 'html.parser')
odds = [x.attrs for x in soup.find_all('td', attrs={'class': 'table-main__odds'})]
print(odds)
Output:
[{u'class': [u'table-main__odds'],
u'data-odd': u'3.46',
u'data-odd-max': u'3.90'},
{u'class': [u'table-main__odds', u'colored']},
{u'class': [u'table-main__odds'],
u'data-odd': u'3.58',
u'data-odd-max': u'3.92'},
{u'class': [u'table-main__odds', u'colored']}]
Again I am having trouble scraping href's in BeautifulSoup. I have a list of pages that I am scraping and I have the data but I can't seem to get the hrefs even when I use various codes that work in other scripts.
So here is the code and my data will be below that:
import requests
from bs4 import BeautifulSoup
with open('states_names.csv', 'r') as reader:
states = [states.strip().replace(' ', '-') for states in reader]
url = 'https://www.hauntedplaces.org/state/alabama'
for state in states:
page = requests.get(url+state)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('div', class_='description')
# When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early?
h_page = soup.findAll('h3')
<h3>Gaines Ridge Dinner Club</h3>
<h3>Purifoy-Lipscomb House</h3>
<h3>Kate Shepard House Bed and Breakfast</h3>
<h3>Cedarhurst Mansion</h3>
<h3>Crybaby Bridge</h3>
<h3>Gaineswood Plantation</h3>
<h3>Mountain View Hospital</h3>
This works perfectly:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hauntedplaces.org/state/Alabama'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for link in soup.select('div.description a'):
print(link['href'])
Try that:
soup = BeautifulSoup(page.content, 'html.parser')
list0 = []
possible_links = soup.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print (link.attrs['href'])
list0.append(link.attrs['href'])
print(list0)
I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.
Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))
You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}