Scraping with beautifulsoup trying to get all the href attributes

Scraping with beautifulsoup trying to get all the href attributes - python

Im trying to scrape all the urls from amazon categories website (https://www.amazon.com/gp/site-directory/ref=nav_shopall_btn)
but I can just get the first url of any category. For example, from "Amazon video" I am getting "All videos", "Fire TV" amazon fire tv, etc.
That is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.es/gp/site-directory/ref=nav_shopall_btn"
amazon_link = requests.get(url)
html = BeautifulSoup(amazon_link.text,"html.parser")
categorias_amazon = html.find_all('div',{'class':'popover-grouping'})
for i in range(len(categorias_amazon)):
print("www.amazon.es" + categorias_amazon[i].a['href'])
I have tried with:
print("www.amazon.es" + categorias_amazon[i].find_all['a'])
but I get an error. I am looking to get href attribute of every sub category.

You can try this code:
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.es/gp/site-directory/ref=nav_shopall_btn"
amazon_link = requests.get(url)
html = BeautifulSoup(amazon_link.text,"html.parser")
# print html
categorias_amazon = html.find_all('div',{'class':'popover-grouping'})
allurls=html.select("div.popover-grouping [href]")
values=[link['href'].strip() for link in allurls]
for value in values:
print("www.amazon.es" + value)
It will print:
www.amazon.es/b?ie=UTF8&node=1748200031
www.amazon.es/gp/dmusic/mp3/player
www.amazon.es/b?ie=UTF8&node=2133385031
www.amazon.es/clouddrive/primephotos
www.amazon.es/clouddrive/home
www.amazon.es/clouddrive/home#download-section
www.amazon.es/clouddrive?_encoding=UTF8&sf=1
www.amazon.es/dp/B0186FET66
www.amazon.es/dp/B00QJDO0QC
www.amazon.es/dp/B00IOY524S
www.amazon.es/dp/B010EK1GOE
www.amazon.es/b?ie=UTF8&node=827234031
www.amazon.es/ebooks-kindle/b?ie=UTF8&node=827231031
www.amazon.es/gp/kindle/ku/sign-up/
www.amazon.es/b?ie=UTF8&node=8504981031
www.amazon.es/gp/digital/fiona/kcp-landing-page
www.amazon.eshttps://www.amazon.es:443/gp/redirect.html?location=https://leer.amazon.es/&token=CA091C61DBBA8A5C0F6E4A46ED30C059164DBC74&source=standards
www.amazon.es/gp/digital/fiona/manage
www.amazon.es/dp/B00ZDWLEEG
www.amazon.es/dp/B00IRKMZX0
www.amazon.es/dp/B01AHBC23E
www.amazon.es/b?ie=UTF8&node=827234031
www.amazon.es/mobile-apps/b?ie=UTF8&node=1661649031
www.amazon.es/b?ie=UTF8&node=1726755031
www.amazon.es/b?ie=UTF8&node=1748200031
www.amazon.es/ebooks-kindle/b?ie=UTF8&node=827231031
www.amazon.es/gp/digital/fiona/manage
www.amazon.es/b?ie=UTF8&node=10909716031
www.amazon.es/b?ie=UTF8&node=10909718031
www.amazon.es/b?ie=UTF8&node=10909719031
www.amazon.es/b?ie=UTF8&node=10909720031
www.amazon.es/b?ie=UTF8&node=10909721031
www.amazon.es/b?ie=UTF8&node=10909722031
www.amazon.es/b?ie=UTF8&node=8464150031
www.amazon.es/mobile-apps/b?ie=UTF8&node=1661649031
www.amazon.es/b?ie=UTF8&node=1726755031
www.amazon.es/b?ie=UTF8&node=4622953031
www.amazon.es/gp/feature.html?ie=UTF8&docId=1000658923
www.amazon.es/gp/mas/your-account/myapps
www.amazon.es/comprar-libros-espa%C3%B1ol/b?ie=UTF8&node=599364031
www.amazon.es/ebooks-kindle/b?ie=UTF8&node=827231031
www.amazon.es/gp/kindle/ku/sign-up/
www.amazon.es/Libros-en-ingl%C3%A9s/b?ie=UTF8&node=665418031
www.amazon.es/Libros-en-otros-idiomas/b?ie=UTF8&node=599367031
www.amazon.es/b?ie=UTF8&node=902621031
www.amazon.es/libros-texto/b?ie=UTF8&node=902673031
www.amazon.es/Blu-ray-DVD-peliculas-series-3D/b?ie=UTF8&node=599379031
www.amazon.es/series-tv-television-DVD-Blu-ray/b?ie=UTF8&node=665293031
www.amazon.es/Blu-ray-peliculas-series-3D/b?ie=UTF8&node=665303031
www.amazon.es/M%C3%BAsica/b?ie=UTF8&node=599373031
www.amazon.es/b?ie=UTF8&node=1748200031
www.amazon.es/musical-instruments/b?ie=UTF8&node=3628866031
www.amazon.es/fotografia-videocamaras/b?ie=UTF8&node=664660031
www.amazon.es/b?ie=UTF8&node=931491031
www.amazon.es/tv-video-home-cinema/b?ie=UTF8&node=664659031
www.amazon.es/b?ie=UTF8&node=664684031
www.amazon.es/gps-accesorios/b?ie=UTF8&node=664661031
www.amazon.es/musical-instruments/b?ie=UTF8&node=3628866031
www.amazon.es/accesorios/b?ie=UTF8&node=928455031
www.amazon.es/Inform%C3%A1tica/b?ie=UTF8&node=667049031
www.amazon.es/Electr%C3%B3nica/b?ie=UTF8&node=599370031
www.amazon.es/portatiles/b?ie=UTF8&node=938008031
www.amazon.es/tablets/b?ie=UTF8&node=938010031
www.amazon.es/ordenadores-sobremesa/b?ie=UTF8&node=937994031
www.amazon.es/componentes/b?ie=UTF8&node=937912031
www.amazon.es/b?ie=UTF8&node=2457643031
www.amazon.es/b?ie=UTF8&node=2457641031
www.amazon.es/Software/b?ie=UTF8&node=599376031
www.amazon.es/pc-videojuegos-accesorios-mac/b?ie=UTF8&node=665498031
www.amazon.es/Inform%C3%A1tica/b?ie=UTF8&node=667049031
www.amazon.es/material-oficina/b?ie=UTF8&node=4352791031
www.amazon.es/productos-papel-oficina/b?ie=UTF8&node=4352794031
www.amazon.es/boligrafos-lapices-utiles-escritura/b?ie=UTF8&node=4352788031
www.amazon.es/electronica-oficina/b?ie=UTF8&node=4352790031
www.amazon.es/oficina-papeleria/b?ie=UTF8&node=3628728031
www.amazon.es/videojuegos-accesorios-consolas/b?ie=UTF8&node=599382031
www.amazon.es/b?ie=UTF8&node=665290031
www.amazon.es/pc-videojuegos-accesorios-mac/b?ie=UTF8&node=665498031
www.amazon.es/b?ie=UTF8&node=8490963031
www.amazon.es/b?ie=UTF8&node=1381541031
www.amazon.es/Juguetes-y-juegos/b?ie=UTF8&node=599385031
www.amazon.es/bebe/b?ie=UTF8&node=1703495031
www.amazon.es/baby-reg/homepage
www.amazon.es/gp/family/signup
www.amazon.es/b?ie=UTF8&node=2181872031
www.amazon.es/b?ie=UTF8&node=3365351031
www.amazon.es/bano/b?ie=UTF8&node=3244779031
www.amazon.es/b?ie=UTF8&node=1354952031
www.amazon.es/iluminacion/b?ie=UTF8&node=3564289031
www.amazon.es/pequeno-electrodomestico/b?ie=UTF8&node=2165363031
www.amazon.es/aspiracion-limpieza-planchado/b?ie=UTF8&node=2165650031
www.amazon.es/almacenamiento-organizacion/b?ie=UTF8&node=3359926031
www.amazon.es/climatizacion-calefaccion/b?ie=UTF8&node=3605952031
www.amazon.es/Hogar/b?ie=UTF8&node=599391031
www.amazon.es/herramientas-electricas-mano/b?ie=UTF8&node=3049288031
www.amazon.es/Cortacespedes-Tractores-Jardineria/b?ie=UTF8&node=3249445031
www.amazon.es/instalacion-electrica/b?ie=UTF8&node=3049284031
www.amazon.es/accesorios-cocina-bano/b?ie=UTF8&node=3049286031
www.amazon.es/seguridad/b?ie=UTF8&node=3049292031
www.amazon.es/Bricolaje-Herramientas-Fontaneria-Ferreteria-Jardineria/b?ie=UTF8&node=2454133031
www.amazon.es/Categorias/b?ie=UTF8&node=6198073031
www.amazon.es/b?ie=UTF8&node=6348071031
www.amazon.es/Categorias/b?ie=UTF8&node=6198055031
www.amazon.es/b?ie=UTF8&node=12300685031
www.amazon.es/Salud-y-cuidado-personal/b?ie=UTF8&node=3677430031
www.amazon.es/Suscribete-Ahorra/b?ie=UTF8&node=9699700031
www.amazon.es/Amazon-Pantry/b?ie=UTF8&node=10547412031
www.amazon.es/moda-mujer/b?ie=UTF8&node=5517558031
www.amazon.es/moda-hombre/b?ie=UTF8&node=5517557031
www.amazon.es/moda-infantil/b?ie=UTF8&node=5518995031
www.amazon.es/bolsos-mujer/b?ie=UTF8&node=2007973031
www.amazon.es/joyeria/b?ie=UTF8&node=2454126031
www.amazon.es/relojes/b?ie=UTF8&node=599388031
www.amazon.es/equipaje/b?ie=UTF8&node=2454129031
www.amazon.es/gp/feature.html?ie=UTF8&docId=12464607031
www.amazon.es/b?ie=UTF8&node=8520792031
www.amazon.es/running/b?ie=UTF8&node=2928523031
www.amazon.es/fitness-ejercicio/b?ie=UTF8&node=2928495031
www.amazon.es/ciclismo/b?ie=UTF8&node=2928487031
www.amazon.es/tenis-padel/b?ie=UTF8&node=2985165031
www.amazon.es/golf/b?ie=UTF8&node=2928503031
www.amazon.es/deportes-equipo/b?ie=UTF8&node=2975183031
www.amazon.es/deportes-acuaticos/b?ie=UTF8&node=2928491031
www.amazon.es/deportes-invierno/b?ie=UTF8&node=2928493031
www.amazon.es/Tiendas-campa%C3%B1a-Sacos-dormir-Camping/b?ie=UTF8&node=2928471031
www.amazon.es/deportes-aire-libre/b?ie=UTF8&node=2454136031
www.amazon.es/ropa-calzado-deportivo/b?ie=UTF8&node=2975170031
www.amazon.es/calzado-deportivo/b?ie=UTF8&node=2928484031
www.amazon.es/electronica-dispositivos-el-deporte/b?ie=UTF8&node=2928496031
www.amazon.es/Coche-y-moto/b?ie=UTF8&node=1951051031
www.amazon.es/b?ie=UTF8&node=2566955031
www.amazon.es/gps-accesorios/b?ie=UTF8&node=664661031
www.amazon.es/Motos-accesorios-piezas/b?ie=UTF8&node=2425161031
www.amazon.es/industrial-cientfica/b?ie=UTF8&node=5866088031
www.amazon.es/b?ie=UTF8&node=6684191031
www.amazon.es/b?ie=UTF8&node=6684193031
www.amazon.es/b?ie=UTF8&node=6684192031
www.amazon.es/handmade/b?ie=UTF8&node=9699482031
www.amazon.es/b?ie=UTF8&node=10740508031
www.amazon.es/b?ie=UTF8&node=10740511031
www.amazon.es/b?ie=UTF8&node=10740559031
www.amazon.es/b?ie=UTF8&node=10740502031
www.amazon.es/b?ie=UTF8&node=10740505031
Hope this is what you were looking for.

Do you want to scrapp it or scrape it? If it's the latter, that about this?
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("https://www.amazon.es/gp/site-directory/ref=nav_shopall_btn")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')

Related

Having trouble extracting the URL from a website

So i want to extract url for all the buttons on the sidebar, but I can't seem to get past the first one, and I dont know why or how to fix it. Unfortunately, this is for an assignment so I cant import anything else.
This is the code I tried
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://books.toscrape.com/"
genres = ["Travel", "Mystery", "Historical Fiction", "Sequential Art", "Classics", "Philosophy"]
# write your code below
response=requests.get(url, timeout=3)
soup = BeautifulSoup(response.content, 'html.parser')
sidebar=soup.find_all('div',{'class':'side_categories'})
for a in sidebar:
genre_url=a.find('a').get('href')
print(genre_url)
I got
catalogue/category/books_1/index.html
I was expecting
catalogue/category/books_1/index.html
catalogue/category/books/travel_2/index.html
catalogue/category/books/mystery_3/index.html
catalogue/category/books/historical-fiction_4/index.html
catalogue/category/books/sequential-art_5/index.html
catalogue/category/books/classics_6/index.html
...

I used the following CSS selector to find all the tags from the sidebar: .side_categories>ul>li>ul>li>a
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://books.toscrape.com/"
genres = ["Travel", "Mystery", "Historical Fiction", "Sequential Art", "Classics", "Philosophy"]
# write your code below
response=requests.get(url, timeout=3)
soup = BeautifulSoup(response.content, 'html.parser')
genre_url_elems = soup.select(".side_categories>ul>li>ul>li>a")
genre_urls = [e['href'] for e in genre_url_elems]
for url in genre_urls:
print(url)
Here's the output:
catalogue/category/books/travel_2/index.html
catalogue/category/books/mystery_3/index.html
catalogue/category/books/historical-fiction_4/index.html
catalogue/category/books/sequential-art_5/index.html
catalogue/category/books/classics_6/index.html
catalogue/category/books/philosophy_7/index.html
catalogue/category/books/romance_8/index.html
catalogue/category/books/womens-fiction_9/index.html
catalogue/category/books/fiction_10/index.html
catalogue/category/books/childrens_11/index.html
catalogue/category/books/religion_12/index.html
catalogue/category/books/nonfiction_13/index.html
catalogue/category/books/music_14/index.html
catalogue/category/books/default_15/index.html
catalogue/category/books/science-fiction_16/index.html
catalogue/category/books/sports-and-games_17/index.html
catalogue/category/books/add-a-comment_18/index.html
catalogue/category/books/fantasy_19/index.html
catalogue/category/books/new-adult_20/index.html
catalogue/category/books/young-adult_21/index.html
catalogue/category/books/science_22/index.html
catalogue/category/books/poetry_23/index.html
catalogue/category/books/paranormal_24/index.html
catalogue/category/books/art_25/index.html
catalogue/category/books/psychology_26/index.html
catalogue/category/books/autobiography_27/index.html
catalogue/category/books/parenting_28/index.html
catalogue/category/books/adult-fiction_29/index.html
catalogue/category/books/humor_30/index.html
catalogue/category/books/horror_31/index.html
catalogue/category/books/history_32/index.html
catalogue/category/books/food-and-drink_33/index.html
catalogue/category/books/christian-fiction_34/index.html
catalogue/category/books/business_35/index.html
catalogue/category/books/biography_36/index.html
catalogue/category/books/thriller_37/index.html
catalogue/category/books/contemporary_38/index.html
catalogue/category/books/spirituality_39/index.html
catalogue/category/books/academic_40/index.html
catalogue/category/books/self-help_41/index.html
catalogue/category/books/historical_42/index.html
catalogue/category/books/christian_43/index.html
catalogue/category/books/suspense_44/index.html
catalogue/category/books/short-stories_45/index.html
catalogue/category/books/novels_46/index.html
catalogue/category/books/health_47/index.html
catalogue/category/books/politics_48/index.html
catalogue/category/books/cultural_49/index.html
catalogue/category/books/erotica_50/index.html
catalogue/category/books/crime_51/index.html
For more, read about 'CSS selectors': https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

Here you go:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://books.toscrape.com/"
genres = ["Travel", "Mystery", "Historical Fiction", "Sequential Art", "Classics", "Philosophy"]
# write your code below
response=requests.get(url, timeout=3)
soup = BeautifulSoup(response.content, 'html.parser')
# sidebar=soup.find_all('div',{'class':'side_categories'})
sidebar=soup.find_all('a',href=True)
for link in sidebar:
url = link['href']
if 'catalogue' in url:
print(url)

Webscrape Beautifulsoup on website (get multiple hrefs)

I want to web scrape this webpage (carbuzz.com). I want to get the links (href) of all the car brands from "Acura" to "Volvo" (link to picture).
Currently, I only get the first entry (Acura). How do I get the remaining ones? As I just started scraping and coding would highly appreciate your input!
Code:
from bs4 import BeautifulSoup
import requests
import time
#Inputs/URLs to scrape:
URL2 = ('https://carbuzz.com/cars')
(response := requests.get(URL2)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
car_brand = overview.find(class_='bg-make-preview')['href']
car_brand_url ='https://carbuzz.com'+car_brand
print(car_brand_url)
Output:
[Finished in 1.2s]

You can use find_all to get the tag with class name bg-make-preview.
soup = BeautifulSoup(response.text, 'lxml')
for elem in soup.find_all(class_='bg-make-preview'):
car_brand_url ='https://carbuzz.com' + elem['href']
print(car_brand_url)
This gives us the expected output
https://carbuzz.com/cars/acura
https://carbuzz.com/cars/alfa-romeo
https://carbuzz.com/cars/aston-martin
https://carbuzz.com/cars/audi
https://carbuzz.com/cars/bentley
https://carbuzz.com/cars/bmw
https://carbuzz.com/cars/bollinger
https://carbuzz.com/cars/bugatti
https://carbuzz.com/cars/buick
https://carbuzz.com/cars/cadillac
https://carbuzz.com/cars/caterham
https://carbuzz.com/cars/chevrolet
https://carbuzz.com/cars/chrysler
https://carbuzz.com/cars/dodge
https://carbuzz.com/cars/ferrari
https://carbuzz.com/cars/fiat
https://carbuzz.com/cars/fisker
https://carbuzz.com/cars/ford
https://carbuzz.com/cars/genesis
https://carbuzz.com/cars/gmc
https://carbuzz.com/cars/hennessey
https://carbuzz.com/cars/honda
https://carbuzz.com/cars/hyundai
https://carbuzz.com/cars/infiniti
https://carbuzz.com/cars/jaguar
https://carbuzz.com/cars/jeep
https://carbuzz.com/cars/karma
https://carbuzz.com/cars/kia
https://carbuzz.com/cars/koenigsegg
https://carbuzz.com/cars/lamborghini
https://carbuzz.com/cars/land-rover
https://carbuzz.com/cars/lexus
https://carbuzz.com/cars/lincoln
https://carbuzz.com/cars/lordstown
https://carbuzz.com/cars/lotus
https://carbuzz.com/cars/lucid
https://carbuzz.com/cars/maserati
https://carbuzz.com/cars/mazda
https://carbuzz.com/cars/mclaren
https://carbuzz.com/cars/mercedes-benz
https://carbuzz.com/cars/mini
https://carbuzz.com/cars/mitsubishi
https://carbuzz.com/cars/nissan
https://carbuzz.com/cars/pagani
https://carbuzz.com/cars/polestar
https://carbuzz.com/cars/porsche
https://carbuzz.com/cars/ram
https://carbuzz.com/cars/rimac
https://carbuzz.com/cars/rivian
https://carbuzz.com/cars/rolls-royce
https://carbuzz.com/cars/spyker
https://carbuzz.com/cars/subaru
https://carbuzz.com/cars/tesla
https://carbuzz.com/cars/toyota
https://carbuzz.com/cars/volkswagen
https://carbuzz.com/cars/volvo
https://carbuzz.com/cars/hummer
https://carbuzz.com/cars/maybach
https://carbuzz.com/cars/mercury
https://carbuzz.com/cars/pontiac
https://carbuzz.com/cars/saab
https://carbuzz.com/cars/saturn
https://carbuzz.com/cars/scion
https://carbuzz.com/cars/smart
https://carbuzz.com/cars/suzuki

find all a href from table

I'm trying to scrape rotten tomatoes with bs4
My aim is to find all a hrefs from the table but i cannot do it can you help me?
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
my code is
from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd
url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
########################################### links ############################################################################
webpages = []
for link in reversed(links):
print(link)
html = request.urlopen(link)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
webpages.extend(links)
print(webpages)
I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help
I need to find an exact solution on how to scrape from table without scrape irrelevant information
thanks

Just grab the main table and then extract all the <a> tags.
For example:
import requests
from bs4 import BeautifulSoup
rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
f"https://www.rottentomatoes.com{link.get('href')}"
for link in
BeautifulSoup(
requests.get(rotten_tomatoes_url).text,
"lxml",
)
.find("table", class_="table")
.find_all("a")
]
print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))
Output (all 100 links to movies):
100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi

try this:
tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]

Scrape url list from Reelgood.com

Hi Im trying to build a scraper (in Python) for the website ReelGood.com.
now I got this topic to and I figured out how to scrape the url from the movie page. but what I can't seem t figure out why this script won't work:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
URL = "https://reelgood.com/movies/source/netflix"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
f = open("C:/Downloaders/test/Scrape/test1.txt", "w")
for link in soup.select('a[href*="https://reelgood.com/movie/"]'):
data = link.get('href')
f.write(data)
f.write("\n")
f.close()
EDIT
The end goal is to export a txt file containing all links to the movie page on reelgood.com.
UPDATE: 1
if I change the script to this:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
URL = "https://reelgood.com/movies/source/netflix"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
f = open("C:/Downloaders/test/Scrape/test1.txt", "w")
for link in soup.find_all("a", href=True):
data = link.get('href')
f.write(data)
f.write("\n")
f.close()
i do get the following output in test1.txt
/ /tv /tv/curated/trending-picks /tv/browse/new-tv-on-your-sources /tv/roulette/netflix /tv?filter-availability=onSources /tv/source/free /tv/source/netflix /tv/source/amazon /tv/source/hulu /tv/curated/2019-emmy-nominees /tv/genre/action-and-adventure /tv/genre/animation /tv/genre/anime /tv/genre/comedy /tv/genre/crime /tv/genre/documentary /tv/genre/drama /tv/genre/food /tv/genre/family /tv/genre/fantasy /tv/curated/imdbs-best-rated-tv /tv/genre/lgbtq /tv/genre/mystery /tv/genre/reality /tv/genre/science-fiction /tv /movies /movies/browse/popular-movies /movies/browse/recent-movies-on-your-sources /movies/roulette/netflix /movies?filter-availability=onSources /movies/source/free /movies/source/netflix /movies/source/amazon /movies/source/hulu /movies/genre/action-and-adventure /movies/genre/animation /movies/genre/comedy /movies/genre/crime /movies/genre/documentary /movies/genre/drama /movies/genre/family /movies/genre/fantasy /movies/genre/horror /movies/curated/imdbs-best-rated-movies /movies/genre/lgbtq /movies/genre/mystery /movies/genre/romance /movies/genre/science-fiction /movies/genre/thriller /movies /new /new?availability=onSources /new/netflix /new/amazon /new/hulu /new/hbo /new/disney_plus /coming?availability=onSources /coming/netflix /coming/amazon /coming/hulu /coming/hbo /coming/disney_plus /leaving?availability=onSources /leaving/netflix /leaving/amazon /leaving/hulu /leaving/hbo /new /login /login /signup /movies/source/netflix /movies/genre/action-and-adventure/on-netflix /movies/genre/animation/on-netflix /movies/genre/anime/on-netflix /movies/genre/biography/on-netflix /movies/genre/children/on-netflix /movies/genre/comedy/on-netflix /movies/genre/crime/on-netflix /movies/genre/cult/on-netflix /movies/genre/documentary/on-netflix /movies/genre/drama/on-netflix /movies/genre/family/on-netflix /movies/genre/fantasy/on-netflix /movies/genre/food/on-netflix /movies/genre/history/on-netflix /movies/genre/horror/on-netflix /movies/genre/lgbtq/on-netflix /movies/genre/musical/on-netflix /movies/genre/mystery/on-netflix /movies/genre/romance/on-netflix /movies/genre/science-fiction/on-netflix /movies/genre/sport/on-netflix /movies/genre/stand-up-and-talk/on-netflix /movies/genre/thriller/on-netflix /movies/list/africa/on-netflix /movies/list/adaptation/on-netflix /movies/list/alien/on-netflix /movies/list/animal/on-netflix /movies/list/apocalypse/on-netflix /movies/list/baseball/on-netflix /movies/list/based-on-novel/on-netflix /movies/list/true-story/on-netflix /movies/list/bollywood/on-netflix /movies/list/boxing/on-netflix /movies/list/british-humour/on-netflix /movies/list/car/on-netflix /movies/list/cartoon/on-netflix /movies/list/christmas/on-netflix /movies/list/classic/on-netflix /movies/list/college/on-netflix /movies/list/based-on-comic/on-netflix /movies/list/coming-of-age/on-netflix /movies/list/dance/on-netflix /movies/list/dark-comedy/on-netflix /movies/list/dating/on-netflix /movies/list/disaster/on-netflix /movies/list/disney/on-netflix /movies/list/doctor/on-netflix /movies/list/dog/on-netflix /movies/list/drug/on-netflix /movies/list/dystopia/on-netflix /movies/list/egypt/on-netflix /movies/list/escape/on-netflix /movies/list/fashion/on-netflix /movies/list/feel-good/on-netflix /movies/list/woman-director/on-netflix /movies/list/fighting/on-netflix /movies/list/friendship/on-netflix /movies/list/futuristic/on-netflix /movies/list/gang/on-netflix /movies/list/gangster/on-netflix /movies/list/genius/on-netflix /movies/list/ghost/on-netflix /movies/list/golf/on-netflix /movies/list/gymnast/on-netflix /movies/list/heroism/on-netflix /movies/list/high-school/on-netflix /movies/list/holiday/on-netflix /movies/list/hunting/on-netflix /movies/list/imagination/on-netflix /movies/list/jungle/on-netflix /movies/list/kidnapping/on-netflix /movies/list/magic/on-netflix /movies/list/martial-arts/on-netflix /movies/list/mature/on-netflix /movies/list/medieval/on-netflix /movies/list/military/on-netflix /movies/list/monster/on-netflix /movies/list/music/on-netflix /movies/list/new-york/on-netflix /movies/list/paris/on-netflix /movies/list/parody/on-netflix /movies/list/pet/on-netflix /movies/list/police/on-netflix /movies/list/political/on-netflix /movies/list/princess/on-netflix /movies/list/prison/on-netflix /movies/list/psychology/on-netflix /movies/list/racing/on-netflix /movies/list/religion/on-netflix /movies/list/revenge/on-netflix /movies/list/river/on-netflix /movies/list/robot/on-netflix /movies/list/rome/on-netflix /movies/list/royalty/on-netflix /movies/list/science/on-netflix /movies/list/serial-killer/on-netflix /movies/list/short/on-netflix /movies/list/singing/on-netflix /movies/list/space/on-netflix /movies/list/sports/on-netflix /movies/list/spy/on-netflix /movies/list/superhero/on-netflix /movies/list/supernatural/on-netflix /movies/list/survival/on-netflix /movies/list/suspense/on-netflix /movies/list/tank/on-netflix /movies/list/teacher/on-netflix /movies/list/technology/on-netflix /movies/list/teen/on-netflix /movies/list/time-travel/on-netflix /movies/list/toy/on-netflix /movies/list/twins/on-netflix /movies/list/vampire/on-netflix /movies/list/games/on-netflix /movies/list/war/on-netflix /movies/list/world-war-ii/on-netflix /movies/list/wrestling/on-netflix /movies/list/zombie/on-netflix /source/netflix /movies/source/netflix /tv/source/netflix /movies /movies/source/free /movies /movies/source/netflix /movies/source/amazon /movies/source/hbo_max /movies/source/hbo /movies/source/showtime /movies/source/hulu /movies/source/fx_tveverywhere /movies/source/starz /movies/source/apple_tv_plus /movies/source/plex_free /movies/source/disney_plus /movies/source/peacock /movies/source/philo /movies/source/fubo_tv /movies/source/epix /movies/source/crunchyroll_premium /movies/source/dc_universe /movies/source/mubi /movies/source/discovery_plus /movies/source/amc_premiere /movies/source/amc /movies/source/britbox /movies/source/ifc /movies/source/youtube_premium /movies/source/shudder /movies/source/criterion_channel /movies/source/funimation /movies/source/fandor /movies/source/hoopla /movies/source/kanopy /movies/source/tubi_tv /movies/source/plutotv /movies/source/peacock_free /movies/source/vudu_free /movies/source/imdb_tv /movies/source/popcornflix /movies/source/crunchyroll_free /movies/source/crackle /movies/source/acorntv /movies/source/cinemax /movies/source/hallmark_movies_now /movies/source/sundance_tveverywhere /movies/source/syfy_tveverywhere /movies/source/tbs /movies/source/tnt /movies/source/bet_plus /movies/source/watch_tcm /movies/source/comedycentral_tveverywhere /movies/source/hallmark_everywhere /movies/source/lifetime_tveverywhere /movies/source/disneynow /movies/source/paramount_plus /movies/source/tvision /movie/the-intouchables-2011 /movie/the-intouchables-2011 /movie/the-irishman-2018 /movie/the-irishman-2018 /movie/marriage-story-2019 /movie/marriage-story-2019 /movie/dangal-2016 /movie/dangal-2016 /movie/the-invisible-guest-2017 /movie/the-invisible-guest-2017 /movie/scott-pilgrim-vs-the-world-2010 /movie/scott-pilgrim-vs-the-world-2010 /movie/david-attenborough-a-life-on-our-planet-2020 /movie/david-attenborough-a-life-on-our-planet-2020 /movie/a-silent-voice-2016 /movie/a-silent-voice-2016 /movie/13th-2016 /movie/13th-2016 /movie/the-dark-knight-2008 /movie/the-dark-knight-2008 /movie/icarus-2017 /movie/icarus-2017 /movie/roma-2018 /movie/roma-2018 /movie/black-mirror-bandersnatch-2018 /movie/black-mirror-bandersnatch-2018 /movie/inception-2010 /movie/inception-2010 /movie/the-two-popes-2019 /movie/the-two-popes-2019 /movie/the-trial-of-the-chicago-7-2020 /movie/the-trial-of-the-chicago-7-2020 /movie/to-all-the-boys-ive-loved-before-2018 /movie/to-all-the-boys-ive-loved-before-2018 /movie/pk-2014 /movie/pk-2014 /movie/the-social-dilemma-2020 /movie/the-social-dilemma-2020 /movie/fruitvale-station-2013 /movie/fruitvale-station-2013 /movie/the-ballad-of-buster-scruggs-2018 /movie/the-ballad-of-buster-scruggs-2018 /movie/the-dawn-wall-2018 /movie/the-dawn-wall-2018 /movie/dolemite-is-my-name-2019 /movie/dolemite-is-my-name-2019 /movie/okja-2017 /movie/okja-2017 /movie/article-15-2019 /movie/article-15-2019 /movie/jim-andy-the-great-beyond-featuring-a-very-special-contractually-obligated-mention-of-tony-clifton-2017 /movie/jim-andy-the-great-beyond-featuring-a-very-special-contractually-obligated-mention-of-tony-clifton-2017 /movie/mudbound-2017 /movie/mudbound-2017 /movie/the-croods-2013 /movie/the-croods-2013 /movie/enola-holmes-2020 /movie/enola-holmes-2020 /movie/the-end-of-evangelion-1997 /movie/the-end-of-evangelion-1997 /movie/miss-americana-2020 /movie/miss-americana-2020 /movie/fyre-the-greatest-party-that-never-happened-2019 /movie/fyre-the-greatest-party-that-never-happened-2019 /movie/the-platform-2019 /movie/the-platform-2019 /movie/swades-we-the-people-2004 /movie/swades-we-the-people-2004 /movie/the-king-2019 /movie/the-king-2019 /movie/pieces-of-a-woman-2020 /movie/pieces-of-a-woman-2020 /movie/the-departed-2006 /movie/the-departed-2006 /movie/django-unchained-2012 /movie/django-unchained-2012 /movie/i-am-mother-2019 /movie/i-am-mother-2019 /movie/disclosure-trans-lives-on-screen-2020 /movie/disclosure-trans-lives-on-screen-2020 /movie/bo-burnham-make-happy-2016 /movie/bo-burnham-make-happy-2016 /movie/haider-2014 /movie/haider-2014 /movie/virunga-2014 /movie/virunga-2014 /movie/john-mulaney-kid-gorgeous-at-radio-city-2018 /movie/john-mulaney-kid-gorgeous-at-radio-city-2018 /movie/the-half-of-it-2020 /movie/the-half-of-it-2020 /movie/the-square-2013 /movie/the-square-2013 /movie/shutter-island-2010 /movie/shutter-island-2010 /movie/snowden-2016 /movie/snowden-2016 /movie/i-lost-my-body-2019 /movie/i-lost-my-body-2019 /movie/the-boy-who-harnessed-the-wind-2019 /movie/the-boy-who-harnessed-the-wind-2019 /movies/source/netflix?offset=50 https://itunes.apple.com/us/app/reelgood-tv-guide-for-streaming/id1031391869 https://play.google.com/store/apps/details?id=com.reelgoodapp.reelgood&referrer=utm_source%3DReelgoodWebApp%26utm_medium%3DGetAppPopUp%26utm_term%3DGetAppPopUp%26utm_content%3DGetAppPopUp%26utm_campaign%3DGetAppPopUp%26anid%3Dadmob /roulette/netflix /swipe /browse/popular-movies /curated/popular-picks /source/free /all /source/netflix /new/netflix /source/hulu /new/hulu /source/hbo /new/hbo /source/amazon /new/amazon /sitemap /about /business/products/catalog/ /tos /privacy-policy https://blog.reelgood.com /careers /faq / https://itunes.apple.com/us/app/reelgood-tv-guide-for-streaming/id1031391869 https://play.google.com/store/apps/details?id=com.reelgoodapp.reelgood&referrer=utm_source%3DReelgoodWepApp%26utm_content%3DFooter%26utm_campaign%3DGetAndroidApp%26anid%3Dadmob
So this outputs all the links found. but misses the base url.
UPDATE: 2
movie url's are as follow: <a href="/movie/the-movie-name-2021">

I would use a combination of attribute = value selectors to target the elements which have the full url in the content attribute
from bs4 import BeautifulSoup
import requests
URL = "https://reelgood.com/movies/source/netflix"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for link in soup.select('[itemprop=itemListElement] [itemprop=url]'):
print(link['content'])

I want my code to not extract links with 0 seeders using python

i wrote my code but it extract all links no matter what value is the seeders count,
here is the code i wrote:
from bs4 import BeautifulSoup
import urllib.request
import re
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
url = input('What site you working on today, sir?\n-> ')
opener = AppURLopener()
html_page = opener.open(url)
soup = BeautifulSoup(html_page, "lxml")
pd = str(soup.findAll('td', attrs={'align':re.compile('right')}))
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
if not('0' is pd[18]):
print (link.get('href'),'\n')
and this is the html am working on : https://imgur.com/a/32J9qF4
in this case it's 0 seeders but it still gives me the magnet link.. HELP

This code snippet will extract all magnet links from the page, where seeders != 0:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
soup = BeautifulSoup(requests.get('https://pirateproxy.mx/browse/201/1/3').text, 'lxml')
tds = soup.select('#searchResult td.vertTh ~ td')
links = [name.select_one('a[href^=magnet]')['href'] for name, seeders, leechers in zip(tds[0::3], tds[1::3], tds[2::3]) if seeders.text.strip() != '0']
pprint(links, width=120)
Prints:
['magnet:?xt=urn:btih:aa8a1f7847a49e640638c02ce851effff38d440f&dn=Affairs.of.State.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:819cb9b477462cd61ab6653ebc4a6f4e790589c3&dn=Bad.Samaritan.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:843d01992aa81d52be68190ee6a733ec9eee9b13&dn=The+Darkest+Minds+2018+HDCAM-1XBET&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:09a23daa69c42003d905ecf0a1cefdb0474e7d88&dn=Insidious+The+Last+Key+2018+BRRip+x264+AAC-SSN&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:98c42d5d620b4db834c5437a75f6da6f2d158207&dn=The+Darkest+Minds+2018+HDCAM-1XBET%5BTGx%5D&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:f30ebc409b215f2a5237433d7508c7ebfabb0e16&dn=Journeyman.2017.SWESUB.BRRiP.x264.mp4&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
...and so on.
EDIT:
The soup.select('#searchResult td.vertTh ~ td') will select all <td> siblings of tag <td> with class vertTh which is inside tag with id=searchResult. There are three siblings like this in each row.
The select_one('a[href^=magnet]') will then select all links that href begins with magnet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping with beautifulsoup trying to get all the href attributes - python

Related

Having trouble extracting the URL from a website

Webscrape Beautifulsoup on website (get multiple hrefs)

find all a href from table

Scrape url list from Reelgood.com

I want my code to not extract links with 0 seeders using python

Categories

Resources