Scrape url list from Reelgood.com - python

Hi Im trying to build a scraper (in Python) for the website ReelGood.com.
now I got this topic to and I figured out how to scrape the url from the movie page. but what I can't seem t figure out why this script won't work:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
URL = "https://reelgood.com/movies/source/netflix"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
f = open("C:/Downloaders/test/Scrape/test1.txt", "w")
for link in soup.select('a[href*="https://reelgood.com/movie/"]'):
data = link.get('href')
f.write(data)
f.write("\n")
f.close()
EDIT
The end goal is to export a txt file containing all links to the movie page on reelgood.com.
UPDATE: 1
if I change the script to this:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
URL = "https://reelgood.com/movies/source/netflix"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
f = open("C:/Downloaders/test/Scrape/test1.txt", "w")
for link in soup.find_all("a", href=True):
data = link.get('href')
f.write(data)
f.write("\n")
f.close()
i do get the following output in test1.txt
/ /tv /tv/curated/trending-picks /tv/browse/new-tv-on-your-sources /tv/roulette/netflix /tv?filter-availability=onSources /tv/source/free /tv/source/netflix /tv/source/amazon /tv/source/hulu /tv/curated/2019-emmy-nominees /tv/genre/action-and-adventure /tv/genre/animation /tv/genre/anime /tv/genre/comedy /tv/genre/crime /tv/genre/documentary /tv/genre/drama /tv/genre/food /tv/genre/family /tv/genre/fantasy /tv/curated/imdbs-best-rated-tv /tv/genre/lgbtq /tv/genre/mystery /tv/genre/reality /tv/genre/science-fiction /tv /movies /movies/browse/popular-movies /movies/browse/recent-movies-on-your-sources /movies/roulette/netflix /movies?filter-availability=onSources /movies/source/free /movies/source/netflix /movies/source/amazon /movies/source/hulu /movies/genre/action-and-adventure /movies/genre/animation /movies/genre/comedy /movies/genre/crime /movies/genre/documentary /movies/genre/drama /movies/genre/family /movies/genre/fantasy /movies/genre/horror /movies/curated/imdbs-best-rated-movies /movies/genre/lgbtq /movies/genre/mystery /movies/genre/romance /movies/genre/science-fiction /movies/genre/thriller /movies /new /new?availability=onSources /new/netflix /new/amazon /new/hulu /new/hbo /new/disney_plus /coming?availability=onSources /coming/netflix /coming/amazon /coming/hulu /coming/hbo /coming/disney_plus /leaving?availability=onSources /leaving/netflix /leaving/amazon /leaving/hulu /leaving/hbo /new /login /login /signup /movies/source/netflix /movies/genre/action-and-adventure/on-netflix /movies/genre/animation/on-netflix /movies/genre/anime/on-netflix /movies/genre/biography/on-netflix /movies/genre/children/on-netflix /movies/genre/comedy/on-netflix /movies/genre/crime/on-netflix /movies/genre/cult/on-netflix /movies/genre/documentary/on-netflix /movies/genre/drama/on-netflix /movies/genre/family/on-netflix /movies/genre/fantasy/on-netflix /movies/genre/food/on-netflix /movies/genre/history/on-netflix /movies/genre/horror/on-netflix /movies/genre/lgbtq/on-netflix /movies/genre/musical/on-netflix /movies/genre/mystery/on-netflix /movies/genre/romance/on-netflix /movies/genre/science-fiction/on-netflix /movies/genre/sport/on-netflix /movies/genre/stand-up-and-talk/on-netflix /movies/genre/thriller/on-netflix /movies/list/africa/on-netflix /movies/list/adaptation/on-netflix /movies/list/alien/on-netflix /movies/list/animal/on-netflix /movies/list/apocalypse/on-netflix /movies/list/baseball/on-netflix /movies/list/based-on-novel/on-netflix /movies/list/true-story/on-netflix /movies/list/bollywood/on-netflix /movies/list/boxing/on-netflix /movies/list/british-humour/on-netflix /movies/list/car/on-netflix /movies/list/cartoon/on-netflix /movies/list/christmas/on-netflix /movies/list/classic/on-netflix /movies/list/college/on-netflix /movies/list/based-on-comic/on-netflix /movies/list/coming-of-age/on-netflix /movies/list/dance/on-netflix /movies/list/dark-comedy/on-netflix /movies/list/dating/on-netflix /movies/list/disaster/on-netflix /movies/list/disney/on-netflix /movies/list/doctor/on-netflix /movies/list/dog/on-netflix /movies/list/drug/on-netflix /movies/list/dystopia/on-netflix /movies/list/egypt/on-netflix /movies/list/escape/on-netflix /movies/list/fashion/on-netflix /movies/list/feel-good/on-netflix /movies/list/woman-director/on-netflix /movies/list/fighting/on-netflix /movies/list/friendship/on-netflix /movies/list/futuristic/on-netflix /movies/list/gang/on-netflix /movies/list/gangster/on-netflix /movies/list/genius/on-netflix /movies/list/ghost/on-netflix /movies/list/golf/on-netflix /movies/list/gymnast/on-netflix /movies/list/heroism/on-netflix /movies/list/high-school/on-netflix /movies/list/holiday/on-netflix /movies/list/hunting/on-netflix /movies/list/imagination/on-netflix /movies/list/jungle/on-netflix /movies/list/kidnapping/on-netflix /movies/list/magic/on-netflix /movies/list/martial-arts/on-netflix /movies/list/mature/on-netflix /movies/list/medieval/on-netflix /movies/list/military/on-netflix /movies/list/monster/on-netflix /movies/list/music/on-netflix /movies/list/new-york/on-netflix /movies/list/paris/on-netflix /movies/list/parody/on-netflix /movies/list/pet/on-netflix /movies/list/police/on-netflix /movies/list/political/on-netflix /movies/list/princess/on-netflix /movies/list/prison/on-netflix /movies/list/psychology/on-netflix /movies/list/racing/on-netflix /movies/list/religion/on-netflix /movies/list/revenge/on-netflix /movies/list/river/on-netflix /movies/list/robot/on-netflix /movies/list/rome/on-netflix /movies/list/royalty/on-netflix /movies/list/science/on-netflix /movies/list/serial-killer/on-netflix /movies/list/short/on-netflix /movies/list/singing/on-netflix /movies/list/space/on-netflix /movies/list/sports/on-netflix /movies/list/spy/on-netflix /movies/list/superhero/on-netflix /movies/list/supernatural/on-netflix /movies/list/survival/on-netflix /movies/list/suspense/on-netflix /movies/list/tank/on-netflix /movies/list/teacher/on-netflix /movies/list/technology/on-netflix /movies/list/teen/on-netflix /movies/list/time-travel/on-netflix /movies/list/toy/on-netflix /movies/list/twins/on-netflix /movies/list/vampire/on-netflix /movies/list/games/on-netflix /movies/list/war/on-netflix /movies/list/world-war-ii/on-netflix /movies/list/wrestling/on-netflix /movies/list/zombie/on-netflix /source/netflix /movies/source/netflix /tv/source/netflix /movies /movies/source/free /movies /movies/source/netflix /movies/source/amazon /movies/source/hbo_max /movies/source/hbo /movies/source/showtime /movies/source/hulu /movies/source/fx_tveverywhere /movies/source/starz /movies/source/apple_tv_plus /movies/source/plex_free /movies/source/disney_plus /movies/source/peacock /movies/source/philo /movies/source/fubo_tv /movies/source/epix /movies/source/crunchyroll_premium /movies/source/dc_universe /movies/source/mubi /movies/source/discovery_plus /movies/source/amc_premiere /movies/source/amc /movies/source/britbox /movies/source/ifc /movies/source/youtube_premium /movies/source/shudder /movies/source/criterion_channel /movies/source/funimation /movies/source/fandor /movies/source/hoopla /movies/source/kanopy /movies/source/tubi_tv /movies/source/plutotv /movies/source/peacock_free /movies/source/vudu_free /movies/source/imdb_tv /movies/source/popcornflix /movies/source/crunchyroll_free /movies/source/crackle /movies/source/acorntv /movies/source/cinemax /movies/source/hallmark_movies_now /movies/source/sundance_tveverywhere /movies/source/syfy_tveverywhere /movies/source/tbs /movies/source/tnt /movies/source/bet_plus /movies/source/watch_tcm /movies/source/comedycentral_tveverywhere /movies/source/hallmark_everywhere /movies/source/lifetime_tveverywhere /movies/source/disneynow /movies/source/paramount_plus /movies/source/tvision /movie/the-intouchables-2011 /movie/the-intouchables-2011 /movie/the-irishman-2018 /movie/the-irishman-2018 /movie/marriage-story-2019 /movie/marriage-story-2019 /movie/dangal-2016 /movie/dangal-2016 /movie/the-invisible-guest-2017 /movie/the-invisible-guest-2017 /movie/scott-pilgrim-vs-the-world-2010 /movie/scott-pilgrim-vs-the-world-2010 /movie/david-attenborough-a-life-on-our-planet-2020 /movie/david-attenborough-a-life-on-our-planet-2020 /movie/a-silent-voice-2016 /movie/a-silent-voice-2016 /movie/13th-2016 /movie/13th-2016 /movie/the-dark-knight-2008 /movie/the-dark-knight-2008 /movie/icarus-2017 /movie/icarus-2017 /movie/roma-2018 /movie/roma-2018 /movie/black-mirror-bandersnatch-2018 /movie/black-mirror-bandersnatch-2018 /movie/inception-2010 /movie/inception-2010 /movie/the-two-popes-2019 /movie/the-two-popes-2019 /movie/the-trial-of-the-chicago-7-2020 /movie/the-trial-of-the-chicago-7-2020 /movie/to-all-the-boys-ive-loved-before-2018 /movie/to-all-the-boys-ive-loved-before-2018 /movie/pk-2014 /movie/pk-2014 /movie/the-social-dilemma-2020 /movie/the-social-dilemma-2020 /movie/fruitvale-station-2013 /movie/fruitvale-station-2013 /movie/the-ballad-of-buster-scruggs-2018 /movie/the-ballad-of-buster-scruggs-2018 /movie/the-dawn-wall-2018 /movie/the-dawn-wall-2018 /movie/dolemite-is-my-name-2019 /movie/dolemite-is-my-name-2019 /movie/okja-2017 /movie/okja-2017 /movie/article-15-2019 /movie/article-15-2019 /movie/jim-andy-the-great-beyond-featuring-a-very-special-contractually-obligated-mention-of-tony-clifton-2017 /movie/jim-andy-the-great-beyond-featuring-a-very-special-contractually-obligated-mention-of-tony-clifton-2017 /movie/mudbound-2017 /movie/mudbound-2017 /movie/the-croods-2013 /movie/the-croods-2013 /movie/enola-holmes-2020 /movie/enola-holmes-2020 /movie/the-end-of-evangelion-1997 /movie/the-end-of-evangelion-1997 /movie/miss-americana-2020 /movie/miss-americana-2020 /movie/fyre-the-greatest-party-that-never-happened-2019 /movie/fyre-the-greatest-party-that-never-happened-2019 /movie/the-platform-2019 /movie/the-platform-2019 /movie/swades-we-the-people-2004 /movie/swades-we-the-people-2004 /movie/the-king-2019 /movie/the-king-2019 /movie/pieces-of-a-woman-2020 /movie/pieces-of-a-woman-2020 /movie/the-departed-2006 /movie/the-departed-2006 /movie/django-unchained-2012 /movie/django-unchained-2012 /movie/i-am-mother-2019 /movie/i-am-mother-2019 /movie/disclosure-trans-lives-on-screen-2020 /movie/disclosure-trans-lives-on-screen-2020 /movie/bo-burnham-make-happy-2016 /movie/bo-burnham-make-happy-2016 /movie/haider-2014 /movie/haider-2014 /movie/virunga-2014 /movie/virunga-2014 /movie/john-mulaney-kid-gorgeous-at-radio-city-2018 /movie/john-mulaney-kid-gorgeous-at-radio-city-2018 /movie/the-half-of-it-2020 /movie/the-half-of-it-2020 /movie/the-square-2013 /movie/the-square-2013 /movie/shutter-island-2010 /movie/shutter-island-2010 /movie/snowden-2016 /movie/snowden-2016 /movie/i-lost-my-body-2019 /movie/i-lost-my-body-2019 /movie/the-boy-who-harnessed-the-wind-2019 /movie/the-boy-who-harnessed-the-wind-2019 /movies/source/netflix?offset=50 https://itunes.apple.com/us/app/reelgood-tv-guide-for-streaming/id1031391869 https://play.google.com/store/apps/details?id=com.reelgoodapp.reelgood&referrer=utm_source%3DReelgoodWebApp%26utm_medium%3DGetAppPopUp%26utm_term%3DGetAppPopUp%26utm_content%3DGetAppPopUp%26utm_campaign%3DGetAppPopUp%26anid%3Dadmob /roulette/netflix /swipe /browse/popular-movies /curated/popular-picks /source/free /all /source/netflix /new/netflix /source/hulu /new/hulu /source/hbo /new/hbo /source/amazon /new/amazon /sitemap /about /business/products/catalog/ /tos /privacy-policy https://blog.reelgood.com /careers /faq / https://itunes.apple.com/us/app/reelgood-tv-guide-for-streaming/id1031391869 https://play.google.com/store/apps/details?id=com.reelgoodapp.reelgood&referrer=utm_source%3DReelgoodWepApp%26utm_content%3DFooter%26utm_campaign%3DGetAndroidApp%26anid%3Dadmob
So this outputs all the links found. but misses the base url.
UPDATE: 2
movie url's are as follow: <a href="/movie/the-movie-name-2021">

I would use a combination of attribute = value selectors to target the elements which have the full url in the content attribute
from bs4 import BeautifulSoup
import requests
URL = "https://reelgood.com/movies/source/netflix"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for link in soup.select('[itemprop=itemListElement] [itemprop=url]'):
print(link['content'])

Related

How to get just links of articles in list using BeautifulSoup

Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html

BeautifulSoup and scraping href's isn't working

Again I am having trouble scraping href's in BeautifulSoup. I have a list of pages that I am scraping and I have the data but I can't seem to get the hrefs even when I use various codes that work in other scripts.
So here is the code and my data will be below that:
import requests
from bs4 import BeautifulSoup
with open('states_names.csv', 'r') as reader:
states = [states.strip().replace(' ', '-') for states in reader]
url = 'https://www.hauntedplaces.org/state/alabama'
for state in states:
page = requests.get(url+state)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('div', class_='description')
# When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early?
h_page = soup.findAll('h3')
<h3>Gaines Ridge Dinner Club</h3>
<h3>Purifoy-Lipscomb House</h3>
<h3>Kate Shepard House Bed and Breakfast</h3>
<h3>Cedarhurst Mansion</h3>
<h3>Crybaby Bridge</h3>
<h3>Gaineswood Plantation</h3>
<h3>Mountain View Hospital</h3>
This works perfectly:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hauntedplaces.org/state/Alabama'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for link in soup.select('div.description a'):
print(link['href'])
Try that:
soup = BeautifulSoup(page.content, 'html.parser')
list0 = []
possible_links = soup.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print (link.attrs['href'])
list0.append(link.attrs['href'])
print(list0)

In Python3, how can I use the .append function to add a string to scraped links?

Thanks to stackoverflow.com I was able write a program that scrapes web links from any given web page. However, I need it to concatenate the home URL to any relative link that it comes across. (Example: "http://www.google.com/sitemap" is okay. But just "/sitemap" by itself is not okay.)
In the following code,
from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest
base_url = "https://www.census.gov/programs-surveys/popest.html"
html_page = myRequest(base_url)
raw_html = html_page.read()
page_soup = mySoup(raw_html, "html.parser")
html_page.close()
f = open("census4-3.csv", "w")
all_links = page_soup.find_all('a', href=True)
def clean_links(tags, base_url):
cleaned_links = set()
for tag in tags:
link = tag.get('href')
if link is None:
continue
full_url = myJoin(base_url, link)
cleaned_links.add(full_url)
return cleaned_links
cleaned_links = clean_links(all_links, base_url)
for link in cleaned_links:
f.write(str(link) + '\n')
f.close()
print("The CSV file is saved to your computer.")
how and where would I add something like this:
.append("http://www.google.com")
You should save your base url as base_url = 'https://www.census.gov'.
Call the requests like this
html_page = myRequest(base_url + '/programs-surveys/popest.html')
When you want to get any full_url, just do this
full_url = base_url + link

Scraping with beautifulsoup trying to get all the href attributes

Im trying to scrape all the urls from amazon categories website (https://www.amazon.com/gp/site-directory/ref=nav_shopall_btn)
but I can just get the first url of any category. For example, from "Amazon video" I am getting "All videos", "Fire TV" amazon fire tv, etc.
That is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.es/gp/site-directory/ref=nav_shopall_btn"
amazon_link = requests.get(url)
html = BeautifulSoup(amazon_link.text,"html.parser")
categorias_amazon = html.find_all('div',{'class':'popover-grouping'})
for i in range(len(categorias_amazon)):
print("www.amazon.es" + categorias_amazon[i].a['href'])
I have tried with:
print("www.amazon.es" + categorias_amazon[i].find_all['a'])
but I get an error. I am looking to get href attribute of every sub category.
You can try this code:
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.es/gp/site-directory/ref=nav_shopall_btn"
amazon_link = requests.get(url)
html = BeautifulSoup(amazon_link.text,"html.parser")
# print html
categorias_amazon = html.find_all('div',{'class':'popover-grouping'})
allurls=html.select("div.popover-grouping [href]")
values=[link['href'].strip() for link in allurls]
for value in values:
print("www.amazon.es" + value)
It will print:
www.amazon.es/b?ie=UTF8&node=1748200031
www.amazon.es/gp/dmusic/mp3/player
www.amazon.es/b?ie=UTF8&node=2133385031
www.amazon.es/clouddrive/primephotos
www.amazon.es/clouddrive/home
www.amazon.es/clouddrive/home#download-section
www.amazon.es/clouddrive?_encoding=UTF8&sf=1
www.amazon.es/dp/B0186FET66
www.amazon.es/dp/B00QJDO0QC
www.amazon.es/dp/B00IOY524S
www.amazon.es/dp/B010EK1GOE
www.amazon.es/b?ie=UTF8&node=827234031
www.amazon.es/ebooks-kindle/b?ie=UTF8&node=827231031
www.amazon.es/gp/kindle/ku/sign-up/
www.amazon.es/b?ie=UTF8&node=8504981031
www.amazon.es/gp/digital/fiona/kcp-landing-page
www.amazon.eshttps://www.amazon.es:443/gp/redirect.html?location=https://leer.amazon.es/&token=CA091C61DBBA8A5C0F6E4A46ED30C059164DBC74&source=standards
www.amazon.es/gp/digital/fiona/manage
www.amazon.es/dp/B00ZDWLEEG
www.amazon.es/dp/B00IRKMZX0
www.amazon.es/dp/B01AHBC23E
www.amazon.es/b?ie=UTF8&node=827234031
www.amazon.es/mobile-apps/b?ie=UTF8&node=1661649031
www.amazon.es/b?ie=UTF8&node=1726755031
www.amazon.es/b?ie=UTF8&node=1748200031
www.amazon.es/ebooks-kindle/b?ie=UTF8&node=827231031
www.amazon.es/gp/digital/fiona/manage
www.amazon.es/b?ie=UTF8&node=10909716031
www.amazon.es/b?ie=UTF8&node=10909718031
www.amazon.es/b?ie=UTF8&node=10909719031
www.amazon.es/b?ie=UTF8&node=10909720031
www.amazon.es/b?ie=UTF8&node=10909721031
www.amazon.es/b?ie=UTF8&node=10909722031
www.amazon.es/b?ie=UTF8&node=8464150031
www.amazon.es/mobile-apps/b?ie=UTF8&node=1661649031
www.amazon.es/b?ie=UTF8&node=1726755031
www.amazon.es/b?ie=UTF8&node=4622953031
www.amazon.es/gp/feature.html?ie=UTF8&docId=1000658923
www.amazon.es/gp/mas/your-account/myapps
www.amazon.es/comprar-libros-espa%C3%B1ol/b?ie=UTF8&node=599364031
www.amazon.es/ebooks-kindle/b?ie=UTF8&node=827231031
www.amazon.es/gp/kindle/ku/sign-up/
www.amazon.es/Libros-en-ingl%C3%A9s/b?ie=UTF8&node=665418031
www.amazon.es/Libros-en-otros-idiomas/b?ie=UTF8&node=599367031
www.amazon.es/b?ie=UTF8&node=902621031
www.amazon.es/libros-texto/b?ie=UTF8&node=902673031
www.amazon.es/Blu-ray-DVD-peliculas-series-3D/b?ie=UTF8&node=599379031
www.amazon.es/series-tv-television-DVD-Blu-ray/b?ie=UTF8&node=665293031
www.amazon.es/Blu-ray-peliculas-series-3D/b?ie=UTF8&node=665303031
www.amazon.es/M%C3%BAsica/b?ie=UTF8&node=599373031
www.amazon.es/b?ie=UTF8&node=1748200031
www.amazon.es/musical-instruments/b?ie=UTF8&node=3628866031
www.amazon.es/fotografia-videocamaras/b?ie=UTF8&node=664660031
www.amazon.es/b?ie=UTF8&node=931491031
www.amazon.es/tv-video-home-cinema/b?ie=UTF8&node=664659031
www.amazon.es/b?ie=UTF8&node=664684031
www.amazon.es/gps-accesorios/b?ie=UTF8&node=664661031
www.amazon.es/musical-instruments/b?ie=UTF8&node=3628866031
www.amazon.es/accesorios/b?ie=UTF8&node=928455031
www.amazon.es/Inform%C3%A1tica/b?ie=UTF8&node=667049031
www.amazon.es/Electr%C3%B3nica/b?ie=UTF8&node=599370031
www.amazon.es/portatiles/b?ie=UTF8&node=938008031
www.amazon.es/tablets/b?ie=UTF8&node=938010031
www.amazon.es/ordenadores-sobremesa/b?ie=UTF8&node=937994031
www.amazon.es/componentes/b?ie=UTF8&node=937912031
www.amazon.es/b?ie=UTF8&node=2457643031
www.amazon.es/b?ie=UTF8&node=2457641031
www.amazon.es/Software/b?ie=UTF8&node=599376031
www.amazon.es/pc-videojuegos-accesorios-mac/b?ie=UTF8&node=665498031
www.amazon.es/Inform%C3%A1tica/b?ie=UTF8&node=667049031
www.amazon.es/material-oficina/b?ie=UTF8&node=4352791031
www.amazon.es/productos-papel-oficina/b?ie=UTF8&node=4352794031
www.amazon.es/boligrafos-lapices-utiles-escritura/b?ie=UTF8&node=4352788031
www.amazon.es/electronica-oficina/b?ie=UTF8&node=4352790031
www.amazon.es/oficina-papeleria/b?ie=UTF8&node=3628728031
www.amazon.es/videojuegos-accesorios-consolas/b?ie=UTF8&node=599382031
www.amazon.es/b?ie=UTF8&node=665290031
www.amazon.es/pc-videojuegos-accesorios-mac/b?ie=UTF8&node=665498031
www.amazon.es/b?ie=UTF8&node=8490963031
www.amazon.es/b?ie=UTF8&node=1381541031
www.amazon.es/Juguetes-y-juegos/b?ie=UTF8&node=599385031
www.amazon.es/bebe/b?ie=UTF8&node=1703495031
www.amazon.es/baby-reg/homepage
www.amazon.es/gp/family/signup
www.amazon.es/b?ie=UTF8&node=2181872031
www.amazon.es/b?ie=UTF8&node=3365351031
www.amazon.es/bano/b?ie=UTF8&node=3244779031
www.amazon.es/b?ie=UTF8&node=1354952031
www.amazon.es/iluminacion/b?ie=UTF8&node=3564289031
www.amazon.es/pequeno-electrodomestico/b?ie=UTF8&node=2165363031
www.amazon.es/aspiracion-limpieza-planchado/b?ie=UTF8&node=2165650031
www.amazon.es/almacenamiento-organizacion/b?ie=UTF8&node=3359926031
www.amazon.es/climatizacion-calefaccion/b?ie=UTF8&node=3605952031
www.amazon.es/Hogar/b?ie=UTF8&node=599391031
www.amazon.es/herramientas-electricas-mano/b?ie=UTF8&node=3049288031
www.amazon.es/Cortacespedes-Tractores-Jardineria/b?ie=UTF8&node=3249445031
www.amazon.es/instalacion-electrica/b?ie=UTF8&node=3049284031
www.amazon.es/accesorios-cocina-bano/b?ie=UTF8&node=3049286031
www.amazon.es/seguridad/b?ie=UTF8&node=3049292031
www.amazon.es/Bricolaje-Herramientas-Fontaneria-Ferreteria-Jardineria/b?ie=UTF8&node=2454133031
www.amazon.es/Categorias/b?ie=UTF8&node=6198073031
www.amazon.es/b?ie=UTF8&node=6348071031
www.amazon.es/Categorias/b?ie=UTF8&node=6198055031
www.amazon.es/b?ie=UTF8&node=12300685031
www.amazon.es/Salud-y-cuidado-personal/b?ie=UTF8&node=3677430031
www.amazon.es/Suscribete-Ahorra/b?ie=UTF8&node=9699700031
www.amazon.es/Amazon-Pantry/b?ie=UTF8&node=10547412031
www.amazon.es/moda-mujer/b?ie=UTF8&node=5517558031
www.amazon.es/moda-hombre/b?ie=UTF8&node=5517557031
www.amazon.es/moda-infantil/b?ie=UTF8&node=5518995031
www.amazon.es/bolsos-mujer/b?ie=UTF8&node=2007973031
www.amazon.es/joyeria/b?ie=UTF8&node=2454126031
www.amazon.es/relojes/b?ie=UTF8&node=599388031
www.amazon.es/equipaje/b?ie=UTF8&node=2454129031
www.amazon.es/gp/feature.html?ie=UTF8&docId=12464607031
www.amazon.es/b?ie=UTF8&node=8520792031
www.amazon.es/running/b?ie=UTF8&node=2928523031
www.amazon.es/fitness-ejercicio/b?ie=UTF8&node=2928495031
www.amazon.es/ciclismo/b?ie=UTF8&node=2928487031
www.amazon.es/tenis-padel/b?ie=UTF8&node=2985165031
www.amazon.es/golf/b?ie=UTF8&node=2928503031
www.amazon.es/deportes-equipo/b?ie=UTF8&node=2975183031
www.amazon.es/deportes-acuaticos/b?ie=UTF8&node=2928491031
www.amazon.es/deportes-invierno/b?ie=UTF8&node=2928493031
www.amazon.es/Tiendas-campa%C3%B1a-Sacos-dormir-Camping/b?ie=UTF8&node=2928471031
www.amazon.es/deportes-aire-libre/b?ie=UTF8&node=2454136031
www.amazon.es/ropa-calzado-deportivo/b?ie=UTF8&node=2975170031
www.amazon.es/calzado-deportivo/b?ie=UTF8&node=2928484031
www.amazon.es/electronica-dispositivos-el-deporte/b?ie=UTF8&node=2928496031
www.amazon.es/Coche-y-moto/b?ie=UTF8&node=1951051031
www.amazon.es/b?ie=UTF8&node=2566955031
www.amazon.es/gps-accesorios/b?ie=UTF8&node=664661031
www.amazon.es/Motos-accesorios-piezas/b?ie=UTF8&node=2425161031
www.amazon.es/industrial-cientfica/b?ie=UTF8&node=5866088031
www.amazon.es/b?ie=UTF8&node=6684191031
www.amazon.es/b?ie=UTF8&node=6684193031
www.amazon.es/b?ie=UTF8&node=6684192031
www.amazon.es/handmade/b?ie=UTF8&node=9699482031
www.amazon.es/b?ie=UTF8&node=10740508031
www.amazon.es/b?ie=UTF8&node=10740511031
www.amazon.es/b?ie=UTF8&node=10740559031
www.amazon.es/b?ie=UTF8&node=10740502031
www.amazon.es/b?ie=UTF8&node=10740505031
Hope this is what you were looking for.
Do you want to scrapp it or scrape it? If it's the latter, that about this?
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("https://www.amazon.es/gp/site-directory/ref=nav_shopall_btn")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')

Could not get link from html content using python

Here is the URL that I'am using:
http://www.protect-stream.com/PS_DL_xODN4o5HjLuqzEX5fRNuhtobXnvL9SeiyYcPLcqaqqXayD8YaIvg9Qo80hvgj4vCQkY95XB7iqcL4aF1YC8HRg_i_i
In fact on this page, the link that I am looking for appears may be 5 second after loading the page.
I see after 5 second a post request to :
http://www.protect-stream.com/secur.php
with data like so :
k=2AE_a,LHmb6kSC_c,sZNk4eNixIiPo_c,_c,Gw4ERVdriKuHJlciB1uuy_c,Sr7mOTQVUhVEcMlZeINICKegtzYsseabOlrDb_a,LmiP80NGUvAbK1xhbZGC6OWMtIaNF12f0mYA4O0WxBkmAtz75kpYcrHzxtYt32hCYSp0WjqOQR9bY_a,ofQtw_b,
I didn't get from where the 'k' value come from ?
Is their an idea on how we could get the 'k' value using python ?
This is not going to be trivial. The k parameter value is "hidden" deep inside a script element inside nested iframes. Here is a requests + BeautifulSoup way to get to the k value:
import re
from urlparse import urljoin
# Python 3: from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = "http://www.protect-stream.com"
with requests.Session() as session:
response = session.get("http://www.protect-stream.com/PS_DL_xODN4o5HjLuqzEX5fRNuhtobXnvL9SeiyYcPLcqaqqXayD8YaIvg9Qo80hvgj4vCQkY95XB7iqcL4aF1YC8HRg_i_i")
# get the top frame url
soup = BeautifulSoup(response.content, "html.parser")
src = soup.select_one('iframe[src^="frame.php"]')["src"]
frame_url = urljoin(base_url, src)
# get the nested frame url
response = session.get(frame_url)
soup = BeautifulSoup(response.content, "html.parser")
src = soup.select_one('iframe[src^="w.php"]')["src"]
frame_url = urljoin(base_url, src)
# get the frame HTML source and extract the "k" value
response = session.get(frame_url)
soup = BeautifulSoup(response.content, "html.parser")
script = soup.find("script", text=lambda text: text and "k=" in text).get_text(strip=True)
k_value = re.search(r'var k="(.*?)";', script).group(1)
print(k_value)
Prints:
YjfH9430zztSYgf7ItQJ4grv2cvH3mT7xGwv32rTy2HiB1uuy_c,Sr7mOTQVUhVEcMlZeINICKegtzYsseabOlrDb_a,LmiP80NGUvAbK1xhbZGC6OWMtIaNF12f0mYA4O0WXhmwUC0ipkPRkLQepYHLyF1U0xvsrzHMcK2XBCeY3_a,O_b,

Categories

Resources