Scraping Metacritic with urllib to follow redirect - python

I'm working on a Python script to scrape information from Metacritic. It works fine for most movies but it has issues with movies that Metacritic redirects.
For example on the list of movies, Metacritic provides the url "/movie/red-riding-in-the-year-of-our-lord-1983" but when you click that URL it brings you to "/movie/red-riding-trilogy". I need urllib to fetch the HTML of the final URL it ends up at.

Try using,
import urllib.request
urllib.request.FancyURLopener().open_http("your url")

I ended up using the requests module. (http://docs.python-requests.org/en/latest/) Here is the code for the request and the line to save the final url.
response = requests.get(url)
newUrl = response.url

Related

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

Follow HTML or PHP redirects with Python?

I want to see the last URL of the website with Python. I'm mostly using requests and urllib2, but everything is welcome.
The website I'm trying isn't giving Response 302. It directly redirects using HTML or maybe PHP.
I used requests module for this, but it seems like it doesn't count HTML PHP redirects as "Redirect".
My current code:
def get_real(domain):
red_domain = requests.get(domain, allow_redirects=True).url
return red_domain
print(get_real("some_url"))
If there is a way to achieve this, how? Thanks in advance!
Posts I checked:
Python follow redirects and then download the page?
Python Requests library redirect new url
Tracking redirection of the request using request history | Packtpub
EDIT: URL I'm trying: http://001.az. It's using HTML to redirect.
HTML Code inside it:
<HTML> <HEAD><META HTTP-EQUIV=Refresh CONTENT="0; url=http://fm.vc"></HEAD> </HTML>
BeautifulSoup can help in detecting HTML Meta redirections:
from bs4 import BeautifulSoup
# use request to extract the HTML text
...
soup = BeautifulSoup(html_text.lower(), "html5lib") # lower because we only want redirections
try:
content = soup.heap.find('meta', {'http-equiv': 'refresh'}).attrs['content']
ix = content.index('url=')
url = content[ix+4:]
# ok, we have to redirect to url
except AttributeError, KeyError, ValueError:
url = None
# if url is not None, loop going there...

BeautifulSoup doesn't return the expected tag as Chrome

I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code
Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium

Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks)

how Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks) !
The problem is that the actual page code does not appear
import mechanicalsoup as ms
Browser = ms.StatefulBrowser()
Browser.open("https://mobile.twitter.com/login")
Browser.select_form('form[action="/sessions"]')
Browser["session[username_or_email]"] = 'email'
Browser["session[password]"] = 'password'
Browser.submit_selected()
Browser.open("https://mobile.twitter.com/i/bookmarks")
html = Browser.get_current_page()
print html
Use BeautifulSoup.
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
From this answer:
https://stackoverflow.com/a/43290890/11034096
Edit:
It looks like the issue is that Twitter is trying to use a JS redirect to load the next page. JS isn't supported by mechanicalsoup, so you'll need to try something like selenium.
The html variable that you are returning is actually a BeautifulSoup object and not the text HTML. I would try using:
print(html.text())
to see if that will print the HTML directly.
Alternatively, from the BeautifulSoup documentation you should be able to use the non-pretty printing of:
str(html)
or
unicode(html.a)

Unable to navigate Amazon pagination with Python and BS4

I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")
This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter

Categories

Resources