python web crawler cannot get full page - python

I try to run the following python code:
import requests
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}
url="https://search.bilibili.com/all?keyword=Steins;Gate0"
try:
r=requests.get(url=url,headers=headers)
r.encoding='utf-8'
if(r.status_code==200):
print(r.text)
except:
print("This is the selection of Steins Gate")
I am a beginner of web crawler. This is a crawler spider write by requests on python, but I cannot get the full page and I think this is the problem of Asynchronous page loading (Perhaps the website has other Strategy)
So the question is how to get the full page.

What you're dealing with is a well known problem that is somewhat straightforward but complex to execute because the content you want doesn't exist on the page without some sort of browser interaction.
Some recommendations:
Investigate headless browsers like headless chrome and their use cases
Investigate selenium, how to use it with Python and headless browsers

Related

How to use 'requests'?

I'm a Korean who just started learning Python.
First, I apologize for my English.
I learned how to use beautifulSoup on YouTube. And on certain sites, crawling was successful.
However, I found out that crawl did not go well on certain sites, and that I had to set up user-agent through a search.
So I used 'requests' to make code to set user-agent. Subsequently, the code to read a particular class from html was used equally, but it did not work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
url ='https://store.leagueoflegends.co.kr/skins'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
for skin in soup.select(".item-name"):
print(skin)
Here's my code. I have no idea what the problem is.
Please help me.
Your problem is that requests do not render javascript. instead, it only gives you the "initial" source code of the page. what you should use is a package called Selenium. it lets you control your browser )Chrome, Firefox, ...etc) from Python. the website won't be able to tell the difference and you won't need to mess with the headers and user-agents. there are plenty of videos on Youtube on how to use it.

Html in browser different than the one requested in Python

import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get("https://sky.lea.moe/stats/PapaGordsmack/", headers=headers)
html_contents = page.text
print(html_contents)
I am trying to webscrape from sky.lea.moe website for a specific user, but when I request the html and print it, it is different than the one shown in browser(on chrome, viewing page source).
The one I get is: https://pastebin.com/91zRw3vP
Analyzing this one, it is something about checking browser and redirecting. Any ideas what I should do?
This is cloudflare's anti-dos protection, and it is effective at stopping scraping. A JS script will usually redirect you after a few seconds.
Something like Selenium is probably your best option for getting around it, though you might be able to scrape the JS file and get the URL to redirect. You could also try spoofing your referrer to be this page, so it goes to the correct one.
Browsers indeed do more than just download a webpage. They also download additional resources, parse style and things like that. To scrape a webpage it is advised to use a scraping library like Scrapy that does all these things for you and provide a complete library to easily extract information from these pages.

BeautifulSoup returning empty list

I’m trying to create a script where I can parse the source code from https://www.youtube.com/feed/subscriptions and retrieve the URLs of the videos in my subscription feed, in order to stick them in a MP4 download and save to my FTP server.
However I have been stuck on this problem for a couple of hours.
import bs4
import requests
source = requests.get('https://www.youtube.com/feed/subscriptions')
sourceSoup = bs4.BeautifulSoup(source.text,'html.parser')
sourceSoup.select('#grid-319397 > li:nth-child(1) > div > div.yt-lockup-dismissable > div.yt-lockup-content > h3')
[]
I am right clicking on the css element and ‘inspect element’ then ‘copy selector’ and pasting it inside the select method.
As you can see, it keeps returning an empty list.
I have tried many different derivatives of this, but it’s not picking up anything. I am having the same problem when doing the same things on the homepage, therefore I doubt that it is because it is behind a login (although I am logged in on the PC in which the script is running).
Can someone please point in the right direction?
You are facing 2 different (but somehow related) issues:
The page that the server returns to the GET reguest that is being sent by your code might be different from the page that you recieve when you go to the page with your browser, because of an unknown user-agent to the server.
The item you're looking for is only visible after you log-in.
Now, instead of manually taking care for both of these issues, there is a YouTube API that you should be considering to use.
A demo code showing that we get a different page for different user-agents:
import requests
python_user_agent_request = requests.get('http://www.youtube.com')
chrome_user_agent_request = requests.get('http://www.youtube.com',
headers={'user-agent':'''Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'''})
print(python_user_agent_request.request.headers['user-agent'])
>> python-requests/2.7.0 CPython/3.4.2 Windows/7
print(chrome_user_agent_request.request.headers['user-agent'])
>> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
# .text holds the HTML page source
print(python_user_agent_request.text == chrome_user_agent_request.text)
>> False

POST request and headers in selenium

I'm trying to add functionality to a headless webbrowser. I know there are easier ways but I stumbled across seleniumrequests an it sparked my interest. I was wondering if there would be a way to add request headers as well as being able to POST data as a payload. I've done some searching around and haven't had much luck. The following prints the html of the first website and screenshots for verification and then my program just hangs on the POST request. Doesn't terminate or raise an exception or anything. Where am I going wrong?
Thanks!
#!/usr/bin/env python
from seleniumrequests import PhantomJS
from selenium import webdriver
#Setting user-agent
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
browser = PhantomJS()
browser.get('http://www.google.com')
print browser.page_source
browser.save_screenshot('preSearch.png')
searchReq='https://www.google.com/complete/search?'
data={"q":"this is my search term"}
resp = browser.request('POST', str(searchReq), data=data)
print resp
browser.save_screenshot('postSearch.png')

urllib2.urlopen() does not return the same page as chrome

I am trying to make a small program that downloads subtitles for movie files.
I noticed however that when I follow a link in chrome and when opening it with urllib2.urlopen() does not give the same results.
As an example let's consider the link http://www.opensubtitles.org/en/subtitleserve/sub/5523343 . In chrome this redirects to http://osdownloader.org/en/osdownloader.subtitles-for.you/subtitles/5523343 which after a little while downloads the file I want.
However, when I use the following code in python, I get redirected to another page:
import urllib2
url = "http://www.opensubtitles.org/en/subtitleserve/sub/5523343"
response = urllib2.urlopen(url)
if response.url == url:
print "No redirect"
else:
print url, " --> ", response.url
Result: http://www.opensubtitles.org/en/subtitleserve/sub/5523343 --> http://www.opensubtitles.org/en/subtitles/5523343/the-musketeers-commodities-en
Why does that happen? How can I follow the same redirect as with the browser?
(I know that these sites offer APIs in python, but this is meant as practice in python and playing with urllib2 for the first time)
There's a significant difference in the request you're making from Chrome and your script using urllib2 above, and that is the HTTP header User-Agent (https://en.wikipedia.org/wiki/User_agent).
opensubtitles.org probably identifies that you're trying to programmatically retrieving the webpage, and are blocking it. Try to use one of the User-Agent strings from Chrome (more here http://www.useragentstring.com/pages/Chrome/):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36
in your script.
See this question on how to edit your script to support a custom User-Agent header - Changing user agent on urllib2.urlopen.
I would also like to recommend using the requests library for Python instead of urllib2, as the API is much easier to understand - http://docs.python-requests.org/en/latest/.

Categories

Resources