I am trying to crawl few pages from monsterindia.com. But whenever I write any xpath on scrapy shell, it gives me empty result. However, there should be some way because view(response) command gives me the same html page.
I ran this command :
scrapy shell "https://www.monsterindia.com/search/computer-jobs"
on my terminal and then tried several ways formulating different xpaths like - response.xpath('//*[#class="job-tittle"]/text()').extract() . But no luck .. always got empty result.
on terminal:
scrapy shell "https://www.monsterindia.com/search/computer-jobs"
then, response.xpath('//div[#class="job-tittle"]/text()').extract()
got empty result.
then, response.xpath('//*[#class="card-apply-content"]/text()').extract()
got empty result.
I expect it to give some results, I mean the text from the website after crawling. Please help me with it.
The data you're looking for isn't on the home page, but in the responses retrieved after the page load. If you check the "View Page Source" in your browser, you will see what actually came in the first request.
And by inspecting the network tab in dev tools, you will see the further requests, like the one to this URL: https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=25
So what Thiago I think was getting at is that the page updates with xhr requests which include a results count query string parameter. This returns json you can parse. So you change your url to that and handle json accordingly.
Using requests to demonstrate
import requests
from bs4 import BeautifulSoup as bs
import json
r = requests.get('https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=100')
soup = bs(r.content, 'lxml')
data = json.loads(soup.select_one('p').text)['jobSearchResponse']['data']
for item in data:
print(item)
JSON of first item
https://jsoneditoronline.org/?id=fe49c53efe10423a8d49f9b5bdf4eb36
With scrapy:
jsonres = json.loads(response.body_as_unicode()
Related
I am making a basic Web Crawler/Spider with Python. I am trying to crawl through a YouTube channel and print all the titles of the videos on it but it never returns anything.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/c/DanTDM/videos'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
x = soup.select(".yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(x)
And the output is always: []. An empty list (which means it didn't find anything). I need to know what I'm doing wrong.
The code seems correct.
Call print(response.text) and see if YouTube is returning you a blocking page.
Anti scraping measures can be in action, as checking your user agent, etc.
Browser Automation with Selenium
When I send a request to YouTube, I receive the following page:
(A 'Before you continue to
Youtube' page).
So...
We should use Selenium instead as we need to click one of the buttons. I don't think we can interact with the website using the requests module.
Selenium allows you to have control over your browser. Read the documentation!
So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.
I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc
I am trying to use request.get(url) to get the response of a url from a server.
The following code works for the url of the first page of a search result:
r = requests.get("https://www.epocacosmeticos.com.br/perfumes")
soup = BeautifulSoup(r.text)
However, when I try to use the same code for the url of the second page of the the search result, which is "https://www.epocacosmeticos.com.br/perfumes#2",
r = requests.get("https://www.epocacosmeticos.com.br/perfumes#2")
soup = BeautifulSoup(r.text)
it returns the response of the first page. It ignores the '#2' at the end of the URL. How can I get the response of the second page of a search result?
You can use a web proxy like BurpSuite to view the requests made by the page. When you click on the "Page 2" button, this is what is being sent in the background:
GET /buscapagina?fq=C%3a%2f1000001%2f&PS=16&sl=f804bbc5-5fa8-4b8b-b93a-641c059b35b3&cc=4&sm=0&PageNumber=2 HTTP/1.1
Therefore, this is the url you will need to query if you want to properly scrape the website.
BurpSuite also allows you to play with the requests, so you can try changing the request (like changing the 2 for a 3) and see if you get the expected result.
It seems this website uses dynamic html. Because of this, the second results page is not a "new page", but the same page with the search content reloaded.
You probably won't be able to scrap only using requests. This probably requires a browser. Selenium with PhantomJS or Headless-Chrome are good choices for this job, and after that you can use beautifulSoup to parse.
I'm attempting to scrape some data from www.ksl.com/auto/ using Python Requests and Beautiful Soup. I'm able to get the results from the first search page but not subsequent pages. When I request the second page using the same URL Chrome constructs when I click the "Next" button on the page, I get a set of results that no longer matches my search query. I've found other questions on Stack Overflow that discuss Ajax calls that load subsequent pages, and using Chrome's Developer tools to examine the request. But, none of that has helped me with this problem -- which I've had on other sites as well.
Here is an example query that returns only Acuras on the site. When you advance in the browser to the second page, the URL is simply this: https://www.ksl.com/auto/search/index?page=1. When I use Requests to hit those two URLs, the second search results are not Acuras. Is there, perhaps a cookie that my browser is passing back to the server to preserve my filters?
I would appreciate any advice someone can give about how to get subsequent pages of the results I searched for.
Here is the simple code I'm using:
from requests import get
from bs4 import BeautifulSoup
page1 = get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = get('https://www.ksl.com/auto/search/index?page=2').text
soup = BeautifulSoup(page1, 'html.parser')
listings = soup.findAll("div", { "class" : "srp-listing-body-right" })
listings[0] # An Acura - success!
soup2 = BeautifulSoup(page2, 'html.parser')
listings2 = soup2.findAll("div", { "class" : "srp-listing-body-right" })
listings2[0] # Not an Acura. :(
Try this. Create a Session object and then call the links. This will maintain your session with the server when you send a call to the next link.
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Add this line
page1 = s.get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = s.get('https://www.ksl.com/auto/search/index?page=1').text
Yes, the website uses cookies so that https://www.ksl.com/auto/search/index shows or extends your last search. More specifically, the search parameters are stored on the server for you particular session cookie, that is, the value of the PHPSESSID cookie.
However, instead of passing that cookie around, you can simply always do full queries (in the sense of the request parameters), each time using a different value for the page parameter.
https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B
https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=1&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B