I'm trying to get all links on a page 'https://www.jumia.com.eg' using scrapy.
The code is like this:
all_categories = response.xpath ('//a')
But I found a lot of missing links in the results.
The count of the results is 242 links.
When I tried Chrome developer tools, I got all the links, the count of the results was 608 with the same selector xpath (//a).
Why doesn't Scarpy get all the links using the mentioned selector while Chrome does?
It turned out that the problem is because data is loaded using Javascript as Justin commented.
That's because the website is using reCAPTCHA.
If you type: view(response) in scrapy shell, you would noticed that you are actually parsing the reCAPTCHA page (which explains the unexpected a tags):
You can try solving the reCAPTCHA (not sure how easy that would be, but this question might help)... Alternatively you can run your scraper from a proxy, such as Crawlera which uses rotating IPs... I have not used Crawlera but according to their website, it retries the page several times (with different IPs) until it hits a clean page.
Related
I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:
#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.
Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.
The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.
Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")
I can't scrape the 'Resolution' field from the javascript webpage, as I believe.
Webpage address:
https://support.na.sage.com/selfservice/viewdocument.do?noCount=true&externalId=60390&sliceId=1&noCount=true&isLoadPublishedVer=&docType=kc&docTypeID=DT_Article&stateId=4183&cmd=displayKC&dialogID=197243&ViewedDocsListHelper=com.kanisa.apps.common.BaseViewedDocsListHelperImpl&openedFromSearchResults=true
I need to extract Description, Cause, and Resolution.
Tried various ways to get element, including:
find_element_by_xpath
find_element_by_id
find_element_by_class_name.
Nothing gave the desired result.
Could you direct me in which way should I work?
https://support.na.sage.com/selfservice/viewContent.do?externalId=60390&sliceId=1
This is the correct url that you can crawl html, use Network tab of your browser devtool to find that.
Example with Chrome
When I search for an xpath in my browser after inspection it show the required result,but when I used the same xpath of my response in scrapy it should an empty list.
So when find an element on the browser, I get showing the number of satisfying element see picture for example.
Now, when I run the same xpath off my response in scrapy shell, I get an empty list,even though the response status is 200. What could be causing this?
You browser renders Javascript code and this leads to change in HTML code. So, in this case, you need to use a Javascript engine for requests in Scrapy. Please look at scrapy-splash to render JS and get same results as in browser.
If you use chrome browser, if would be a a little different in some tag with you get from requests or scrapy.
Like chrome will auto add in the html.
I am using Scrapy to scrape a page, I tried many times and I am convinced that the following doesn't work (in the shell) and returns empty result:
response.xpath('//*[#class="itemtitle"]/a/text()').extract()
this is where in chrome console, this brings me the expected result:
$x('//*[#class="itemtitle"]/a/text()')[0]
I checked the robot.txt for the target url and found out the following:
User-agent: *
Disallow: /~a/
I am wondering if it is not allowed to scrape it.
So my specific question is that is it possible to prevent robots from scraping on certain pages? if not what can be wrong with my code, that bring empty result in Scrapy shell.
Always check source HTML (usually Ctrl+U in a browser). You need:
response.xpath('//item/title/text()').extract()
I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?
i.e as the url is not present as href or a how to crawl to other pages?
Would be greatful if someone helped me on this.
PS:URL remains the same when different page is clicked.
You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.
if you are using google chrome, you can check the url which is dynamically being called in
network->headers of the developer tools
so based on that you can identify whether it is a GET or POST request.
If it is a GET request you can find the parameters straight away from the url.
If it is a POST request you can find the parameters from form data in network->headers
of the developer tools.
You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.
Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way
If you don't mind using gevent.GRobot is another good choose.