Getting blocked from crawling data from website in python

Getting blocked from crawling data from website in python - python

I am new to web scraping and building crawlers and i started practicing on a grocery website.
I've been trying to crawl data from a website for quite some time and could'nt get through for more than three pages, for the first three pages the websites let's me access the data but after that i dont get any response and even for a few seconds i stop getting response on the browser as well. The website uses API to get all the data so i can not even use BeautifulSoup, i thought of using selenium but no luck there too.
I am using python's requests library to get the data and json to parse. The website requires post method to access all the products so i am sending cookies, headers and params as well and using same cookies etc for the next pages also.
I am looking for some general responses if anyone went through the same situation and got a workaround maybe.
Thank you.

Here is how you can unblock this website. (Sorry, can't provide the code because, it is likely to not fucntion without my location details. So try the method I say to get the code).
Open that link in Google Chrome > Open Developer Tools by pressing Ctrl + Shift + I > Go to Networks tab. Over there, go to XMR and find 'details'. This looks like:
Right click on it, Copy it as Bash Curl.
Go to Curl to Requests , paste the code, and press enter. The curl gets converted to requests. Copy that and run.
Here in that, the last line will be like:
response = requests.post('https://www.kroger.com/products/api/products/details', headers=headers, cookies=cookies, data=data)
This does the requests.
4. Now after this, when we extract what we require:
data = response.json() # saving as a dictionary
product = data['products'] # getting the product
Now from this scraped data, take whatever you need. Happy Coding :)

Related

BeautifulSoup request is returning an empty list from LinkedIn.com/jobs

I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)

As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!

Post API search with Python

I'm trying to scrape all news items from this website. They are not showing in the source code: http://www.uvm.dk/aktuelt
I've tried using Firefox' LIVE Http Headers and Chrome's developer tool but still can't figure out what goes on behind the scenes. I'm sure it's pretty simple :-)
I have these information but how do I use them to scrape the wanted news?
http://www.uvm.dk/api/search
Request Method:POST
Connection: keep-alive
PageId=8938bc1b-a673-4513-80d1-e1714ca93d7c&Term=&Years%5B%5D=2017&WorkAreaIds=&SubjectIds=&TemplateIds=&NewsListIds%5B%5D=Emner&TimeSearch%5BEvaluation%5D=&FlagSearch%5BEvaluation%5D=Alle&DepartmentNames=&Letters=&RootItems=&Language=da&PageSize=10&Page=1
Can anyone help?

Not a direct answer but some hints.
Your approach with livehttpheaders is a good one. Open the side bar before loading home page, clear all. Then load home page and an article. There usually will a ton of http request because of images, css and js. But you'll be able to find the few ones useful. Usually the very first is for home page and one somewhere below is the article main page. An other interesting one is the one when you click next page.
I like to decouple download (HTTP) and scraping (HTML or JSON or so).
I download to a file with a first script and scrap with a second one.
First because I want to be able to adjust scraping without downloading again and again. Second because I prefer to use bash+curl to download and python+lxml to scrap. If I need information from scraping to go on downloading, my scraping script output it on the console.

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?

What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case

Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Urllib Counts as WebPage Hit?

I'm studying opening pages through Python (3.3).
url=('http://www.google.com')
page = urllib.request.urlopen(url)
Does the above code count that as one hit to Google or does this?
os.system('start chrome.exe google.com')
The first one scrapes the page while the second one actually opens the page in a browser. I was just wondering if it made a difference page hit wise?

both do very different things.
using urllib.request.urlopen makes a single http request.
your second example will do the same, and then it will parse the document it receives and request subsequent resources (images/javascript/css/whatever). So the result of loading google.com in your browser will trigger many hits.
try it yourself by looking in your browsers development tools (usually in network section) while you load a page.

Direct link to comments that are being loaded asynchronously?

I am playing around with change.org and trying to download a couple of comments on a petition. For this, I would like to know where the comments are being pulled from when the user clicks on "load more reasons" For an example, look here:
http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food
Looking at the XHR requests in Chrome, I see requests being sent to http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments Of course, the page number varies with the number of times comments are being loaded.
However, this link leads to a blank page when I try it in a browser. Is this because of some missing data in the url or is this a result of some authentication step within the javascript that makes the request in the first place?
Any pointers will be appreciated. Thanks!
EDIT: Thanks to the first response, I see that the data is being received when I use the console. How do I receive the same data when making the request from a python script. Do I have to mimic the browser or is there a way to just use urllib?

They must be validating the source of the request. If you go to the site open the console and run this:
$.get('http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments',{},function(data){console.log(data);});
You will see the data come back

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.