Direct link to comments that are being loaded asynchronously? - python

I am playing around with change.org and trying to download a couple of comments on a petition. For this, I would like to know where the comments are being pulled from when the user clicks on "load more reasons" For an example, look here:
http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food
Looking at the XHR requests in Chrome, I see requests being sent to http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments Of course, the page number varies with the number of times comments are being loaded.
However, this link leads to a blank page when I try it in a browser. Is this because of some missing data in the url or is this a result of some authentication step within the javascript that makes the request in the first place?
Any pointers will be appreciated. Thanks!
EDIT: Thanks to the first response, I see that the data is being received when I use the console. How do I receive the same data when making the request from a python script. Do I have to mimic the browser or is there a way to just use urllib?

They must be validating the source of the request. If you go to the site open the console and run this:
$.get('http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments',{},function(data){console.log(data);});
You will see the data come back

Related

Getting blocked from crawling data from website in python

I am new to web scraping and building crawlers and i started practicing on a grocery website.
I've been trying to crawl data from a website for quite some time and could'nt get through for more than three pages, for the first three pages the websites let's me access the data but after that i dont get any response and even for a few seconds i stop getting response on the browser as well. The website uses API to get all the data so i can not even use BeautifulSoup, i thought of using selenium but no luck there too.
I am using python's requests library to get the data and json to parse. The website requires post method to access all the products so i am sending cookies, headers and params as well and using same cookies etc for the next pages also.
I am looking for some general responses if anyone went through the same situation and got a workaround maybe.
Thank you.
Here is how you can unblock this website. (Sorry, can't provide the code because, it is likely to not fucntion without my location details. So try the method I say to get the code).
Open that link in Google Chrome > Open Developer Tools by pressing Ctrl + Shift + I > Go to Networks tab. Over there, go to XMR and find 'details'. This looks like:
Right click on it, Copy it as Bash Curl.
Go to Curl to Requests , paste the code, and press enter. The curl gets converted to requests. Copy that and run.
Here in that, the last line will be like:
response = requests.post('https://www.kroger.com/products/api/products/details', headers=headers, cookies=cookies, data=data)
This does the requests.
4. Now after this, when we extract what we require:
data = response.json() # saving as a dictionary
product = data['products'] # getting the product
Now from this scraped data, take whatever you need. Happy Coding :)

Using python to parse a webpage that is already open

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

Python requests to online store not checking out

I am new here so bear with me if I break the etiquette for this forum. Anyway, I've been working on a python project for a while now and I'm nearing the end but I've been dealing with the same problem for a couple of days now and I can't figure out what the issue is.
I'm using python and the requests module to send a post request to the checkout page of an online store. The response I get when i send it in is the page where you put in your information, not the page that says your order was confirmed, etc.
At first I thought that it could be the form data that I was sending in, and I was right. I checked what it was supposed to be in the network tab on chrome and i saw I was sending in 'Visa' and it was supposed to be 'visa' but it still didn't work after that. Then I thought it could be the encoding but I have no clue how to check what kind the site takes.
Do any of you have any ideas of what could be preventing this from working? Thanks.
EDIT: I realized that I wasn't sending a Cookie in the request headers, so I fixed that and it's still not working. I set up a server script that prints the request on another computer and posted to that instead and the requests are exactly the same, both headers and body. I have no clue what it could possibly be.

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Trying to scrape the Youtube Statistics for a video not owned by me. Python

I want to scrape the "average duration watched" for a video not owned by me, using Scrapy.
While parsing the page http://www.youtube.com/watch?v=#########, the data does not load. This is expected because, it seems to be a ajax call.
I didn't find an API that does the trick.
In the XHR, the POST request sent is
http://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=OoWSnDmeqAs
In the POST response I can see the details of data, but when I hit it a separate tab, I don't see any data. In this page the user beeglebug did try to mention something.
Any help is deeply appreciated.
In firefox, you may try the add-on "Shelve", in autoexecution mode. Then ask your firefox to load the relevant youtube page, perhaps automatically, and it will automatically be saved with the information you want. I don't know how automatic you want the result, your question is not precise enough about that.
auto-execution is called "Auto-Shelve" ; you may want to edit the pattern of files saved automatically.

Categories

Resources