I'm trying to parse the HTML of a page with infinite scrolling. I want to load all of the content so that I can parse it all. I'm using Python. Any hints?
Those pages update their html with AJAX. Usually you just need to find the new AJAX requests send by browser, guess the meaning of the AJAX url parameters and fetch the data from the API.
API servers may validate the user agent, referer, cookie, oauth_token ... of the AJAX request, keep an eye on them.
the data is
either loaded in advance
or the page sends a request while you scroll
you can use httpfox to find the request and send it
Related
"To display a Web page, the browser sends an original request to fetch
the HTML document that represents the page. It then parses this file,
making additional requests corresponding to execution scripts, layout
information (CSS) to display, and sub-resources contained within the
page (usually images and videos)."
The previous quote is form MDN Web Docs An overview of HTTP
My question is: I want to identify the original request from the client, and then temporarily store the request and all subrequests made to the server, but when the client make another original request I want to replace the temporarily data with the new requests.
for example let say that I have an html page that when parsed by the client make additional requests to some resources on the server, when the user reload the page he is just making another original request, so the temporarily stored request data should be replaced by the new original request and its subrequests, the same happens when the client request for another html page.
I am trying to scrape an appStore, particularly the reviews for every app. The issue is that the reviews are dynamic and the client-side is making an AJAX POST call to get that data. The payload for this call is also being generated dynamically for every page. I was wondering if there is any way to intercept this request through code to get the payload, using which I could make a call using requests or any other library. I am able to make this call through POSTMAN using the parameters I get from inspect Network Activity utility of the web-browsers.
I could use selenium to scrape the finally loaded page- but it waits for the entire page to load which is highly un-optimized since I don't need to wait for the entirety of the page to load.
payload = "<This is dynamically created for every page and is constant for that given page>"
header = {"Content-Type": "application/x-www-form-urlencoded"}
url = 'https://appexchange.salesforce.com/appxListingDetail'
r = requests.post(url=url, data=payload, headers=header)
I was wondering if its possible to get this payload through the scraper which can intercept all the AJAX calls made when it tries to scrape the base web-page.
If you want to scrape a page that contains scripts and dynamic content, I would recommend using puppeteer.
It is a headless scriptable browser (can also be used in non-headless mode), it loads the page exactly like a browser with all its dynamic contents and scripts. So you can simply wait for the page to finish loading, and then read the rendered content
You can also use selenium
Also check this out: a list of almost all headless browsers
UPDATE
I know it's been some time, but I will post the answer to your comment here, in case you still need it or just for the sake of completeness:
In puppeteer you can asynchronously wait for a specific DOM node to appear on the page, or for some other events to occur. So you can continue on with your work after that and don't care about loading the rest of the page (although it will load in the background but you don't need to wait for it).
During Loading this web page , the browser makes many requests,
now I need a certain request url (e.g.starts with 'http://api.le.com/mms/out/video/playJson?') during the Loading process ,and the request is made on condition that the adobe flash player plugin for NPAPI is enabled , so any way to get the url?
P.S. Better to show some code, I am new to this area.
Scrapy doesn't handle requests while the page is being processed, you either know specifically the url you want and directly request it
or
You have to use something like scrapy-splash which can return a HAR file with all the requests made while loading the page. The only downside to this is that splash doesn't return the contents of each request, only the headers =(
If you absolutely need the contents of the request it's best to use Selenium with browsermob, if you find a better solution please do tell.
EDIT
Seems now that Splash does handle requests' body, check #Mikhail comment.
I am developing a project on Python using Django. The project is doing lot of work in the background so i want to notify users what's going on now in the system. For this i have declared a p tag in HTML and i want to send data to it.
I know i can do this by templates but i am little confused as 5 functions need to pass the status to the p tag and if i use render_to_response() it refreshes the page every time a status is passed from the function
Anyone please tell me how to do this in the correct way
Part of your page that contains the paragraph tags is a piece of JavaScript that contains a timer.
Every once in a while it does an Ajax request to get the data with regard to "what's going on now in the system".
If you use the Ajax facilites of JQuery, which is probably the easiest, you can pass a JavaScript callback function that will be called if the request is answered. This callback function receives the data served by Django as response to the asynchroneous request. In the body of this callback you put the code to fill your paragraph.
Django doesn't have to "know" about Ajax, it just serves the required info from a different URL than your original page containing the paragraph tags. That URL is part the your Ajax request from the client.
So it's the client that takes the initiative. Ain't no such thing as server push (fortunately).
My google search application making a request each time while i am using paginator. Suppose i have a 100 records. Each page have to show 10 records so ten pages. When i click 2nd page it again sending a request. Ideally it should not send the request.
When i click 2nd page it again sending a request. Ideally it should not send the request.
What do you mean by request? Is it a request to Google?
Your application apparently does not cache the results. If your request to Google returns 100 pages then you should cache those hundred. When you request the second page the view should retrieve this cache and return the second page to you.
If you mean request to your app, then #Daniel's comment has it right. You can get around this by sending all the results to the browser and then do the pagination using JavaScript to avoid this.
A more detailed answer is difficult without seeing some code.