How to get notified when new content is pushed to a webpage

How to get notified when new content is pushed to a webpage - python

I'm looking for a way to scan a website and instantly detect an update, without having to refresh the page. So when a new post is pushed to the webpage, I'd like to be instantly notified. Is there a way to do that without having to refresh the page constantly?
Cheers

What browser are you using? Chrome has an auto refresh extension. Try doing a Google search for the extension. It's very easy to set up. It's more of a timed refresh that you can program. But it works for situations like what you are asking.

Without knowing a bit more about your task, it's hard to give you a clear answer. Typically you would set up some kind of API to determine if data has been updated, rather than scraping the contents of a website directly. See if an API exists, or if you could create one for your purpose.
Using an API
Write a script that calls the API every minute or so (or more often if necessary). Every time you call the API, save the result. Then compare the previous result to the new result - if they're different then the data has been updated.
Scraping a Website
If you do have to scrape a website, this is possible. If you execute an HTTP GET request against a webpage, the response will contain the DOM of the webpage. You can then traverse the DOM to determine the contents of a webpage. Similar to the API example, you can write a script that executes the HTTP request every minute or so, saves the state, and compares it to the previous state. There are numerous libraries out there to help preform HTTP request and traverse the DOM, but without knowing your tech stack I can't really recommend anything.

Related

Open a web page and let it refresh automatically in python

When you do a request in python, you simply download the page and your connection is over.
However, if you open it in your browser, for some websites the page content will automatically refresh. For example the stock prices on yahoo finance, or notifications on reddit.
Is it possible to replicate this behaviour in python: automatic refresh without having to constantly manually re-download the same page entirely?

These websites use this thing called websockets, which allows you to easily send live data back and forth from client to server. They keep a normal request open so that the server can send data back whenever it needs. If you need websockets, there are some resources listed below:
MDN Web docs
100 seconds of code

The results will be the same if you re-download the page. Don't make it harder than it has to be. If you are hell-bent, you'll need to use something like puppeteer, phantomjs, or selenium.

Is there any other way to extract data from dynamic website, rather than using selenium?

I am trying to extract the data from the website https://shop.nordstrom.com/ for all the products (like shirt, t-shirt and so on). The page is dynamically loaded. I know I can use selenium with headless browser, but that is also a time consuming process and looking up on the elements, having strange ID and class names, that is also not so promising.
So I thought of looking up on the Network tool, if I can find the path to the API, from where the data is being loaded (XHR Request) . But I could not find any thing helpful. So is there a way to get the data from the website ?

If you don't want to use selenium then the alternative is to use a web parser like bs4 or use simply the request module.
You are on the right path in finding the call to the API. XHR requests can be seen under the network tab but the multitude of resources that appears makes it intricate to understand the requests being made. A simple way around this is to use the following method:
Instead of Network tab go to the console tab. There click on the settings icon, and then tick just the option Log XMLHTTPRequests.
Now refresh the page and scroll down to initiate dynamic calls. You will now be able to see the logs of all XHR in a more clear way.
For example
(index):29 Fetch finished loading: GET "**https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C24%2C24%2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null**".
Making a get request to that URL gives a bunch of Json objects. You can now use this url and others that you can derive to make the request straight to the URL.
See the answer here on how you can integrate the url with a request module to fetch data.

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?

What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case

Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Python - Manipulate HTML in order to use printer

I am programming an application in Python that, among other functions, will print PDF files via a Xerox printer.
I am facing two options right now:
First one: find a way to comunicate with the printer driver so I could easily send instructions to the printer and do whatever I wanted with it.
Second one: since the first one seems to be a bit tricky and I don't know any API for Python that does something like that, I had the idea of using the service that Xerox provides. Basically there is an IP address that redirects me to an administration page where I can get informations about the status of the printer and... an option to select files to print (and set the number of pages, the tray where the pages will exit, etc...).
I think the best way is to follow the second option, but I don't know if that's doable.
Basically I want to be able to change that webpage source code in order to change, for example the textboxes and in the end "press" the submit button.
I don't know if this is possible, but if it is, can anyone point me in the right path, please?
Or if you have another idea, I would like to hear it.
By now I only managed to get the page source code, I still don't know how to submit it after I change it.
import requests
url = 'http://www.example.com'
response = requests.get(url)
print(response.content)

Unless Xerox has a Python API or library, your second option is the best choice.
When you visit the administration page and submit files for printing, try doing the following:
When you load the admin page, open Chromes developer tools (right click -> Inspect Element)
Open the "Network" tab in the developer console.
Try submitting some files for printing through the online form. Watch the Network panel for any activity. If a new row appears, click on it and view the request data.
Try to replicate the request's query parameters and HEAD parameters with Python's requests.
If you need any help replicating the exact request, feel free to start a new question with the request data and what you have tried.

How to monitor the GET methods of a URL?

Not sure If I'll make sense or not but here goes. In Google chrome, if you rightclick a page and go to resources, then refresh a page, you can see all the GET/POST methods pop up as they happen. I'm wanting to know if there is a way, in python, to input a url and have it generate a list of each get call be listed (not sure if possible)
Would love some direction on it!
Thanks

I believe I can clarify parts of your original question.
One the one hand, using the browser-built-in debugging tools for investigating how a certain website behaves when loaded by a browser is a good technique, and not easily replaceable by custom code.
On the other hand, it looks like you are looking for an HTML parser, such as BeautifulSoup.
Also, you seem to confuse the meaning of a URL and an HTML document. A URL can point to an HTML document, but in many cases it points to other things, such as JSON-API endpoint.
Assuming you actually wanted to ask how "to input a URL to an HTML document and have it generate a list of each remote resource call a browser would perform":
Before rendering a website, a web browser fires off the initial HTTP GET request and retrieves the main HTML document. It parses this document and, among others, searches for further resources to be retrieved. Such resources may be CSS files, JavaScript files, images, iframes, ... (long list). If it finds such resources, the browser automatically fires off one HTTP GET request for each of these resources. As you can see, there is quite some work involved and happening behind the scenes, before all these requests are performed by your browser.
In Python, you cannot trivially simulate the behavior of your browser. You can easily retrieve a single HTML document via the urllib or requests module. That is, you can manually fire off a single HTTP GET request to retrieve an HTML document. Replicating the behavior of a browser would then require
to parse the HTML document in the same way the browser does,
to search the document for remote sources such as images, CSS files, ....,
to decide which remote resources to query in which order, and
then to fire off even more HTTP GET requests, and possibly recursively repeat the entire process (as would be required for iframes)
Exact replication of browser behavior is too complex. Building a proper web browser is an inherently difficult task.
That is, if you want to understand the behavior of a website within a browser, use the browser's debugging tools.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.