When you do a request in python, you simply download the page and your connection is over.
However, if you open it in your browser, for some websites the page content will automatically refresh. For example the stock prices on yahoo finance, or notifications on reddit.
Is it possible to replicate this behaviour in python: automatic refresh without having to constantly manually re-download the same page entirely?
These websites use this thing called websockets, which allows you to easily send live data back and forth from client to server. They keep a normal request open so that the server can send data back whenever it needs. If you need websockets, there are some resources listed below:
MDN Web docs
100 seconds of code
The results will be the same if you re-download the page. Don't make it harder than it has to be. If you are hell-bent, you'll need to use something like puppeteer, phantomjs, or selenium.
Related
I'm looking for a way to scan a website and instantly detect an update, without having to refresh the page. So when a new post is pushed to the webpage, I'd like to be instantly notified. Is there a way to do that without having to refresh the page constantly?
Cheers
What browser are you using? Chrome has an auto refresh extension. Try doing a Google search for the extension. It's very easy to set up. It's more of a timed refresh that you can program. But it works for situations like what you are asking.
Without knowing a bit more about your task, it's hard to give you a clear answer. Typically you would set up some kind of API to determine if data has been updated, rather than scraping the contents of a website directly. See if an API exists, or if you could create one for your purpose.
Using an API
Write a script that calls the API every minute or so (or more often if necessary). Every time you call the API, save the result. Then compare the previous result to the new result - if they're different then the data has been updated.
Scraping a Website
If you do have to scrape a website, this is possible. If you execute an HTTP GET request against a webpage, the response will contain the DOM of the webpage. You can then traverse the DOM to determine the contents of a webpage. Similar to the API example, you can write a script that executes the HTTP request every minute or so, saves the state, and compares it to the previous state. There are numerous libraries out there to help preform HTTP request and traverse the DOM, but without knowing your tech stack I can't really recommend anything.
I'm trying to scrape all news items from this website. They are not showing in the source code: http://www.uvm.dk/aktuelt
I've tried using Firefox' LIVE Http Headers and Chrome's developer tool but still can't figure out what goes on behind the scenes. I'm sure it's pretty simple :-)
I have these information but how do I use them to scrape the wanted news?
http://www.uvm.dk/api/search
Request Method:POST
Connection: keep-alive
PageId=8938bc1b-a673-4513-80d1-e1714ca93d7c&Term=&Years%5B%5D=2017&WorkAreaIds=&SubjectIds=&TemplateIds=&NewsListIds%5B%5D=Emner&TimeSearch%5BEvaluation%5D=&FlagSearch%5BEvaluation%5D=Alle&DepartmentNames=&Letters=&RootItems=&Language=da&PageSize=10&Page=1
Can anyone help?
Not a direct answer but some hints.
Your approach with livehttpheaders is a good one. Open the side bar before loading home page, clear all. Then load home page and an article. There usually will a ton of http request because of images, css and js. But you'll be able to find the few ones useful. Usually the very first is for home page and one somewhere below is the article main page. An other interesting one is the one when you click next page.
I like to decouple download (HTTP) and scraping (HTML or JSON or so).
I download to a file with a first script and scrap with a second one.
First because I want to be able to adjust scraping without downloading again and again. Second because I prefer to use bash+curl to download and python+lxml to scrap. If I need information from scraping to go on downloading, my scraping script output it on the console.
I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.
I am writing a web scraping application. When I enter the URL directly into a browser, it displays the JSON data I want.
However, if I use Python's request lib, or URLDownloadToFile in C++, it simply downloads the html for the login page.
The site I am trying to scrape it from (DraftKings.com) requires a login. The other sites I scrape from don't.
I am 100% sure this is related, since if I paste the url when I am logged out, I get the login page, rather than the JSON data. Once I log in, if I paste the URL again, I get the JSON data again.
The thing is that if I remain logged in, and then use the Python script or C++ app to download the JSON data, as mentioned.... it downloads the Login HTML.
Anyone know how I can fix this issue?
Please don't ask us to help with an activity that violates the terms of service of the site you are trying to (ab-)use:
Using automated means (including but not limited to harvesting bots, robots, parser, spiders or screen scrapers) to obtain, collect or access any information on the Website or of any User for any purpose.
Even if that kind of usage were allowed, the answer would be boring:
You'd need to implement the login functionality in your scraper.
I'd like to retrieve data from a specific webpage by using urllib library.
The problem is that in order to open this page some data should be sent to
the server before. If I do it with IE, i need to update first some checkboxes and
then press "display data" button, which opens the desired page.
Looking into the source code, I see that pressing "display data" submits some kind of
form - there is no specific url address there. I cannot figure out by looking
at the code what paramaters are sent to the server...
I think that maybe the simpler way to do that would be to analyze the communication
between the IE and the webserver after pressing the "display data" button.
If I could see explicitly what IE does, I could mimic it with urllib.
What is the easiest way to do that?
An HTML debugging proxy would be the best tool to use in this situation. As you're using IE, I recommend Fiddler, as it is developed by Microsoft and automatically integrates with Internet Explorer through a plugin. I personally use Fiddler all the time, and it is a really helpful tool, as I'm building an app that mimics a user's browsing session with a website. Fiddler has really good debugging of request parameters, responses, and can even decode encrypted packets.
You can use a web debugging proxy (e.g. Fiddler, Charles) or a browser addon (e.g. HttpFox, TamperData) or a packet sniffer (e.g. Wireshark).