As a beginner I've been heavily warned to avoid resource heavy browsers for web scraping such as Selenium.
Then I looked at this site: Intcomex Webstore
My idea was to make an alert program to tell me the price and if the item was low in quantity.
I can't for the life of me figure out how one would even attempt to get any of this information, whether through the CSV/EXML files or directly.
I'd possibly use requests however it only returns the javascript function as a link: href="javascript:PriceListExportCSV('/en-XUS/Products/Csv','query');
In Developer Tools after I've clicked the CSV link I see a GET request to http://store.intcomex.com/en-XUS/Products/Csv
However if I use requests I get status_code = 404.
Any help to point me in the right direction is greatly appreciated.
After taking the advice of many helpful commenters, I've come to the conclusion that I indeed need to use a browser such as Selenium.
While it may not be the ideal solution, it appears to be only viable one at the moment.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://store.intcomex.com/en-XUS/Products/ByCategory/cpt.allone?r=True')
browser.execute_script("javascript:PriceListExportCSV('/en-XUS/Products/Csv','query');")
I'll have to figure it out from here...
Related
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?
I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0
This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.
I know there are many similar questions, but I've been through all of those and they couldn't help me. I'm trying to get information from a website, and I've used the same method on other websites with success. Here however, it doesn't work. I would very much appreciate if somebody could give me a few tips!
I want to get the max temperature for tomorrow from this website.
import re, requests, time
from lxml import html
page = requests.get('http://www.weeronline.nl/Europa/Nederland/Amsterdam/4058223')
tree = html.fromstring(page.content)
a = tree.xpath('//*[#id="app"]/div/div[2]/div[5]/div[2]/div[2]/div[6]/div/div/div/div/div/div/ul/div[2]/div/li[1]/div/span/text()')
print(a)
This returns an empty list, however. The same method on a few other websites I checked worked fine. I've tried applying this method on other parts of this website and this domain, all to no avail.
Thanks for any and all help!
Best regards
Notice that when you try to open that page you are asked whether you agree to allow cookies. (It's something like that, I have no Dutch.) You will need to use something like selenium to click on a button to 'OK' that so that you have access to the page that you really want. Then you can use the technique discussed at Web Scrape page with multiple sections to be able to get the HTML for that page, and finally apply whatever xpath it takes to retrieve the content that you want.
My coding experience is in Python. Is there a simple way to execute a python code in firefox that would detect a particular address, say nytimes.com, load the page, then delete the end of the address following html (this allows bypassing the 20 pageviews/month limit) and reload?
Your best bet is to use selenium as proposed before. Here's a small example how you could do it. Basically the code checks if the limit has been reached and if it has it deletes cookies and refreshes the page letting you to continue reading. Deleting cookies lets you read another 10 articles without continuously editing the address. Thats the technical part, you have to consider the legal implications yourself.
from selenium import webdriver
browser=webdriver.Firefox()
browser.get('http://www.nytimes.com')
if browser.find_element_by_xpath('.//*[contains(.,"You’ve reached the limit of 10 free articles a month.")]'):
browser.delete_all_cookies()
browser.refresh()
you can use selenium it lets you easily fully control firefox and other web browsers with python. it would only be a few lines of code to acheive this. this answer How to integrate Selenium and Python has a working example
when I can't delete FF cookies from webdriver. When I use the .delete_all_cookies method, it returns None. And when I try to get_cookies, I get the following error:
webdriver_common.exceptions.ErrorInResponseException: Error occurred when processing
packet:Content-Length: 120
{"elementId": "null", "context": "{9b44672f-d547-43a8-a01e-a504e617cfc1}", "parameters": [], "commandName": "getCookie"}
response:Length: 266
{"commandName":"getCookie","isError":true,"response":{"lineNumber":576,"message":"Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMLocation.host]","name":"NS_ERROR_FAILURE"},"elementId":"null","context":"{9b44672f-d547-43a8-a01e-a504e617cfc1} "}
How can I fix it?
Update:
This happens with clean installation of webdriver with no modifications. The changes I've mentioned in another post were made later than this post being posted (I was trying to fix the issue myself).
Hmm, I actually haven't worked with Webdriver so this may be of no help at all... but in your other post you mention that you're experimenting with modifying the delete cookie webdriver js function. Did get_cookies fail before you were modifying the delete function? What happens when you get cookies before deleting them? I would guess that the modification you're making to the delete function in webdriver-read-only\firefox\src\extension\components\firefoxDriver.js could break the delete function. Are you doing it just for debugging or do you actually want the browser itself to show a pop up when the driver tells it to delete cookies? It wouldn't surprise me if this modification broke.
My real advice though would be actually to start using Selenium instead of Webdriver since it's being discontinued in it's current incarnation, or morphed into Selenium. Selenium is more actively developed and has pretty active and responsive forms. It will continue to be developed and stable while the merge is happening, while I take it Webdriver might not have as many bugfixes going forward. I've had success using the Selenium commands that control cookies. They seem to be revamping their documentation and for some reason there isn't any link to the Python API, but if you download selenium rc, you can find the Python API doc in selenium-client-driver-python, you'll see there are a good 5 or so useful methods for controlling cookies, which you use in your own custom Python methods if you want to, say, delete all the cookies with a name matching a certain regexp. If for some reason you do want the browser to alert() some info about the deleted cookies too, you could do that by getting the cookie names/values from the python method, and then passing them to selenium's getEval() statement which will execute arbitrary js you feed it (like "alert()"). ... If you do go the selenium route feel free to contact me if you get a blocker, I might be able to assist.