Downloading a file using python BeautifulSoup - python

I am writing a web scraping script in python and I have to download a file. On the website , there is an image <a href="javascript:DownloadMyFile();">. When I click it, it calls a function which brings the dialog to save or open the file. How do I download the file using python directly using beautiful soup?

BeautifulSoup does not render JavaScript, so you have two options:
Figure out what the JavaScript is doing to generate the url that is handed to your browser when you click that anchor.
Write your scraper using something modern like CasperJS -- it can handle JavaScript.

Related

How to automatically download the files that have a download button on a webpage without BeautifulSoup?

I have created a streamlit app that has a download button to download a csv file. I want to automatically download the content using "GET" or "POST" requests. is there anyway to do that?
Here is my app URL:
https://maalaei97-test2.hf.space/?__theme=light
I tried to use urllib2.Request("GET", URL) but I receive None. I should mention that since I use the python script inside a software which only supports Iron Python, I do not have access to libraries like Beautifulsoap and requests. But I have access to urllib2 and webbrowser libraries.
I've checked your website, it uses the JavaScript on that button to open a new window and download your CSV. The actual URL in your case is:
https://maalaei97-test2.hf.space/media/c721b14394345aab14989004e9ce8a3bae6fbb02a8e8d6a41a4f5401.csv?title=app%20%C2%B7%20Streamlit
I think you can download this with urllib and GET request. If you want to click on that link with python, you need to run JavaScript code.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Getting html of Facebook page using urllib2

I am writing a Python script that can take a Facebook URL and locally save an html file of that Facebook page. Based on the answer to this question: Inherent way to save web page source
I tried using urllib2, but the resulting html file is different (missing some parts) compared to the html file that get from manually right clicking on the Facebook page and saving the entire webpage. Do you know why they would be different and what other Python libraries I could use instead of urllib2?

Downloading dynamic web pages in python

I use the python requests (http://docs.python-requests.org/en/latest/) library in order to download and parse particular web pages. This works fine as long as the page is not dynamic. Things look different if the page under consideration uses javascript.
In particular, I am talking about a web page that automatically loads more content once you scrolled to the bottom of the page so that you can continue scrolling. This new content is not included in the page's source text, thus, I can't download it.
I thought about simulating a browser in python (selenium) - is this the right way to go?

Viewing whole webpage with beautifulsoup

I am scraping a website but it only shows a portion of the website at the bottom it has a view more button. Is there anyway to view everything on the webpage via python?
BeautifulSoup just parses the returned HTML. It doesn't execute JavaScript, which is often used to load new content or to modify the existing webpage after it has loaded.
You'll need to execute the JavaScript, which requires more than just an HTML parser. You basically need to use a browser. There are a few Python packages to do this:
Selenium
Ghost.py

Categories

Resources