I am working on python and splinter. I want to download a file from clicking event using splinter. I wrote following code
from splinter import Browser
browser = Browser('chrome')
url = "download link"
browser.visit(url)
I want to know how to download with splinter knowing URL and name of the file
Splinter is not involved in the download of a file.
Maybe you need to navigate the page to find the exact URL, but then use the regular requests library for the download:
import requests
url="some.download.link.com"
result = requests.get(url)
with open('file.pdf', 'wb') as f:
f.write(result.content)
Related
I am trying to download some files from a free dataset using Beautifulsoup.
I repeat the same process for two similar links in the web page.
This is the page address.
import requests
from bs4 import BeautifulSoup
first_url = "http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.region_xyz_centers_file.bcf53cd53a90f374.55434c415f43434e5f41504f455f4454495f41504f452d335f355f726567696f6e5f78797a5f63656e746572732e747874.txt"
second_url="http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.connectivity_matrix_file.bfcc4fb8da90e7eb.55434c415f43434e5f41504f455f4454495f41504f452d335f355f636f6e6e6563746d61742e747874.txt"
# labeled as Connectivity Matrix File in the webpage
def download_file(url, file_name):
myfile = requests.get(url)
open(file_name, 'wb').write(myfile.content)
download_file(first_url, "file1.txt")
download_file(second_url, "file2.txt")
output files:
file1.txt:
50.118248 53.451775 39.279296
51.417612 67.443649 41.009074
...
file2.txt
<html><body><h1>Internal error</h1>Ticket issued: umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562</body><!-- this is junk text else IE does not display the page: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx //--></html>
but I can download the second_url from the chrome browser properly (contains some numbers).
I tried to set user-agent
headers = {'User-Agent': "Chrome/6.0.472.63 Safari/534.3"}
r = requests.get(url, headers=headers)
but did not work.
Edit
The site does not need login to get the data. I opened the page in a private mode browser then downloaded the file in second_url.
Direct coping the second_url in address bar gave error:
Internal error
Ticket issued: umcd/89.41.15.124.2020-04-30.03-18-34.49c8cb58-7202-4f05-9706-3309b581af76
Do you have any idea?
Thank you in advance for any guide.
This isn't a Python issue. The second URL gives the same error both in Curl and in my browser.
It's odd to me the second URL would be shorter by the way. Are you sure you copied it right?
I am trying to scrape and SAVE pdf files that automatically start to download once you click on the URL, such as: https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5b0a3b62d&appId=PPGMS
I have been trying with urllib but with no success.
Given that the download is initiated by the javascript the most universal solution is to use a browser that actually executes the javascript.
Selenium driver with headless PhantomJS should do the trick in general case.
In this particular case (for this page that is) the code that executes the download is rather simple:
<script type="text/javascript">
$('document').ready(function(){
window.location='https://ec.europa.eu/research/participants/documents/downloadPublic/NXBvSk9oSlVwSFhueUcxNlJDUnNOSGVnOEpNWkVvWDlveDFoalRUb3E2VC8yVHlIU3hYMFVBPT0=/attachment/VFEyQTQ4M3ptUWNRa2R4dEZ6MkU3endWb2dWSDJHNTM=';
});
</script>
You can download the page first, parse the url starting with window.location and then download the file it points to (just make sure you include the cookies returned with the html page). This would be brittle as any change to the implementation of this page may break it.
Here's how this can be done with requests:
import re
import requests
s = requests.Session()
response = s.get('https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5b0a3b62d&appId=PPGMS')
url_pattern = re.compile("window.location='(?P<url>.*)';")
html = response.text
match_result = url_pattern.search(html)
url = match_result.group('url')
content_response = s.get(url)
file_content = content_response.content
with open('/tmp/file.pdf', 'wb') as f:
f.write(file_content)
Have you tried this?
from urllib.request import urlretrieve
for link in link_list:
urlretrieve(link)
I am trying to scrape this web-page using python requests library.
But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :
import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')
can anybody please help me out?? Thanks
The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing
You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")
I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
AND...
For parsing the code, have a look at BeautifulSoup.
Thanks to you both, #blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
html=driver.page_source
I am trying to do the following via python:
From this website:
http://www.bmf.com.br/arquivos1/arquivos_ipn.asp?idioma=pt-BR&status=ativo
I would like to check the 4th checkbox and then click on Download image.
That is what I did:
import urllib2
import urllib
url = "http://www.bmf.com.br/arquivos1/arquivos_ipn.asp?idioma=pt-BR&status=ativo"
payload = {"chkArquivoDownload3_ativo":"1"}
data = urllib.urlencode(payload)
request = urllib2.Request(url, data)
print request
response = urllib2.urlopen(request)
contents = response.read()
print contents
Does anyone have any suggestions?
Selenium is a great project, it lets you control a firefox browser with python. Something like this:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://www.bmf.com.br/arquivos1/arquivos_ipn.asp?idioma=pt-BR&status=ativo')
browser.find_element_by_id('chkArquivoDownload3').click()
browser.find_element_by_id('imgSubmeter_ativo').click()
browser.quit()
would probably work.
Web browsers are a complex collection of components which interact together.
Python does not have a web-browser built in (in particular a DOM or Javascript engine) and it is simply downloading a html file which would normally interact with said DOM and javascript in your browser.
The easiest method I foresee:
Pares the string using the python module BeautifulSoup.
Manually make the download request with the information you have parsed.
Save the downloaded image to file
I have a URL, for example
url = "www.example.com/file/processing/path/excefile.xls"
This URL downloads an excel file directly when I paste it in a browser.
How can I use python to download this file? That is, if I run the python code the above URL should open in a browser and download the excel file.
If you don't necessarily need to go through the browser, you can use the urllib module to save a file to a specified location.
import urllib
url = 'http://www.example.com/file/processing/path/excelfile.xls'
local_fname = '/home/John/excelfile.xls'
filename, headers = urllib.retrieveurl(url, local_fname)
http://docs.python.org/library/urllib.html#urllib.urlretrieve
Use the webbrowser module:
import webbrowser
webbrowser.open(url)
You should definitely look into the awesome requests lib.