error in downloading file with Beautifulsoup - python

I am trying to download some files from a free dataset using Beautifulsoup.
I repeat the same process for two similar links in the web page.
This is the page address.
import requests
from bs4 import BeautifulSoup
first_url = "http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.region_xyz_centers_file.bcf53cd53a90f374.55434c415f43434e5f41504f455f4454495f41504f452d335f355f726567696f6e5f78797a5f63656e746572732e747874.txt"
second_url="http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.connectivity_matrix_file.bfcc4fb8da90e7eb.55434c415f43434e5f41504f455f4454495f41504f452d335f355f636f6e6e6563746d61742e747874.txt"
# labeled as Connectivity Matrix File in the webpage
def download_file(url, file_name):
myfile = requests.get(url)
open(file_name, 'wb').write(myfile.content)
download_file(first_url, "file1.txt")
download_file(second_url, "file2.txt")
output files:
file1.txt:
50.118248 53.451775 39.279296
51.417612 67.443649 41.009074
...
file2.txt
<html><body><h1>Internal error</h1>Ticket issued: umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562</body><!-- this is junk text else IE does not display the page: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx //--></html>
but I can download the second_url from the chrome browser properly (contains some numbers).
I tried to set user-agent
headers = {'User-Agent': "Chrome/6.0.472.63 Safari/534.3"}
r = requests.get(url, headers=headers)
but did not work.
Edit
The site does not need login to get the data. I opened the page in a private mode browser then downloaded the file in second_url.
Direct coping the second_url in address bar gave error:
Internal error
Ticket issued: umcd/89.41.15.124.2020-04-30.03-18-34.49c8cb58-7202-4f05-9706-3309b581af76
Do you have any idea?
Thank you in advance for any guide.

This isn't a Python issue. The second URL gives the same error both in Curl and in my browser.
It's odd to me the second URL would be shorter by the way. Are you sure you copied it right?

Related

Downloading PDF's using Python webscraping not working

Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.
Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

How to download PDF files in python that doesn't end with .pdf

The URL looks like this: https://apps.websitename.com/AccountOnlineWeb/AccountOnlineCommand?command=getBlobImage&image=11/19/2019 I have tried everything. But none of them worked.
import requests
from requests.auth import HTTPBasicAuth
url ='https://apps.websitename.com/AccountOnlineWeb/AccountOnlineCommand?command=getBlobImage&image=11/19/2019'
s = requests.Session()
r = requests.get(url, allow_redirects=True, auth=HTTPBasicAuth('username', 'password'))
with open('filepath/file.pdf', 'wb')as f:
f.write(r.content)
I tested getting a .jpg file from the website to make sure the authentication part is working. I have downloaded a file with a .pdf url that's not authenticated to make sure downloading pdf is working. But I just cannot download this file.
I used r.is_redirect to test if the url redirects to another url for the PDF but it returned False
I should mention that when you open the file manually it just waits for like 2s and loads the PDF like a regular PDF and you can download it just like a regular PDF.
Currently my code downloads a file that's supposed to be the PDF but it has 0 KB.

Issue downloading multiple PDFs

After running the following code, I am unable to open the downloaded PDF's. Even though the code ran successfully, the downloaded PDF files are damaged.
My computer's error message is
Unable to open file. it may be damaged or in a format Preview doesn't recognize.
Why are they damaged and how do I solve this?
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
This issue is you are requesting the link that is within github 'blob' when you need the the 'raw' link:
'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
but you want:
'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
So just adjust that. Full code below:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
pdf_link = link['href'].replace('blob','raw')
pdf_file = requests.get('https://github.com' + pdf_link)
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(pdf_file.content)
I had to use soup.select("a[href$=.pdf]") (without the inner quotes) to get it to select the links correctly.
After that, your script works, but: what you're downloading is not a PDF, but an HTML webpage! Try visiting one of the URLs: https://github.com/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
You'll be presented with a GitHub webpage, not the actual PDF. To get that, you need the "raw" GitHub URL, which you can see when you hover over the Download button: https://github.com/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
So, it looks like you just have to replace blob with raw at the proper spot to make it work:
href = link['href']
href = href.replace('/blob/', '/raw/')
requests.get(urljoin(url,href).content)
The issue is that the file is not properly closed after the open/write.
Just add f.close() at the end of the code to do that.

how to download with splinter knowing direction and name of the file

I am working on python and splinter. I want to download a file from clicking event using splinter. I wrote following code
from splinter import Browser
browser = Browser('chrome')
url = "download link"
browser.visit(url)
I want to know how to download with splinter knowing URL and name of the file
Splinter is not involved in the download of a file.
Maybe you need to navigate the page to find the exact URL, but then use the regular requests library for the download:
import requests
url="some.download.link.com"
result = requests.get(url)
with open('file.pdf', 'wb') as f:
f.write(result.content)

Html source code of https pages different when fetched manually vs. with HTTPConnection

I'm new to python and I've been trying to get the html source code of 'https' pages. Thanks to a previous question, I am now able to extract part of the source code, but not as much as when I manually open the page and look at the source.
Is there a simple way to fetch the entire code that I see when I open the source of an HTTPS page manually using python?
Here's the code I'm currently using:
import http.client
from urllib.parse import urlparse
url = "https://www.google.ca/?gfe_rd=cr&ei=u6d_VbzoMaei8wfE1oHgBw&gws_rd=ssl#q=test"
p = urlparse(url)
conn = http.client.HTTPConnection(p.netloc)
conn.request('GET', p.path)
resp = conn.getresponse()
text_file = open("google_test_python.txt", "wb")
for i in resp:
text_file.write(i)
text_file.close()

Categories

Resources