How to download files from website using PHP with Python - python

I have a Python script that crawls various webistes and downloads files form them. My problem is, that some of the websites seem to be using PHP, at least that's my theory since the URLs look like this: https://www.portablefreeware.com/download.php?dd=1159
The problem is that I can't get any file names or endings from a link like this and therefore can't save the file. Currently I'm only saving the URLs.
Is there any way to get to the actual file name behind the link?
This is my stripped down download code:
r = requests.get(url, allow_redirects=True)
file = open("name.something", 'wb')
file.write(r.content)
file.close()
Disclaimer: I've never done any work with PHP so please forgive any incorrect terminolgy or understanding I have of that. I'm happy to learn more though

import requests
import mimetypes
response = requests.get('https://www.portablefreeware.com/download.php?dd=1159')
content=response.content
content_type = response.headers['Content-Type']
ext= mimetypes.guess_extension(content_type)
print(content)# [ZipBinary]
print(ext)# .zip
print(content_type)#application/zip, application/octet-stream
with open("newFile."+ext, 'wb') as f:
f.write(content)
f.close()

With your use of the allow_redirects=True option, requests.get would automatically follow the URL in the Location header of the response to make another request, losing the headers of the first response as a result, which is why you can't find the file name information anywhere.
You should instead use the allow_redirects=False option so that you can the Location header, which contains the actual download URL:
import requests
url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])
This outputs:
https://www.diskinternals.com/download/Linux_Reader.exe
Demo: https://replit.com/#blhsing/TrivialLightheartedLists
You can then make another request to the download URL, and use os.path.basename to obtain the name of the file to which the content will be written:
import os
url = r.headers['Location']
with open(os.path.basename(url), 'w') as file:
r = requests.get(url)
file.write(r.content)

You're using requests for downloading. This doesn't work with downloads of this kind.
Try urllib instead:
import urllib.request
urllib.request.urlretrieve(url, filepath)

You can download the file with file name get from response header.
Here's my code for a download with a progress bar and a chunk size buffer:
To display a progress bar, use tqdm. pip install tqdm
In this, chunk write is used to save memory during downloading.
import os
import requests
import tqdm
url = "https://www.portablefreeware.com/download.php?dd=1159"
response_header = requests.head(url)
file_path = response_header.headers["Location"]
file_name = os.path.basename(file_path)
with open(file_name, "wb") as file:
response = requests.get(url, stream=True)
total_length = int(response.headers.get("content-length"))
for chunk in tqdm.tqdm(response.iter_content(chunk_size=1024), total=total_length / 1024, unit="KB"):
if chunk:
file.write(chunk)
file.flush()
Progress output:
6%|▌ | 2848/46100.1640625 [00:04<01:11, 606.90KB/s]

redirectable can be bounced via DNS distributed Network any where. So the example answers above show https://www but in my case they will be resolved to Europe so my fastest local source is coming in as
https://eu.diskinternals.com/download/Linux_Reader.exe
by far the simplest is to raw curl first if its good no need to inspect or scrape
without bothering to resolve anything,
curl -o 1159.tmp https://www.portablefreeware.com/download.php?dd=1159
however I know in this case that not the expected result, so next level is
curl -I https://www.portablefreeware.com/download.php?dd=1159 |find "Location"
and that gives the result as shown by others
https://www.diskinternals.com/download/Linux_Reader.exe
but that's not the fuller picture since if we back feed that
curl.exe -K location.txt
we get
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
</body></html>
hence the nested redirects to
https://eu.diskinternals.com/download/Linux_Reader.exe
all of that can be command line scripted to run in loops in a line or two but I don't use Python so you will need to write perhaps a dozen lines to do similar
C:\Users\WDAGUtilityAccount\Desktop>curl -O https://eu.diskinternals.com/download/Linux_Reader.exe
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 44.9M 100 44.9M 0 0 3057k 0 0:00:15 0:00:15 --:--:-- 3640k
C:\Users\WDAGUtilityAccount\Desktop>dir /b lin*.*
Linux_Reader.exe
and from the help file yesterdays extra update (Sunday, ‎September ‎4, ‎2022) Link
curl -O https://eu.diskinternals.com/download/Uneraser_Setup.exe

Related

Save streaming audio from URL as MP3, or even just audio file from URL as MP3

I am trying to have my server, in python 3, go grab files from URLs. Specifically, I would like to pass a URL into a function, I would like the function to go grab an audio file(of many varying formats) and save it as an MP3, probably using ffmpeg or ffmpy. If the URL also has a PDF, I would also like to save that, as a PDF. I haven't done much research on the PDF yet, but I have been working on the audio piece and wasn't sure if this was even possible.
I have looked at several questions here, but most notably;
How do I download a file over HTTP using Python?
It's a little old but I tried several methods in there and always get some sort of issue. I have tried using the requests library, urllib, streamripper, and maybe one other.
Is there a way to do this and with a recommended library?
For example, most of the ones I have tried do save something, like the html page, or an empty file called 'file.mp3' in this case.
Streamripper received a try changing user agents error.
I am not sure if this is possible, but I am sure there is something I'm not understanding here, could someone point me in the right direction?
This isn't necessarily the code I'm trying to use, just an example of something I have used that doesn't work.
import requests
url = "http://someurl.com/webcast/something"
r = requests.get(url)
with open('file.mp3', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
**Edit
import requests
import ffmpy
import datetime
import os
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE AUDIO/MPEG, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.MP3
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE application/pdf, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.PDF
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE other than application/pdf, OR
## audio/mpeg, THE FILE WILL NOT BE SAVED
def BordersPythonDownloader(url):
print('Beginning file download requests')
r = requests.get(url, stream=True)
contype = r.headers['content-type']
if contype == "audio/mpeg":
print("audio file")
filename = '[{}].mp3'.format(str(datetime.datetime.now()))
with open('file.mp3', 'wb+') as f:
f.write(r.content)
ff = ffmpy.FFmpeg(
inputs={'file.mp3': None},
outputs={filename: None}
)
ff.run()
if os.path.exists('file.mp3'):
os.remove('file.mp3')
elif contype == "application/pdf":
print("pdf file")
filename = '[{}].pdf'.format(str(datetime.datetime.now()))
with open(filename, 'wb+') as f:
f.write(r.content)
else:
print("URL DID NOT RETURN AN AUDIO OR PDF FILE, IT RETURNED {}".format(contype))
# INSERT YOUR URL FOR TESTING
# OR CALL THIS SCRIPT FROM ELSEWHERE, PASSING IT THE URL
#DEFINE YOUR URL
#url = 'http://archive.org/download/testmp3testfile/mpthreetest.mp3'
#CALL THE SCRIPT; PASSING IT YOUR URL
#x = BordersPythonDownloader(url)
#ANOTHER EXAMPLE WITH A PDF
#url = 'https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SY/configuration/guide/sy_swcg/etherchannel.pdf'
#x = BordersPythonDownloader(url)
Thanks Richard, this code works and helps me understand this better. Any suggestions for improving the above working example?

django how to download a file from the internet

I want to have a user input a file URL and then have my django app download the file from the internet.
My first instinct was to call wget inside my django app, but then I thought there may be another way to get this done. I couldn't find anything when I searched. Is there a more django way to do this?
You are not really dependent on Django for this.
I happen to like using requests library.
Here is an example:
import requests
def download(url, path, chunk=2048):
req = requests.get(url, stream=True)
if req.status_code == 200:
with open(path, 'wb') as f:
for chunk in req.iter_content(chunk):
f.write(chunk)
f.close()
return path
raise Exception('Given url is return status code:{}'.format(req.status_code))
Place this is a file and import into your module whenever you need it.
Of course this is very minimal but this will get you started.
You can use urlopen from urllib2 like in this example:
import urllib2
pdf_file = urllib2.urlopen("http://www.example.com/files/some_file.pdf")
with open('test.pdf','wb') as output:
output.write(pdf_file.read())
For more information, read the urllib2 docs.

Image scraped as HTML page with urlretrieve

I'm trying to scrape this image using urllib.urlretrieve.
>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg',
path) # path was previously defined
This code successfully saves the file in the given path. However, when I try to open the file, I get:
Could not load image 'imagename.jpg':
Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)
When I do file imagename.jpg in my bash terminal, I get imagefile.jpg: HTML document, ASCII text.
So how do I scrape this image as a JPEG file?
It's because the owner of the server hosting that image is deliberately blocking access from Python's urllib. That's why it's working with requests. You can also do it with pure Python, but you'll have to give it an HTTP User-Agent header that makes it look like something other than urllib. For example:
import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
outfile.write(imgdata)
So it's a little more involved to get around, but still not too bad.
Note that the site owner probably did this because some people had gotten abusive. Please don't be one of them! With great power comes great responsibility, and all that.

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

Can't print the result of POSTing all files in a folder in Python?

I have a Python script that tries to POST all the files in a certain directory with the extension ".csv" to a URL, and then print the result (using Requests and Glob):
import requests, glob
text_files = glob.iglob("./user/Documents/folder/folder/*.csv")
url = 'http://myWebsite.com/extension/extension/extension'
for data in text_files:
current_file = open(data)
r = requests.post(url, files=current_file)
print r.text
However, nothing is printed, even though POSTing the same files in Terminal via cURL produces an output specific to the server. I can't figure out why but am guessing that I'm somehow implementing glob incorrectly.
The reason nothing is printed is simple
You are posting the file up, it gets posted, but is there any script on the server side which would necessarily generate any response text?
Probably the script on server side simply accepts the file and by responding HTTP 200 OK tells you that all is fine. Note, that HTTP 200 OK is not part of r.text
Do simple test: post the file from command line using curl or post just one file by requests and see, if there is really some response. It is likely, there is (naturally) no response text.
Btw, you shall care about closing your files after you open them:
for data in text_files:
with open(data) as current_file:
r = requests.post(url, files=current_file)
print r.text
Fixed via trial and error. I apologize if I incorrectly answer my own question (as far as etiquette is concerned, that is) and will try to give credit where it is necessary. Credit to #Martijn Pieters for suggesting that something was wrong with the path to the directory. Went to where the .csv files are and got the path to them manually via "get info" (I'm on a MAC). The correct path should have been:
/Users/ME/Documents/folder/folder/*.csv
Also current_file = open(data) should be replaced with current_file = {'file': open(data)} The problem here is that Python doesn't know that "data" is a file and will throw a plethora of errors. New code:
import requests, glob
text_files = glob.iglob("/Users/ME/Documents/folder/folder/*.csv")
url = 'http://myWebsite.com/extension/extension/extension'
for data in text_files:
with open(data) as f:
file2 = {'file': f}
r = requests.post(url, files=file2)
print r.text

Categories

Resources