How to save a file with a URL name? - python

I'm saving HTTP request as an HTML page.
How can I save the HTML file with the name of the URL.
I'm using Linux OS
So the file name will look like this: "http://www.test.com.html"
My code:
url = "http://www.test.com"
page = urllib.urlopen(url).read()
f = open("./file.html", "w")
f.write(page)
f.close()

Unfortunately you cannot save a file with a url name. The character "/" is not allowed in Windows file names.
However, you can create a file with the name www.test.com.html with the following line
file_name = url.split('/')[2]
If you need to anything like https://www.test.com/posts/1, you can try to replace / with another custom character that usually not occurs in url such as __
url = 'https://www.test.com/posts/11111111'
file_name = '__'.join(url.split('/')[2:])
Will result in
www.test.com__posts__1

Related

Python download image from short url by keeping its own name

I would like to download image file from shortener url or generated url which doesn't contain file name on it.
I have tried to use [content-Disposition]. However my file name is not in ASCII code. So it can't print the name.
I have found out i can use urlretrieve, request to download file but i need to save as different name.
I want to download by keeping it's own name..
How can i do this?
matches = re.match(expression, message, re.I)
url = matches[0]
print(url)
original = urlopen(url)
remotefile = urlopen(url)
#blah = remotefile.info()['content-Disposition']
#print(blah)
#value, params = cgi.parse_header(blah)
#filename = params["filename*"]
#print(filename)
#print(original.url)
#filename = "filedown.jpg"
#urlretrieve(url, filename)
These are the list that i have try but none of them work
I was able to get this to work with the requests library because you can use it to get the url that the shortened url redirects to. Then, I applied your code to the redirected url and it worked. There might be a way to only use urllib (I assume thats what you are using) with this, but I dont know.
import requests
from urllib.request import urlopen
import cgi
def getFilenameFromURL(url):
req = requests.request("GET", url)
# req.url is now the url the shortened url redirects to
original = urlopen(req.url)
value, params = cgi.parse_header(original.info()['Content-Disposition'])
filename = params["filename*"]
print(filename)
return filename
getFilenameFromURL("https://shorturl.at/lKOY3")
You can then use urlretrieve with this. Its inefficient but it works... Also since you can get the actual url with the requests library, you can probably get the filname through there.

How to write file name to use URL name in python?

I have an API scan of a large URL file, read that URL and get the result in JSON
I get the kind of url and domain like
google.com
http://c.wer.cn/311/369_0.jpg
How to change file format name using url name ".format (url_scan, dates)"
If I use manual name and it successfully creates a file, but I want to use it to read all URL names from the URL text file used for file name
The domain name is used for json file name and created successfully without errors
dates = yesterday.strftime('%y%m%d')
savefile = Directory + "HTTP_{}_{}.json".format(url_scan,dates)
out = subprocess.check_output("python3 {}/pa.py -K {} "
"--sam '{}' > {}"
.format(SCRIPT_DIRECTORY, API_KEY_URL, json.dumps(payload),savefile ), shell=True).decode('UTF-8')
result_json = json.loads(out)
with open(RES_DIRECTORY + 'HTTP-aut-20{}.csv'.format(dates), 'a') as f:
import csv
writer = csv.writer(f)
for hits in result_json['hits']:
writer.writerow([url_scan, hits['_date'])
print('{},{},{}'.format(url_scan, hits['_date']))
Only the error displayed when the http url name is used to write the json file name
So the directory is not a problem
Every / shown is interpreted by the system as a directory
[Errno 2] No such file or directory: '/Users/tes/HTTP_http://c.wer.cn/311/369_0.jpg_190709.json'
Most, if not all, operating systems disallow the characters : and / from being used in filenames as they have special meaning in URL strings. So that's why it's giving you an error.
You could replace those characters like this, for example:
filename = 'http://c.wer.cn/311/369_0.jpg.json google.com.json'
filename = filename.replace(':', '-').replace('/', '_')

Retrieve filename from the API URL once its downloaded, Python 3.6

I am downloading file from the API URL http://api.worldbank.org/v2/en/topic/19?downloadformat=csv and We get file "API_19_DS2_en_csv_v2_10225248.zip" after hit.
Above URL does not contain "File name" like other URL "http://databank.worldbank.org/data/download/SE4ALL_csv.zip", here I can use
ntpath.basename(URL)
How to get file name?
Below code working
r = requests.get(Source_Link)
URL_Metadata = r.headers['Content-Disposition']
Source_File_Name = URL_Metadata[URL_Metadata.find('filename=')+9:]

Downloading pdf file using Python

I am working on an API which returns me a document ID and then I can use that document ID to get the PDF. For example document_id = 'fanlfe48ry4ihkefewfl934'. Now I concatenate this id like the following
document_id = 'fanlfe48ry4ihkefewfl934'
full_url = url + document_id
from urllib.request import urlopen
response = urlopen(full_url)
file = open("document.pdf", 'w')
file.write(response.read())
file.close()
But it is getting just an html response. The reason is that the url is not a file with .pdf extension but it is a url which when clicked, pops up the window for save location and then it saves the file as pdf.
I don't understand how to handle this situation

Writing data to csv or text file using python

I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?
Change the last line to:
outputcsv.writerow([url])

Categories

Resources