I have an API scan of a large URL file, read that URL and get the result in JSON
I get the kind of url and domain like
google.com
http://c.wer.cn/311/369_0.jpg
How to change file format name using url name ".format (url_scan, dates)"
If I use manual name and it successfully creates a file, but I want to use it to read all URL names from the URL text file used for file name
The domain name is used for json file name and created successfully without errors
dates = yesterday.strftime('%y%m%d')
savefile = Directory + "HTTP_{}_{}.json".format(url_scan,dates)
out = subprocess.check_output("python3 {}/pa.py -K {} "
"--sam '{}' > {}"
.format(SCRIPT_DIRECTORY, API_KEY_URL, json.dumps(payload),savefile ), shell=True).decode('UTF-8')
result_json = json.loads(out)
with open(RES_DIRECTORY + 'HTTP-aut-20{}.csv'.format(dates), 'a') as f:
import csv
writer = csv.writer(f)
for hits in result_json['hits']:
writer.writerow([url_scan, hits['_date'])
print('{},{},{}'.format(url_scan, hits['_date']))
Only the error displayed when the http url name is used to write the json file name
So the directory is not a problem
Every / shown is interpreted by the system as a directory
[Errno 2] No such file or directory: '/Users/tes/HTTP_http://c.wer.cn/311/369_0.jpg_190709.json'
Most, if not all, operating systems disallow the characters : and / from being used in filenames as they have special meaning in URL strings. So that's why it's giving you an error.
You could replace those characters like this, for example:
filename = 'http://c.wer.cn/311/369_0.jpg.json google.com.json'
filename = filename.replace(':', '-').replace('/', '_')
Related
I have written a script where I am taking the input of URLs hardcoded and giving their filenames also hardcoded, whereas I want to take the URLs from a saved text file and save their names automatically in a chronological order to a specific folder.
My code (works) :
import requests
#input urls and filenames
urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1980.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1981.nc']
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download function
for i in inputs :
result = download_url(i)
Trying to fetch the links from text (error in code):
import requests
# getting all URLS from textfile
file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
#for each_url in enumerate(f):
list_of_urls = [(line.strip()).split() for line in file]
file.close()
#input urls and filenames
urls = list_of_urls
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download fupdftion
for i in inputs :
result = download_url(i)
testing.txt has those 3 links pasted in it on each line.
Error :
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1979.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1980.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1981.nc']"
PS :
I am new to python and it would be helpful if someone could advice me on how to loop or go through files from a text file and save them indivually in a chronological order as opposed to hardcoding the names(as I have done).
When you do list_of_urls = [(line.strip()).split() for line in file], you produce a list of lists. (For each line of the file, you produce the list of urls in this line, and then you make a list of these lists)
What you want is a list of urls.
You could do
list_of_urls = [url for line in file for url in (line.strip()).split()]
Or :
list_of_urls = []
for line in file:
list_of_urls.extend((line.strip()).split())
By far the simplest method in this simple case is use the OS command
so go to the work directory C:\Users\HBI8\Downloads
invoke cmd (you can simply put that in the address bar)
write/paste your list using >notepad testing.txt (if you don't already have it there)
Note NC HDF files are NOT.pdf
https://www.northwestknowledge.net/metdata/data/pr_1979.nc
https://www.northwestknowledge.net/metdata/data/pr_1980.nc
https://www.northwestknowledge.net/metdata/data/pr_1981.nc
then run
for /F %i in (testing.txt) do curl -O %i
92 seconds later
I have inserted a delimiter as ',' by using split function.
In order to give automated file name I used the index number of the stored list.
Data saved in following manner in txt file.
FileName | Object ID | Base URL
url_file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
fns=[]
list_of_urls = []
for line in url_file:
stripped_line = line.split(',')
print(stripped_line)
list_of_urls.append(stripped_line[2]+stripped_line[1])
fns.append(stripped_line[0])
url_file.close()
I am trying to download the file in Python from the url https://marketdata.theocc.com/position-limits?reportType=change.
I am able to convert it to DataFrame just by using:
df = pd.read_csv('https://marketdata.theocc.com/position-limits?reportType=change')
But what I want is to obtain the name of the file also.
so, if you download the file directly from browser the name of the file obtain is "POSITIONLIMITCHANGE_20201202.txt".
Can someone suggest an efficient way to do this in Python?
Thanks.
if you are using the requests library, the information about the file is in the response header (a dictionary):
response = requests.get('https://marketdata.theocc.com/position-limits?reportType=change')
print(response.headers['content-disposition'])
Output:
attachment; filename=POSITIONLIMITCHANGE_20201202.txt
Example code in Python to fetch a file from URL, extract filename, save to local file, and import into Pandas dataframe.
import io
import requests
import re
import pandas as pd
url = 'https://marketdata.theocc.com/position-limits?reportType=change'
r = requests.get(url)
# NOTE: filename is found in content-disposition HTTP response header
s = r.headers.get('content-disposition')
# use regexp with \w to match only safe characters in filename
# this will prevent accepting paths or drive letters as part of name
m = re.search(r'filename=(\w+)', s)
if m:
filename = m.group(1)
else:
# set default if filename not provided or name has bad characters
filename = "out.csv"
print("filename:", filename)
text = r.text
# if you want to write out file with filename provided
with open(filename, 'w') as fp:
fp.write(text)
# to read from string in-memory wrap with io.StringIO()
df = pd.read_csv(io.StringIO(text))
print(list(df.columns))
Output:
filename: POSITIONLIMITCHANGE_20201202.txt
['Equity_Symbol',' ','Start_Date','Start_Pos_Limit','End_Date','End_Pos_Limit','Action']
I am re-framing an existing question for simplicity. I have the following code to download Excel files from a company Share Point site.
import requests
import pandas as pd
def download_file(url):
filename = url.split('/')[-1]
r = requests.get(url)
with open(filename, 'wb') as output_file:
output_file.write(r.content)
df = pd.read_excel(r'O:\Procurement Planning\QA\VSAF_test_macro.xlsm')
df['Name'] = 'share_point_file_path_documentName' #i'm appending the sp file path to the document name
file = df['Name'] #I only need the file path column, I don't need the rest of the dataframe
# for loop for download
for url in file:
download_file(url)
The downloads happen and I don't get any errors in Python, however when I try to open them I get an error from Excel saying Excel cannot open the file because the file format or extension is not valid. If I print the link in Jupyter Notebooks it does open correctly, the issue appears to be with the download.
Check r.status_code. This must be 200 or you have the wrong url or no permission.
Open the downloaded file in a text editor. It might be a HTML file (Office Online)
If the URL contains a web=1 query parameter, remove it or replace it by web=0.
I'm saving HTTP request as an HTML page.
How can I save the HTML file with the name of the URL.
I'm using Linux OS
So the file name will look like this: "http://www.test.com.html"
My code:
url = "http://www.test.com"
page = urllib.urlopen(url).read()
f = open("./file.html", "w")
f.write(page)
f.close()
Unfortunately you cannot save a file with a url name. The character "/" is not allowed in Windows file names.
However, you can create a file with the name www.test.com.html with the following line
file_name = url.split('/')[2]
If you need to anything like https://www.test.com/posts/1, you can try to replace / with another custom character that usually not occurs in url such as __
url = 'https://www.test.com/posts/11111111'
file_name = '__'.join(url.split('/')[2:])
Will result in
www.test.com__posts__1
I developed a web crawler to extract all the source codes in a wiki link. The program terminates after writing a few files.
def fetch_code(link_list):
for href in link_list:
response = urllib2.urlopen("https://www.wikipedia.org/"+href)
content = response.read()
page = open("%s.html" % href, 'w')
page.write(content.replace("[\/:?*<>|]", " "))
page.close()
link_list is an array, which has the extracted links from the seed page.
The error I get after executing is
IOError: [Errno 2] No such file or directory: u'M/s.html'
you cannot create a file with '/' in its name.
you could escape the filename as M%2Fs.html
/ is %2F
in python2, you could simply use urllib to escape the filename, example:
import urllib
filePath = urllib.quote_plus('M/s.html')
print(filePath)
on the other hand, you could also save http response to hierarchy, for example, M/s.html means s.html file under directory named 'M'.