This is my first time posting on stack overflow, and I look forward to getting more involved with the community. I need to download, rename, and save many Excel files from an ASPX website, but I cannot access these Excel files directly via a URL (i.e., the URL does not end with "excelfilename.csv"). What I can do is go to a URL which initiates the download of the file. An example of the URL is below.
https://www.websitename.com/something/ASPXthing.aspx?ReportName=ExcelFileName&Date=SomeDate&reportformat=csv
The inputs that I want to vary via loops are "ExcelFileName" and "SomeDate". I know one can fetch these files with urllib when the Excel files can be accessed directly via a URL, but how can I do it with a URL like this one?
Thanks in advance for helping out!
Using the requests library, you can fetch the file and iterate over chunks to write to file
import requests
report_names = ["Filename1","Filename2"]
dates = ['2016-02-22','2016-02-23'] # as strings
for report_name in report_names:
for date in dates:
with open('%s_%s_fetched.csv' % (report_name.split('.')[0],date,), 'wb') as handle:
response = requests.get('https://www.websitename.com/something/ASPXthing.aspx?ReportName=%s&Date=%s&reportformat=csv' % (report_name,date,), stream=True)
if not response.ok:
# Something went wrong
for block in response.iter_content(1024):
handle.write(block)
Related
I am using Python 3.8.12. I tried the following code to download files from URLs with the requests package, but got 'Unkown file format' message when opening the zip file. I tested on different zip URLs but the size of all zip files are 18KB and none of the files can be opened successfully.
import requests
file_url = 'https://www.censtatd.gov.
hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
file_download = requests.get(file_url, allow_redirects=True, stream=True)
open(save_path+file_name, 'wb').write(file_download.content)
Zip file opening error message
Zip files size
However, once I updated the url as file_url = 'https://www.td.gov.hk/datagovhk_tis/mttd-csv/en/table41a_eng.csv' the code worked well and the csv file could be downloaded perfectly.
I try to use requests, urllib , wget and zipfile io packages, but none of them work.
The reason may be that the zip URL directs to both the zip file and a web page, while the csv URL directs to the csv file only.
I am really new to this field, could anyone help on it? Thanks a lot!
You might examine headers after sending HEAD request to get information regarding file, examining Content-Type allows you to reveal actual type of file
import requests
file_url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
r = requests.head(file_url)
print(r.headers["Content-Type"])
gives output
text/html
So file you have URL to is actually HTML page.
import wget
url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?
pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
#url = 'https://golang.org/dl/go1.17.3.windows-amd64.zip'
wget.download(url)
I am trying to download a file using python from a URL. However its not working and instead I am getting index.html.
Please help on same.
import requests
target_url = "https://transparency-in-coverage.uhc.com/?file=2022-07-01_United-HealthCare-Services_Third-Party-Administrator_EP1-50_C1_in-network-rates.json.gz&origin=uhc"
filename = "2022-07-01_United-HealthCare-Services_Third-Party-Administrator_EP1-50_C1_in-network-rates.json.gz"
with requests.get(target_url, stream=True) as r:
r.raise_for_status()
with open(filename, "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
That's because you're the URL you specified is for an HTML page that subsequently starts the download for the .gz file you want.
This is the link for the file:
https://mrfstorageprod.blob.core.windows.net/mrf-even/2022-07-01_ALL-SAVERS-INSURANCE-COMPANY_Insurer_PS1-50_C2_in-network-rates.json.gz?sv=2021-04-10&st=2022-07-05T22%3A19%3A13Z&se=2022-07-09T22%3A19%3A13Z&skoid=89efab61-5daa-4cf2-aa04-ce3ba9d1d1e8&sktid=db05faca-c82a-4b9d-b9c5-0f64b6755421&skt=2022-07-05T22%3A19%3A13Z&ske=2022-07-09T22%3A19%3A13Z&sks=b&skv=2021-04-10&sr=b&sp=r&sig=NaLrw2KG239S%2BpfZibvw7%2B25AAQsf9GYZ1gFK0KRN20%3D&rscd=attachment
To find it, you need to have the inspector open on the 'Network' tab whilst loading the page (or you can click on the file in the list when it loads the list of files on the page). When the download starts you'll see two files pop-up, one of which is the actual URL of the .gz file.
It does look the URL has a timestamp in it, so it might not work at a later time, I don't know.
this is an URL example "https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
if you put it in the browser a file will start downloading in your system.
I want to download this file using python and store it somewhere on my computer
this is how tried
import requests
# first_url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
second_url="https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
myfile = requests.get(second_url , allow_redirects=True)
# this works for the first URL
# open('example.pdf' , 'wb').write(myfile.content)
# this did't work for both of them
# open('example.txt' , 'wb').write(myfile.content)
# this works for the second URL
open('example.doc' , 'wb').write(myfile.content)
first: if I put the first_url in the browser it will download a pdf file, putting second_url will download a .doc file How can I know what type of file will the URL give to us or what type of file will be downloaded so that I use the correct open(...) method?
second: If I use the second URL in the browser a file with the name "T__proc_notices_notices_080_k_notice_doc_79545_770020123.docx" starts downloading. how can I know this file name when I try to download the file?
if you know any better solution kindly let me know for the implementation.
kindly have a quick look at Downloading Files from URLs and zip downloaded files in python question aswell
myfile.headers['content-type'] will give you the MIME-type of the URL's content and myfile.headers['content-disposition'] gives you info like filename etc. (if the response contains this header at all)
you can use response headers content-type like for first url it is application/pdf and sencond url for is application/msword you save file according to it. you can make extension dictinary where you can store possible file format and their types and match with it. your second question is also same like this one so i am taking your two urls from that question and for file name i am using just integers
all_Urls = ['https://omextemplates.content.office.net/support/templates/en-us/tf16402488.dotx' ,
'https://procurement-notices.undp.org/view_file.cfm?doc_id=257280']
extension_dict = {'application/vnd.openxmlformats-officedocument.wordprocessingml.document':'.docx',
'application/vnd.openxmlformats-officedocument.wordprocessingml.template':'.dotx',
'application/vnd.ms-word.document.macroEnabled.12':'.docm',
'application/vnd.ms-word.template.macroEnabled.12':'.dotm',
'application/pdf':'.pdf',
'application/msword':'.doc'}
for i,url in enumerate(all_Urls):
resp = requests.get(url)
response_headers = resp.headers
file_extension = extensio_dict[response_headers['Content-Type']]
with open(f"{i}.{file_extension}",'wb') as f:
f.write(resp.content)
for MIME-Type see this answer
I have a web link which downloads an excel file directly. It opens a page writing "your file is downloading" and starts downloading the file.
Is there any way i can automate it using requests module ?
I am able to do it with selenium but i want it to run in background so i was wondering if i can use request module.
I have used request.get but it simply gives the text i.e "your file is downloading" but somehow i am not able to get the file.
This Python3 code downloads any file from web to a memory:
import requests
from io import BytesIO
url = 'your.link/path'
def get_file_data(url):
response = requests.get(url)
f = BytesIO()
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
f.seek(0)
return f
data = get_file_data(url)
You can use next code to read the Excel file:
import pandas as pd
xlsx = pd.read_excel(data, skiprows=0)
print(xlsx)
It sounds like you don't actually have a direct URL to the file, and instead need to engage with some javascript. Perhaps there is an underlying network call that you can find by inspecting the page traffic in your browser that shows a direct URL for downloading the file. With that you can actually just read the excel file URL directly with pandas:
import pandas as pd
url = "https://example.com/some_file.xlsx"
df = pd.read_excel(url)
print(df)
This is nice and tidy, but if you really want to use requests (or avoid pandas) you can download the raw file content as shown in this answer and then use the pyexcel_xlsx package's get_xlsx function to read it without any pandas involvement.
I'm currently creating an app thats supposed to take a input in form of a url (here a PDF-file) and recognize this as a PDF and then upload it to a tmp folder i have on a server.
I have absolutely no idea how to proceed with this. I've already made a form which contains a FileField which works perfectly, but when it comes to urls i have no clue.
Thank you for all answers, and sorry about the lacking english skills.
The first 4 bytes of a pdf file are %PDF so you could just download the first 4 bytes from that url and compare them to %PDF. If it matches, then download the whole file.
Example:
import urllib2
url = 'your_url'
req = urllib2.urlopen(url)
first_four_bytes = req.read(4)
if first_four_bytes == '%PDF':
pdf_content = urllib2.urlopen(url).read()
# save to temp folder
else:
# file is not PDF