I'm still pretty new to scripting. I'm trying to figure out a way to output a list of URL after the redirect has occurred. I have about 800 sites in a text file that I want to test for a redirect using a python script and output the final redirect to a file (on it's own line). Is this possible?
With the file open, I can't figure out how to make urllib2.urlopen() read a line in a text file. It seems to require a URL? Maybe there is another module or something else I should be using instead?
Please help.
Thanks!
I'd use the requests library:
import requests
with open('urls.txt') as url_file:
for url in url_file:
resp = requests.get(url.strip())
print resp.url
Related
I am trying to: Load links from a .txt file, search for a specific Word, and if the word exists on that webpage, save the link to another .txt file but i am getting error: No scheme supplied. Perhaps you meant http://<_io.TextIOWrapper name='import.txt' mode='r' encoding='cp1250'>?
Note: the links has HTTPS://
The code:
import requests
list_of_pages = open('import.txt', 'r+')
save = open('output.txt', 'a+')
word = "Word"
save.truncate(0)
for page_link in list_of_pages:
res = requests.get(list_of_pages)
if word in res.text:
response = requests.request("POST", url)
save.write(str(response) + "\n")
Can anyone explain why ? thank you in advance !
Try putting http:// behind the links.
When you use res = requests.get(list_of_pages) you're creating HTTP connection to list_of_pages. But requests.get takes URL string as a parameter (e.g. http://localhost:8080/static/image01.jpg), and look what list_of_pages is - it's an already opened file. Not a string. You have to either use requests library, or file IO API, not both.
If you have an already opened file, you don't need to create HTTP request at all. You don't need this request.get(). Parse list_of_pages like a normal, local file.
Or, if you would like to go the other way, don't open this text file in list_of_arguments, make it a string with URL of that file.
this is an URL example "https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
if you put it in the browser a file will start downloading in your system.
I want to download this file using python and store it somewhere on my computer
this is how tried
import requests
# first_url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
second_url="https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
myfile = requests.get(second_url , allow_redirects=True)
# this works for the first URL
# open('example.pdf' , 'wb').write(myfile.content)
# this did't work for both of them
# open('example.txt' , 'wb').write(myfile.content)
# this works for the second URL
open('example.doc' , 'wb').write(myfile.content)
first: if I put the first_url in the browser it will download a pdf file, putting second_url will download a .doc file How can I know what type of file will the URL give to us or what type of file will be downloaded so that I use the correct open(...) method?
second: If I use the second URL in the browser a file with the name "T__proc_notices_notices_080_k_notice_doc_79545_770020123.docx" starts downloading. how can I know this file name when I try to download the file?
if you know any better solution kindly let me know for the implementation.
kindly have a quick look at Downloading Files from URLs and zip downloaded files in python question aswell
myfile.headers['content-type'] will give you the MIME-type of the URL's content and myfile.headers['content-disposition'] gives you info like filename etc. (if the response contains this header at all)
you can use response headers content-type like for first url it is application/pdf and sencond url for is application/msword you save file according to it. you can make extension dictinary where you can store possible file format and their types and match with it. your second question is also same like this one so i am taking your two urls from that question and for file name i am using just integers
all_Urls = ['https://omextemplates.content.office.net/support/templates/en-us/tf16402488.dotx' ,
'https://procurement-notices.undp.org/view_file.cfm?doc_id=257280']
extension_dict = {'application/vnd.openxmlformats-officedocument.wordprocessingml.document':'.docx',
'application/vnd.openxmlformats-officedocument.wordprocessingml.template':'.dotx',
'application/vnd.ms-word.document.macroEnabled.12':'.docm',
'application/vnd.ms-word.template.macroEnabled.12':'.dotm',
'application/pdf':'.pdf',
'application/msword':'.doc'}
for i,url in enumerate(all_Urls):
resp = requests.get(url)
response_headers = resp.headers
file_extension = extensio_dict[response_headers['Content-Type']]
with open(f"{i}.{file_extension}",'wb') as f:
f.write(resp.content)
for MIME-Type see this answer
attach_file is not picking the absolute url although file exist. its able to pic internal url and send file but not the absolute url
email.attach_file("http://devuserapi.doctorinsta.com/static/pdfs/Imran_1066.pdf",mimetype="application/pdf")
this file opens when i copy paste the url in browser. what could be the issue.
Thanks in advance
attach_file takes a file from your filesystem, not a URL, so you have to use a local path to it
See https://docs.djangoproject.com/en/1.9/topics/email/
One, untested, possibility is to use the attach method instead and to download the file on the fly:
import urllib2
response = urllib2.urlopen("http://devuserapi.doctorinsta.com/static/pdfs/Imran_1066.pdf")
email.attach('IMran_1066.pdf',response.read(),mimetype="application/pdf")
It lacks error checking to make sure the file was downloaded, of course, and I haven't actually tried it myself, but that might be an alternative for you.
This is my first time posting on stack overflow, and I look forward to getting more involved with the community. I need to download, rename, and save many Excel files from an ASPX website, but I cannot access these Excel files directly via a URL (i.e., the URL does not end with "excelfilename.csv"). What I can do is go to a URL which initiates the download of the file. An example of the URL is below.
https://www.websitename.com/something/ASPXthing.aspx?ReportName=ExcelFileName&Date=SomeDate&reportformat=csv
The inputs that I want to vary via loops are "ExcelFileName" and "SomeDate". I know one can fetch these files with urllib when the Excel files can be accessed directly via a URL, but how can I do it with a URL like this one?
Thanks in advance for helping out!
Using the requests library, you can fetch the file and iterate over chunks to write to file
import requests
report_names = ["Filename1","Filename2"]
dates = ['2016-02-22','2016-02-23'] # as strings
for report_name in report_names:
for date in dates:
with open('%s_%s_fetched.csv' % (report_name.split('.')[0],date,), 'wb') as handle:
response = requests.get('https://www.websitename.com/something/ASPXthing.aspx?ReportName=%s&Date=%s&reportformat=csv' % (report_name,date,), stream=True)
if not response.ok:
# Something went wrong
for block in response.iter_content(1024):
handle.write(block)
I want to have a function which can save a page from the web into a designated path using urllib2.
Problem with urllib is that it doesn't check for Error 404, but unfortunately urllib2 doesn't have such a function although it can check for http errors.
How can i make a function to save the file permanently to a path?
def save(url,path):
g=urllib2.urlopen(url)
*do something to save g to 'path'*
Just use .read() to get the contents and write it to a file path.
def save(url,path):
g = urllib2.urlopen(url)
with open(path, "w") as fH:
fH.write(g.read())