Using urlopen to open list of urls - python

I have a python script that fetches a webpage and mirrors it. It works fine for one specific page, but I can't get it to work for more than one. I assumed I could put multiple URLs into a list and then feed that to the function, but I get this error:
Traceback (most recent call last):
File "autowget.py", line 46, in <module>
getUrl()
File "autowget.py", line 43, in getUrl
response = urllib.request.urlopen(url)
File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 361, in open
req.timeout = timeout
AttributeError: 'tuple' object has no attribute 'timeout'
Here's the offending code:
url = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(*url):
response = urllib.request.urlopen(url)
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl()
I've exhausted Google trying to find how to open a list with urlopen(). I found one way that sort of works. It takes a .txt document and goes through it line-by-line, feeding each line as a URL, but I'm writing this using Python 3 and for whatever reason twillcommandloop won't import. Plus, that method is unwieldy and requires (supposedly) unnecessary work.
Anyway, any help would be greatly appreciated.

In your code there are some errors:
You define getUrls with variable arguments list (the tuple in your error);
You manage getUrls arguments as a single variable (list instead)
You can try with this code
import urllib2
import shutil
urls = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(urls):
for url in urls:
#Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/','_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)

It do not support tuple:
urllib.request.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.
And your calling is incorrect. It should be:
getUrl(url[0],url[1],url[2])
And inside the function, use a loop like "for u in url" to travel all urls.

You should just iterate over your URLs using a for loop:
import shutil
import urllib.request
urls = ['https://www.example.org/', 'https://www.foo.com/']
file_name = 'foo.txt'
def fetch_urls(urls):
for i, url in enumerate(urls):
file_name = "page-%s.html" % i
response = urllib.request.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
fetch_urls(urls)
I assume you want the content saved to separate files, so I used enumerate here to create a uniqe file name, but you can obviously use anything from hash(), the uuid module to creating slugs.

Related

Python-3 Trying to iterate through a csv and get http response codes

I am attempting to read a csv file that contains a long list of urls. I need to iterate through the list and get the urls that throw a 301, 302, or 404 response. In trying to test the script I am getting an exited with code 0 so I know it is error free but it is not doing what I need it to. I am new to python and working with files, my experience has been ui automation primarily. Any suggestions would be gladly appreciated. Below is the code.
import csv
import requests
import responses
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open('redirect.csv', 'r')
contents = []
with open('redirect.csv', 'r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
def run():
resp = urllib.request.urlopen(url)
print(self.url, resp.getcode())
run()
print(run)
Given you have a CSV similar to the following (the heading is URL)
URL
https://duckduckgo.com
https://bing.com
You can do something like this using the requests library.
import csv
import requests
with open('urls.csv', newline='') as csvfile:
errors = []
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
try:
r = requests.get(row['URL'])
if r.status_code in [301, 302, 404]:
# print(f"{r.status_code}: {row['url']}")
errors.append([row['url'], r.status_code])
except:
pass
Uncomment the print statement if you want to see the results in the terminal. The code at the moment appends a list of URL and status code to an errors list. You can print or continue processing this if you prefer.

How to close a file which is opened to read in an API Call

import glob
import os
import requests
import shutil
class file_service:
def file():
dir_name = '/Users/TEST/Downloads/TU'
os.chdir(dir_name)
pattern='TU_*.csv'
for x in glob.glob(pattern):
file_name=os.path.join(dir_name,x)
print (file_name)
from datetime import date
dir_name_backup = '/Users/Zill/Downloads/backup'
today = date.today()
backup_file_name = f'Backup_TU_{today.year}{today.month:02}{today.day:02}.csv'
backup_file_name_directory= os.path.join(dir_name_backup,backup_file_name)
print(backup_file_name_directory)
newPath = shutil.copy(file_name, backup_file_name_directory)
url = "google.com"
payload = {'name': 'file'}
files = [
('file', open(file_name,'rb'))
]
headers = {
'X-API-TOKEN': '12312'
}
response = requests.request("POST", url, headers=headers, data = payload, files = files)
print(response.text.encode('utf8'))
files.close()
os.remove(file_name)
file()
To provide an overall context, I am trying to retrieve a file from my OS and using POST method I am trying to post the content of the file into an application. Its working as expected so far, the details are getting pushed into application as expected. As part of my next step I am trying to remove the file from my existing directory using os.remove(). But I am getting a Win32 error as my file is not closed when it was opened in read-only mode in the POST call. I am trying to close it but I am unable to do so.
Can anyone please help me out with it.
Thanks!
I'm not sure I understand your code correctly. Could you try replacing
files.close()
with
for _, file in files:
file.close()
and check if it works?
Explanation:
In
files = [('file', open(file_name,'rb'))]
you create a list containing exactly one tuple that has the string 'file' as first element and a file object as second element:
[('file', file_object)]
The loop takes the tuple from the list, ignores its first element (_), takes its second element, the file object, and uses its close method to close it.
I've just now realised the list contains only one tuple. So there's no need for a loop:
files[0][1].close()
should do it.
The best way would be to use with (the file gets automatically closed once you leave the with block):
payload = {'name': 'file'}
with open(file_name, 'rb') as file:
files = [('file', file)]
headers = {'X-API-TOKEN': '12312'}
response = requests.request("POST", url, headers=headers, data = payload, files = files)
print(response.text.encode('utf8'))
os.remove(file_name)

Python downloading PDF into a .zip

What I am trying to do is loop through a list of URL to download a series of .pdfs, and save them to a .zip. At the moment I am just trying to test code using just one URL. The ERROR I am getting is:
Traceback (most recent call last):
File "I:\test_pdf_download_zip.py", line 36, in <module>
zip_file(zipfile_name, url)
File "I:\test_pdf_download_zip.py", line 30, in zip_file
myzip.write(dowload_pdf(url))
TypeError: expected a string or other character buffer object
Would someone know how to pass .pdf request to the .zip correctly (avoiding the error above) in order for me to append it, or know if it is possible to do this?
import os
import zipfile
import requests
output = r"I:"
# File name of the zipfile
zipfile_name = os.path.join(output, "test.zip")
# Random test pdf
url = r"http://www.pdf995.com/samples/pdf.pdf"
def create_zipfile(zipfile_name):
zipfile.ZipFile(zipfile_name, "w")
def dowload_pdf(url):
response = requests.get(url, stream=True)
with open('test.pdf', 'wb') as f:
f.write(response.content)
def zip_file(zip_name, url):
with open(zip_name,'a') as myzip:
myzip.write(dowload_pdf(url))
if __name__ == "__main__":
create_zipfile(zipfile_name)
zip_file(zipfile_name, url)
print("Done")
Your download_pdf() function is saving a file but it doesn't return anything. You need to modify it so it actually returns the file path to myzip.write(). You don't want to hardcode test.pdf but pass unique paths to your download function so you don't end up with multiple test.pdf in your archive.
def dowload_pdf(url, path):
response = requests.get(url, stream=True)
with open(path, 'wb') as f:
f.write(response.content)
return path

Download a file in python with urllib2 instead of urllib

I'm trying to download a tarball file and save it locally with python. With urllib it's pretty simple:
import urllib
urllib2.urlopen(url, 'compressed_file.tar.gz')
tar = tarfile.open('compressed_file.tar.gz')
print tar.getmembers()
So my question is really simple: What's the way to achieve this using the urllib2 library?
Quoting docs:
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[,
context]]]]])
Open the URL url, which can be either a string or a
Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed.
Nothing in urlopen interface documentation says, that second argument is a name of file where response should be written.
You need to explicitly write data read from response to file:
r = urllib2.urlopen(url)
CHUNK_SIZE = 1 << 20
with open('compressed_file.tar.gz', 'wb') as f:
# line belows downloads all file at once to memory, and dumps it to file afterwards
# f.write(r.read())
# below is preferable lazy solution - download and write data in chunks
while True:
chunk = r.read(CHUNK_SIZE)
if not chunk:
break
f.write(chunk)

zip downloaded csv file

I'm trying to download a csv file and zip it before saving it.
The code I'm using is:
req = urllib2.Request(url)
fh = urllib2.urlopen(req)
with contextlib.closing(ZipFile("test.csv.zip", "w", zipfile.ZIP_STORED)) as f:
f.write(fh.read())
f.close()
What this does is to print the contents of the csv file to stdout and create an empty zipfile.
Any ideas of what could be wrong?
Thanks,
Isaac
Take a look at the documentation for ZipFile.write(). Here is the function signature:
ZipFile.write(filename[, arcname[, compress_type]])
The first argument should be the file name of the file that you are adding to the zip archive, not the contents of the file. Instead you are passing the entire contents of the downloaded resource as the file name and, because that will likely be illegal (too long), you see the file contents dumped as part of the error message of the raised exception.
To fix this you need to use is ZipFile.writestr():
req = urllib2.Request(url)
fh = urllib2.urlopen(req)
with ZipFile("test.csv.zip", "w", zipfile.ZIP_STORED) as f:
f.writestr('test.csv', fh.read())
If it is your intention to compress only a single file, you probably don't need to use a zip archive, and you might be better off using gzip or bzip2.

Categories

Resources