urllib TimeoutError: The read operation timed out? - python

New to Python here. So Im trying to build a script that pulls a list of websites from a csv file, and uses urllib to ping each one and check if the website is running or not. It works for a few dozen websites and then just times out. The error I am getting is: TimeoutError: The read operation timed out. The csv file has the domains arranged in a column. Once the script finishes I want it to produce a csv file with only the validated domains. Here is the code:
from urllib.request import urlopen
from urllib.error import *
from pandas import *
import csv
data = read_csv("./domains.csv")
# converting column data to list
domains = data['Domains'].tolist()
validated = []
for x in domains:
try:
html = urlopen("https://www." + x, None, timeout=2)
except HTTPError as e:
print("HTTP error" + x, e)
continue
except URLError as e:
print("Opps ! Page not found!" + x, e)
continue
else:
print('Yeah ! found ' + x)
validated.append([x])
print("validated domains:", validated)
with open("newfilePath.csv", "w", newline ='') as f:
writer = csv.writer(f)
writer.writerows(validated)

Related

Python-3 Trying to iterate through a csv and get http response codes

I am attempting to read a csv file that contains a long list of urls. I need to iterate through the list and get the urls that throw a 301, 302, or 404 response. In trying to test the script I am getting an exited with code 0 so I know it is error free but it is not doing what I need it to. I am new to python and working with files, my experience has been ui automation primarily. Any suggestions would be gladly appreciated. Below is the code.
import csv
import requests
import responses
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open('redirect.csv', 'r')
contents = []
with open('redirect.csv', 'r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
def run():
resp = urllib.request.urlopen(url)
print(self.url, resp.getcode())
run()
print(run)
Given you have a CSV similar to the following (the heading is URL)
URL
https://duckduckgo.com
https://bing.com
You can do something like this using the requests library.
import csv
import requests
with open('urls.csv', newline='') as csvfile:
errors = []
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
try:
r = requests.get(row['URL'])
if r.status_code in [301, 302, 404]:
# print(f"{r.status_code}: {row['url']}")
errors.append([row['url'], r.status_code])
except:
pass
Uncomment the print statement if you want to see the results in the terminal. The code at the moment appends a list of URL and status code to an errors list. You can print or continue processing this if you prefer.

unable to write csv file, python

here is my code, i added exception to indexerror, but its not writing to csv file.
import urllib2
import csv
import time
import requests
import os
#a = open(r"C:\Drive F data\Client\Blake\III\2.txt")
a = ['http://www.houzz.com/trk/aHR0cDovL3d3dy5SSUtCLmNvbQ/b020157b98711b4a190eee3331eb0066/ue/MjA5ODg2MQ/1b9b00b9fdc9f270f14688046ef161e2',
'http://www.houzz.com/trk/aHR0cDovL3d3dy5nc2NhYmluZXRyeS5jb20/0323b7db059b9e0357d045685be21a6d/ue/NDY2MjE0/d8815293352eb2a6e40c95060c019697',
'http://www.houzz.com/trk/aHR0cDovL3NpY29yYS5jb20/dc807b3705b95b5da772a7aefe23a803/ue/Njc0NDA4/a73f8bdb38e10abd5899fb5c55ff3548',
'http://www.houzz.com/trk/aHR0cDovL3d3dy5DYXNlRGVzaWduLmNvbQ/d79c6af934e3c815d602c4d79b0d6617/ue/OTY3MDg/ce9a87e31e84871a96bca7538aae9856',
'http://www.houzz.com/trk/aHR0cDovL2phcnJldHRkZXNpZ25sbGMuY29t/9d0009d3544d9c22f6058b20097321b3/ue/MzExNTk1NA/310d49732d317725364368ea3fbfd7c1',
'http://www.houzz.com/trk/aHR0cDovL3d3dy5yb2JlcnRsZWdlcmVkZXNpZ24uY29t/8ac7311be2f794654cefba71474563f7/ue/MTExNTQ4/af201ffdc62de6aba9e2de90f69a770d']
with open("C:\Drive F data\Blake\III/2.csv", "ab")as export:
names = ['source','link']
writer = csv.DictWriter(export, fieldnames=names)
writer.writeheader()
for each in a:
try:
link = urllib2.urlopen(each).geturl()
except IndexError:
pass
print each, link
writer.writerow({'source':each,'link':link})
After removing try & exception , it works fine
I think you miss escape character in your file path. "C:\\Drive F data\\Blake\\III\\2.csv"

Timeout error, getting images from urls - Python

I am trying to save jpegs to a file from a list of urls. This code times out frequently and randomly. It has saved up to 113 jpegs, there are many more than that, and sometimes only saves 10 before timing out. Is there a way to put a wait in so the timeout doesn't occur? I have tried sleep in the commented section with no luck. Thanks for the feedback!
Heres the timeout error message:
import urllib.request
import urllib
import codecs
from urllib import request
import time
import csv
class File:
def __init__(self, data):
self.data = data
file = File("1")
with open("file.csv", encoding = "utf8") as f1:
file.data = list(csv.reader(f1, skipinitialspace = True))
for i in file.data[1:]:
if len(i[27]) != 0:
#i[14] creates a unique jpeg file name in the dir
image = open('C:\\aPath'+i[14]+'.JPG', 'wb')
path = 'aPath' + i[14] + '.JPG'
#time.sleep(2) Tried sleep here, didn't work
#i[27] is a working jpeg url
urllib.request.urlretrieve(i[27], path)
image.close()
print('done!')
There's no way to prevent the exception. You need to catch the exception and retry.
...
for i in file.data[1:]:
if not i[27]:
continue
path = 'aPath' + i[14] + '.JPG'
while True: # retry loop
try:
urllib.request.urlretrieve(i[27], path)
break # On success, stop retry.
except TimeoutError:
print('timeout, retry in 1 second.')
time.sleep(1)
BTW, you don't need to open file if you use urllib.request.urlretrieve.

Writing text to txt file in python on new lines?

So I am trying to check whether a url exists and if it does I would like to write the url to a file using python. I would also like each url to be on its own line within the file. Here is the code I already have:
import urllib2
CREATE A BLANK TXT FILE THE DESKTOP
urlhere = "http://www.google.com"
print "for url: " + urlhere + ":"
try:
fileHandle = urllib2.urlopen(urlhere)
data = fileHandle.read()
fileHandle.close()
print "It exists"
Then, If the URL does exist, write the url on a new line in the text file
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
If the URL doesn't exist, don't write anything to the file.
The way you worded your question is a bit confusing but if I understand you correctly all your trying to do is test if a url is valid using urllib2 and if it is write the url to a file? If that is correct the following should work.
import urllib2
f = open("url_file.txt","a+")
urlhere = "http://www.google.com"
print "for url: " + urlhere + ":"
try:
fileHandle = urllib2.urlopen(urlhere)
data = fileHandle.read()
fileHandle.close()
f.write(urlhere + "\n")
f.close()
print "It exists"
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
If you want to test multiple urls but don't want to edit the the python script you could use the following script by typing python python_script.py "http://url_here.com". This is made possible by using the sys module where sys.argv[1] is equal to the first argument passed to python_script.py. Which in this example is the url ('http://url_here.com').
import urllib2,sys
f = open("url_file.txt","a+")
urlhere = sys.argv[1]
print "for url: " + urlhere + ":"
try:
fileHandle = urllib2.urlopen(urlhere)
data = fileHandle.read()
fileHandle.close()
f.write(urlhere+ "\n")
f.close()
print "It exists"
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
Or if you really wanted to make your job easy you could use the following script by typing the following into the command line python python_script http://url1.com,http://url2.com where all the urls you wish to test are separated by commas with no spaces.
import urllib2,sys
f = open("url_file.txt","a+")
urlhere_list = sys.argv[1].split(",")
for urls in urlhere_list:
print "for url: " + urls + ":"
try:
fileHandle = urllib2.urlopen(urls)
data = fileHandle.read()
fileHandle.close()
f.write(urls+ "\n")
print "It exists"
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
except:
print "invalid url"
f.close()
sys.argv[1].split() can also be replaced by a python list within the script if you don't want to use the command line functionality. Hope this is of some use to you and good luck with your program.
note
The scripts using command line inputs were tested on the ubuntu linux, so if you are using windows or another operating system I can't guarantee that it will work with the instructions given but it should.
How about something like this:
import urllib2
url = 'http://www.google.com'
data = ''
try:
data = urllib2.urlopen(url).read()
except urllib2.URLError, e:
data = 'PAGE 404: It Doesnt Exist ' + e
with open('outfile.txt', 'w') as out_file:
out_file.write(data)
Use requests:
import requests
def url_checker(urls):
with open('somefile.txt', 'a') as f:
for url in urls:
r = requests.get(url)
if r.status_code == 200:
f.write('{0}\n'.format(url))
url_checker(['http://www.google.com','http://example.com'])

Getting JSON objects from website using standard json and urllib2

I wrote a code to extract JSON objects from the github website using json and requests:
#!/usr/bin/python
import json
import requests
r = requests.get('https://github.com/timeline.json') #Replace with your website URL
with open("a.txt", "w") as f:
for item in r.json or []:
try:
f.write(item['repository']['name'] + "\n")
except KeyError:
pass
This works perfectly fine. However, I want to do the same thing using urllib2 and standard json module. How do I do that? Thanks.
Simply download the data with urlopen and parse it with Python's json module:
import json
import urllib2
r = urllib2.urlopen('https://github.com/timeline.json')
with open("a.txt", "w") as f:
for item in json.load(r) or []:
try:
f.write(item['repository']['name'] + "\n")
except KeyError:
pass

Categories

Resources