I am attempting to read a csv file that contains a long list of urls. I need to iterate through the list and get the urls that throw a 301, 302, or 404 response. In trying to test the script I am getting an exited with code 0 so I know it is error free but it is not doing what I need it to. I am new to python and working with files, my experience has been ui automation primarily. Any suggestions would be gladly appreciated. Below is the code.
import csv
import requests
import responses
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open('redirect.csv', 'r')
contents = []
with open('redirect.csv', 'r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
def run():
resp = urllib.request.urlopen(url)
print(self.url, resp.getcode())
run()
print(run)
Given you have a CSV similar to the following (the heading is URL)
URL
https://duckduckgo.com
https://bing.com
You can do something like this using the requests library.
import csv
import requests
with open('urls.csv', newline='') as csvfile:
errors = []
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
try:
r = requests.get(row['URL'])
if r.status_code in [301, 302, 404]:
# print(f"{r.status_code}: {row['url']}")
errors.append([row['url'], r.status_code])
except:
pass
Uncomment the print statement if you want to see the results in the terminal. The code at the moment appends a list of URL and status code to an errors list. You can print or continue processing this if you prefer.
I am trying to do the following things:
1 - open a text file containing a list with URLs (http://example.com). <br>
2 - read the text file and check if the path existe. <br>
3 - write the results back in another text file.
I have tried the following code:
import urllib2
file = open('file.txt', 'r')
search = urllib2.urlopen(file + "/js/tools.js")
if search.code == 200:
print "Exists!"
I appreciate any help provided.
considering you have file Filename as links are stores line by line
import requests
with open(filename,'r') as file, open(filename2,'w+') as file2:
for url in file.readlines():
check = requests.get(url)
if check.ok:
file2.write(url)
My text file consist of :
http://www.makemytrip.com/
http://www.makemytrip.com/blog/dil-toh-roaming-hai?intid=Blog_HPHeader_Logo //how do i remove /dil-toh-roaming-hai?intid=Blog_HPHeader_Logo
http://www.makemytrip.com/rewards/?intid=New_ch_mtr_na
javascript:void(0) //how do i remove this
javascript:void(0)
javascript:void(0)
http://www.makemytrip.com/rewards/?intid=new_ch_mtr_dropdwn
https://support.makemytrip.com/MyAccount/MyTripReward/DashBoard
https://support.makemytrip.com/MyAccount/User/User
https://support.makemytrip.com/MyAccount/MyBookings/BookingSummary/
https://support.makemytrip.com/customersupports.aspx?actiontype=PRINTETICKET
how do i go about checking only urls and save them in another file so that i can parse them one at a time . I tried this Python code But it matches and open the 1st url only.
import urllib
with open("s.txt","r") as file:
for line in file:
url = urllib.urlopen(line)
read = url.read()
print read
I have a python script that fetches a webpage and mirrors it. It works fine for one specific page, but I can't get it to work for more than one. I assumed I could put multiple URLs into a list and then feed that to the function, but I get this error:
Traceback (most recent call last):
File "autowget.py", line 46, in <module>
getUrl()
File "autowget.py", line 43, in getUrl
response = urllib.request.urlopen(url)
File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 361, in open
req.timeout = timeout
AttributeError: 'tuple' object has no attribute 'timeout'
Here's the offending code:
url = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(*url):
response = urllib.request.urlopen(url)
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl()
I've exhausted Google trying to find how to open a list with urlopen(). I found one way that sort of works. It takes a .txt document and goes through it line-by-line, feeding each line as a URL, but I'm writing this using Python 3 and for whatever reason twillcommandloop won't import. Plus, that method is unwieldy and requires (supposedly) unnecessary work.
Anyway, any help would be greatly appreciated.
In your code there are some errors:
You define getUrls with variable arguments list (the tuple in your error);
You manage getUrls arguments as a single variable (list instead)
You can try with this code
import urllib2
import shutil
urls = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(urls):
for url in urls:
#Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/','_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
It do not support tuple:
urllib.request.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.
And your calling is incorrect. It should be:
getUrl(url[0],url[1],url[2])
And inside the function, use a loop like "for u in url" to travel all urls.
You should just iterate over your URLs using a for loop:
import shutil
import urllib.request
urls = ['https://www.example.org/', 'https://www.foo.com/']
file_name = 'foo.txt'
def fetch_urls(urls):
for i, url in enumerate(urls):
file_name = "page-%s.html" % i
response = urllib.request.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
fetch_urls(urls)
I assume you want the content saved to separate files, so I used enumerate here to create a uniqe file name, but you can obviously use anything from hash(), the uuid module to creating slugs.
I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?
Change the last line to:
outputcsv.writerow([url])