Copying URLs to file that contain specific term - python

So I'm trying to get all the urls in the range whose pages contain either the term "Recipes adapted from" or "Recipe from". This copies all the links to the file up until about 7496, then it spits out HTTPError 404. What am I doing wrong? I've tried to implement BeautifulSoup and requests, but I still can't get it to work.
import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
page_content = urllib2.urlopen(url).read()
if "Recipe adapted from" in page_content:
print url
f.write(url + '\n')
elif "Recipe from" in page_content:
print url
f.write(url + '\n')
else:
pass

Some of the URLs you are trying to scrape do not exist. Simply skip perhaps, by ignoring the exception:
import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
try:
page_content = urllib2.urlopen(url).read()
except urllib2.HTTPError as error:
if 400 < error.code < 500:
continue # not found, unauthorized, etc.
raise # other errors we want to know about
if "Recipe adapted from" in page_content or "Recipe from" in page_content:
print url
f.write(url + '\n')

Related

unable to print multiple values in django

I have a code that tests if the different directories exist in the URL. Example - www.xyz.com/admin.php here admin.php is the directory or a different page
I am checking these pages or directories through opening a text file.
suppose the following is the file.txt
index.php
members.html
login.php
and this is the function in views.py
def pagecheck(st):
url = st
print("Avaiable links :")
module_dir = os.path.dirname(__file__)
file_path = os.path.join(module_dir, 'file.txt')
data_file = open(file_path,'r')
while True:
sub_link = data_file.readline()
if not sub_link:
break
req_link = url + "/"+sub_link
req = Request(req_link)
try:
response = urlopen(req)
except HTTPError as e:
continue
except URLError as e:
continue
else:
print (" "+req_link)
the code is working fine and prints all the webpages that are actually there in the console.
but when I try to return it at the last to print it in the Django page.
return req_link
print (" "+req_link)
it only shows the first page that makes the connection from the file.text. Suppose, all the webpages in the file.txt are actually there on a website. it prints all the in the console but returns a single webpage in the django app
I tried using a for loop but it didn't work

How to remove urls from file which has 404 status code using python remove function?

I have to remove urls from a file which has 404 status using python remove function. But I am not sure why it is not working.
Code:
#!/usr/bin/python
import requests
url_lines = open('url.txt').read().splitlines()
for url in url_lines:
remove_url = requests.get(url)
if remove_url.status_code == 404:
print remove_url.status_code
url_lines.remove(url)
url.txt file contains following lines:
https://www.amazon.co.uk/jksdkkhsdhk
http://www.google.com
Line https://www.amazon.co.uk/jksdkkhsdhk should be removed from url.txt file.
Thank you so much for help in advance.
You could just skip it:
if remove_url.status_code == 404:
continue
You shouldn't try to remove it while inside the for loop. Instead, add it to another list remove_from_urls and, after your for loop, remove all the indices in your new list. This could be done by:
remove_from_urls = []
for url in url_lines:
remove_url = requests.get(url)
if remove_url.status_code == 404:
remove_from_urls.append(url)
continue
# Code for handling non-404 requests
url_lines = [url for url in url_lines if url not in remove_from_urls]
# Save urls example
with open('urls.txt', 'w+') as file:
for item in url_lines:
file.write(item + '\n')

Beautifulsoup download all .zip files from Google Patent Search

What I am trying to do is use Beautifulsoup to download every zip file from the Google Patent archive. Below is the code that i've written thus far. But it seems that I am having troubles getting the files to download into a directory on my desktop. Any help would be greatly appreciated
from bs4 import BeautifulSoup
import urllib2
import re
import pandas as pd
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
soup.prettify()
path = open('/Users/username/Desktop/', "wb")
for name in soup.findAll('a', href=True):
print name['href']
linkpath = name['href']
rq = urllib2.request(linkpath)
res = urllib2.urlope
The result that I am supposed to get, is that all of the zip files are supposed to download into a specific dir. Instead, I am getting the following error:
> #2015 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last)
> <ipython-input-13-874f34e07473> in <module>() 17 print name['href'] 18
> linkpath = name['href'] ---> 19 rq = urllib2.request(namep) 20 res =
> urllib2.urlopen(rq) 21 path.write(res.read()) AttributeError: 'module'
> object has no attribute 'request' –
In addition to using a non-existent request entity from urllib2, you don't output to a file correctly - you can't just open the directory, you have to open each file for output separately.
Also, the 'Requests' package has a much nicer interface than urllib2. I recommend installing it.
Note, that, today anyway, the first .zip is 5.7Gb so streaming to a file is essential.
Really, you want something more like this:
from BeautifulSoup import BeautifulSoup
import requests
# point to output directory
outpath = 'D:/patent_zips/'
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
mbyte=1024*1024
print 'Reading: ', url
html = requests.get(url).text
soup = BeautifulSoup(html)
print 'Processing: ', url
for name in soup.findAll('a', href=True):
zipurl = name['href']
if( zipurl.endswith('.zip') ):
outfname = outpath + zipurl.split('/')[-1]
r = requests.get(zipurl, stream=True)
if( r.status_code == requests.codes.ok ) :
fsize = int(r.headers['content-length'])
print 'Downloading %s (%sMb)' % ( outfname, fsize/mbyte )
with open(outfname, 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024): # chuck size can be larger
if chunk: # ignore keep-alive requests
fd.write(chunk)
fd.close()
This is your problem:
rq = urllib2.request(linkpath)
urllib2 is a module and it has no request entity/attribute in it.
I see a Request class in urllib2, but I'm unsure if that's what you intended to actually use...

Writing text to txt file in python on new lines?

So I am trying to check whether a url exists and if it does I would like to write the url to a file using python. I would also like each url to be on its own line within the file. Here is the code I already have:
import urllib2
CREATE A BLANK TXT FILE THE DESKTOP
urlhere = "http://www.google.com"
print "for url: " + urlhere + ":"
try:
fileHandle = urllib2.urlopen(urlhere)
data = fileHandle.read()
fileHandle.close()
print "It exists"
Then, If the URL does exist, write the url on a new line in the text file
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
If the URL doesn't exist, don't write anything to the file.
The way you worded your question is a bit confusing but if I understand you correctly all your trying to do is test if a url is valid using urllib2 and if it is write the url to a file? If that is correct the following should work.
import urllib2
f = open("url_file.txt","a+")
urlhere = "http://www.google.com"
print "for url: " + urlhere + ":"
try:
fileHandle = urllib2.urlopen(urlhere)
data = fileHandle.read()
fileHandle.close()
f.write(urlhere + "\n")
f.close()
print "It exists"
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
If you want to test multiple urls but don't want to edit the the python script you could use the following script by typing python python_script.py "http://url_here.com". This is made possible by using the sys module where sys.argv[1] is equal to the first argument passed to python_script.py. Which in this example is the url ('http://url_here.com').
import urllib2,sys
f = open("url_file.txt","a+")
urlhere = sys.argv[1]
print "for url: " + urlhere + ":"
try:
fileHandle = urllib2.urlopen(urlhere)
data = fileHandle.read()
fileHandle.close()
f.write(urlhere+ "\n")
f.close()
print "It exists"
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
Or if you really wanted to make your job easy you could use the following script by typing the following into the command line python python_script http://url1.com,http://url2.com where all the urls you wish to test are separated by commas with no spaces.
import urllib2,sys
f = open("url_file.txt","a+")
urlhere_list = sys.argv[1].split(",")
for urls in urlhere_list:
print "for url: " + urls + ":"
try:
fileHandle = urllib2.urlopen(urls)
data = fileHandle.read()
fileHandle.close()
f.write(urls+ "\n")
print "It exists"
except urllib2.URLError, e:
print 'PAGE 404: It Doesnt Exist', e
except:
print "invalid url"
f.close()
sys.argv[1].split() can also be replaced by a python list within the script if you don't want to use the command line functionality. Hope this is of some use to you and good luck with your program.
note
The scripts using command line inputs were tested on the ubuntu linux, so if you are using windows or another operating system I can't guarantee that it will work with the instructions given but it should.
How about something like this:
import urllib2
url = 'http://www.google.com'
data = ''
try:
data = urllib2.urlopen(url).read()
except urllib2.URLError, e:
data = 'PAGE 404: It Doesnt Exist ' + e
with open('outfile.txt', 'w') as out_file:
out_file.write(data)
Use requests:
import requests
def url_checker(urls):
with open('somefile.txt', 'a') as f:
for url in urls:
r = requests.get(url)
if r.status_code == 200:
f.write('{0}\n'.format(url))
url_checker(['http://www.google.com','http://example.com'])

How to delete a line from a file after it has been used

I'm trying to create a script which makes requests to random urls from a txt file e.g.:
import urllib2
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
But I want that when some url indicates 404 not found, the line containing the URL is erased from the file. There is one unique URL per line, so basically the goal is to erase every URL (and its corresponding line) that returns 404 not found.
How can I accomplish this?
You could simply save all the URLs that worked, and then rewrite them to the file:
good_urls = []
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
good_urls.append(url)
with open('urls.txt', 'w') as urls:
urls.write("".join(good_urls))
The easiest way is to read all the lines, loop over the saved lines and try to open them, and then when you are done, if any URLs failed you rewrite the file.
The way to rewrite the file is to write a new file, and then when the new file is successfully written and closed, then you use os.rename() to change the name of the new file to the name of the old file, overwriting the old file. This is the safe way to do it; you never overwrite the good file until you know you have the new file correctly written.
I think the simplest way to do this is just to create a list where you collect the good URLs, plus have a count of failed URLs. If the count is not zero, you need to rewrite the text file. Or, you can collect the bad URLs in another list. I did that in this example code. (I haven't tested this code but I think it should work.)
import os
import urllib2
input_file = "urls.txt"
debug = True
good_urls = []
bad_urls = []
bad, good = range(2)
def track(url, good_flag, code):
if good_flag == good:
good_str = "good"
elif good_flag == bad:
good_str = "bad"
else:
good_str = "ERROR! (" + repr(good) + ")"
if debug:
print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
if good_flag == good:
good_urls.append(url)
else:
bad_urls.append(url)
with open(input_file) as f:
for line in f:
url = line.strip()
try:
r = urllib2.urlopen(url)
if r.code in (200, 401):
print '[{0}]: '.format(url), "Up!"
if r.code == 404:
# URL is bad if it is missing (code 404)
track(url, bad, r.code)
else:
# any code other than 404, assume URL is good
track(url, good, r.code)
except urllib2.URLError as e:
track(url, bad, "exception!")
# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
# simple way to get a filename for temp file: append ".tmp" to filename
temp_file = input_file + ".tmp"
with open(temp_file, "w") as f:
for url in good_urls:
f.write(url + '\n')
# if we reach this point, temp file is good. Remove old input file
os.remove(input_file) # only needed for Windows
os.rename(temp_file, input_file) # replace original input file with temp file
EDIT: In comments, #abarnert suggests that there might be a problem with using os.rename() on Windows (at least I think that is what he/she means). If os.rename() doesn't work, you should be able to use shutil.move() instead.
EDIT: Rewrite code to handle errors.
EDIT: Rewrite to add verbose messages as URLs are tracked. This should help with debugging. Also, I actually tested this version and it works for me.

Categories

Resources