How to properly extract URLs from HTML code?

How to properly extract URLs from HTML code? - python

I have saved a website's HTML code in a .txt file on my computer. I would like to extract all URLs from this text file using the following code:
def get_net_target(page):
start_link=page.find("href=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))
However, the script only prints the first URL, but not all other links. Why is this?

You need to implement a loop to go through all URLs.
print(get_net_target(page)) only prints the first URL found in page, so you will need to call this function again and again, each time replacing page by the substring page[end_quote+1:] until no more URL is found.
To get you started, next_index will store the last ending URL position, then the loop will retrieve the following URLs:
next_index = 0 # the next page position from which the URL search starts
def get_net_target(page):
global next_index
start_link=page.find("href=")
if start_link == -1: # no more URL
return ""
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
next_index=end_quote
url=page[start_quote+1:end_quote]
end_quote=5
return url
my_file = open("test12.txt")
page = my_file.read()
while True:
url = get_net_target(page)
if url == "": # no more URL
break
print(url)
page = page[next_index:] # continue with the page
Also be careful because you only retrieve links which are enclosed inside ", but they can be enclosed by ' or even nothing...

Related

unable to print multiple values in django

I have a code that tests if the different directories exist in the URL. Example - www.xyz.com/admin.php here admin.php is the directory or a different page
I am checking these pages or directories through opening a text file.
suppose the following is the file.txt
index.php
members.html
login.php
and this is the function in views.py
def pagecheck(st):
url = st
print("Avaiable links :")
module_dir = os.path.dirname(__file__)
file_path = os.path.join(module_dir, 'file.txt')
data_file = open(file_path,'r')
while True:
sub_link = data_file.readline()
if not sub_link:
break
req_link = url + "/"+sub_link
req = Request(req_link)
try:
response = urlopen(req)
except HTTPError as e:
continue
except URLError as e:
continue
else:
print (" "+req_link)
the code is working fine and prints all the webpages that are actually there in the console.
but when I try to return it at the last to print it in the Django page.
return req_link
print (" "+req_link)
it only shows the first page that makes the connection from the file.text. Suppose, all the webpages in the file.txt are actually there on a website. it prints all the in the console but returns a single webpage in the django app
I tried using a for loop but it didn't work

How to remove urls from file which has 404 status code using python remove function?

I have to remove urls from a file which has 404 status using python remove function. But I am not sure why it is not working.
Code:
#!/usr/bin/python
import requests
url_lines = open('url.txt').read().splitlines()
for url in url_lines:
remove_url = requests.get(url)
if remove_url.status_code == 404:
print remove_url.status_code
url_lines.remove(url)
url.txt file contains following lines:
https://www.amazon.co.uk/jksdkkhsdhk
http://www.google.com
Line https://www.amazon.co.uk/jksdkkhsdhk should be removed from url.txt file.
Thank you so much for help in advance.

You could just skip it:
if remove_url.status_code == 404:
continue
You shouldn't try to remove it while inside the for loop. Instead, add it to another list remove_from_urls and, after your for loop, remove all the indices in your new list. This could be done by:
remove_from_urls = []
for url in url_lines:
remove_url = requests.get(url)
if remove_url.status_code == 404:
remove_from_urls.append(url)
continue
# Code for handling non-404 requests
url_lines = [url for url in url_lines if url not in remove_from_urls]
# Save urls example
with open('urls.txt', 'w+') as file:
for item in url_lines:
file.write(item + '\n')

Iterating website URLs from a text file into BeautifulSoup w/ Python

I have a .txt file with a different link on each line that I want to iterate, and parse into BeautifulSoup(response.text, "html.parser"). I'm having a couple issues though.
I can see the lines iterating from the text file, but when I assign them to my requests.get(websitelink), my code that previously worked (without iteration) no longer prints any data that I scrape.
All I receive are some blank lines in the results.
I'm new to Python and BeautifulSoup, so I'm not quite sure what I'm doing wrong. I've tried parsing the lines as a string, but that didn't seem to work.
import requests
from bs4 import BeautifulSoup
filename = 'item_ids.txt'
with open(filename, "r") as fp:
lines = fp.readlines()
for line in lines:
#Test to see if iteration for line to line works
print(line)
#Assign single line to websitelink
websitelink = line
#Parse websitelink into requests
response = requests.get(websitelink)
soup = BeautifulSoup(response.text, "html.parser")
#initialize and reset vars for cd loop
count = 0
weapon = ''
stats = ''
#iterate through cdata on page, and parse wanted data
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
#print(cd)
count += 1
if count == 1:
weapon = cd
if count == 6:
stats = cd
#concatenate cdata info
both = weapon + " " + stats
print(both)
The code should follow these steps:
Read line (URL) from text file, and assign to variable to be used w/ request.get(websitelink)
BeautifulSoup scrapes that link for the CData and prints it
Repeat Step 1 & 2 until final line of the text file (last URL)
Any help would be greatly appreciated,
Thanks

I don't know this could help you or not but I've added a strip() to your link variable when you are assigning it to the websitelink and helped me to make your code work. You could try it.
websitelink = line.strip()

Downloadable link from public box link [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'd like to be able to retrieve a pdf from a public Box link through python, but I'm not quite sure how I can do this. Here's an example of the type of pdf I hope to be able to download:
https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje
I can click the download button or click a button to get a printable link on my browser, but I haven't been able to find the link to this page in the source html. Is there a way to find this link programmatically? Perhaps through selenium or requests or even through the box API?
Thanks a lot for the help!

This is code to get download link of pdf:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_soup(html):
''' function returns soup of html code
Beautiful Soup is a Python library for pulling data out of HTML
and XML files. It works with your favorite parser to provide idiomatic
ways of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
more at https://www.crummy.com/software/BeautifulSoup/bs4/doc/'''
soup = BeautifulSoup(html, "lxml")
## if it doesn't work instead of using "lxml"
## you can use any of these options:
## soup = BeautifulSoup(html, "html.parser")
## soup = BeautifulSoup(html, "lxml-xml")
## soup = BeautifulSoup(html, "xml")
## soup = BeautifulSoup(markup, "html5lib")
return soup
def get_data_file_id(html):
'''function returns data_file_id which is found in html code'''
## to scrape website i suggest using BeautifulSoup,
## you can do it manually using html.read() which will give you
## html code as string and then you need to do some string searching
soup = get_soup(html)
## part of html code we are interested in is:
## <div class="preview" data-module="preview" data-file-id="69950302561" data-file-version-id="">
## we want to extract this data-file-id
## first we find div in which it's located in
classifier = {"class": 'preview'} ## classifier specifies div we are looking for
div = soup.find('div', classifier) ## we will get div which has class 'preview'
## now we can easily get data-file-id by using
data_file_id = div.get('data-file-id')
return data_file_id
## you can install BeautifulSoup from:
## on windows http://www.lfd.uci.edu/~gohlke/pythonlibs/
## or from https://pypi.python.org/pypi/beautifulsoup4/4.4.1
## official page is https://www.crummy.com/software/BeautifulSoup/
## if you don't want to use BeautifulSoup than you should do smotehing like this:
##
##html_str = str(html.read())
##search_for = 'div class="preview" data-module="preview" data-file-id="'
##start = html_str.find(search_for) + len(search_for)
##end = html_str.find('"', start)
##data_file_id = html_str[start : end]
##
## it may seem easier to do it than to use BeautifulSoup, but the problem is that
## if there is one more space in search_for or the order of div attributes is different
## or there sign " is used instead of ' and and vice versa this string searching
## won't work while BeautifulSoup will so I recommend using BeautifulSoup
def get_url_id(url):
''' function returns url_id which is last part of url'''
reverse_url = url[::-1]
start = len(url) - reverse_url.find('/') # start is position of last '/' in url
url_id = url[start:]
return url_id
def get_download_url(url_id, data_file_id):
''' function returns download_url'''
start = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name='
download_url = start + url_id + '&file_id=f_' + data_file_id
return download_url
url = 'https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje'
url = 'https://fnn.app.box.com/s/n74mnmrwyrmtiooqwppqjkrd1hhf3t3j'
html = get_html(url)
data_file_id = get_data_file_id(html) ## we need data_file_id to create download url
url_id = get_url_id(url) ## we need url_id to create download url
download_url = get_download_url(url_id, data_file_id)
## this actually isn't real download url
## you can get real url by using:
## real_download_url = get_html(download_url).geturl()
## but you will get really long url for your example it would be
## https://dl.boxcloud.com/d/1/4vx9ZWYeeQikW0KHUuO4okRjjQv3t6VGFTbMkh7weWQQc_tInOFR_1L_FuqVFovLqiycOLUDHu4o2U5EdZjmwnSmVuByY5DhpjmmdlizjaVjk6RMBbLcVhSt0ewtusDNL5tA8aiUKD1iIDlWCnXHJlcVzBc4aH3BXIEU65Ki1KdfZIlG7_jl8wuwP4MQG_yFw2sLWVDZbddJ50NLo2ElBthxy4EMSJ1auyvAWOp6ai2S4WPdqUDZ04PjOeCxQhvo3ufkt3og6Uw_s6oVVPryPUO3Pb2M4-Li5x9Cki882-WzjWUkBAPJwscVxTbDbu1b8GrR9P-5lv2I_DC4uPPamXb07f3Kp2kSJDVyy9rKbs16ATF3Wi2pOMMszMm0DVSg9SFfC6CCI0ISrkXZjEhWa_HIBuv_ptfQUUdJOMm9RmteDTstW37WgCCjT2Z22eFAfXVsFTOZBiaFVmicVAFkpB7QHyVkrfxdqpCcySEmt-KOxyjQOykx1HiC_WB2-aEFtEkCBHPX8BsG7tm10KRbSwzeGbp5YN1TJLxNlDzYZ1wVIKcD7AeoAzTjq0Brr8du0Vf67laJLuBVcZKBUhFNYM54UuOgL9USQDj8hpl5ew-W__VqYuOnAFOS18KVUTDsLODYcgLMzAylYg5pp-2IF1ipPXlbBOJgwNpYgUY0Bmnl6HaorNaRpmLVQflhs0h6wAXc7DqSNHhSnq5I_YbiQxM3pV8K8IWvpejYy3xKED5PM9HR_Sr1dnO0HtaL5PgfKcuiRCdCJjpk766LO0iNiRSWKHQ9lmdgA-AUHbQMMywLvW71rhIEea_jQ84elZdK1tK19zqPAAJ0sgT7LwdKCsT781sA90R4sRU07H825R5I3O1ygrdD-3pPArMf9bfrYyVmiZfI_yE_XiQ0OMXV9y13daMh65XkwETMAgWYwhs6RoTo3Kaa57hJjFT111lQVhjmLQF9AeqwXb0AB-Hu2AhN7tmvryRm7N2YLu6IMGLipsabJQnmp3mWqULh18gerlve9ZsOj0UyjsfGD4I0I6OhoOILsgI1k0yn8QEaVusHnKgXAtmi_JwXLN2hnP9YP20WjBLJ/download
## and we don't really care about real download url so i will use just download_url
print(download_url)
also I wrote code to download that pdf:
from urllib.request import Request, urlopen
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_current_path():
''' function returns path of folder in which python program is saved'''
try:
path = __file__
except:
try:
import sys
path = sys.argv[0]
except:
path = ''
if path:
if '\\' in path:
path = path.replace('\\', '/')
end = len(path) - path[::-1].find('/')
path = path[:end]
return path
def check_if_name_already_exists(name, path):
''' function checks if there is already existing pdf file
with same name in folder given by path.'''
try:
file = open(path+name+'.pdf', 'r')
file.close()
return True
except:
return False
def get_new_name(old_name, path):
''' functions ask user to enter new name for file and returns inputted name.'''
print('File with name "{}" already exist.'.format(old_name))
answer = input('Would you like to replace it (answer with "r")\nor create new one (answer with "n") ? ')
while answer not in 'rRnN':
print('Your answer is inconclusive')
print('Please answer again:')
print('if you would like to replece the existing file answer with "r"')
print('if you would like to create new one answer with "n"')
answer = input('Would you like to replace it (answer with "r")\n or create new one (answer with "n") ? ')
if answer in 'nN':
new_name = input('Enter new name for file: ')
if check_if_name_already_exists(new_name, path):
return get_new_name(new_name, path)
else:
return new_name
if answer in 'rR':
return old_name
def download_pdf(url, name = 'document1', path = None):
'''function downloads pdf file from its url
required argument is url of pdf file and
optional argument is name for saved pdf file and
optional argument path if you want to choose where is your file saved
variable path must look like:
'C:\\Users\\Computer name\\Desktop' or
'C:/Users/Computer name/Desktop' '''
# and not like
# 'C:\Users\Computer name\Desktop'
pdf = get_html(url)
name = name.replace('.pdf', '')
if path == None:
path = get_current_path()
if '\\' in path:
path = path.replace('\\', '/')
if path[-1] != '/':
path += '/'
if path:
check = check_if_name_already_exists(name, path)
if check:
if name == 'document1':
i = 2
name = 'document' + str(i)
while check_if_name_already_exists(name, path):
i += 1
name = 'document' + str(i)
else:
name = get_new_name(name, path)
file = open(path+name + '.pdf', 'wb')
else:
file = open(name + '.pdf', 'wb')
file.write(pdf.read())
file.close()
if path:
print(name + '.pdf file downloaded in folder "{}".'.format(path))
else:
print(name + '.pdf file downloaded.')
return
download_url = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name=n74mnmrwyrmtiooqwppqjkrd1hhf3t3j&file_id=f_53868474893'
download_pdf(download_url)
Hope it helps, let me know if it works.

Python URLs in file Requests

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)

You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors

You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to properly extract URLs from HTML code? - python

Related

unable to print multiple values in django

How to remove urls from file which has 404 status code using python remove function?

Iterating website URLs from a text file into BeautifulSoup w/ Python

Downloadable link from public box link [closed]

Python URLs in file Requests

Categories

Resources