Check URL file existe from a text file - python

I am trying to do the following things:
1 - open a text file containing a list with URLs (http://example.com). <br>
2 - read the text file and check if the path existe. <br>
3 - write the results back in another text file.
I have tried the following code:
import urllib2
file = open('file.txt', 'r')
search = urllib2.urlopen(file + "/js/tools.js")
if search.code == 200:
print "Exists!"
I appreciate any help provided.

considering you have file Filename as links are stores line by line
import requests
with open(filename,'r') as file, open(filename2,'w+') as file2:
for url in file.readlines():
check = requests.get(url)
if check.ok:
file2.write(url)

Related

Can write into the file in python and startswith not working

I have a problem. I have a task "download by ID"
This is my previous program which download text (PDB file)
from urllib.request import urlopen
def download(inf):
url = xxxxxxxxxxx
response = urlopen(xxx)
text = response.read().decode('utf-8')
return data
new_download = download('154)
It works perfect, but my function that I must create, don't write to file lines which starts with num
from urllib.request import urlopen #moduule for URL processing
with open('new_test', 'w') as a:
for sent in text: #for every line in sequences file
if line.startswith('num'):
line1.writeline(sent)
You're not iterating over the lines, you're iterating over the characters. Change
for line in data2:
to
for line in data2.splitlines():

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Find Hard to Match only url from a text file

My text file consist of :
http://www.makemytrip.com/
http://www.makemytrip.com/blog/dil-toh-roaming-hai?intid=Blog_HPHeader_Logo //how do i remove /dil-toh-roaming-hai?intid=Blog_HPHeader_Logo
http://www.makemytrip.com/rewards/?intid=New_ch_mtr_na
javascript:void(0) //how do i remove this
javascript:void(0)
javascript:void(0)
http://www.makemytrip.com/rewards/?intid=new_ch_mtr_dropdwn
https://support.makemytrip.com/MyAccount/MyTripReward/DashBoard
https://support.makemytrip.com/MyAccount/User/User
https://support.makemytrip.com/MyAccount/MyBookings/BookingSummary/
https://support.makemytrip.com/customersupports.aspx?actiontype=PRINTETICKET
how do i go about checking only urls and save them in another file so that i can parse them one at a time . I tried this Python code But it matches and open the 1st url only.
import urllib
with open("s.txt","r") as file:
for line in file:
url = urllib.urlopen(line)
read = url.read()
print read

ElementTree errors, html files will not parse using Python/Sublime

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks, but the first one is this: I can not get it to properly parse the file. Below is a brief explanation, the python code and the traceback info.
Using Python & Sublime to parse html files, I am getting several errors. What IS working: it runs fine up until if '.html' in file:. It does not execute that loop. It will iterate through print allFiles just fine. It also creates the csv file and creates the headers (though not in separate columns, but I can ask about that later).
It seems that the problem is in the if tree = ET.parse(HTML_PATH+"/"+file) piece. I've written this several different ways (without "/" and/or "file", for example)--so far I have yet to resolve this problem.
If I can provide more information or if anyone can direct me to other documenation, it would be greatly appreciated. So far I have yet to find anything that addresses this issue.
Many thanks for your thoughts.
//C
# Parses out data from crawled html files under "html files"
# and places the output in output.csv.
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = ET.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
print "tbody"
# The tbody attribute spells it all (or does it):
name = node.attrib.get('/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font')
# Check common header stuff
if name=='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
#print ' ------------------'
#print ' Category:'
category=node.text
print "category"
f.close()
Traceback:
File "/Users/C/data/Folder_NS/data_parse.py", line 34, in
tree = ET.parse(HTML_PATH+"/"+file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 63, column 2
You are trying to parse HTML with an XML parser, and valid HTML is not always valid XML. You would be better off using the HTML parsing library in the lxml package.
import xml.etree.ElementTree as ET
# ...
tree = ET.parse(HTML_PATH + '/' + file)
would be changed to
import lxml.html
# ...
tree = lxml.html.parse(HTML_PATH + '/' + file)

Python Read URLs from File and print to file

I have a list of URLs in a text file from which I want to fetch the article text, author and article title. When these three elements are obtained I want them to be written to a file. So far I can read the URLs from the text file but Python only prints out the URLS and one (the final article). How can I re-write my script in order for Python to read and write every single URL and content?
I have to the following Python script (version 2.7 - Mac OS X Yosemite):
from newspaper import Article
f = open('text.txt', 'r') #text file containing the URLS
for line in f:
print line
url = line
first_article = Article(url)
first_article.download()
first_article.parse()
# write/append to file
with open('anothertest.txt', 'a') as f:
f.write(first_article.title)
f.write(first_article.text)
print str(first_article.title)
for authors in first_article.authors:
print authors
if not authors:
print 'No author'
print str(first_article.text)
You're getting the last article, because you're iterating over all the lines of the file:
for line in f:
print line
and once the loop is over, line contains the last value.
url = line
If you move the contents of your code within the loop, so that:
with open('text.txt', 'r') as f: #text file containing the URLS
with open('anothertest.txt', 'a') as fout:
for url in f:
print(u"URL Line: {}".format(url.encode('utf-8')))
# you might want to remove endlines and whitespaces from
# around the URL, which what strip() does
article = Article(url.strip())
article.download()
article.parse()
# write/append to file
fout.write(article.title)
fout.write(article.text)
print(u"Title: {}".format(article.title.encode('utf-8')))
# print authors only if there are authors to show.
if len(article.authors) == 0:
print('No author!')
else:
for author in article.authors:
print(u"Author: {}".format(author.encode('utf-8')))
print("Text of the article:")
print(article.text.encode('utf-8'))
I also made a few changes to improve your code:
use with open() also for reading the file, to properly release the file descriptor
when you don't need it anymore ;
call the output file fout to avoid shadowing the first file
made the opening call for fout done once, before entering the loop to avoid opening/closing the file at each iteration,
check for length of article.authors instead of checking for existence of authors
as authors won't exist when you don't get within the loop because article.authors
is empty.
HTH

Categories

Resources