I have a list of URLs in a text file from which I want to fetch the article text, author and article title. When these three elements are obtained I want them to be written to a file. So far I can read the URLs from the text file but Python only prints out the URLS and one (the final article). How can I re-write my script in order for Python to read and write every single URL and content?
I have to the following Python script (version 2.7 - Mac OS X Yosemite):
from newspaper import Article
f = open('text.txt', 'r') #text file containing the URLS
for line in f:
print line
url = line
first_article = Article(url)
first_article.download()
first_article.parse()
# write/append to file
with open('anothertest.txt', 'a') as f:
f.write(first_article.title)
f.write(first_article.text)
print str(first_article.title)
for authors in first_article.authors:
print authors
if not authors:
print 'No author'
print str(first_article.text)
You're getting the last article, because you're iterating over all the lines of the file:
for line in f:
print line
and once the loop is over, line contains the last value.
url = line
If you move the contents of your code within the loop, so that:
with open('text.txt', 'r') as f: #text file containing the URLS
with open('anothertest.txt', 'a') as fout:
for url in f:
print(u"URL Line: {}".format(url.encode('utf-8')))
# you might want to remove endlines and whitespaces from
# around the URL, which what strip() does
article = Article(url.strip())
article.download()
article.parse()
# write/append to file
fout.write(article.title)
fout.write(article.text)
print(u"Title: {}".format(article.title.encode('utf-8')))
# print authors only if there are authors to show.
if len(article.authors) == 0:
print('No author!')
else:
for author in article.authors:
print(u"Author: {}".format(author.encode('utf-8')))
print("Text of the article:")
print(article.text.encode('utf-8'))
I also made a few changes to improve your code:
use with open() also for reading the file, to properly release the file descriptor
when you don't need it anymore ;
call the output file fout to avoid shadowing the first file
made the opening call for fout done once, before entering the loop to avoid opening/closing the file at each iteration,
check for length of article.authors instead of checking for existence of authors
as authors won't exist when you don't get within the loop because article.authors
is empty.
HTH
Related
I am currently parsing an xml file to find the a pattern and extract what i need from inside it.
is there a way so that when i find the line i am looking for to count two lines down and grab that line.
with open(filepath) as f:
for line in f:
if pattern.search(line):
#parse each line returned and return only the host names
result = re.findall('"([^"]*)"',line )
print(result)
example xml
<Computer3Properties name="UH25">
<Description property="Description">
<DescriptionValue value="lab" type="VTR" />
output
UH25
Desired output
UH25
lab
now i cant reparse the file and look for the pattern because there are many instances of
<DescriptionValue value=
so i have to grab it once i find the hostname go down the rows and scrape the data inside value
I created an example.xml file containing the exact example contents you specified:
<Computer3Properties name="UH25">
<Description property="Description">
<DescriptionValue value="lab" type="VTR" />
This code:
import re
pattern = "UH25"
with open("path","r") as file:
for line in file:
if re.search(pattern,line):
file.readline()
print(file.readline())
will print whichever line comes two lines after the line where the pattern match was found. Using the example file, you get ''. The reason why this prints two lines down is that the readline() method will grab the contents of the next line. Using it two times (as I did) will print the second line from the line with the match. You said your desired output was specifically printing 'lab' from this line. If so, only the print() line needs to be slightly modified:
import re
pattern = "UH25"
hostnames = []
with open("path","r") as file:
for line in file:
if re.search(pattern,line):
hostnames += re.findall(pattern,line)
file.readline()
hostnames.append(file.readline().split('"')[1])
for x in hostnames:
print(x)
I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Please help I need python to compare text line(s) to words like this.
with open('textfile', 'r') as f:
contents = f.readlines()
print(f_contents)
if f_contents=="a":
print("text")
I also would need it to, read a certain line, and compare that line. But when I run this program it does not do anything no error messages, nor does it print text. Also
How do you get python to write in just line 1? When I try to do it for some reason, it combines both words together can someone help thank you!
what is f_contents it's supposed to be just print(contents)after reading in each line and storing it to contents. Hope that helps :)
An example of reading a file content:
with open("criticaldocuments.txt", "r") as f:
for line in f:
print(line)
#prints all the lines in this file
#allows the user to iterate over the file line by line
OR what you want is something like this using readlines():
with open("criticaldocuments.txt", "r") as f:
contents = f.readlines()
#readlines() will store each and every line into var contents
if contents == None:
print("No lines were stored, file execution failed most likely")
elif contents == "Password is Password":
print("We cracked it")
else:
print(contents)
# this returns all the lines if no matches
Note:
contents = f.readlines()
Can be done like this too:
for line in f.readlines():
#this eliminates the ambiguity of what 'contents' is doing
#and you could work through the rest of the code the same way except
#replace the contents with 'line'.
I should preface that I am a complete Python Newbie.
Im trying to create a script that will loop through a directory and its subdirectories looking for text files. When it encounters a text file it will parse the file and convert it to NITF XML and upload to an FTP directory.
At this point I am still working on reading the text file into variables so that they can be inserted into the XML document in the right places. An example to the text file is as follows.
Headline
Subhead
By A person
Paragraph text.
And here is the code I have so far:
with open("path/to/textFile.txt") as f:
#content = f.readlines()
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
pth = os.getcwd()
print head,sub,auth,data,pth
My question is: how do I iterate through the body of the text file(data) and wrap each line in HTML P tags? For example;
<P>line of text in file </P> <P>Next line in text file</p>.
Something like
output_format = '<p>{}</p>\n'.format
with open('input') as fin, open('output', 'w') as fout:
fout.writelines( output_format(line.strip()) for line in fin )
This assumes that you want to write the new content back to the original file:
with open('path/to/textFile.txt') as f:
content = f.readlines()
with open('path/to/textFile.txt', 'w') as f:
for line in content:
f.write('<p>' + line.strip() + '</p>\n')
with open('infile') as fin, open('outfile',w) as fout:
for line in fin:
fout.write('<P>{0}</P>\n'.format(line[:-1]) #slice off the newline. Same as `line.rstrip('\n')`.
#Only do this once you're sure the script works :)
shutil.move('outfile','infile') #Need to replace the input file with the output file
in you case, you should probably replace
data=f.read()
with:
data = '\n'.join("<p>%s</p>" % l.strip() for l in f)
use data=f.readlines() here,
and then iterate over data and try something like this:
for line in data:
line="<p>"+line.strip()+"</p>"
#write line+'\n' to a file or do something else
append the and <\p> for each line
ex:
data_new=[]
data=f.readlines()
for lines in data:
data_new.append("<p>%s</p>\n" % data.strip().strip("\n"))
You could use the fileinput module to modify one or more files in-place, with optional backup file creation if desired (see its documentation for details). Here's it being used to process one file.
import fileinput
for line in fileinput.input('testinput.txt', inplace=1):
print '<P>'+line[:-1]+'<\P>'
The 'testinput.txt' argument could also be a sequence of two or more file names instead of just a single one, which could be useful especially if you're using os.walk() to generate the list of files in the directory and its subdirectories to process (as you probably should be doing).
I have got a txt file in the format of:
line_1
line_2
line_3
I am trying to read it into a list and displaying it on to a web page just as it looks inside the txt file; one line under another. Here is my code
#cherrypy.expose
def readStatus(self):
f = open("directory","r")
lines = "\n".join(f.readlines())
f.close()
page += "<p>%s</p>" % (lines)
However, the output i have been getting is:
line_1 line_2 line_3
It would be great if someone could give me a hit as to what to do so line_1, line_2 and line_3 are displayed on 3 seperate lines inside the web browser?
Thanks in advance.
You're wrapping paragraph tags around all of the filenames. You probably meant to put paragraph tags around each filename individually:
with open("directory", "r") as f:
page = "\n".join("<p>%s</p>" % line for line in f)
Or, more semantically, you could put it all in an unordered list:
with open("directory", "r") as f:
page = '<ul>%s</ul>' % "\n".join("<li>%s</li>" % line for line in f)
Alternatively, you could put it all inside of a pre (preformatted text) tag:
with open('directory', 'r') as f:
page = '<pre>%s</pre>' % f.read()
Additionally, you might want to consider escaping the filenames with cgi.escape so browsers don't interpret any special characters in the filename.