Python URLs in file Requests - python

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)

You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors

You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()

Related

How to follow webscraping progress with python

I'm running a Python script with BeautifulSoup in order to extract Text, topics and tags from web articles. The website contains 210 pages, and each page contain 10 articles. (each article's url is stocked in a txt file)
I'm using the following code :
data = []
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.para_content_text').get_text(strip=True)
topic = soup.select_one('div.article_tags_topics').get_text(strip=True)
tags = soup.select_one('div.article_tags_tags').get_text(strip=True)
except AttributeError:
print (" ")
data.append(
{
'text':text,
'topic': topic,
'tags':tags
}
)
pd.DataFrame(data).to_csv('text.csv', index = False, header=True)
time.sleep(3)
My code seems to be corret but I ran this code and it has been running for several days now.
I would like to understand if it is an error that is blocking progress or if the process is simply very long.
To do this, I would like to know if it would be possible to add a "component" to my code that would allow me to track the number of urls processed in real time.
Any ideas ?
The way your code is written now, you are accumulating all the data in memory until it's all fetched. The easiest way to keep track of the progress without changing the code too much would be to just print either the current URL, or the number of the URL you're processing.
A better way that involves changing the code a little more would be to write the data to the CSV file as you are parsing it, instead of all at once in the end. Something like
print("text,topic,tags")
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
# Getting the data you want...
print(f"{text},{topic},{tags}")
If you are going with this method, make sure to escape/remove commas, or use an actual CSV library to produce the lines.

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Iterating website URLs from a text file into BeautifulSoup w/ Python

I have a .txt file with a different link on each line that I want to iterate, and parse into BeautifulSoup(response.text, "html.parser"). I'm having a couple issues though.
I can see the lines iterating from the text file, but when I assign them to my requests.get(websitelink), my code that previously worked (without iteration) no longer prints any data that I scrape.
All I receive are some blank lines in the results.
I'm new to Python and BeautifulSoup, so I'm not quite sure what I'm doing wrong. I've tried parsing the lines as a string, but that didn't seem to work.
import requests
from bs4 import BeautifulSoup
filename = 'item_ids.txt'
with open(filename, "r") as fp:
lines = fp.readlines()
for line in lines:
#Test to see if iteration for line to line works
print(line)
#Assign single line to websitelink
websitelink = line
#Parse websitelink into requests
response = requests.get(websitelink)
soup = BeautifulSoup(response.text, "html.parser")
#initialize and reset vars for cd loop
count = 0
weapon = ''
stats = ''
#iterate through cdata on page, and parse wanted data
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
#print(cd)
count += 1
if count == 1:
weapon = cd
if count == 6:
stats = cd
#concatenate cdata info
both = weapon + " " + stats
print(both)
The code should follow these steps:
Read line (URL) from text file, and assign to variable to be used w/ request.get(websitelink)
BeautifulSoup scrapes that link for the CData and prints it
Repeat Step 1 & 2 until final line of the text file (last URL)
Any help would be greatly appreciated,
Thanks
I don't know this could help you or not but I've added a strip() to your link variable when you are assigning it to the websitelink and helped me to make your code work. You could try it.
websitelink = line.strip()

How to input a list of URLs saved in a .txt to a Python program?

I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.
Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()

Save body text on csv file | Python 3

I am trying to create a database with several articles for Text mining purposes.
I am extracting the body via web scraping and then save the body of these articles on a csv file. However, I couldn't manage to save all the body texts.
The code that I came up with saves only the text the last URL (article) while if I print what I am scraping (and what I am supposed to save) I obtain the body of all the articles.
I just included some of the URL from the list (which contains a larger number of URLs) just to give you an idea:
import requests
from bs4 import BeautifulSoup
import csv
r=["http://www.nytimes.com/2016/10/12/world/europe/germany-arrest-syrian-refugee.html",
"http://www.nytimes.com/2013/06/16/magazine/the-effort-to-stop-the- attack.html",
"http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html",
"http://www.nytimes.com/2016/08/23/world/europe/france-terrorist-attacks.html",
"http://www.nytimes.com/interactive/2016/09/09/us/document-Review-of-the-San-Bernardino-Terrorist-Shooting.html",
]
for url in r:
t= requests.get(url)
t.encoding = "ISO-8859-1"
soup = BeautifulSoup(t.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))
print(text)
with open('newdb30.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(text)
Try declaring variable such as all_text = "" before the for loop and adding text to all_text by all_text += text + "\n" at the end of the for loop (the \n creates a new line).
Then, in the last row, instead of writing text, you write all_text.

Categories

Resources