Iterating website URLs from a text file into BeautifulSoup w/ Python - python

I have a .txt file with a different link on each line that I want to iterate, and parse into BeautifulSoup(response.text, "html.parser"). I'm having a couple issues though.
I can see the lines iterating from the text file, but when I assign them to my requests.get(websitelink), my code that previously worked (without iteration) no longer prints any data that I scrape.
All I receive are some blank lines in the results.
I'm new to Python and BeautifulSoup, so I'm not quite sure what I'm doing wrong. I've tried parsing the lines as a string, but that didn't seem to work.
import requests
from bs4 import BeautifulSoup
filename = 'item_ids.txt'
with open(filename, "r") as fp:
lines = fp.readlines()
for line in lines:
#Test to see if iteration for line to line works
print(line)
#Assign single line to websitelink
websitelink = line
#Parse websitelink into requests
response = requests.get(websitelink)
soup = BeautifulSoup(response.text, "html.parser")
#initialize and reset vars for cd loop
count = 0
weapon = ''
stats = ''
#iterate through cdata on page, and parse wanted data
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
#print(cd)
count += 1
if count == 1:
weapon = cd
if count == 6:
stats = cd
#concatenate cdata info
both = weapon + " " + stats
print(both)
The code should follow these steps:
Read line (URL) from text file, and assign to variable to be used w/ request.get(websitelink)
BeautifulSoup scrapes that link for the CData and prints it
Repeat Step 1 & 2 until final line of the text file (last URL)
Any help would be greatly appreciated,
Thanks

I don't know this could help you or not but I've added a strip() to your link variable when you are assigning it to the websitelink and helped me to make your code work. You could try it.
websitelink = line.strip()

Related

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Web scraping python: IndexError: list index out of range

The script reads a single URL from a text file and then imports information from that web page and store it in a CSV file. The script works fine for a single URL.
Problem: I have added several URLs in my text file line by line and now I want my script to read first URL, do the desired operation and then go back to text file to read the second URL and repeat.
Once I added the for loop to get this done, I stated facing the below error:
Traceback (most recent call last):
File "C:\Users\T947610\Desktop\hahah.py", line 22, in
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
IndexError: list index out of range
f = open("URL.txt", 'r')
for line in f.readlines():
print (line)
page = requests.get(line)
print(page.status_code)
print(page.content)
soup = BeautifulSoup(page.text, 'html.parser')
print("soup command worked")
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
rows = table.findAll("tr")
Sometimes findAll throws an exception if it can't find the data in the findall. I have this same issue and I work around it with try/except, except you'll need to deal with empty values probably differently than I've show, which is for example:
f = open("URL.txt", 'r')
for line in f.readlines():
print (line)
page = requests.get(line)
print(page.status_code)
print(page.content)
soup = BeautifulSoup(page.text, 'html.parser')
print("soup command worked")
try:
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
rows = table.findAll("tr")
except IndexError:
table = None
rows = None
If the single url input was working, maybe new input line from .txt is the problem. Try apply .strip() to the line, the line normally has whitespace at the head and tail
page = requests.get(line.strip())
Also, if soup.findall() find nothing, it will return None, which cannot be indexed. Try print the soup and check the content.

Is there a better way to scrape this data?

For work, I was asked to create a spreadsheet of the names and addresses of all allopathic medical schools in the United States. Being new to python, I thought that this would be the perfect situation to try web scraping. While I eventually wrote a program that returned the data I needed, I know that there is a better way to do it as there were some extraneous characters (eg: ", ], [) that I had to go into excel and manually remove. I would just like to know if there was a better way I could have written this code so I can get what I needed, minus the extraneous characters.
Edit: I have also attached an image of the csv file that was created to show the extraneous characters that I'm speaking about.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
# link to the site we want to scrape from
page_response = requests.get(link)
# fetching the content using the requests library
soup = BeautifulSoup(page_response.text, "html.parser")
# Calling BeautifulSoup in order to parse our document
data = []
# Empty list for the first scrape. We only get one column with many rows.
# We still have the line break tags here </br>
for tr in soup.find_all('tr', {'valign': 'top'}):
values = [td.get_text('</b>', strip=True) for td in tr.find_all('td')]
data.append(values)
data2 = []
# New list that we'll use to have name on index i, address on index i+1
for i in data:
test = list(str(i).split('</b>'))
# Using the line breaks to our advantage.
name = test[0].strip("['")
'''Here we are saying that the name of the school is the first element
before the first line break'''
addy = test[1:]
# The address is what comes after this first line break
data2.append(name)
data2.append(addy)
# Append the name of the school and address to our new list.
school_name = data2[::2]
# Making a new list that consists of the school name
school_address = data2[1::2]
# Another list that consists of the school's address.
with open("Medschooltest.csv", 'w', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(zip(school_name, school_address))
'''Zip the two together making a 2 column table with the schools name and
it's address'''
print("CSV Completed!")
Created CSV file
It seems applying conditional statements along with string manipulation can do the trick. I think the following script will lead you real close to what you want.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
res = requests.get(link)
soup = BeautifulSoup(res.text, "html.parser")
with open("membersInfo.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["Name","Address"])
for tr in soup.find_all('table', class_='bodyTXT'):
items = ', '.join([item.string for item in tr.select_one('td') if item.string!="\n" and item.string!=None])
name = items.split(",")[0].strip()
address = items.split(name)[1].strip(",")
writer.writerow([name,address])
If you have knowledge of SQL AND the data is in such a structured manner, it would be the best solution to extract it to a database.

Python URLs in file Requests

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)
You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors
You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()

Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.
Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.
Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:
for i, var in enumerate(paraStringList): # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:
from urllib import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]
listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []
for i in listIterator:
print findPatTitle[i]
articlePage = urlopen(findPatLink[i]).read()
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
article = articlePage[divBegin:divEnd]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
para_string = ''
for i in paragList:
para_string += str(i)
paraStringList.append(para_string)
for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

Categories

Resources