Is there a better way to scrape this data? - python

For work, I was asked to create a spreadsheet of the names and addresses of all allopathic medical schools in the United States. Being new to python, I thought that this would be the perfect situation to try web scraping. While I eventually wrote a program that returned the data I needed, I know that there is a better way to do it as there were some extraneous characters (eg: ", ], [) that I had to go into excel and manually remove. I would just like to know if there was a better way I could have written this code so I can get what I needed, minus the extraneous characters.
Edit: I have also attached an image of the csv file that was created to show the extraneous characters that I'm speaking about.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
# link to the site we want to scrape from
page_response = requests.get(link)
# fetching the content using the requests library
soup = BeautifulSoup(page_response.text, "html.parser")
# Calling BeautifulSoup in order to parse our document
data = []
# Empty list for the first scrape. We only get one column with many rows.
# We still have the line break tags here </br>
for tr in soup.find_all('tr', {'valign': 'top'}):
values = [td.get_text('</b>', strip=True) for td in tr.find_all('td')]
data.append(values)
data2 = []
# New list that we'll use to have name on index i, address on index i+1
for i in data:
test = list(str(i).split('</b>'))
# Using the line breaks to our advantage.
name = test[0].strip("['")
'''Here we are saying that the name of the school is the first element
before the first line break'''
addy = test[1:]
# The address is what comes after this first line break
data2.append(name)
data2.append(addy)
# Append the name of the school and address to our new list.
school_name = data2[::2]
# Making a new list that consists of the school name
school_address = data2[1::2]
# Another list that consists of the school's address.
with open("Medschooltest.csv", 'w', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(zip(school_name, school_address))
'''Zip the two together making a 2 column table with the schools name and
it's address'''
print("CSV Completed!")
Created CSV file

It seems applying conditional statements along with string manipulation can do the trick. I think the following script will lead you real close to what you want.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
res = requests.get(link)
soup = BeautifulSoup(res.text, "html.parser")
with open("membersInfo.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["Name","Address"])
for tr in soup.find_all('table', class_='bodyTXT'):
items = ', '.join([item.string for item in tr.select_one('td') if item.string!="\n" and item.string!=None])
name = items.split(",")[0].strip()
address = items.split(name)[1].strip(",")
writer.writerow([name,address])

If you have knowledge of SQL AND the data is in such a structured manner, it would be the best solution to extract it to a database.

Related

Python Looping through urls in csv file returns \ufeffhttps://

I am new to python and I am trying to loop through the list of urls in a csv file and grab the website titleusing BeautifulSoup, which I would like then to save to a file Headlines.csv. But I am unable to grab the webpage title. If I use a variable with single url as follows:
url = 'https://www.space.com/japan-hayabusa2-asteroid-samples-landing-date.html'
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
It works just fine and I get the title Japanese capsule carrying pieces of asteroid Ryugu will land on Earth Dec. 6 | Space
But when I use the loop,
import csv
with open('urls_file2.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for url in reader:
print(url)
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
I get the following
['\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']
and an error message
InvalidSchema: No connection adapters were found for "['\\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']"
I am not sure what am I doing wrong.
You have a byte order mark \\ufeff on the URL you parse from your file.
It looks like your file is a signature file and has encoding like utf-8-sig.
You need to read with the file with encoding='utf-8-sig'
Read more here.
As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.
The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.
Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.
import csv
with open('urls_file2.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for url in reader:
print(url[0])
To read up more on lists, you can refer here.
You can add more columns to the CSV file and experiment to see how the results would appear.
If you would like to refer to the column name while reading each row, you can refer here.

Iterating website URLs from a text file into BeautifulSoup w/ Python

I have a .txt file with a different link on each line that I want to iterate, and parse into BeautifulSoup(response.text, "html.parser"). I'm having a couple issues though.
I can see the lines iterating from the text file, but when I assign them to my requests.get(websitelink), my code that previously worked (without iteration) no longer prints any data that I scrape.
All I receive are some blank lines in the results.
I'm new to Python and BeautifulSoup, so I'm not quite sure what I'm doing wrong. I've tried parsing the lines as a string, but that didn't seem to work.
import requests
from bs4 import BeautifulSoup
filename = 'item_ids.txt'
with open(filename, "r") as fp:
lines = fp.readlines()
for line in lines:
#Test to see if iteration for line to line works
print(line)
#Assign single line to websitelink
websitelink = line
#Parse websitelink into requests
response = requests.get(websitelink)
soup = BeautifulSoup(response.text, "html.parser")
#initialize and reset vars for cd loop
count = 0
weapon = ''
stats = ''
#iterate through cdata on page, and parse wanted data
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
#print(cd)
count += 1
if count == 1:
weapon = cd
if count == 6:
stats = cd
#concatenate cdata info
both = weapon + " " + stats
print(both)
The code should follow these steps:
Read line (URL) from text file, and assign to variable to be used w/ request.get(websitelink)
BeautifulSoup scrapes that link for the CData and prints it
Repeat Step 1 & 2 until final line of the text file (last URL)
Any help would be greatly appreciated,
Thanks
I don't know this could help you or not but I've added a strip() to your link variable when you are assigning it to the websitelink and helped me to make your code work. You could try it.
websitelink = line.strip()

How to input a list of URLs saved in a .txt to a Python program?

I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.
Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()

How to prevent writing into txt file the same words using open(text.txt,a)?

I have a question regarding appending to text file. I have written a script and what this script does is that it will read the URL in JSON format and extract the list of titles and write into the file "WordsInCategory.text".
As this code will be used in a loop thus I used f1 = open('WordsInCategory.text', 'a').
But I encountered a problem, that is it will add in already existing title into the file.
I am having trouble coming out with a solution to solve this problem and using 'w' will overwrite what it is written.
My code is as follows:
import urllib2
import json
url1 ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
f1 = open('WordsInCategory.text', 'a')
for item in data1['query']:
for i in data1['query']['categorymembers']:
f1.write((i['title']).encode('utf8')+"\n")
Please advice on how I should modify my code.
Thank you.
I would suggest saving every title in an array, before writing to a file (and hence writing only once to the given file). You can modify your code this way :
import urllib2
import json
data = []
f1 = open('WordsInCategory.text', 'w')
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
data.append(i['title'].encode('utf8')+"\n")
# Do additional requests, and append the new titles to the data array
f1.write(''.join(set(data)))
f1.close()
set allows me to delete any duplicate entry.
If keeping the titles in memory is a problem, you can check if the title already exists before writing it to the file, but it may be awfully time consuming :
import urllib2
import json
data = []
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
title = (i['title'].encode('utf8')+"\n")
with open('WordsInCategory.text', 'r') as title_check:
if title not in title_check:
data.append(title)
with open('WordsInCategory.text', 'a') as f1:
f1.write(''.join(set(data)))
# Handle additional requests
Hope it'll be helpful.
You can track the titles you added.
titles = []
and then add each title to the list when writing
if title not in titles:
# write to file
titles += title

Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.
Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.
Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:
for i, var in enumerate(paraStringList): # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:
from urllib import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]
listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []
for i in listIterator:
print findPatTitle[i]
articlePage = urlopen(findPatLink[i]).read()
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
article = articlePage[divBegin:divEnd]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
para_string = ''
for i in paragList:
para_string += str(i)
paraStringList.append(para_string)
for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

Categories

Resources