Why is it just printing 0 when I run this code?
I am trying to print the articles' title, link, and date. Is the .aspx link possibly a problem for this method?
I originally was trying an rss version https://codeburst.io/building-an-rss-feed-scraper-with-python-73715ca06e1f and changing that to an individual website because the rss doesn't give me the info I actually need. Thanks for the help!
from bs4 import BeautifulSoup
import requests
url= 'https://tymeinc.com/newsroom/press-releases/default.aspx'
def news(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.content, features= 'xml')
articles = soup.findAll('blog_section')
print(len(articles))
for a in articles:
title = a.find('blog_title').text
link = a.find('blog_link').text
published = a.find('module_date-time').text
description = a.find('blog_short-body').text
article = {'blog_title': title,'blog_link': link,'module_date-time': published}
articles.append(article)
return print(url)
return print(title)
#return print(articles, "done")
except Exception as e:
print('The scraping job failed. See exception: ', e)
news(url)
Related
new to web scraping (using python) and encountered a problem trying to get an email from a university's athletic department site.
I've managed to get to navigate to the email I want to extract but don't know where to go from here. When I print what I have, all I get is '' and not the actual text of the email.
I'm attaching what I have so far, let me know if it needs a better explanation.
Here's a link to an image of what I'm trying to scrape. Website
and the website: https://goheels.com/staff-directory
Thanks!
Here's my code:
from bs4 import BeautifulSoup
import requests
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls += line
urls = list(urls.split())
print(urls)
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try:
body = soup.find(headers="col-staff_email category-0")
links = body.a
print(links)
except Exception as e:
print(f'"This url didn\'t work:" {url}')
The emails are hidden inside a <script> element. With a little pushing, shoving, css selecting and string splitting you can get there:
for em in soup.select('td[headers*="col-staff_email"] script'):
target = em.text.split('var firstHalf = "')[1]
fh = target.split('";')[0]
lh = target.split('var secondHalf = "')[1].split('";')[0]
print(fh+ '#' +lh)
Output:
bubba.cunningham#unc.edu
molly.dalton#unc.edu
athgallo#unc.edu
dhollier#unc.edu
etc.
I am working on this task right now:
"Use BeautifulSoup and requests Python packages to print out a list of all the article titles on the New York Times homepage."
For now I can only connect to the page:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.nytimes.com/")
if r.status_code == 200:
print("Page opened successfully.")
soup = BeautifulSoup(r.text,'html.parser')
else:
print("Page not found!")
exit(1)
r_html = r.text
exit(0)
So... my question is how can I use "bs4" library and source code from the page to find the information I want from there(list of articles from the homepage)?
The sorting criterion/criteria (common tags, or html properties for articles) is the main challenge. What i have done below is to scoop all article titles that appear within the tag.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.nytimes.com/")
if r.status_code == 200:
print("Page opened successfully.")
soup = BeautifulSoup(r.text,'html.parser')
result = soup.find_all('h2')
headlines = []
for i in result:
if result.index(i) < len(result)-2:
headlines.append(i.text)
else:
print("Page not found!")
exit(1)
r_html = r.text
print(headlines)
exit(0)
You may take some time to study the page source, as this will give you more insight into what properties are unique to article headlines (with with you can better scrape the information you want)
I recently started learning Python. In the process of learning about web scraping, I followed an example to scrape from Google News. After running my code, I get the message: "Process finished with exit code 0" with no results. If I change the url to "https://yahoo.com" I get results. Could anyone point out what, if anything I am doing wrong?
Code:
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("\n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()
Try this out:
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
else:
print("\n" + url)
if __name__ == '__main__':
news = "https://news.google.com/"
Scraper(news).scrape()
Initially you were checking each link to see if it contained 'html' in it. I am assuming the example you were following was checking to see if the links ended in '.html;
Beautiful soup works really well, but you need to check the source code on the website your scraping to get an idea for how the code is layed out. Devtools in chrome works really well for this, F12 to get their quick.
I removed:
if "html" in url:
print("\n" + url)
and replaced it with:
else:
print("\n" + url)
I need to scrap a website to obtain some information like Film's title and the relative links. My code run correctly but it stops at the first line of the website. This is my code, thank you in advance for your help and sorry if this is not a smart question but I'm a novice.
import requests
from bs4 import BeautifulSoup
URL= 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(URL):
page = requests.get(URL)
html = page.text
soup = BeautifulSoup(html, 'lxml') l
films = soup.find_all("div",{"id": "movie_wide"})
for film in films:
link = film.find('p').find("a").attrs['href']
title = film.find('p').find("a").text.strip('>')
print (link)
print(title)
Try the below way. I've slightly modified your script to serve the purpose and make it look better. Let me know if you encounter any further issues:
import requests
from bs4 import BeautifulSoup
URL = 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
for film in soup.find(id="movie_wide").find_all("p"):
link = film.find("a")['href']
title = film.find("a").text
print (link,title)
if __name__ == '__main__':
scarica_pagina(URL)
I am new to python. I am building a crawler for the company I work for. Crawling its website, there is a internal link that is not in the link format that it is used to. How can I get the entire link instead of the directory only. If I was not too clear, please run the code that I made bellow:
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
for link in soup.find_all('a'):
print (link.get('href'))
print soup
print get_first_page('http://www.fashionroom.com.br')
print web_page_string
Tks everyone for the answers I tried to put an if in the script. If anyone sees a potential problem with something I will find in the future, pls let me know
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
final_page_string = soup.get_text()
for link in soup.find_all('a'):
if (link.get('href'))[0:4]=='http':
print (link.get('href'))
else:
print seed+'/'+(link.get('href'))
print final_page_string
print get_first_page('http://www.fashionroom.com.br')
print web_page_string