finding unique web links using python

finding unique web links using python - python

I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/
import urllib
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = bs (html)
tags = soup ('a')
for tag in tags:
print tag.get('href',None)
please guide me here and let me know why it is not working with www.stevens.edu?

The site check the User-Agent header, and returns different html base on it.
You need to set User-Agent header to get proper html:
import urllib
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)

Related

BeautifulSoup find() function returning None because I am not getting correct html info

I am trying to webscrape an Amazon web product page and print out the productTitle. Code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import smtplib
def extract():
headers = {my_info}
url = "link"
# this link is to a webpage on Amazon
req = requests.get(url, headers=headers)
soup1 = BeautifulSoup(req.content, "html.parser") # get soup1 from website
soup2 = BeautifulSoup(soup1.prettify(), 'html.parser') # prettify it in soup 2
print(soup2) # when i inspect element on website, it shows that there is a html tag <span> with
id=productTitle
# title = soup2.find('span', id='productTitle').get_text() # find() is returning None
print(soup2.prettify())
I was expecting the html content that i inspected element from directly on the website to be the same as in soup2, but for some reason it's not, which is causing find() to return None. How come the html is not the same? Any help would be appreciated.

Python BeautifulSoup not extracting every URL

I'm trying to find all the URLs on this page: https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments
More specifically, I want the links that are hyperlinked under each "Subject Code". However, when I run my code, barely any links get extracted.
I would like to know why this is happening, and how I can fix it.
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
This is my first attempt in web-scraping..

there's an anti-bot protection, just add a user-agent to your headers. and do not forget to check your soup when things go wrong
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
the message in the soup was
Sorry for the inconvenience.
We have detected excess or unusual web requests originating from your browser, and are unable to determine whether these requests are automated.
To proceed to the requested page, please complete the captcha below.

I would use nth-child(1) to restrict to the first column of the table matched by id. Then simply extract the .text. If that contains * then provide a default string for no course offered, otherwise, concatenate the retrieved course identifier onto a base query string construct:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base + i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info

How can I print any website content? (Using something like my code)

I want to open the website and get its content, store it in a variable and print it
from urllib.request import urlopen
url = any_website
content = urlopen(url).read().decode('utf-8')
print(content)
The expected result is that I get what is written in the page

In python, there are several libraries you may be interested in. An example of printing contents to get you started below:-
from bs4 import BeautifulSoup as soup
import requests
url = "https://en.wikipedia.org/wiki/List_of_multinational_corporations"
page = requests.get(url)
page_html = (page.content)
page_soup = soup(page_html, "html.parser")
print (page_soup)
with urlopen, you may try as below
from bs4 import BeautifulSoup
import urllib
url = "https://en.wikipedia.org/wiki/List_of_multinational_corporations"
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r)
print type(soup)
print (soup.prettify()[0:1000])

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class
Specifying tags work fine
import urllib2
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week"
hdr = { 'User-Agent' : 'tempro' }
req = urllib2.Request(url, headers=hdr)
htmlpage = urllib2.urlopen(req).read()
BeautifulSoupFormat = BeautifulSoup(htmlpage,'lxml')
name_box = BeautifulSoupFormat.findAll("a",{'class':'title'})
for data in name_box:
print(data.text)
I'm trying to get only the text of the post. The current code prints out nothing. If I remove the {'class':'title'} it prints out the post text as well as username and comments of the post which I don't want.
I'm using python2 with the latest versions of BeautifulSoup and urllib2

To get all the comments you are going to need a method like selenium which will allow you to scroll. Without that, just to get initial results, you can grab from a script tag in the requests response
import requests
from bs4 import BeautifulSoup as bs
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0'}
r = requests.get('https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week', headers = headers)
soup = bs(r.content, 'lxml')
script = soup.select_one('#data').text
p = re.compile(r'window.___r = (.*); window')
data = json.loads(p.findall(script)[0])
for item in data['posts']['models']:
print(data['posts']['models'][item]['title'])

The selector you try to use is not good, because you do not have a class = "title" for those posts. Please try this below:
name_box = BeautifulSoupFormat.select('a[data-click-id="body"] > h2')
this finds all the <a data-click-id="body"> where you have <h2> tag that contain the post text you need
More about selectors using BeatufulSoup you can read here:
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)

How to crawl the description for sfglobe using python

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore.
This is the code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = urllib2.urlopen(url)
html = req.read()
soup = BeautifulSoup(html)
desc = soup.find('span', class_='articletext intro')
Could anyone help me to solve this problem?

From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.
You were on the right track, but I'm not exactly sure why you did:
desc = soup.find('span', class_='articletext intro')
Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2
import requests
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'
print desc
If that isn't what you are looking for, please clarify so I can try and help you more.
EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').
Maybe this is what you are looking for:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
body = soup.find('span', class_='articletext intro')
# remove script tags
[s.extract() for s in body('script')]
text = ""
# iterate through non-script elements in the content body
for stuff in body.select('*'):
# get contents of tags, .contents returns a list
content = stuff.contents
# check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
if len(content) == 1 and isinstance(content[0], NavigableString):
text += content[0]
print text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

finding unique web links using python - python

Related

BeautifulSoup find() function returning None because I am not getting correct html info

Python BeautifulSoup not extracting every URL

How can I print any website content? (Using something like my code)

BeautifulSoup findAll returns empty list when selecting class

How to crawl the description for sfglobe using python

Categories

Resources