HTML Parsing gives no response - python

I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.

First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.

I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.

Related

AttributeError: type object 'typing.re' has no attribute 'split'

I try to scam URL link form google. Users can input any search then they can take a URL link. but here is the main problem is this split function can't work. I can't fix it. So please help me
[[Suppose: Any user can input "all useless website" that time google can showing us a result. User can take only URL link.]]
from typing import re
from bs4 import BeautifulSoup
import requests
user_input = input('Enter value for search : ')
print('Please Wait')
page_source = requests.get("https://www.google.com/search?q=" + user_input)
soup = BeautifulSoup(page_source.text, 'html.parser')
print(soup.title)
print(soup.title.string)
print(soup.title.parent.name)
all_links = soup.find_all('a')
for link in all_links:
link_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
print(link_google.find["a"])
You're importing re from the wrong place. You need to use it via import re, as follows:
import re
...
link_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
Update to make your code work:
import re correctly
fix this line from all_links = soup.find_all('a') to all_links = soup.find_all('a', href=True)
Take the link and clean it up like you did before (re.split() works perfectly but it returns a list) and add that link to a list (unpack the list) or print it
Here is the code updated to make it work
# issue 1
import re
from bs4 import BeautifulSoup
import requests
user_input = input('Enter value for search : ')
print('Please Wait')
page_source = requests.get("https://www.google.com/search?q=" + user_input)
soup = BeautifulSoup(page_source.text, 'html.parser')
print(soup.title)
print(soup.title.string)
print(soup.title.parent.name)
# issue 2
all_links = soup.find_all('a', href=True)
for link in all_links:
link_from_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
# issue 3
print(link_from_google[0])
>>> {returns all the http links}
One liner list comprehension for fun
google_links = [re.split(":(?=http)", link["href"].replace("/url?q=", ""))[0] for link in soup.find_all('a', href=True)]
>>> {returns all the http links}

PYTHON: ValueError: unknown url type: 'comments_42.html'

Okay, so I am doing a course on Pyhton and the assignment asks us to retrieve data from an html document.
Here is what I came up with:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
intlist = list()
tot = 0
count = 0
url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('span')
for tag in tags:
n = tag.contents[0]
n = int(n)
count += 1
tot = tot + n
print("Count:", n)
print("Total:", tot)
And this is what happens when I try to access the file (NOTE: the file I am trying to retrieve is stored locally):
What is the cause of this error?
Thanks anyone for the help.
You're supposed to read the html directly into BeautifulSoup. You cannot open a local file using urlopen as easily.
from bs4 import BeautifulSoup
...
with open('filename.html', 'r') as htmlfile:
html = htmlfile.read()
soup = BeautifulSoup(html, 'html.parser')
Now it's loaded for you to parse, don't forget to change filename.html to your actual file path
Edit: There are also many more problems with your code. soup('span') does not find span elements. Please refer to the docs for at least a basic understanding.

Duplicates links in python

Good morning world
I'm new to python and trying out things. I'm trying to remove duplicate links from the below run.
currently their are 253 links that were retrieved. Can someone please help me with this?
import requests
from bs4 import BeautifulSoup
import csv
page = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(page)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
links = soup.find_all("a")
print ('Number of links retrieved: ', len (links))
Convert it to a set and it will remove duplicates:
links = set(soup.find_all("a"))
out:
Number of links retrieved: 244
set will not care about the sort order.
Therefor i used a list with cleaning the href correctly.
Now the len is 123
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.census.gov/programs-surveys/popest.html")
soup = BeautifulSoup(r.text, 'html.parser')
links = []
for item in soup.findAll("a", href=True):
item = item.get("href")
if item.startswith("h"):
pass
else:
item = f"https://www.census.gov/{item}"
if item not in links:
links.append(item)
print(item)
print(len(links))
Output:
https://www.census.gov/#content
https://www.census.gov/en.html
https://www.census.gov/topics/population/age-and-sex.html
https://www.census.gov/businessandeconomy
https://www.census.gov/topics/education.html
https://www.census.gov/topics/preparedness.html
https://www.census.gov/topics/employment.html
https://www.census.gov/topics/families.html
https://www.census.gov/topics/population/migration.html
https://www.census.gov/geo
https://www.census.gov/topics/health.html
https://www.census.gov/topics/population/hispanic-origin.html
https://www.census.gov/topics/housing.html
https://www.census.gov/topics/income-poverty.html
https://www.census.gov/topics/international-trade.html
https://www.census.gov/topics/population.html
https://www.census.gov/topics/population/population-estimates.html
https://www.census.gov/topics/public-sector.html
https://www.census.gov/topics/population/race.html
https://www.census.gov/topics/research.html
https://www.census.gov/topics/public-sector/voting.html
https://www.census.gov/about/index.html
https://www.census.gov/data
https://www.census.gov/academy
https://www.census.gov/about/what/admin-data.html
https://www.census.gov/data/data-tools.html
https://www.census.gov/developers/
https://www.census.gov/data/experimental-data-products.html
https://www.census.gov/data/related-sites.html
https://www.census.gov/data/software.html
https://www.census.gov/data/tables.html
https://www.census.gov/data/training-workshops.html
https://www.census.gov/library/visualizations.html
https://www.census.gov/library.html
https://www.census.gov/AmericaCounts
https://www.census.gov/library/audio.html
https://www.census.gov/library/fact-sheets.html
https://www.census.gov/library/photos.html
https://www.census.gov/library/publications.html
https://www.census.gov/library/video.html
https://www.census.gov/library/working-papers.html
https://www.census.gov/programs-surveys/are-you-in-a-survey.html
https://www.census.gov/programs-surveys/decennial-census/2020census-redirect.html
https://www.census.gov/2020census
https://www.census.gov/programs-surveys/acs
https://www.census.gov/programs-surveys/ahs.html
https://www.census.gov/programs-surveys/abs.html
https://www.census.gov/programs-surveys/asm.html
https://www.census.gov/programs-surveys/cog.html
https://www.census.gov/programs-surveys/cbp.html
https://www.census.gov/programs-surveys/cps.html
https://www.census.gov/EconomicCensus
https://www.census.gov/internationalprograms
https://www.census.gov/programs-surveys/metro-micro.html
https://www.census.gov/popest
https://www.census.gov/programs-surveys/popproj.html
https://www.census.gov/programs-surveys/saipe.html
https://www.census.gov/programs-surveys/susb.html
https://www.census.gov/programs-surveys/sbo.html
https://www.census.gov/sipp/
https://www.census.gov/programs-surveys/surveys-programs.html
https://www.census.gov/newsroom.html
https://www.census.gov/partners
https://www.census.gov/programs-surveys/sis.html
https://www.census.gov/NAICS
https://www.census.gov/library/reference/code-lists/schedule/b.html
https://www.census.gov/data/developers/data-sets/Geocoding-services.html
https://www.census.gov/about-us
https://www.census.gov/about/who.html
https://www.census.gov/about/what.html
https://www.census.gov/about/business-opportunities.html
https://www.census.gov/careers
https://www.census.gov/fieldjobs
https://www.census.gov/about/history.html
https://www.census.gov/about/policies.html
https://www.census.gov/privacy
https://www.census.gov/regions
https://www.census.gov/about/contact-us/staff-finder.html
https://www.census.gov/about/contact-us.html
https://www.census.gov/about/faqs.html
https://www.commerce.gov/
https://www.census.gov//en.html
https://www.census.gov//programs-surveys.html
https://www.census.gov//popest
https://www.census.gov//programs-surveys/popest/about.html
https://www.census.gov//programs-surveys/popest/data.html
https://www.census.gov//programs-surveys/popest/geographies.html
https://www.census.gov//programs-surveys/popest/guidance.html
https://www.census.gov//programs-surveys/popest/guidance-geographies.html
https://www.census.gov//programs-surveys/popest/library.html
https://www.census.gov//programs-surveys/popest/news.html
https://www.census.gov//programs-surveys/popest/technical-documentation.html
https://www.census.gov//programs-surveys/popest/data/tables.html
https://www.census.gov//programs-surveys/popest/about/schedule.html
https://www.census.gov//newsroom/press-releases/2019/popest-nation.html
https://www.census.gov//newsroom/press-releases/2019/popest-nation/popest-nation-spanish.html
https://www.census.gov//newsroom/press-releases/2019/new-years-2020.html
https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-national.html
https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-state.html
https://www.census.gov//data/tables/time-series/demo/popest/pre-1980-county.html
https://www.census.gov//library/publications/2015/demo/p25-1142.html
https://www.census.gov//library/publications/2010/demo/p25-1139.html
https://www.census.gov//library/publications/2010/demo/p25-1138.html
https://www.census.gov//programs-surveys/popest/library/publications.html
https://www.census.gov//library/visualizations/2020/comm/superbowl.html
https://www.census.gov//library/visualizations/2019/comm/slower-growth-nations-pop.html
https://www.census.gov//library/visualizations/2019/comm/happy-new-year-2020.html
https://www.census.gov//programs-surveys/popest/library/visualizations.html
https://www.census.gov/#
https://www.census.gov/#uscb-nav-skip-header
https://www.census.gov/newsroom/blogs.html
https://www.census.gov/newsroom/stories.html
https://www.facebook.com/uscensusbureau
https://twitter.com/uscensusbureau
https://www.linkedin.com/company/us-census-bureau
https://www.youtube.com/user/uscensusbureau
https://www.instagram.com/uscensusbureau/
https://www.census.gov/quality/
https://www.census.gov/datalinkage
https://www.census.gov/about/policies/privacy/privacy-policy.html#accessibility
https://www.census.gov/foia
https://www.usa.gov/
https://www.census.gov//
123

Python scraping information has the same classes

How do I return data from https://finance.yahoo.com/quote/FB?p=FB. I am trying to pull the open and close data. The thing is that both of these numbers share the same class in the code.
They both share this class 'Trsdu(0.3s) '
How can I differentiate these if the classes are the same?
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
This function:
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
will return just the text of the first element with class Trsdu(0.3s).
Using:
googclose = googsoupsoup.find_all(class_='Trsdu(0.3s)')
will return an array containing the page's elements with class Trsdu(0.3s).
Then you can iterate them:
for element in googsoupsoup.find_all(class_='Trsdu(0.3s)'):
print element.get_text()
Check this out, if this is what you wanted:
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.select("span[data-reactid=42]")[1].text
googopen = googsoup.select("span[data-reactid=48]")[0].text
print("Close: {}\nOpen: {}".format(googclose,googopen))
Result:
Close: 172.17
Open: 171.69
If you want just the values for Open and Previous Close, you can either use findAll and grab the first 2 items in the results
googclose, googopen = googsoup.findAll('span', class_='Trsdu(0.3s) ')[:2]
googclose = googclose.get_text()
googopen = googopen.get_text()
print(googclose, googopen)
>>> 172.17 171.69
Or you can go one level higher, and find the values based on the parent td using the data-test attribute
googclose = googsoup.find('td', attrs={'data-test': 'PREV_CLOSE-value'}).get_text()
googopen = googsoup.find('td', attrs={'data-test': 'OPEN-value'}).get_text()
print(googclose, googopen)
>>> 172.17 171.69
If you use the Chrome browser you can right-click on the item that you want to know more about then select Inspect from the resulting menu. The browser will show you something like this for the number associated with OPEN.
Notice that, not only is there a class attribute, there's the data-reactid attribute that might do the trick. In fact, if you also inspect the close number you will find, as I did, that its attribute is different.
This suggests the following code.
>>> import requests
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> soup.findAll('span', attrs={'data-reactid': '42'})[0].text
'172.17'
>>> soup.findAll('span', attrs={'data-reactid': '48'})[0].text
'171.69'

Python index function

I am writing a simple Python program which grabs a webpage and finds all the URL links in it. However I try to index the starting and ending delimiter (") of each href link but the ending one always indexed wrong.
# open a url and find all the links in it
import urllib2
url=urllib2.urlopen('right.html')
urlinfo = url.info()
urlcontent = url.read()
bodystart = urlcontent.index('<body')
print 'body starts at',bodystart
bodycontent = urlcontent[bodystart:].lower()
print bodycontent
linklist = []
n = bodycontent.index('<a href=')
while n:
print n
bodycontent = bodycontent[n:]
a = bodycontent.index('"')
b = bodycontent[(a+1):].index('"')
print a, b
linklist.append(bodycontent[(a+1):b])
n = bodycontent[b:].index('<a href=')
print linklist
I would suggest using a html parsing library instead of manually searching the DOM String.
Beautiful Soup is an excellent library for this purpose. Here is the reference link
With bs your link searching functionality could look like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(bodycontent, 'html.parser')
linklist = [a.get('href') for a in soup.find_all('a')]

Categories

Resources