BeautifulSoup/Python Problems Parsing Websites - python

I'm sure this may have been asked in the past but I am attempting to parse a website (hopefully somehow automate it to parse multiple websites at once eventually) but it's not working properly. I may be having issues grabbing appropriate tags or something but essentially I want to go to this website and pull off all of the items from the lists created (possibly with hrefs intact or in a separate document) and stick them into a file where I can share in an easy-to-read format. So far this is my code:
url = "http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/" `
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
results = soup.find_all('div', class_"tab_content")
for element in results:
title_elem = element.find('h1')
h2_elem = element.find('h2')
h3_elem = element.find('h3')
href_elem = element.find('href')
if None in (title_elem, h2_elem, h3_elem, href_elem):
continue
print(title_elem.text.strip())
print(h2_elem.text.strip())
print(h3_elem.text.strip())
print(href_elem.text.strip())
print()
I even attempted to write this for a table but I get the same type of output, which are a bunch of empty elements:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
print(subtable)
Does anyone have any insight as to why this may be the case? If possible I would also not be opposed to regex parsing, but the main goal here is to go into this site (and hopefully others like it) and take the entire table/lists/descriptions of the individual programs for each major and write the information into an easy-to-read file

Similar approach in that I also selected to combine bs4 with pandas but I tested for the presence of the hyperlink class.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.sc_courselist'):
tbl = pd.read_html(str(table))[0]
links_column = ['http://catalog.apu.edu' + i.select_one('.bubblelink')['href'] if i.select_one('.bubblelink') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
print(tbl)
With BeautifulSoup, an alternative to find/find_all is select_one/select. The latter two apply css selectors with select_one returning the first match for the css selector passed in, and select returning a list of all matches. "." is a class selector, meaning it will select attributes with the specified class e.g. sc_courselist or bubblelink. bubblelink is the class of the element with the desired hrefs. These are within the first column of each table which is selected using td:nth-of-type(1).

Related

Matching a specific piece of text in a title using Beuatiful Soup

Basically, I want to find all links that contain certain key terms. In my case, the titles of these links that I want come in this form: abc... (common text), dce... (common text), ... I want to take all of the links containing "(common text)" and put them in the list. I got the code working and I understand how to find all links. However, I converted the links to strings to find the "(common text)". I know that this isn't good practice and I am not sure how to use Beautiful Soup to find this common element without converting to a string. The issue here is that the titles I am searching for are not all the same. Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
links_length = len(links)
string_links = []
targetlist = []
for a in range(links_length):
string_links.append(str(links[a]))
if '(common text)' in string_links[a]:
targetlist.append(string_links[a])
NOTE: I am looking for the simplest method using Beautiful Soup to accomplish this. Any help will be appreciated.
Without the actual website and actual output you want, it's very difficult to say what you want but this is a "cleaner" solution using list comprehension.
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
targetlist = [str(link) for link in links if "(common text)" in str(link)]

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Extract specific value from a table using Beautiful Soup (Python)

I looked around on Stackoverflow, and most guides seem to be very specific on extracting all data from a table. However, I only need to extract one, and just can't seem to extract that specific value from the table.
Scrape link:
https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919
I am looking to extract the "Style" value from the table within the link.
Code:
import bs4
styleData=[]
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
table=cleanbyAddPD.find('div',{'id':'MainContent_ctl01_panView'})
style=table.findall('tr')[3]
style=style.findall('td')[1].text
print(style)
styleData.append(style)
Probably you misused find_all function, try this solution:
style=table.find_all('tr')[3]
style=style.find_all('td')[1].text
print(style)
It will give you the expected output
You can use a CSS Selector:
#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)
Which will select the "MainContent_ctl01_grdCns" id, the fourth <tr>, the second <td>.
To use a CSS Selector, use the .select() method instead of find_all(). Or select_one() instead of find().
import requests
from bs4 import BeautifulSoup
URL = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print(
soup.select_one(
"#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)"
).text
)
Output:
Townhouse End
Could also do something like:
import bs4
import requests
style_data = []
url = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = bs4.BeautifulSoup(requests.get(url).content, 'html.parser')
# select the first `td` tag whose text contains the substring `Style:`.
row = soup.select_one('td:-soup-contains("Style:")')
if row:
# if that row was found get its sibling which should be that vlue you want
home_style_tag = row.next_sibling
style_data.append(home_style_tag.text)
A couple of notes
This uses CSS selectors rather than the find methods. See SoupSieve docs for more details.
The select_one relies on the fact that the table is always ordered in a certain way, if this is not the case use select and iterate through the results to find the bs4.Tag whose text is exactly 'Style:' then grab its next sibling
Using select:
rows = soup.select('td:-soup-contains("Style:")')
row = [r for r in rows if r.text == 'Style:']
home_style_text = row.text
You can use :contains on a td to get the node with innerText "Style" then an adjacent sibling combinator with td type selector to get the adjacent td value.
import bs4, requests
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
print(cleanpagedata.select_one('td:contains("Style") + td').text)

beautifulsoup doesn't show all ellements

i'm trying to parse Taobao website and get information about Goods (photo , text and link ) with BeautifulSoup.find but it doesn't find all classes.
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})
z-is empty.
but for example:
z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3
works fine
Why this approach doesn't work? What should i do to get all information about good? i am using python 2.7
So, it looks like the items you want to parse are being built dynamically by JavaScript, that's why soup.text.find("J_TItems") returns -1, i.e. there's no "J_TItems" at all in the text. What you can do is use selenium with a JS interpreter, for a headless browsing you can use PhantomJS like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})
Note the items you want are inside each row defined by <div class="item4line1">, and there are 5 of them (you maybe only want the first three, because the other two are not really inside the main search, filtering that should not be difficult, a simple rows = rows[2:] do the trick):
rows = JTitems.findAll("div", attrs={"class":"item4line1"})
>>> len(rows)
5
Now notice each "Good" you mention in the question is inside a <dl class="item">, so you need to get them all in a for loop:
Goods = []
for row in rows:
for item in row.findAll("dl", attrs={"class":"item"}):
Goods.append(item)
All there's left to do is to get "photo, text and link" as you mentioned, and this can be easily done accessing each item in Goods list, by inspection you can know how to get each of the information, for examples, for picture url a simple one-line would be:
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'

Best way to get 'hrefs' from CSS selector in BeautifulSoup?

Writing a script that would, initially, scrape the data for all of the census blocks in a given census block group. In order to do that, though, I first need to be able to get a link all of the block groups in a given tract. The tracts are defined by a list with the URLs to them, which returns a page which lists the block groups within the css selector "div#rList3 a". When I run this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
tracts = ['http://www.usa.com/NY023970800.html','http://www.usa.com/NY023970900.html',
'http://www.usa.com/NY023970600.html','http://www.usa.com/NY023970700.html',
'http://www.usa.com/NY023970500.html']
class Scrape:
def scrapeTracts(self):
for i in tracts:
html = urlopen(i)
soup = BeautifulSoup(html.read(), 'lxml')
bgs = soup.select("div#rList3 a")
print(bgs)
s = Scrape()
s.scrapeTracts()
This gives me an output that looks like: [NY0239708001] (with the real amount of links cut out for the sake of the length of this post.) My question is, how can I get just the string after 'href', in this case being /NY0239708001.html?
You can do this mostly in one line, by doing this:
bgs = [i.attrs.get('href') for i in soup.select("div#rList3 a")]
Output:
['/NY0239708001.html']
['/NY0239709001.html', '/NY0239709002.html', '/NY0239709003.html', '/NY0239709004.html']
['/NY0239706001.html', '/NY0239706002.html', '/NY0239706003.html', '/NY0239706004.html']
['/NY0239707001.html', '/NY0239707002.html', '/NY0239707003.html', '/NY0239707004.html', '/NY0239707005.html']
['/NY0239705001.html', '/NY0239705002.html', '/NY0239705003.html', '/NY0239705004.html']
Each node has an attrs dictionary which contains the attributes of that node...including CSS classes, or in this case, the href.
hrefs = []
for bg in bgs:
hrefs.append(bg.attrs['href'])

Categories

Resources