parsing webpage using python - python

I am trying to parse a webpage (forums.macrumors.com) and get a list of all the threads posted.
So I have got this so far:
import urllib2
import re
from BeautifulSoup import BeautifulSoup
address = "http://forums.macrumors.com/forums/os/"
website = urllib2.urlopen(address)
website_html = website.read()
text = urllib2.urlopen(address).read()
soup = BeautifulSoup(text)
Now the webpage source has this code at the start of each thread:
<li id="thread-1880" class="discussionListItem visible sticky WikiPost "
data-author="ABCD">
How do I parse this so I can then get to the thread link within this li tag? Thanks for the help.

So from your code here, you have the soup object which contains the BeautifulSoup object of your html. The question is what part of that tag you're looking for is static? Is the id always the same? the class?
Finding by the id:
my_li = soup.find('li', {'id': 'thread-1880'})
Finding by the class:
my_li = soup.find('li', {'class': 'discussionListItem visible sticky WikiPost "})
Ideally you would figure out the unique class you can check for and use that instead of a list of classes.
if you are expecting an a tag inside of this object, you can do this to check:
if my_li and my_li.a:
print my_li.a.attrs.get('href')
I always recommend checking though, because if the my_li ends up being None or there is no a inside of it, your code will fail.
For more details, check out the BeautifulSoup documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

The idea here would be to use CSS selectors and to get the a elements inside the h3 with class="title" inside the div with class="titleText" inside the li element having the id attribute starting with "thread":
for link in soup.select("div.discussionList li[id^=thread] div.titleText h3.title a[href]"):
print link["href"]
You can tweak the selector further, but this should give you a good starting point.

Related

Refering to two classes inside an Html tag using BeautifulSoup 4

hello im having an issue trying to get my dom element of a specific tag that has two class
let's have this example :
<a href="website" class="tab_item app_impression_tracked >
</a>
I used as a code for finding elements inside my <a> tag
containers = page_soup.find_all("a", {"class": "tab_item.app_impressions_tracked"})
and as a result I get an empty variable which means it's not working, do you guys have any other alternative that can help me ? even the css selector didnt solve the problem
The page is loaded dynamically, therefore if we call soup.prettify(), you see that all the desired output is under the class tab_item which includes tags that we don't want, and not under the class tab_item app_impression_tracked.
A different approach would be to use a CSS Selector to find all links under the NewReleasesRows ID (<div id="NewReleasesRows">).
To use a CSS Selector use the select() method instead of find_all() (in your example you have used find_all() instead of select())
import requests
from bs4 import BeautifulSoup
URL = "https://store.steampowered.com/tags/fr/RPG/"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for tag in soup.select("#NewReleasesRows > a"):
print(tag["href"])
Output:
https://store.steampowered.com/app/1433420/Hero_by_Chance/?snr=1_241_4_rpg_103
https://store.steampowered.com/app/1235140/Yakuza_Like_a_Dragon/?snr=1_241_4_rpg_103
https://store.steampowered.com/app/1445440/Blacksmith_of_the_Sand_Kingdom/?snr=1_241_4_rpg_103
...And on...
I would use css selectors for this:
# You can chain classes together, css will match objects which satisfy both classes
# the added [href] specifies element must have an href
my_links = soup.select('a.tab_item.app_impression_tracked[href]')
# Get the hrefs for each one:
my_links_href = [x.get("href") for x in my_links]

Why is BeautifulSoup's findAll returning an empty list when I search by class?

I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list.
<h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job">
html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job")
bs0bj=BeautifulSoup(html,"lxml")
nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"})
print(nameList)
The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using to obtain iframe content (the iframe src). Then extract the string from the script tag that has the info and load with json, extract the description (which is html) and pass back to bs to then select the h2 tags. You now have the rest of the info stored in the second soup object as well if required.
import requests
from bs4 import BeautifulSoup as bs
import json
r = requests.get('https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job?mobile=false&width=1140&height=500&bga=true&needsRedirect=false&jan1offset=0&jun1offset=60&in_iframe=1')
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script)
soup = bs(data['description'], 'lxml')
headers = [item.text for item in soup.select('h2')]
print(headers)
The answer lays hidden in two elements:
javascript rendered contents: after document.onload
in particular the content managed by js comes after this comment and it's, indeed, rendered by js. The line where the block starts is: "< ! - -BEGIN ICIMS - - >" (space added to avoid it goes blank)
As you can imagine the h2 class="ICISM class here" DOESN'T exist WHEN you call the bs4 methods.
The solution?
IMHO the best way to achieve what you want is to use selenium, to get a full rendered web page.
check this also
Web-scraping JavaScript page with Python

Finding video id on html website using python

I am scraping an html file, each page has a video on it, and in the html there is the video id. I want to print out the video id.
I know that if i want to print a headline from a div class i would do this
with open('yeehaw.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
article = soup.find('div', class_='article')
headline = article.h2.a.text
print headline
However the id for the video is found inside a data-id='qe67234'
I dont know how to access this 'qe67234' and print it out.
please help thank you!
Assuming that the tag for data-id begins with div:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('<div class="_article" data-id="qe67234"></div>')
results = soup.findAll("div", {"data-id" : re.compile(r".*")})
print('output: ', results[0]['data-id'])
# output: qe67234
Assuming that the data-id is in div
BeautifulSoup.find returns you the found html element as a dictionary. You can therefore navigate it using standard means to get access to the text (as you did in your question) as well as html tags (as shown in the code below)
soup = BeautifulSoup('<div class="_article" data-id="qe67234">')
soup.find("div", {"class":"_article"})['data-id']
Note that, oftentimes, video elements require JS for playback, and you might not be able to find the necessary element if it was scraped with a non-javascript client (i.e. python requests).
If this happens, you have to use tools like phantomjs + selenium browser to get the website combined with the javascript to perform your scraping.
EDIT
If the data-id tag itself is not constant, you should look into lxml library to replace BeautifulSoup and use xpath values to find the element that you need

How to find links within a specified class with Beautiful Soup

I'm using Beautiful Soup 4 to parse a news site for links contained in the body text. I was able to find all the paragraphs that contained the links but the paragraph.get('href') returned type none for each link. I'm using Python 3.5.1. Any help is really appreciated.
from bs4 import BeautifulSoup
import urllib.request
import re
soup = BeautifulSoup("http://www.cnn.com/2016/11/18/opinions/how-do-you-deal-with-donald-trump-dantonio/index.html", "html.parser")
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
print(paragraph.get('href'))
Do you really want this?
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
for a in paragraph("a"):
print(a.get('href'))
Note that paragraph.get('href') tries to find attribute href in <div> tag you found. As there's no such attribute, it returns None. Most probably you actually have to find all tags <a> which a descendants of your <div> (this can be done with paragraph("a") which is a shortcut for paragraph.find_all("a") and then for every element <a> look at their href attribute.

How to use BeautifulSoup to scrape links in a html

I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks
I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)
You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.

Categories

Resources