BeautifulSoup and Large html - python

I was trying to scrape a number of large Wikipedia pages like this one.
Unfortunately, BeautifulSoup is not able to work with such a large content, and it truncates the page.

I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(htmlContent, 'html5lib')
However, if you prefer, you can also use lxml as follows:
import lxml.html
doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')

I suggest you get the html content and then pass it to BS:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
soup = BeautifulSoup(r.content)
# get the div with links at the bottom of the page
links_div = soup.find('div', id='catlinks')
for a in links_div.find_all('a'):
print a.text
else:
print r.status_code

Related

How to use lxml for web scraping?

I want to write a python script that fetches my current reputation on stack overflow --https://stackoverflow.com/users/14483205/raunanza?tab=profile
This is the code I have written.
from lxml import html
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)
Now, what to do to fetch my reputation. (I can't understand how to use xpath even
after googling it.)
You need to make some modifications in your code to get the xpath. Below is the code:
from lxml import HTML
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)
title = tree.xpath('//*[#id="avatar-card"]/div[2]/div/div[1]/text()')
print(title) #prints 3
You can easily get the xpath of element in chrome console(inspect option).
To learn more about xpath you can refer: https://www.w3schools.com/xml/xpath_examples.asp
Simple solution using lxml and beautifulsoup:
from lxml import html
from bs4 import BeautifulSoup
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile').text
tree = BeautifulSoup(page, 'lxml')
name = tree.find("div", {'class': 'grid--cell fw-bold'}).text
title = tree.find("div", {'class': 'grid--cell fs-title fc-dark'}).text
print("Stackoverflow reputation of {}is: {}".format(name, title))
# output: Stackoverflow reputation of Raunanza is: 3
If you don't mind using BeautifulSoup, you can directly extract the text from the tag which contains your reputation. Of course you need to check page structure first.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
soup = BeautifulSoup(page.content, features= 'lxml')
for tag in soup.find_all('strong', {'class': 'ml6 fc-medium'}):
print(tag.text)
#this will output as 3

BeautifulSoup doesn't find line

I'm trying to get the length of a YouTube video with BeautifulSoup and by inspecting the site I can find this line: <span class="ytp-time-duration">6:14:06</span> which seems to be perfect to get the duration, but I can't figure out how to do it.
My script:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
soup = BeautifulSoup(response.text, "html5lib")
mydivs = soup.findAll("span", {"class": "ytp-time-duration"})
print(mydivs)
The problem is that the output is []
From the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup.findAll('div', attrs={"class":u"ytp-time-duration"})
But I guess that your syntax is a shortcut, so I would probably consider that when loading the youtube page, the div.ytp-time-duration is not present. It is only added after loading. Maybe that's why the soup does not pick it up.

Suitable javascript parser to be used with urlopen

I am trying the following:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://search.wcad.org/Property-Detail?PropertyQuickRefID=R000017&PartyQuickRefID=O0532572'
soup = BeautifulSoup(urlopen(url).read())
print soup
The print statement displays very complicated text structure and it is difficult to extract variables. What is the better way to extract variables like Legal Description
You don't need to parse JavaScript to get the "Legal Description" value - you need to parse HTML and BeautifulSoup HTML parser can do the job. Locate the td element "by 'Legal Description' text" and then get the next td element:
soup.find("td", text="Legal Description").find_next_sibling("td").get_text()
Note: you are using BeautifulSoup version 3 - it is very outdated and not maintained - switch to the 4th version:
pip install beautifulsoup4
And change your import from:
from BeautifulSoup import BeautifulSoup
to:
from bs4 import BeautifulSoup
Though you can do this with urllib2 I would recommend to use requests.
The id is unique for each field, so you can get the text directly by finding the element using id.
import requests
from bs4 import BeautifulSoup
url = "http://search.wcad.org/Property-Detail?PropertyQuickRefID=R000017&PartyQuickRefID=O0532572"
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
text = soup.find("td", id="dnn_ctr1460_View_tdGILegalDescription").get_text()
print(text)
NOTE: I've used Beautifulsoup version 4. To install it use this command - pip install bs4.

bs4 the second comment <!-- > is missing

I am doing python challenge level-9 with BeautifulSoup.
url = "http://www.pythonchallenge.com/pc/return/good.html".
bs4.version == '4.3.2'.
There are two comments in its page source. The output of soup should be as follows.
However, when BeautifulSoup is applied, the second comment is missing.
It seems kinda weird. Any hint? Thanks!
import requests
from bs4 import BeautifulSoup
url = "http://www.pythonchallenge.com/pc/return/good.html"
page = requests.get(url, auth = ("huge", "file")).text
print page
soup = BeautifulSoup(page)
print soup
Beautiful Soup is a wrapper around an html parser. The default parser is very strict, and when it encounters malformed html silently drops the elements it had trouble with.
You should instead install the package 'html5lib' and use that as your parser, like so:
soup = BeautifulSoup(page, 'html5lib')

BeautifulSoup does not work for some web sites

I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs

Categories

Resources