How to extract ids and classes from a webpage using python? - python

This is my code so far :
import urllib2
with urllib2.urlopen("https://quora.com") as response:
html = response.read()
I am new to Python and somehow I am successful in fetching the webpage, now how to extract ids and classes from the webpage?

A better way to do so would be using the BeautifulSoup (bs4) web-scraping library, and requests.
After having installed both using pip, you can start as so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://quora.com")
soup = BeautifulSoup(r.content, "html.parser")
To find an element with a specific id:
soup.find(id="your_id")
To find all elements with the "Answer" class:
soup.find_all(class_="Answer")
You can then use .get_text() to remove the html tags and use python string operations to organize your data.

You may try to parse the html code using dedicated libraries, for instance BeautifulSoup.

you can do it easly by xml parsing
from lxml import html
import requests
page = requests.get('http://google.com')
with open('/home/Desktop/test.txt','wb') as f :
f.write(page.content)

Related

How to collect data from website when wanted tag haven't class?

I would know how to get data from a website
I find a tutorial and finished with this
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
page = requete.content
soup = BeautifulSoup(page)
The tutorial say me that I should use something like this to get the string of a tag
h1 = soup.find("h1", {"class": "ico-after ico-tutorials"})
print(h1.string)
But I got a problem : the tag where I want to get text content haven't class... how should I do ?
I tried to put {} but not working
this too {"class": ""}
In fact, it's return me a None
I want to get the text content of this part of the website :
<div style="font-size:3em; color:#6200C5;">
Orchard</div>
Where Orchard is the random word
Thank for any type of help
Unfortunately, there aren't many pointers featured in BeautifulSoup, and the page you are trying to get is terribly ill-suited for your task (no IDs, classes, or other useful html features to point at).
Hence, you should change the way you use to point at the html element, and use the Xpath - and you can't do it with BeautifulSoup. In order to do that, just use html from package lxml to parse the page. Below a code snippet (based on the answers to this question) which extracts the random word in your example.
import requests
from lxml import html
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
tree = html.fromstring(requete.content)
rand_w = tree.xpath('/html/body/center/center/table[1]/tr/td/div/text()')
print(rand_w)

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

Scraping text from HTML5 website using Python

I need to way to scrape just the text from a website using python. I have installed BeautifulSoup 4, HTML Requests, and NLTK but I just can't seem to find out how to scrape.
I really need a simple snippet of code that I can plug any URL into and get the plain text. I'm trying to get it from this website
BeautifulSoup can extract all the texts from a page easily. The following is an example to extract texts inside the <body>...</body> section.
import urllib
from bs4 import BeautifulSoup
from contextlib import closing
url = 'https://developer.valvesoftware.com/wiki/Hammer_Selection_Tool'
with closing(urllib.urlopen(url)) as h:
soup = BeautifulSoup(h.read())
print soup.body.get_text()

can we use XPath with BeautifulSoup?

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html() module with additional functionality.
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
I can confirm that there is no XPath support within Beautiful Soup.
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath
when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[#class="shared-components"]/#href')
but when use BeautifulSoup BS4 all simple too:
first remove "//" and "#"
second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/#href" part
I've searched through their docs and it seems there is no XPath option.
Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.
Maybe you can try the following without XPath
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
use soup.find(class_='myclass')

Python XMl Parser with BeautifulSoup. How do I remove tags?

For a project I decided to make an app that helps people find friends on Twitter.
I have been able to grab usernames from xml pages. So for example with my current code I can get <uri>http://twitter.com/username</uri> from an XML page, but I want to remove the <uri> and </uri> tags using Beautiful Soup.
Here is my current code:
import urllib
import BeautifulSoup
doc = urllib.urlopen("http://search.twitter.com/search.atom?q=travel").read()
soup = BeautifulStoneSoup(''.join(doc))
data = soup.findAll("uri")
Don't use BeautifulSoup to parse twitter, use their API (also don't use BeautifulSoup, use lxml). To answer your question:
import urllib
from BeautifulSoup import BeautifulSoup
resp = urllib.urlopen("http://search.twitter.com/search.atom?q=travel")
soup = BeautifulSoup(resp.read())
for uri in soup.findAll('uri'):
uri.extract()
To answer your question about BeautifulSoup, text is what you need to grab the contents of each <uri> tag. Here I extract the information into a list comprehension:
>>> uris = [uri.text for uri in soup.findAll('uri')]
>>> len(uris)
15
>>> print uris[0]
http://twitter.com/MarieJeppesen
But, as zeekay says, Twitter's REST API is a better approach for querying Twitter.

Categories

Resources