I am trying to get information from https://rosettacode.org/wiki/Category:Rascal and similar pages. The information that I am interested in is in the window on the right side on upper part of page that lists details of the language such as execution method, garbage collected etc. This information is contained in following line on the html source of the page:
<script type="8b5f853f8b614ed469e51514-">window.RLQ = window.RLQ || []; window.RLQ.push( function () {
mw.config.set({"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Rascal","wgTitle":"Rascal","wgCurRevisionId":137957,"wgRevisionId":137957,"wgArticleId":11663,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],
"wgCategories":["Execution method/Interpreted","Garbage collection/Yes","Parameter passing/By value","Typing/Safe","Typing/Strong","Typing/Expression/Partially implicit","Typing/Checking/Dynamic","Impl needed","Programming Languages"],
"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Category:Rascal"
,"wgRelevantArticleId":11663,"wgIsProbablyEditable":!0,"wgRestrictionEdit":[],"wgRestrictionMove":[],"sfgAutocompleteValues":[],"sfgAutocompleteOnAllChars":!1,"sfgFieldProperties":[],"sfgDependentFields":[],"sfgShowOnSelect":[],"sfgScriptPath":"/mw/extensions/SemanticForms","sdgDownArrowImage":"/mw/extensions/SemanticDrilldown/skins/down-arrow.png","sdgRightArrowImage":"/mw/extensions/SemanticDrilldown/skins/right-arrow.png"});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function($,jQuery){mw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\"});});mw.loader.load(["ext.smw.style","ext.smw.tooltips","mediawiki.page.startup","mediawiki.legacy.wikibits"]);
} );</script>
The main part is in "wgCategories" (shown in the middle of code above).
I have following code to get the page:
import requests, sys
lang_url = 'https://rosettacode.org/wiki/Category:Rascal'
rg = requests.get(lang_url)
if rg is None:
print("Could not obtain web page.")
sys.exit()
else: print("length of obtained page:", len(rg.text) )
from bs4 import BeautifulSoup
What function of BeautifulSoup can I use to get this information?
Edit: I checked about BeautifulSoup - I can get title, para by p and links by a and a['href'] and so on, but I cannot find a method to find and search inside a script function.
You can pass your requests object's content into the BeautifulSoup constructor, while specifying BeautifulSoup's HTML parser, html.parser, to get it in the correct format. Then, you can use BeautifulSoup's find_all() function, which has an element tag parameter and returns a list. See below:
import requests
r = requests.get('https://rosettacode.org/wiki/Category:Rascal')
from bs4 import BeautifulSoup as bs
soup = bs(r.content, 'html.parser')
print(soup.find_all('script'))
Another option is to use regex, if you're into that kind of thing.
It's not beautifulsoup, but you may want to use re for this, as html parsing will return the entire script block.
import re
wgcontent = re.findall('wgCategories":\[(.+?)]', rg.text)[0].replace('"', '').split(',')
this will return a list of:
Execution method/Interpreted
Garbage collection/Yes
Parameter passing/By value
Typing/Safe
Typing/Strong
Typing/Expression/Partially implicit
Typing/Checking/Dynamic
Impl needed
Programming Languages
Related
I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.
Since the JSON is hidden in the <head>, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
Get url
Get all <script>
Check if PRODUCT_DETAIL_APP_INITIAL_STAT exist in the string (valid json)
Remove the prefix (hardcoded)
Find the index of the next key (hardcoded)
Remove after the suffix
Try to parse to json
Print json['product']['productCode'] if it exists
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A
In this case beautifulsoup is not needed cause response could be searched directly with regex:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
Example
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
Output
MGND3TU/A
That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text (r is returned from requests.get()) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.
I am trying to get the body text of news articles like this one:
https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html
In the source code, it can be found after "articleBody".
I've tried using bs4 Beautifulsoup but it looks like it cannot access the 'window' object where the article body information is. I'm able to get the text by using string functions:
text = re.search('"articleBody":"(.*)","keywords"', source_code)
Where source_code is a string that contains the source code of the URL. However, this method looks pretty inefficient compared to using the bs4 methods when the page allows it. Any advice, please?
You're right about BeautifulSoup not being able to handle window objects. In fact, you need to use Selenium for that kind of thing. Here's an example on how to do so with Python 3 (you'll have to adapt it slightly if you want to work in Python 2):
from selenium import webdriver
import time
# Create a new instance of Chrome and go to the website we want to scrape
browser = webdriver.Chrome()
browser.get("http://www.elpais.com/")
time.sleep(5) # Let the browser load
# Find the div element containing the article content
div = browser.find_element_by_class_name('articleContent')
# Print out all the text inside the div
print(div.text)
Hope this helps!
Try:
import json
import requests
from bs4 import BeautifulSoup
url = "https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for ld_json in soup.select('[type="application/ld+json"]'):
data = json.loads(ld_json.text)
if "#type" in data and "NewsArticle" in data["#type"]:
break
print(data["articleBody"])
Prints:
A una semana de que arranque Sumar ...
Or:
text = soup.select_one('[data-dtm-region="articulo_cuerpo"]').get_text(
strip=True
)
print(text)
I would know how to get data from a website
I find a tutorial and finished with this
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
page = requete.content
soup = BeautifulSoup(page)
The tutorial say me that I should use something like this to get the string of a tag
h1 = soup.find("h1", {"class": "ico-after ico-tutorials"})
print(h1.string)
But I got a problem : the tag where I want to get text content haven't class... how should I do ?
I tried to put {} but not working
this too {"class": ""}
In fact, it's return me a None
I want to get the text content of this part of the website :
<div style="font-size:3em; color:#6200C5;">
Orchard</div>
Where Orchard is the random word
Thank for any type of help
Unfortunately, there aren't many pointers featured in BeautifulSoup, and the page you are trying to get is terribly ill-suited for your task (no IDs, classes, or other useful html features to point at).
Hence, you should change the way you use to point at the html element, and use the Xpath - and you can't do it with BeautifulSoup. In order to do that, just use html from package lxml to parse the page. Below a code snippet (based on the answers to this question) which extracts the random word in your example.
import requests
from lxml import html
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
tree = html.fromstring(requete.content)
rand_w = tree.xpath('/html/body/center/center/table[1]/tr/td/div/text()')
print(rand_w)
I am having a problem finding a value in a soup based on text. Here is the code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text)
findit=soup.find("td", text=re.compile('Market Cap'))
This returns [], yet there absolutely is text in a 'td' tag with 'Market Cap'.
When I use
soup.find_all("td")
I get a result set which includes:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>
Explanation:
The problem is that this particular tag has other child elements and the .string value, which is checked when you apply the text argument, is None (bs4 has it documented here).
Solutions/Workarounds:
Don't specify the tag name here at all, find the text node and go up to the parent:
soup.find(text=re.compile('Market Cap')).parent.get_text()
Or, you can use find_parent() if td is not the direct parent of the text node:
soup.find(text=re.compile('Market Cap')).find_parent("td").get_text()
You can also use a "search function" to search for the td tags and see if the direct text child nodes has the Market Cap text:
soup.find(lambda tag: tag and
tag.name == "td" and
tag.find(text=re.compile('Market Cap'), recursive=False))
Or, if you are looking to find the following number 5:
soup.find(text=re.compile('Market Cap')).next_sibling.get_text()
You can't use regex with tag. It just won't work. Don't know if it's a bug of specification. I just search after all, and then get the parent back in a list comprehension cause "td" "regex" would give you the td tag.
Code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text, "lxml")
findit=soup.find_all(text=re.compile('Market Cap'))
findit=[x.parent for x in findit if x.parent.name == "td"]
print(findit)
Output
[<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>]
Regex is just a terrible thing to integrate into parsing code and in my humble opinion should be avoided whenever possible.
Personally, I don't like BeautifulSoup due to its lack of XPath support. What you're trying to do is the sort of thing that XPath is ideally suited for. If I were doing what you're doing, I would use lxml for parsing rather than BeautifulSoup's built in parsing and/or regex. It's really quite elegant and extremely fast:
from lxml import etree
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = etree.HTML(source)
tds_w_market_cap = parsed.xpath('//td[contains(., "Market Cap")]')
FYI the above returns an lxml object rather than the text of the page source. In lxml you don't really work with the source directly, per se. If you need to return a list of the actual source for some reason, you would add something like:
print [etree.tostring(i) for i in tds_w_market_cap]
If you absolutely have to use BeautifulSoup for this task, then I'd use a list comprehension:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = bs(source, 'lxml')
tds_w_market_cap = [i for i in parsed.find_all('td') if 'Market Cap' in i.get_text()]
I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html() module with additional functionality.
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
I can confirm that there is no XPath support within Beautiful Soup.
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath
when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[#class="shared-components"]/#href')
but when use BeautifulSoup BS4 all simple too:
first remove "//" and "#"
second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/#href" part
I've searched through their docs and it seems there is no XPath option.
Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.
Maybe you can try the following without XPath
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
use soup.find(class_='myclass')