While I use bs4 to parse site

While I use bs4 to parse site - python

I want to parse the price information in Bitmex using bs4.
(The site url is 'https://www.bitmex.com/app/trade/XBTUSD')
So, I wrote down the code like this
from bs4 import BeautifulSoup
import requests
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)
And, the result is like this
connected...
[]
Why '[]' poped up? and To bring the price text like '6065.5', what should I do?
The text I want to parse is
<span class="price">6065.5</span>
and the selector is
content > div > div.tickerBar.overflown > div > span.instruments.tickerBarSection > span:nth-child(1) > span.price
I just study Python, so question can seems to be odd to pro...sorry

You were pretty close. Give the following a try and see if it's more what you wanted. Perhaps the format you seeing or retrieving is not quite what you expect. Hope this is helpful.
from bs4 import BeautifulSoup
import requests
import sys
import json
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
sys.exit(1)
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()
# parse json text
d = json.loads(price.text)
# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)
Example output:
connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]

Your problem is that the page doesn't contain those span elements in first place. If you check the response tab in your browser developer tools (press F12 in firefox) you can see that the page is composed of script tags with some code written in javascript that creates the elements dynamically when executed.
Since BeautifulSoup can't execute Javascript, you can't extract the elements directly with it. You have two alternatives:
Use something like selenium that allows you to drive a browser from python - that means javascript will be executed because you're using a real browser - however the performance suffers.
Read the javascript code, understand it and write python code to simulate it. This usually is harder but luckly for you this seem very simple for the page you want:
import requests
import lxml.html
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[#id='initialData']/text()")[0])
As you can see the data is in json format inside the page. After loading the data variable you can use it to access the infomation you want:
for row in data['orderBook']:
print(row['symbol'], row['price'], row['side'])
Will print:
('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

Related

Is there a way I can extract a list from a javascript document?

There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK

To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}

import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Why is my web scraping producing HTML but won't return any text?

New coder here. I am trying to return all the earnings per share data from this website here: https://www.nasdaq.com/market-activity/stocks/csco/revenue-eps
I started off slow by just trying to return "March", and used this code:
from bs4 import BeautifulSoup
from requests import get
url = "https://www.nasdaq.com/market-activity/stocks/csco/revenue-eps"
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
month = soup.find("th", {"class": "revenue-eps__cell revenue-eps__cell--rowheading"})
print(month.text)
When I run it there are no errors, but nothing is returned.
When I try running the same code but use print(month) instead, I return the HTML from the element that looks like the following:
th class="revenue-eps__cell revenue-eps__cell--rowheading" scope="row"> /th>
I noticed in the HTML that is returned, that the text isn't inside the th. Why is that? Am I doing something wrong or is it the site I'm trying to scrape?

The data is not embedded in the page but retrieved from an API. You can pass the company name as parameter to get all the data directly :
import requests
import json
company = "CSCO"
r = requests.get("https://api.nasdaq.com/api/company/{}/revenue?limit=1".format(company))
print(json.loads(r.text)['data'])

Getting output 0 even if there are 25 same class

Image of the HTML
Link to the page
I am trying to see how many of class are there on this page but the output is 0. And I have been using BeautifulSoup for a while but never saw such error.
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.holonis.com/motivationquotes")
c = result.content
soup = BeautifulSoup(c)
samples = soup.findAll("div", {"class": "ng-scope"})
print(len(samples))
Output
0
and I want the correct output at least more than 25

This is a "dynamic" Angular-based page which needs a Javascript engine or a browser to be fully loaded. To put it differently - the HTML source code you see in the browser developer tools is not the same as you would see in the result.content - the latter is a non-rendered initial HTML of the page which does not contain the desired data.
You can use things like selenium to have the page rendered and loaded and then HTML-parse it, but, why don't make a direct request to the site API:
import requests
result = requests.get("https://www.holonis.com/api/v2/activities/motivationquotes/all?limit=15&page=0")
data = result.json()
for post in data["items"]:
print(post["body"]["description"])
Post descriptions are retrieved and printed for example-purposes only - the post dictionaries contain all the other relevant post data that is displayed on the web-page.

Basically, result.content does not contain any divs with ng-scope class. As stated in one of the comments the html you are trying to get is added there due to the javascript running on the browser.
I recommend you this package requests-html created by the author of very popular requests.
You can try to use the code below to build on that.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.holonis.com/motivationquotes')
r.html.render()
To see how many ng-scope classes are there just do this:
>>> len(r.html.find('.ng-scope'))
302
I assume you want to scrape all the hrefs from the a tags that are children of the divs you gave the image to. You can obtain them this way:
divs = r.html.find('[ng-if="!isVideo"]')
link_sets = (div.absolute_links for div in divs)
>>> list(set(chain.from_iterable(link_sets)))
['https://www.holonis.com/motivationquotes/o/byiqe-ydm',
'https://www.holonis.com/motivationquotes/o/rkhv0uq9f',
'https://www.holonis.com/motivationquotes/o/ry7ra2ycg',
...
'https://www.holonis.com/motivationquotes/o/sydzfwgcz',
'https://www.holonis.com/motivationquotes/o/s1eidcdqf']

There's nothing wrong with BeautifulSoup, in fact, the result of your GET request, does not contain any ng-scope text.
You can see the output here:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> result = requests.get("https://www.holonis.com/motivationquotes")
>>> c = result.content
>>>
>>> print(c)
**Verify the output yourself**

You only have ng-cloak class as you can see from the regex result:
import re
regex = re.compile('ng.*')
samples = soup.findAll("div", {"class": regex})
samples
#[<div class="ng-cloak slideRoute" data-ui-view="main" fixed="400" main-content="" ng-class="'{{$state.$current.animateClass}}'"></div>]

To get the content of that webpage, it is either wise to use their api or choose any browser simulator like selenium. That webpage loads it's content using lazyload. When you scroll down you will see more content. The webpage expands its content through pagination like https://www.holonis.com/excellenceaddiction/1. However, you can give this a go. I've created this script to parse content displayed within 4 pages. You can always change that page number to satisfy your requirement.
from selenium import webdriver
URL = "https://www.holonis.com/excellenceaddiction/{}"
driver = webdriver.Chrome() #If necessary, define the path
for link in [URL.format(page) for page in range(1,4)]:
driver.get(link)
for items in driver.find_elements_by_css_selector('.tile-content .tile-content-text'):
print(items.text)
driver.quit()
Btw, the above script parses the description of each post.

Scraping table contents using Requests and Beautiful Soup

Python/Webscraping Beginner so please bear with me. I'm trying to grab all product names from this URL
Unfortunately, nothing gets returned when I run my code. The same code works fine for most other sites but I've tried dozens of variations and I can't make it work for this site.
Is this URL even scrapable using Bsoup? Any feedback is appreciated.
import bs4
import requests
url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
payload = {'q': 'Python',}
r = requests.get(url % payload)
soup = bs4.BeautifulSoup(r.text)
titles = [a.attrs.get('href') for a in soup.findAll('div.productscontainer a[href^=/prod]')]
for t in titles:
print(t)
import bs4
import requests
url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
titles = [td.text for td in soup.findAll('td', attrs={'class': 'searchlist'})]
for t in titles:
print(t)
If this formatting is correct, is the JS for sure preventing me from pulling anything?

First of all, your string formatting likely is wrong. Look at this:
>>> url = 'http://www.rakuten.com/sr/searchresults.aspx?qu'
>>> payload = {'q': 'Python',}
>>> url % payload
'http://www.rakuten.com/sr/searchresults.aspx?qu'
I guess that's not what you want. You should look up how string formatting works in Python, and then come up with a proper way to construct your URL.
Secondly, that "search engine" makes heavy use of JavaScript. Probably you will not be able to retrieve the information you want by just looking at the initially retrieved HTML content.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

While I use bs4 to parse site - python

Related

Is there a way I can extract a list from a javascript document?

How to get CData from html using beautiful soup

Why is my web scraping producing HTML but won't return any text?

Getting output 0 even if there are 25 same class

Scraping table contents using Requests and Beautiful Soup

Categories

Resources