How to parse information between {} on web page using Beautifulsoup - python

I am scraping sports odds from a web page and, as seen below, my result from a find request. The .get_text() will display -110 which is fine.
What if I wanted to get any of the numbers within the {}. How would I go about getting these values?
I purposely deleted the opening < and closing > from the first div statement down below in order for it to appear.
results = soup.find('div', attrs={'class':'op-item spread-price'})
print(results)
div class="op-item spread-price" data-op-info='{"fullgame":"-110","firsthalf":"-121","secondhalf":"-115","firstquarter":"-109","secondquarter":"","thirdquarter":"","fourthquarter":""}' data-op-overprice='{"fullgame":"-110","firsthalf":"-109","secondhalf":"-122","firstquarter":"-103","secondquarter":"","thirdquarter":"","fourthquarter":""}'>-110</div
screen

Long story short - This should show what todo
from bs4 import BeautifulSoup
html_doc = """<div class="op-item spread-price" data-op-info='{"fullgame":"-110","firsthalf":"-121","secondhalf":"-115","firstquarter":"-109","secondquarter":"","thirdquarter":"","fourthquarter":""}' data-op-overprice='{"fullgame":"-110","firsthalf":"-109","secondhalf":"-122","firstquarter":"-103","secondquarter":"","thirdquarter":"","fourthquarter":""}'>-110</div>"""
bs = BeautifulSoup(html_doc)
data = bs.find('div')
print(data['data-op-info'])
print(data['data-op-overprice'])
Based on your "provided" information
print(result['data-op-info'])
print(result['data-op-overprice'])
Based on your comment you can replace the printing by
import json
for k,v in json.loads(result['data-op-info']).items():
print(k,v)
Hope that helps, let us know

Related

Navigating html tree with beautifulsoup

I'm trying to scrape some data with beautifulsoup on python (url:http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html)
When the data is first occurrence, no problem like
titlebook = soup.find("h1")
titlebook = titlebook.text
but i want to scrape different values, further in page, like upc, price incl.tax, etc
Upc value is first and i have it running universal_product_code= soup.find("tr").find("td").text
I tried so many solutions to access the other ones (i've read beautifulsoup documentation and tried lot of things but it didn't really help me)
So my question is, how to access specific values in a tree where tags are same? I joined a screenshot of the tree to help you understand what i'm talking about
Thank you for your help
For example, if you want to find price (excluding tax), you can use string= parameter in .find and then search for text in next <td>:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# get "Price (excl. tax)" from the table:
key = "Price (excl. tax)"
print("{}: {}".format(key, soup.find("th", string=key).find_next("td").text))
Prints:
Price (excl. tax): £53.74
Or: Use CSS selector:
print(soup.select_one('th:-soup-contains("Price (excl. tax)") + td').text)

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)
Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.
Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

How to get the text and URL from a link using beautifulsoup

I have the following code which prints out a list of the links for each team in a table:
import requests
from bs4 import BeautifulSoup
# Get all teams in Big Sky standings table
URL = 'https://www.espn.com/college-football/standings/_/group/20/view/fcs-i-aa'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
standings = soup.find_all('table', 'Table Table--align-right Table--fixed Table--fixed-left')
for team in standings:
team_data = team.find_all('span', 'hide-mobile')
print(team_data)
The code prints out the entire list and if I pinpoint an index such as 'print(team_data[0])', it will print out the specific link from the page.
How can I then go into that link and get the string from the URL as well as the text for the link?
For example, my code prints out the following for the first index in the list.
<span class="hide-mobile"><a class="AnchorLink" data-clubhouse-uid="s:20~l:23~t:2692" href="/college-football/team/_/id/2692/weber-state-wildcats" tabindex="0">Weber State Wildcats</a></span>
How can I pull
/college-football/team/_/id/2692/weber-state-wildcats
and
Weber State Wildcats
from the link?
Thank you for your time and if there is anything I can add for clarification, please don't hesitate to ask.
Provided that you have an html like:
<span class="hide-mobile"><a class="AnchorLink" data-clubhouse-uid="s:20~l:23~t:2692" href="/college-football/team/_/id/2692/weber-state-wildcats" tabindex="0">Weber State Wildcats</a></span>
To get the /college-football/team/_/id/2692/weber-state-wildcats:
>>> team_data.find_all('a')[0]['href']
'/college-football/team/_/id/2692/weber-state-wildcats'
To get the Weber State Wildcats:
>>> team_data.find_all('a')[0].text
'Weber State Wildcats''
In terms of the href/url, you can do something like this.
In regards to the link text, you could do something like this.
Both amount to filtering down to the target element, and then extracting the desired attribute.

While I use bs4 to parse site

I want to parse the price information in Bitmex using bs4.
(The site url is 'https://www.bitmex.com/app/trade/XBTUSD')
So, I wrote down the code like this
from bs4 import BeautifulSoup
import requests
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)
And, the result is like this
connected...
[]
Why '[]' poped up? and To bring the price text like '6065.5', what should I do?
The text I want to parse is
<span class="price">6065.5</span>
and the selector is
content > div > div.tickerBar.overflown > div > span.instruments.tickerBarSection > span:nth-child(1) > span.price
I just study Python, so question can seems to be odd to pro...sorry
You were pretty close. Give the following a try and see if it's more what you wanted. Perhaps the format you seeing or retrieving is not quite what you expect. Hope this is helpful.
from bs4 import BeautifulSoup
import requests
import sys
import json
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
sys.exit(1)
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()
# parse json text
d = json.loads(price.text)
# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)
Example output:
connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]
Your problem is that the page doesn't contain those span elements in first place. If you check the response tab in your browser developer tools (press F12 in firefox) you can see that the page is composed of script tags with some code written in javascript that creates the elements dynamically when executed.
Since BeautifulSoup can't execute Javascript, you can't extract the elements directly with it. You have two alternatives:
Use something like selenium that allows you to drive a browser from python - that means javascript will be executed because you're using a real browser - however the performance suffers.
Read the javascript code, understand it and write python code to simulate it. This usually is harder but luckly for you this seem very simple for the page you want:
import requests
import lxml.html
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[#id='initialData']/text()")[0])
As you can see the data is in json format inside the page. After loading the data variable you can use it to access the infomation you want:
for row in data['orderBook']:
print(row['symbol'], row['price'], row['side'])
Will print:
('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

beautifulsoup doesn't show all ellements

i'm trying to parse Taobao website and get information about Goods (photo , text and link ) with BeautifulSoup.find but it doesn't find all classes.
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})
z-is empty.
but for example:
z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3
works fine
Why this approach doesn't work? What should i do to get all information about good? i am using python 2.7
So, it looks like the items you want to parse are being built dynamically by JavaScript, that's why soup.text.find("J_TItems") returns -1, i.e. there's no "J_TItems" at all in the text. What you can do is use selenium with a JS interpreter, for a headless browsing you can use PhantomJS like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})
Note the items you want are inside each row defined by <div class="item4line1">, and there are 5 of them (you maybe only want the first three, because the other two are not really inside the main search, filtering that should not be difficult, a simple rows = rows[2:] do the trick):
rows = JTitems.findAll("div", attrs={"class":"item4line1"})
>>> len(rows)
5
Now notice each "Good" you mention in the question is inside a <dl class="item">, so you need to get them all in a for loop:
Goods = []
for row in rows:
for item in row.findAll("dl", attrs={"class":"item"}):
Goods.append(item)
All there's left to do is to get "photo, text and link" as you mentioned, and this can be easily done accessing each item in Goods list, by inspection you can know how to get each of the information, for examples, for picture url a simple one-line would be:
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'

Categories

Resources