Using Regular Expressions With Python to Get Value Buried in HTML5

Using Regular Expressions With Python to Get Value Buried in HTML5 - python

I'm trying to use BeautifulSoup and RE to get a specific value from Yahoo Finance. I can't figure out exactly how to get it. I'll paste some code I have along with the HTML and unique selector I got.
I just want this number in here, "7.58," but the problem is that the class of this column is the same as many other ones in the same element.
<tr><td class="yfnc_tablehead1" width="74%">Diluted EPS (ttm):</td><td class="yfnc_tabledata1">7.58</td>"
Here is the selector Google gave me...
yfncsumtab > tbody > tr:nth-child(2) > td.yfnc_modtitlew1 > table:nth-child(10) > tbody > tr > td > table > tbody > tr:nth-child(8) > td.yfnc_tabledata1
Here is some template code I'm using to test different things, but I'm very new to regular expressions and can't find a way to extract that number after "Diluted EPS (ttm):###
from bs4 import BeautifulSoup
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
soup = BeautifulSoup(res.text, 'html.parser')
body = soup.findAll('td')
print (body)
Thanks!

You could find by text Diluted EPS (ttm): first:
soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})

If using regex, please try:
>>> import re
>>> text = '<tr><td class="yfnc_tablehead1" width="74%">Diluted EPS (ttm):</td><
td class="yfnc_tabledata1">7.58</td>"'
>>> re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', text)
['7.58']
UPDATE Here is the sample code using requests and re:
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
print re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', res.text)
Output:
[u'7.58']

Thanks for answering my question. I was able to use two ways to get the desired value. The first way is this.
from bs4 import BeautifulSoup
import requests
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
soup = BeautifulSoup(res.text, 'html.parser')
eps = soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})
for i in eps:
print (i)
Here is the second way...
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
print (re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', res.text.strip()))
I don't quite understand it all yet, but this is a great start with two different ways to understand it and move forward incorporating this aspect of the project. Really appreciate your assistance!

Related

Python - How to extract a number from a bs4 output

I am trying to get a price from a website using BeautifulSoup and so far I have managed to get:
<h2>£<!-- -->199.99</h2>
I just want to receive '£199.99'
Is there a way to filter out the letters?
Thanks in advance

You will use get_text function with strip=True to clean if necessary
from bs4 import BeautifulSoup
html = '<h2>£<!-- -->199.99</h2>'
soup = BeautifulSoup(html,'html5lib')
result = soup.find('h2').get_text(strip=True)
print(result)
#£199.99

Use re?
import re
s = "<h2>£<!-- -->199.99</h2>"
rx_price = re.compile(r'([0-9.]+)')
content = re.sub(r'<.+?>', '', s)
print (f"£{rx_price.findall(content)[0]}")
Output:
£199.99

Extracting Embedded <span> in Python using BeautifulSoup

I am trying to extract a value in a span however the span is embedded into another. I was wondering how I get the value of only 1 span rather than both.
from bs4 import BeautifulSoup
some_price = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"})
some_price.span
# that code returns this:
'''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
# BUT I only want the $289 part, not the 99 associated with it
After making this adjustment:
some_price.span.text
the interpreter returns
$28999
Would it be possible to somehow remove the '99' at the end? Or to only extract the first part of the span?
Any help/suggestions would be appreciated!

You can access the desired value from the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = '''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
result = soup(html, 'html.parser').find('span').contents[0]
Output:
'$289'
Thus, in the context of your original div lookup:
result = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"}).span.contents[0]

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?

With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}

You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

How to scrape on only certain tags that are all in the same class?

Im creating this program which allows me to scrape all the names and abilities of characters from this website. The tags (li) which contain the information I need are mixed in with other li tags that are not needed.
I have tried selecting different classes but that wont work.
Here is my code:
import bs4, requests, lxml, re, time, os
from bs4 import BeautifulSoup as soup
def webscrape():
res = requests.get('https://www.usgamer.net/articles/15-11-2017-skyrim-guide-for-xbox-one-and-ps4-which-races-and-character-builds-are-the-best')
soup = bs4.BeautifulSoup(res.text, 'lxml')
races_list = soup.find_all("li < strong")
races_list_text = [f.text.strip() for f in races_list]
print(races_list_text)
time.sleep(1)
webscrape()
It is expected to print out all the races and their corresponding information.

You can use the following
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.usgamer.net/articles/15-11-2017-skyrim-guide-for-xbox-one-and-ps4-which-races-and-character-builds-are-the-best')
soup = bs(r.content, 'lxml')
#one list of tuples
race_info = [ (item.text, item.next_sibling) for item in soup.select('h2 ~ ul strong')]
# separate lists
races, abilities = zip(*[ (item.text, item.next_sibling) for item in soup.select('h2 ~ ul strong')])
A dictionary might be nicer in which case you can do
race_info = [ (item.text, item.next_sibling) for item in soup.select('h2 ~ ul strong')]
race_info = dict(race_info)
The ~ is a general sibling combinator:
The ~ combinator selects siblings. This means that the second element
follows the first (though not necessarily immediately), and both share
the same parent.

Find and extract curr_id number from Investing

I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.

You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.

You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Regular Expressions With Python to Get Value Buried in HTML5 - python

You could find by text Diluted EPS (ttm): first: soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})

Related

Python - How to extract a number from a bs4 output

Extracting Embedded <span> in Python using BeautifulSoup

is there any convenient way to get a index of sub section in a page?

How to scrape on only certain tags that are all in the same class?

Find and extract curr_id number from Investing

Categories

Resources