Extract number from a website using beautifulsoup?

Extract number from a website using beautifulsoup? - python

The following python code:
from bs4 import BeautifulSoup
div = '<div class="hm"><span class="xg1">查看:</span> 15660<span class="pipe">|</span><span class="xg1">回复:</span> 435</div>'
soup = BeautifulSoup(div, "lxml")
hm = soup.find("div", {"class": "hm"})
print(hm)
The output that i want two number in this case:
15660
435
I want to try to extract the numbers from the website using beautifulsoup. But I do not know how to do it?

Call soup.find_all, with a regex -
>>> list(map(str.strip, soup.find_all(text=re.compile(r'\b\d+\b'))))
Or,
>>> [x.strip() for x in soup.find_all(text=re.compile(r'\b\d+\b'))]
['15660', '435']
If you need integers instead of strings, call int inside the list comprehension -
>>> [int(x.strip()) for x in soup.find_all(text=re.compile(r'\b\d+\b'))]
[15660, 435]

Related

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?

With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}

You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

Python - Beautiful Soup - How to filter the extracted data for keywords?

I want to scrape the data of websitses using Beautiful Soup and requests, and I've come so far that I've got the data I want but now I want to filter it:
from bs4 import BeautifulSoup
import requests
url = "website.com"
keyword = "22222"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
print(article.get("data-variant-code"))
Let's say this prints the following:
11111
22222
33333
How can I filter this so it only returns me the "22222"?

assuming that article.get("data-variant-code") prints 11111, 22222, 33333,
you can simply use an if statement:
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
x = article.get("data-variant-code")
if x == '22222':
print(x)

if you want to print the 2nd group of chars in a string delimited by space, then you can split the string using space as delimiter. This will give you a list of strings then access the 2nd item of the list.
For example:
print(article.get("data-variant-code").split(" ")[1])
result: 22222

Convert a BeautifulSoup ResultSet into a list of strings

I'm trying to scrape the details of the reviews from here into a CSV using Python. Each movie has a star rating, which is denoted by an image, having a class('icon-star-fill' , or 'icon-star-half'). I'm trying to write a function to assign a numerical value.
The code that I have so far is returning a bs4.element.ResultSet, with each element a Tag
[<i class="icon-star-full"></i>, <i class="icon-star-full"></i>]
I want to convert that into a list of strings, like
["<i class="icon-star-full"></i>", "<i class="icon-star-full"></i>"]
I've tried soup_obj.text, soup_obj.content, and they're returning empty strings.
This is my code
from bs4 import BeautifulSoup
import requests
result = requests.get(url='http://www.rogerebert.com/reviews')
result_content = result.content
soup_obj = BeautifulSoup(result_content, 'html5lib')
wrapper_class = soup_obj.find('div', id='review-list')
for x in wrapper_class.find_all('figure'):
convoluted_rating = x.find('span', class_='star-rating').find_all('i')
print convoluted_rating
I've seen this and it returns an array with None, like so
[None,None]

You can iterate over the ResultSet and call tag.prettify:
tags = []
for x in wrapper_class.find_all('figure'):
tags.extend(
(i.prettify() for i in x.find('span', class_='star-rating').find_all('i'))
)
print(tags)
['<i class="icon-star-full">\n</i>\n',
'<i class="icon-star-full">\n</i>',
'<i class="icon-star-full">\n</i>\n',
...
]

Hashtags python html

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you

I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Find and extract curr_id number from Investing

I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.

You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.

You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract number from a website using beautifulsoup? - python

Related

is there any convenient way to get a index of sub section in a page?

Python - Beautiful Soup - How to filter the extracted data for keywords?

Convert a BeautifulSoup ResultSet into a list of strings

Hashtags python html

Find and extract curr_id number from Investing

Categories

Resources