Crawling - specific word

Crawling - specific word - python

I am crawling a website where I need to grab sentence starting with "Confirmed ..."
The html for corresponding sentence looks like this
<span class='text-secondary ml-2 d-none d-sm-inline-block'
title='Estimated duration between time First Seen and included in block'> | <i class='fal fa-stopwatch ml-1'></i>
Confirmed within 25 secs</span>
Using requests_html from Python I can retrieve:
r.html.find("span", containing="Confirmed "
[<Element 'span' class=('text-secondary', 'ml-2', 'd-none', 'd-sm-inline-block') title='Estimated duration between time First Seen and included in block'>]`
But for some reason, it doesn't return the rest. What am I missing?

Have you try to find span element using parameter string "Confirmed "?
Like this:
r.html.find("span", containing="Confirmed ")
I do some testing in localhost using your html and it does return the element : screenshot

Related

BeautifulSoup4: Fail to find 'a' tag with specific href value by find()

I am trying to crawl the realtime Bitcoin-HKD Currency from https://www.coinbase.com/pt-PT/price/ with python3.
The only way I found to locate it specificly in the HTML is by this tage a with href="/pt-PT/price/bitcoin"
<a href="/pt-PT/price/bitcoin" title="Visite a moeda Bitcoin" data-element-handle="asset-highlight-top-daily-volume" class="Link__A-eh4rrz-0 hfBqui AssetHighlight__StyledLink-sc-1srucyv-1 cbFcph" color="slate">
<h2 class="AssetHighlight__Title-sc-1srucyv-2 jmJxYl">Volume mais alto (24 h)</h2>
<div class="Flex-l69ttv-0 gaVUrq">
<img src="https://dynamic-assets.coinbase.com/e785e0181f1a23a30d9476038d9be91e9f6c63959b538eabbc51a1abc8898940383291eede695c3b8dfaa1829a9b57f5a2d0a16b0523580346c6b8fab67af14b/asset_icons/b57ac673f06a4b0338a596817eb0a50ce16e2059f327dc117744449a47915cb2.png" alt="Visite a moeda Bitcoin" aria-label="Visite a moeda Bitcoin" loading="lazy" class="AssetHighlight__AssetImage-sc-1srucyv-5 lcjcxh"/>
<div class="Flex-l69ttv-0 kvilOX">
<div class="Flex-l69ttv-0 gTbYCC">
<h3 class="AssetHighlight__SubTitle-sc-1srucyv-3 gdcBEE">Bitcoin</h3>
<p class="AssetHighlight__Price-sc-1srucyv-4 bUAWAG">460 728,81 HK$</p>
Here 460 728,81 HK$ is the data wanted.
Thus I applied the following codes:
import bs4
import urllib.request as req
url="https://www.coinbase.com/prthe ice/bitcoin/hkd"
request=req.Request(url,headers={
"user-agent":"..."
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
root=bs4.BeautifulSoup(data,"html.parser")
secBitcoin=root.find('a',href="/pt-PT/price/bitcoin")
realtimeCurrency=secBitcoin.find('p')
print(realtimeCurrency.string)
However, it always returns secBitcoin = None. No result matches.
The find function works just fine when I search 'div' label with class parameter.
I have also tried format like
.find('a[href="/pt-PT/price/bitcoin"]')
But nothing works.

It's possible the page is loading the currency values after the initial page load. You could try hitting ctrl+s to save the full webpage and open that file instead of using requests. If that also doesn't work, then I'm not sure where the problem is.
And if that does work, then you'll probably need to use something like selenium to get what you need

href is an attribute of an element and hence I think you cannot find it that way.
def is_a_and_href_matching(element):
is_a = element.name == a
if is_a and element.has_attr(href):
if element['href'] == "/pt-PT/price/bitcoin":
return True
return False
secBitcoins=root.find_all(is_a_and_href_matching)
for secBitcoin in secBitcoins:
p = setBitcoin.find('p')

How can I clean html code to return only the number values?

<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div>,
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div>,
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>,
I currently have the above html code that is in a list. I wish to use python so that it may output the following and then append to a list:
10
3
18
3
24

I would recommend using Beautiful Soup which is a very popular html parsing module that is uniquely suited for this kind of thing. If each element has the attribute of title then you could do something like this:
from bs4 import BeautifulSoup
import requests
def randomFacts(url):
r = requests.get(url)
bs = BeautifulSoup(r.content, 'html.parser')
title = bs.find_all('div')
for each in title:
print(each['title'])
Beautiful Soup is my normal go to for html parsing, hope this helps.

Here are 3 possibilities. In the first 2 versions we make sure the class checks out before appending it to the list - just in case there are other divs that you don't want to include. In the third method there isn't really a good way to do that. Unlike adrianp's method of splitting, mine doesn't care where the title is.
The third method may be a bit confusing so, allow me to explain it. First we split everywhere that title=" appears. We dump the first index of that list because it is everything before the first title. We then loop over the remainder and split on the first quote. Now the number you want is in the first index of that split. We do an inline pop to get that value so we can keep everything in a list comprehension, instead of expanding the entire loop and wrestling the values out with specific indexes.
To load the html remotely, uncomment the commented html var and replace "yourURL" with the proper one for you.
I think I have given you every possible way of doing this - certainly the most obvious ones.
from bs4 import BeautifulSoup
import re, requests
html = '<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>'
#html = requests.get(yourURL).content
# possibility 1: BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# assumes that all bb-fl classed divs have a title and all divs have a class
# you may need to disassemble this generator and add some extra checks
bs_titleval = [div['title'] for div in soup.find_all('div') if 'bb-fl' in div['class']]
print(bs_titleval)
# possibility 2: Regular Expressions ~ not the best way to go
# this isn't going to work if the tag attribute signature changes
title_re = re.compile('<div class="bb-fl" style="[^"]*" title="([0-9]+)">', re.I)
re_titleval = [m.group(1) for m in title_re.finditer(html)]
print(re_titleval)
# possibility 3: String Splitting ~
# probably the best method if there is nothing extra to weed out
title_sp = html.split('title="')
title_sp.pop(0) # get rid of first index
# title_sp is now ['10"></div>...', '3"></div>...', '18"></div>...', etc...]
sp_titleval = [s.split('"').pop(0) for s in title_sp]
print(sp_titleval)

Assuming that each div is saved as a string in the variable div, you can do the following:
number = div.split()[3].split('=')[1]
Each div should be in the same format for this to work.

Find html-tag with one and only one attribute with BeautifulSoup

I have a html-site that I want to scrape some data from. The html looks like this:
<p class="provice hidden-xs">
<span class="provice-mobile">NEW YORK</span>
witespace
<span class="provice-mobile" style="color: #8888 !important">UNION</span>
</p>
I just want to choose "NEW YORK", and I tried this code:
city = soup.find('span', attrs={'class':'provice-mobile'})
city.text also includes "UNION", but I just want to see the span-tag that only has the attribute:
'class': 'provice-mobile

If I understand your question correctly, you are looking for the span-tags whose only attribute is class = "provice-mobile. I suggest you start by finding all the tags that has that attribute and afterwards sort out the ones that has more than that one attribute, i.e. keeping tags with only one attribute.
The code to accomplish this could look like this:
results = soup.findAll('span', attrs = {'class':'provice-mobile'})
results = [tag for tag in results if len(tag.attrs) == 1]

Find command in python catches only first line

Trying to grab the magnet link from the following code
rawdata = ''' <div class="iaconbox center floatright">
<a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a> <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a> <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ 'name': 'Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK', 'extension': 'mkv', 'magnet': 'magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce' }"></div>
<a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
<a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''
Using this command
rawdata[rawdata.find("<")+1:rawdata.find(">")]
Gives me
div class="iaconbox center floatright"
But when I try to find Magnet link
rawdata[rawdata.find("href="magnet:?")+1:rawdata.find(""")]
It gives me
' '
What I actually want it to give me
magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
It's so easy with Shell, but it has to be done with Python itself.

try rawdata[rawdata.find('href="magnet:?')+1:rawdata.find('"')]

It's better to use regular expression.
import re
rawdata = '''your rawdata......'''
regex = re.compile('href="(.+)" class="icon16')
magnet_href = regex.search(rawdata).group(1)

First of all, as pointed out by HenryM, you need to use single quotes or escape the " to make the strings valid.
Second, find() always returns the first index of the character found. So you will find the first " and not the one ending the link. To fix this use the beg parameter to define the beginning of your search.
Additionally, you need to add the length of your query to the start index, as find gives you the starting index of the match, not the end you are looking for. The code would look something like this (completely untested):
start = rawdata.find('href="magnet:?') + 14
end = rawdata.find('"', beg=start)
link = rawdata[start:end]

The input data is an HTML fragment. You should not be using regular expressions to parse it.
Use a parser instead. Here is a working sample using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
rawdata = ''' <div class="iaconbox center floatright">
<a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a> <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a> <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ 'name': 'Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK', 'extension': 'mkv', 'magnet': 'magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce' }"></div>
<a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
<a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''
soup = BeautifulSoup(rawdata, "html.parser")
print(soup.find("a", title="Torrent magnet link")["href"])
Prints:
magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

Extracting a line of text using BeautifulSoup

I have two numbers (NUM1; NUM2) I am trying to extract across webpages that have the same format:
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
NUM1 and NUM2 are always followed by the same text across webpages
</div>
I am thinking that regex might be the way to go for these particular fields. Here's my attempt (borrowed from various sources):
def nums(self):
nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
nums_match = nums_regex.search(self)
nums_text = nums_match.group(0)
digits = [int(s) for s in re.findall(r'\d+', nums_text)]
return digits
By itself, outside of a function, this code works when specifying the actual source of the text (e.g., nums_regex.search(text)). However, I am modifying another person's code and I myself have never really worked with classes or functions before. Here's an example of their code:
#property
def title(self):
tag = self.soup.find('span', class_='summary')
title = unicode(tag.string)
return title.strip()
As you might have guessed, my code isn't working. I get the error:
nums_match = nums_regex.search(self)
TypeError: expected string or buffer
It looks like I'm not feeding in the original text correctly, but how do I fix it?

You can use the same regular expression pattern to find with BeautifulSoup by text and then to extract the desired numbers:
import re
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Note that, since you are trying to match a part of text and not anything HTML structure related, I think it's pretty much okay to just apply your regular expression to the complete document instead.
Complete working sample code samples below.
With BeautifulSoup regex/"by text" search:
import re
from bs4 import BeautifulSoup
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Regex-only search:
import re
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data)) # prints [('10', '20')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling - specific word - python

Have you try to find span element using parameter string "Confirmed "? Like this: r.html.find("span", containing="Confirmed ") I do some testing in localhost using your html and it does return the element : screenshot

Related

BeautifulSoup4: Fail to find 'a' tag with specific href value by find()

How can I clean html code to return only the number values?

Find html-tag with one and only one attribute with BeautifulSoup

Find command in python catches only first line

Extracting a line of text using BeautifulSoup

Categories

Resources