Find command in python catches only first line

Find command in python catches only first line - python

Trying to grab the magnet link from the following code
rawdata = ''' <div class="iaconbox center floatright">
<a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a> <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a> <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ 'name': 'Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK', 'extension': 'mkv', 'magnet': 'magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce' }"></div>
<a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
<a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''
Using this command
rawdata[rawdata.find("<")+1:rawdata.find(">")]
Gives me
div class="iaconbox center floatright"
But when I try to find Magnet link
rawdata[rawdata.find("href="magnet:?")+1:rawdata.find(""")]
It gives me
' '
What I actually want it to give me
magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
It's so easy with Shell, but it has to be done with Python itself.

try rawdata[rawdata.find('href="magnet:?')+1:rawdata.find('"')]

It's better to use regular expression.
import re
rawdata = '''your rawdata......'''
regex = re.compile('href="(.+)" class="icon16')
magnet_href = regex.search(rawdata).group(1)

First of all, as pointed out by HenryM, you need to use single quotes or escape the " to make the strings valid.
Second, find() always returns the first index of the character found. So you will find the first " and not the one ending the link. To fix this use the beg parameter to define the beginning of your search.
Additionally, you need to add the length of your query to the start index, as find gives you the starting index of the match, not the end you are looking for. The code would look something like this (completely untested):
start = rawdata.find('href="magnet:?') + 14
end = rawdata.find('"', beg=start)
link = rawdata[start:end]

The input data is an HTML fragment. You should not be using regular expressions to parse it.
Use a parser instead. Here is a working sample using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
rawdata = ''' <div class="iaconbox center floatright">
<a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a> <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a> <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ 'name': 'Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK', 'extension': 'mkv', 'magnet': 'magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce' }"></div>
<a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
<a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''
soup = BeautifulSoup(rawdata, "html.parser")
print(soup.find("a", title="Torrent magnet link")["href"])
Prints:
magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

Related

Crawling - specific word

I am crawling a website where I need to grab sentence starting with "Confirmed ..."
The html for corresponding sentence looks like this
<span class='text-secondary ml-2 d-none d-sm-inline-block'
title='Estimated duration between time First Seen and included in block'> | <i class='fal fa-stopwatch ml-1'></i>
Confirmed within 25 secs</span>
Using requests_html from Python I can retrieve:
r.html.find("span", containing="Confirmed "
[<Element 'span' class=('text-secondary', 'ml-2', 'd-none', 'd-sm-inline-block') title='Estimated duration between time First Seen and included in block'>]`
But for some reason, it doesn't return the rest. What am I missing?

Have you try to find span element using parameter string "Confirmed "?
Like this:
r.html.find("span", containing="Confirmed ")
I do some testing in localhost using your html and it does return the element : screenshot

Is not a valid XPath expression

I am trying to dowload a image in my web page with the xpath expression
Part of code
with open('stiker.png', 'wb') as file:
file.write(driver.find_element(By.XPATH,'//div[#class "_3IfUe"]/img[crossorigin = "anonymous"]').screenshot_as_png)
Part of page source i'm trying to dowload:
<div class="_3IfUe">
<img crossorigin="anonymous" src="blob:https://web.whatsapp.com/9a74a410-721b-4e8e-80f0-42d18288f480"
alt="" draggable="true" class="gndfcl4n p357zi0d ppled2lx ac2vgrno gfz4du6o r7fjleex g0rxnol2 ln8gz9je b9fczbqn i0jNr" style="visibility: visible;">
</div>
Error
SyntaxError: Failed to execute 'evaluate' on 'Document': The string '//div[#class "_3IfUe"]/img[#crossorigin = "anonymous"]' is not a valid XPath expression.

As #John Gordon pointed out in his comment, you are missing a = between the #class and the value "_3IfUe" that you are trying to compare.
After fixing that, you need an # before the crossorigin attribute name. Otherwise, it thinks you are looking for an element with that name.
It should be:
//div[#class = "_3IfUe"]/img[#crossorigin = "anonymous"]

BeautifulSoup4: Fail to find 'a' tag with specific href value by find()

I am trying to crawl the realtime Bitcoin-HKD Currency from https://www.coinbase.com/pt-PT/price/ with python3.
The only way I found to locate it specificly in the HTML is by this tage a with href="/pt-PT/price/bitcoin"
<a href="/pt-PT/price/bitcoin" title="Visite a moeda Bitcoin" data-element-handle="asset-highlight-top-daily-volume" class="Link__A-eh4rrz-0 hfBqui AssetHighlight__StyledLink-sc-1srucyv-1 cbFcph" color="slate">
<h2 class="AssetHighlight__Title-sc-1srucyv-2 jmJxYl">Volume mais alto (24 h)</h2>
<div class="Flex-l69ttv-0 gaVUrq">
<img src="https://dynamic-assets.coinbase.com/e785e0181f1a23a30d9476038d9be91e9f6c63959b538eabbc51a1abc8898940383291eede695c3b8dfaa1829a9b57f5a2d0a16b0523580346c6b8fab67af14b/asset_icons/b57ac673f06a4b0338a596817eb0a50ce16e2059f327dc117744449a47915cb2.png" alt="Visite a moeda Bitcoin" aria-label="Visite a moeda Bitcoin" loading="lazy" class="AssetHighlight__AssetImage-sc-1srucyv-5 lcjcxh"/>
<div class="Flex-l69ttv-0 kvilOX">
<div class="Flex-l69ttv-0 gTbYCC">
<h3 class="AssetHighlight__SubTitle-sc-1srucyv-3 gdcBEE">Bitcoin</h3>
<p class="AssetHighlight__Price-sc-1srucyv-4 bUAWAG">460 728,81 HK$</p>
Here 460 728,81 HK$ is the data wanted.
Thus I applied the following codes:
import bs4
import urllib.request as req
url="https://www.coinbase.com/prthe ice/bitcoin/hkd"
request=req.Request(url,headers={
"user-agent":"..."
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
root=bs4.BeautifulSoup(data,"html.parser")
secBitcoin=root.find('a',href="/pt-PT/price/bitcoin")
realtimeCurrency=secBitcoin.find('p')
print(realtimeCurrency.string)
However, it always returns secBitcoin = None. No result matches.
The find function works just fine when I search 'div' label with class parameter.
I have also tried format like
.find('a[href="/pt-PT/price/bitcoin"]')
But nothing works.

It's possible the page is loading the currency values after the initial page load. You could try hitting ctrl+s to save the full webpage and open that file instead of using requests. If that also doesn't work, then I'm not sure where the problem is.
And if that does work, then you'll probably need to use something like selenium to get what you need

href is an attribute of an element and hence I think you cannot find it that way.
def is_a_and_href_matching(element):
is_a = element.name == a
if is_a and element.has_attr(href):
if element['href'] == "/pt-PT/price/bitcoin":
return True
return False
secBitcoins=root.find_all(is_a_and_href_matching)
for secBitcoin in secBitcoins:
p = setBitcoin.find('p')

Keep \n in string content and write to one line

I have the following code for parsing some HTML. I need to save the output (html result) as a single line of code with the escaped character sequences there such as \n but I'm either getting a representation I can't use from repr() because of the single quotes or the output is being written to multiple lines like so (interpreting the escape sequences):
<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>
What I require (including the escape sequences):
<section class="prog__container">\n <span class="prog__sub">Title</span>\n <p>PEP 336 - Make None Callable</p>\n <span class="prog__sub">Description</span>\n <p>\n <p>\n <code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n </p>\n </section>
My Code
soup = BeautifulSoup(html, "html.parser")
for match in soup.findAll(['div']):
match.unwrap()
for match in soup.findAll(['a']):
match.unwrap()
html = soup.contents[0]
html = str(html)
html = html.splitlines(True)
html = " ".join(html)
html = re.sub(re.compile("\n"), "\\n", html)
html = repl(html) # my current solution works, but unusable
The above is my solution, but an object representation is no good, I need the string representation. How can I achieve this?

Why don't use just repr?
a = """this is the first line
this is the second line"""
print repr(a)
Or even (if I clear with your issue of exact output without literal quotes)
print repr(a).strip("'")
Output:
'this is the first line\nthis is the second line'
this is the first line\nthis is the second line

import bs4
html = '''<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>'''
soup = bs4.BeautifulSoup(html, 'lxml')
str(soup)
out:
'<html><body><section class="prog__container">\n<span class="prog__sub">Title</span>\n<p>PEP 336 - Make None Callable</p>\n<span class="prog__sub">Description</span>\n<p>\n</p><p>\n<code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n</section></body></html>'
There are more complex way to output the html code in the Document

from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen('https://www.example.com')
soup = BeautifulSoup(r.read(), 'html.parser')
html = str(soup)
This will give your html as one string and lines separated by \n

Add an optional subexpression to a RE on python

How I can add a subexpression of a regular expression in python?
Indicating that some html code may or may not appear in the text.
It's because I'm making an API for filmaffinity and want to make a RE to filter search results, but I have problems.
In the html code of a result there's an rating image, and in other code this isn't, then I would add to a subexpression to the RE, where the image appears, it can take rate for the movie (an integer), and if not, it returns an empty string.
For example, this is a section os resoults html:
<div class="mc-title">Movie Name (2012) <img src="/imgs/countries/CF.jpg" title="Country Name"></div>
<img src="http://www.filmaffinity.com/imgs/ratings/8.png" border="0" alt="Notable" > <div class="mc-director">Some Director</div>
In this other html code is not the img tag.
<div class="mc-title">Another movie name (2015) <img src="/imgs/countries/XY.jpg" title="Another Country"></div>
<div class="mc-director">Another director</div>
So... I need a RE that return this:
>>>R=findall(expression, html_Code)
>>>print R
[('111111', 'Movie Name', '2012', '8', 'Some Director'), ('000000', 'Another Movie Name', '2015', '', 'Another director')]
Note that in the second tuple, there is not a rating... only a '' string.
My poor RE is this:
<div class="mc-title">([^<]*)\s*\((\d{4})\)\s*<img src="/imgs/countries/([A-Z]{2}).jpg" title="[^"]*"></div>\s*<img src="http://www.filmaffinity.com/imgs/ratings/(\d+).png" border="0" alt="\w*" ?>\s*<div class="mc-director">[^<]*</div>

For parsing HTML, I find BeautifulSoup better than using straight regular expressions. There's also PyQuery which seems nice, but I've never used it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find command in python catches only first line - python

try rawdata[rawdata.find('href="magnet:?')+1:rawdata.find('"')]

It's better to use regular expression. import re rawdata = '''your rawdata......''' regex = re.compile('href="(.+)" class="icon16') magnet_href = regex.search(rawdata).group(1)

Related

Crawling - specific word

Is not a valid XPath expression

BeautifulSoup4: Fail to find 'a' tag with specific href value by find()

Keep \n in string content and write to one line

Add an optional subexpression to a RE on python

Categories

Resources