I use lxml for parsing html files in Python.
And I use cssselect.
Something like that:
from lxml.html import parse
page = parse('http://.../').getroot()
img = page.cssselect('div.photo cover div.outer a') # problem
But I have a problem. There are spaces in class-names in HTML:
<div class="photo cover"><div class=outer>
<a href=...
Without them everything is ok. How can I parse it (I can't edit html code)?
To match div with photo and cover class, use div.photo.cover.
img = page.cssselect('div.photo.cover div.outer a')
Instead of thinkg class="photo cover" as class attribute with photo cover as a value, think it as a class attribute with photo and cover as values.
Related
I'm working on a little jig that generates a static gallery page based on a folder full of images. My current hangup is generating the HTML itself-
I used Airium to reverse-translate my existing HTML to Airium's python code, and added the variables I want to modify for each anchor tag in a loop. But I can't for the life of me figure out how to get it to let me add 'thumblink'. I'm not sure why it's treating it so differently from the others, my guess is that Airium expects foo:bar but not foo:bar(xyz) with xyz being the only part I want to pull out and modify.
from airium import Airium
imagelink = "image name here" # after pulling image filename from list
thumblink = "thumb link here" # after resizing image to thumb size
artistname = "artist name here" # after extracting artist name from filename
a = Airium()
with a.a(href='javascript:void(0);', **{'data-image': imagelink}):
with a.div(klass='imagebox', style='background-image:url(images/2015-12-29kippy.png)'):
a.div(klass='artistname', _t= artistname)
html = str(a) # cast to string
print(html) # print to console
where "images/2015-12-29kippy.png" is what I'd replace with string variable "thumblink".
image and artist do translate correctly in the output after testing -
<a href="javascript:void(0);" data-image="image name here">
<div class="imagebox" style="background-image:url(images/2015-12-29kippy.png)">
<div class="artistname">artist name here</div>
</div>
</a>
>>>
I am trying to crawl the realtime Bitcoin-HKD Currency from https://www.coinbase.com/pt-PT/price/ with python3.
The only way I found to locate it specificly in the HTML is by this tage a with href="/pt-PT/price/bitcoin"
<a href="/pt-PT/price/bitcoin" title="Visite a moeda Bitcoin" data-element-handle="asset-highlight-top-daily-volume" class="Link__A-eh4rrz-0 hfBqui AssetHighlight__StyledLink-sc-1srucyv-1 cbFcph" color="slate">
<h2 class="AssetHighlight__Title-sc-1srucyv-2 jmJxYl">Volume mais alto (24 h)</h2>
<div class="Flex-l69ttv-0 gaVUrq">
<img src="https://dynamic-assets.coinbase.com/e785e0181f1a23a30d9476038d9be91e9f6c63959b538eabbc51a1abc8898940383291eede695c3b8dfaa1829a9b57f5a2d0a16b0523580346c6b8fab67af14b/asset_icons/b57ac673f06a4b0338a596817eb0a50ce16e2059f327dc117744449a47915cb2.png" alt="Visite a moeda Bitcoin" aria-label="Visite a moeda Bitcoin" loading="lazy" class="AssetHighlight__AssetImage-sc-1srucyv-5 lcjcxh"/>
<div class="Flex-l69ttv-0 kvilOX">
<div class="Flex-l69ttv-0 gTbYCC">
<h3 class="AssetHighlight__SubTitle-sc-1srucyv-3 gdcBEE">Bitcoin</h3>
<p class="AssetHighlight__Price-sc-1srucyv-4 bUAWAG">460 728,81 HK$</p>
Here 460 728,81 HK$ is the data wanted.
Thus I applied the following codes:
import bs4
import urllib.request as req
url="https://www.coinbase.com/prthe ice/bitcoin/hkd"
request=req.Request(url,headers={
"user-agent":"..."
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
root=bs4.BeautifulSoup(data,"html.parser")
secBitcoin=root.find('a',href="/pt-PT/price/bitcoin")
realtimeCurrency=secBitcoin.find('p')
print(realtimeCurrency.string)
However, it always returns secBitcoin = None. No result matches.
The find function works just fine when I search 'div' label with class parameter.
I have also tried format like
.find('a[href="/pt-PT/price/bitcoin"]')
But nothing works.
It's possible the page is loading the currency values after the initial page load. You could try hitting ctrl+s to save the full webpage and open that file instead of using requests. If that also doesn't work, then I'm not sure where the problem is.
And if that does work, then you'll probably need to use something like selenium to get what you need
href is an attribute of an element and hence I think you cannot find it that way.
def is_a_and_href_matching(element):
is_a = element.name == a
if is_a and element.has_attr(href):
if element['href'] == "/pt-PT/price/bitcoin":
return True
return False
secBitcoins=root.find_all(is_a_and_href_matching)
for secBitcoin in secBitcoins:
p = setBitcoin.find('p')
<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div>,
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div>,
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>,
I currently have the above html code that is in a list. I wish to use python so that it may output the following and then append to a list:
10
3
18
3
24
I would recommend using Beautiful Soup which is a very popular html parsing module that is uniquely suited for this kind of thing. If each element has the attribute of title then you could do something like this:
from bs4 import BeautifulSoup
import requests
def randomFacts(url):
r = requests.get(url)
bs = BeautifulSoup(r.content, 'html.parser')
title = bs.find_all('div')
for each in title:
print(each['title'])
Beautiful Soup is my normal go to for html parsing, hope this helps.
Here are 3 possibilities. In the first 2 versions we make sure the class checks out before appending it to the list - just in case there are other divs that you don't want to include. In the third method there isn't really a good way to do that. Unlike adrianp's method of splitting, mine doesn't care where the title is.
The third method may be a bit confusing so, allow me to explain it. First we split everywhere that title=" appears. We dump the first index of that list because it is everything before the first title. We then loop over the remainder and split on the first quote. Now the number you want is in the first index of that split. We do an inline pop to get that value so we can keep everything in a list comprehension, instead of expanding the entire loop and wrestling the values out with specific indexes.
To load the html remotely, uncomment the commented html var and replace "yourURL" with the proper one for you.
I think I have given you every possible way of doing this - certainly the most obvious ones.
from bs4 import BeautifulSoup
import re, requests
html = '<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>'
#html = requests.get(yourURL).content
# possibility 1: BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# assumes that all bb-fl classed divs have a title and all divs have a class
# you may need to disassemble this generator and add some extra checks
bs_titleval = [div['title'] for div in soup.find_all('div') if 'bb-fl' in div['class']]
print(bs_titleval)
# possibility 2: Regular Expressions ~ not the best way to go
# this isn't going to work if the tag attribute signature changes
title_re = re.compile('<div class="bb-fl" style="[^"]*" title="([0-9]+)">', re.I)
re_titleval = [m.group(1) for m in title_re.finditer(html)]
print(re_titleval)
# possibility 3: String Splitting ~
# probably the best method if there is nothing extra to weed out
title_sp = html.split('title="')
title_sp.pop(0) # get rid of first index
# title_sp is now ['10"></div>...', '3"></div>...', '18"></div>...', etc...]
sp_titleval = [s.split('"').pop(0) for s in title_sp]
print(sp_titleval)
Assuming that each div is saved as a string in the variable div, you can do the following:
number = div.split()[3].split('=')[1]
Each div should be in the same format for this to work.
I have an xml file that looks like:
<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->
<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">
<!-- The crowd-classifier element will create a tool for the Worker to
select the correct answer to your question.
Your image file URLs will be substituted for the "image_url" variable below
when you publish a batch with a CSV input file containing multiple image file URLs.
To preview the element with an example image, try setting the src attribute to
"https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n
src= "https://someone#example.com/abcd.jpg"\n
categories="[\'Yes\', \'No\']"\n
header="abcd"\n
name="image-contains">\n\n
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n
good and bad answers here can help get good results. You can include\n
any HTML here. -->\n
<short-instructions>\n\n
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->
I want to extract the line:
src = https://someone#example.com/abcd.jpg
and assign it to a variable in python.
Bit New to xml parsing:
I tried like:
hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']
Error:
image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']
TypeError: string indices must be integers
If I don't access the ['crowd-image-classifier'] in code and limit myself to
hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']
Then I'm getting complete xml file.
How to access that img src?
You can use BeautifulSoup. See a the working code below.
from bs4 import BeautifulSoup
html = '''<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->
<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">
<!-- The crowd-classifier element will create a tool for the Worker to
select the correct answer to your question.
Your image file URLs will be substituted for the "image_url" variable below
when you publish a batch with a CSV input file containing multiple image file URLs.
To preview the element with an example image, try setting the src attribute to
"https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n
src= "https://someone#example.com/abcd.jpg"\n
categories="[\'Yes\', \'No\']"\n
header="abcd"\n
name="image-contains">\n\n
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n
good and bad answers here can help get good results. You can include\n
any HTML here. -->\n
<short-instructions>\n\n
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->'''
soup = BeautifulSoup(html, 'html.parser')
element = soup.find('crowd-image-classifier')
print(element['src'])
output
https://someone#example.com/abcd.jpg
I switched to using xml element tree
Syntax I got is somewhat similar to:
import xml.etree.ElementTree as ET
root = ET.fromstring(hit_doc)
for child in root:
if child[0].text == 'crowd-image-classifier':
image_data = child[1].text
I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.