I'm trying to extract image URLs from this code:
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
How can I find the URLs in data-src?
I'm using beautiful soup and find function but I have no idea how to extract links because I don't see img tag as usual...
Thank you for your time in advance
If you can't use an HTML parser for whatever reason, then you can use regex.
import re
text = '''
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
'''
parsed = re.search('(?<=data-src=").*(?=" )', text).group(0)
print(parsed)
You can try the following:
from bs4 import BeautifulSoup
html = """
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
"""
soup = BeautifulSoup(html, "html.parser")
url = soup.select_one(
"div.theme-screenshot.one.attachment-theme-screenshot.size-theme-screenshot.wp-post-image.loaded"
).get("data-src")
print(url)
This will return:
https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg
Documentation for BeautifulSoup(bs4) can be found at:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Related
Currently i use lxml to parse the html document to get the data from the HTML elements
but there is a new challenge, there is one data stored as ratings inside HTML elements
https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
Its easy to extract text between tags but within tags no ideas.
What do you suggest ?
Challenge i want to extract "3"
URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br,
Gabriel.
If I understand your question and comments correctly, the following should extract all the rating in that page:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[#class]]/#data-rating')
For example:
targets[0]
output
3
Try below script:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])
I would like to extract the text 'THIS IS THE TEXT I WANT TO EXTRACT' from the snippet below. Does anyone have any suggestions? Thanks!
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
from bs4 import BeautifulSoup
html = """<span class="cw-type__h2 Ingredients-title">Ingredients</span><p>THIS IS THE TEXT I WANT TO EXTRACT</p>"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.text)
Assuming there is likely more html, I would use the class of the preceeding span with adjacent sibling combinator and p type selector to target the appropriate p tag
from bs4 import BeautifulSoup as bs
html = '''
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
'''
soup = bs(html, 'lxml')
print(soup.select_one('.Ingredients-title + p').text.strip())
Here is my some html source code :
<div class="s">
<div class="th N3nEGc" style="height:48px;width:61px">
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM"
data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ"
ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">
</a>
</div>
</div>
What I want to extract is the link:
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&
so the output will be in that way,
https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg
What I tried by using python is :
sourceCode = opener.open(googlePath).read().decode('utf-8')
links = re.findall('href="/imgres?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)
Better way than parse query string through regex is using parse_qs function (safer, you get exactly what you want without regex fiddling) (doc):
data = '''<div class="s"><div class="th N3nEGc" style="height:48px;width:61px"><a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM" data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ" ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">'''
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
soup = BeautifulSoup(data, 'lxml')
d = urlparse(soup.select_one('a[href*="imgurl"]')['href'])
q = parse_qs(d.query)
print(q['imgurl'])
Prints:
['https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg']
If the problem is your regex, then I think you can try this one:
link = re.search('^https?:\/\/.*[\r\n]*[^.\\,:;]', sourceCode)
link = link.group()
print (link)
Perhaps you should add an escape character for '?', try out this :
links = re.findall('href="/imgres\?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)
I am trying to automate a process of downloading imgur files, and for this purpose I am using beautifulsoup to get the link however to be honest I am pretty lost on why this doesn't work, as according to my research it should:
soup = BeautifulSoup("http://imgur.com/ha0WYYQ")
imageUrl = soup.select('.image a')[0]['href']
The code above just returns an empty list, and therefore an error. I tried to modify it, but to no avail. Any and all input is appreciated.
<div class="post-image">
<a href="//i.imgur.com/ha0WYYQ.jpg" class="zoom">
<img src="//i.imgur.com/ha0WYYQ.jpg" alt="Frank in his bb8 costume" itemprop="contentURL">
</a>
</div>
this is the image tag, the "post-image" is a single word, can not be separated.
imageUrl = soup.select('.post-image a')[0]['href']
shortcut for select one tag:
imageUrl = soup.select_one('.post-image a')['href']
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
There are a few things wrong with your approach:
BeautifulSoup does not expect an url, so you will need to use a library to fetch the HTML stream first; and
Your selector seems invalid based on what I can see it should be .post-image a.
r = urllib.urlopen('http://imgur.com/ha0WYYQ').read()
soup = BeautifulSoup(r,'lxml')
soup.select('.post-image a')[0]['href']
Or more elegant:
with urllib.urlopen('http://imgur.com/ha0WYYQ') as f:
r = f.read()
soup = BeautifulSoup(r,'lxml')
result = soup.select('.post-image a')[0]['href']
I have a large chunk of text and would like to parse out all the URLs, returning a list of the urls that follow this pattern: https://www.facebook.com/.*$.
Here is an example of the text I would like to parse from:
<abbr title="Monday xxxx" data-utime="xx" class="timestamp">over a year ago</abbr></div></div></div></div></div></li><li class="fbProfileBrowserListItem"><div class="clearfix _5qo4"><a class="_8o _8t lfloat" href="https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser" tabindex="-1" aria-hidden="true" data-hovercard="/ajax/hovercard/user.php?id=xxxx&extragetparams=%7B%22hc_location%22%3A%22profile_browser%22%7D"><img class="_s0 _rw img" src="https://fbcdn-profile-xxxxxxxx.net/hprofile-ak-ash2/xxxxxx.jpg" alt=""></a><div class="clearfix _42ef"><div class="_6a rfloat"><div class="_6a _6b" style="height:50px"></div><div class="_6a _6b"><div class="_5t4x"><div class="FriendButton" id="u_2h_1w"><button class="_42ft _4jy0 FriendRequestAdd addButton _4jy3 _517h" type="button">
And I would like to get "https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser"
What I tried
from bs4 import BeautifulSoup
html = open('full_page_firefox.html')
def getLinks(html):
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
links.append(a['href'])
return links
print getLinks(html)
Splitting also doesn't seem to work because it does not retain the pattern. So if I use something like "https://www.facebook.com/*.$" to get the urls with re.split() or something, it doesn't work.
your code works here, check your input file, make sure beautiful soap can parse it.
btw, also consider using a lxml
from lxml import etree
print etree.parse('full_page_firefox.html').xpath('//a/#href | //img/#src')
['https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser',
'https://fbcdn-profile-xxxxxxxx.net/hprofile-ak-ash2/xxxxxx.jpg']
Your function works. I copied the bit of html you provided into a html file and added the <html> and <body> tags for good measure.
Then I tried:
with open('C:/users/brian/desktop/html.html') as html:
print getLinks(html)
in the python interpreter and got the following output:
[u'https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser']
call str on this and you're good
you can check the url by that pattern, after parsed by BS, like this:
from bs4 import BeautifulSoup
import re
html = open('full_page_firefox.html')
def getLinks(html):
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
match_result = re.match(r'https://www.facebook.com/.*$', a['href'])
if match_result is not None:
links.append(match_result.string)
return links
print getLinks(html)
Note:
1.no whitespace between '/' and '.'
2.'$' match end of string, careful to use