Getting urls out of block of text? - python

I have a large chunk of text and would like to parse out all the URLs, returning a list of the urls that follow this pattern: https://www.facebook.com/.*$.
Here is an example of the text I would like to parse from:
<abbr title="Monday xxxx" data-utime="xx" class="timestamp">over a year ago</abbr></div></div></div></div></div></li><li class="fbProfileBrowserListItem"><div class="clearfix _5qo4"><a class="_8o _8t lfloat" href="https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser" tabindex="-1" aria-hidden="true" data-hovercard="/ajax/hovercard/user.php?id=xxxx&extragetparams=%7B%22hc_location%22%3A%22profile_browser%22%7D"><img class="_s0 _rw img" src="https://fbcdn-profile-xxxxxxxx.net/hprofile-ak-ash2/xxxxxx.jpg" alt=""></a><div class="clearfix _42ef"><div class="_6a rfloat"><div class="_6a _6b" style="height:50px"></div><div class="_6a _6b"><div class="_5t4x"><div class="FriendButton" id="u_2h_1w"><button class="_42ft _4jy0 FriendRequestAdd addButton _4jy3 _517h" type="button">
And I would like to get "https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser"
What I tried
from bs4 import BeautifulSoup
html = open('full_page_firefox.html')
def getLinks(html):
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
links.append(a['href'])
return links
print getLinks(html)
Splitting also doesn't seem to work because it does not retain the pattern. So if I use something like "https://www.facebook.com/*.$" to get the urls with re.split() or something, it doesn't work.

your code works here, check your input file, make sure beautiful soap can parse it.
btw, also consider using a lxml
from lxml import etree
print etree.parse('full_page_firefox.html').xpath('//a/#href | //img/#src')
['https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser',
'https://fbcdn-profile-xxxxxxxx.net/hprofile-ak-ash2/xxxxxx.jpg']

Your function works. I copied the bit of html you provided into a html file and added the <html> and <body> tags for good measure.
Then I tried:
with open('C:/users/brian/desktop/html.html') as html:
print getLinks(html)
in the python interpreter and got the following output:
[u'https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser']
call str on this and you're good

you can check the url by that pattern, after parsed by BS, like this:
from bs4 import BeautifulSoup
import re
html = open('full_page_firefox.html')
def getLinks(html):
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
match_result = re.match(r'https://www.facebook.com/.*$', a['href'])
if match_result is not None:
links.append(match_result.string)
return links
print getLinks(html)
Note:
1.no whitespace between '/' and '.'
2.'$' match end of string, careful to use

Related

How to extract image url with python?

I'm trying to extract image URLs from this code:
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
How can I find the URLs in data-src?
I'm using beautiful soup and find function but I have no idea how to extract links because I don't see img tag as usual...
Thank you for your time in advance
If you can't use an HTML parser for whatever reason, then you can use regex.
import re
text = '''
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
'''
parsed = re.search('(?<=data-src=").*(?=" )', text).group(0)
print(parsed)
You can try the following:
from bs4 import BeautifulSoup
html = """
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
"""
soup = BeautifulSoup(html, "html.parser")
url = soup.select_one(
"div.theme-screenshot.one.attachment-theme-screenshot.size-theme-screenshot.wp-post-image.loaded"
).get("data-src")
print(url)
This will return:
https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg
Documentation for BeautifulSoup(bs4) can be found at:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

How to extract or Scrape data from HTML page but from the element itself

Currently i use lxml to parse the html document to get the data from the HTML elements
but there is a new challenge, there is one data stored as ratings inside HTML elements
https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
Its easy to extract text between tags but within tags no ideas.
What do you suggest ?
Challenge i want to extract "3"
URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br,
Gabriel.
If I understand your question and comments correctly, the following should extract all the rating in that page:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[#class]]/#data-rating')
For example:
targets[0]
output
3
Try below script:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])

python beautifulsoup: replace links with url in string

In a string containing HTML I have several links that I want to replace with the pure href value:
from bs4 import BeautifulSoup
a = "<a href='www.google.com'>foo</a> some text <a href='www.bing.com'>bar</a> some <br> text'
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all()
for tag in tags:
if tag.has_attr('href'):
html = html.replace(str(tag), tag['href'])
Unfortunatly this creates some issues:
the tags in the html use single quotes ', but beautifulsoup will create with str(tag) an tag with " quotes (foo). So replace() will not find the match.
<br> get identified as <br/>. Again replace() will not find the match.
So it seems using python's replace() method will not give reliable results.
Is there a way to use beautifulsoup's methods to replace a tag with a string?
edit:
Added value for str(tag) = foo
Relevant part of the docs: Modifying the tree
html="""
<html><head></head>
<body>
foo some text
bar some <br> text
</body></html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.string = a_tag.get('href')
print(soup)
output
<html><head></head>
<body>
www.google.com some text
www.bing.com some <br/> text
</body></html>

How can I extract the following links from html source code in python?

Here is my some html source code :
<div class="s">
<div class="th N3nEGc" style="height:48px;width:61px">
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM"
data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ"
ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">
</a>
</div>
</div>
What I want to extract is the link:
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&
so the output will be in that way,
https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg
What I tried by using python is :
sourceCode = opener.open(googlePath).read().decode('utf-8')
links = re.findall('href="/imgres?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)
Better way than parse query string through regex is using parse_qs function (safer, you get exactly what you want without regex fiddling) (doc):
data = '''<div class="s"><div class="th N3nEGc" style="height:48px;width:61px"><a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM" data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ" ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">'''
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
soup = BeautifulSoup(data, 'lxml')
d = urlparse(soup.select_one('a[href*="imgurl"]')['href'])
q = parse_qs(d.query)
print(q['imgurl'])
Prints:
['https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg']
If the problem is your regex, then I think you can try this one:
link = re.search('^https?:\/\/.*[\r\n]*[^.\\,:;]', sourceCode)
link = link.group()
print (link)
Perhaps you should add an escape character for '?', try out this :
links = re.findall('href="/imgres\?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)

Using BeautifulSoup to extract text within a tag

I am trying to scrape a text within a site source code using BeautifulSoup. Part of the source code looks like this:
<hr />
<div class="see-more inline canwrap" itemprop="genre">
<h4 class="inline">Genres:</h4>
<a href="/genre/Horror?ref_=tt_stry_gnr"
> Horror</a> <span>|</span>
<a href="/genre/Mystery?ref_=tt_stry_gnr"
> Mystery</a> <span>|</span>
<a href="/genre/Thriller?ref_=tt_stry_gnr"
> Thriller</a>
</div>
So I have been trying to extract the texts 'horror' 'mystery' and 'thriller' with these codes:
import requests
from bs4 import BeautifulSoup
url1='http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
r1=requests.get(url1)
soup1= BeautifulSoup(r1.text, 'lxml')
genre1=soup1.find('div',attrs={'itemprop':'genre'}).contents
print(genre1)
But the return comes out as:
['\n', <h4 class="inline">Genres:</h4>, '\n', <a href="/genre/Horror?
ref_=tt_stry_gnr"> Horror</a>, '\xa0', <span>|</span>, '\n', Mystery, '\xa0', <span>|</span>,
'\n', Thriller, '\n']
I am pretty new at python and webscraping, so I would appreciate all the help I can get. Thanks!
Use straight-forward BeautifulSoup.select() function to extract the needed elements to CSS selector:
import requests
from bs4 import BeautifulSoup
url1 = 'http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
soup = BeautifulSoup(requests.get(url1).text, 'lxml')
genres = [a.text.strip() for a in soup.select("div[itemprop='genre'] > a")]
print(genres)
The output:
['Horror', 'Mystery', 'Thriller']
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
You can use BeautifulSoup get_text() method indstead od the .contents property to get what you want:
From get_text() documentation:
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
>>> u'\nI linked to example.com\n'
soup.i.get_text()
>>> u'example.com'
You can specify a string to be used to join the bits of text together:
soup.get_text("|")
>>> u'\nI linked to |example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:
soup.get_text("|", strip=True)
>>> u'I linked to|example.com'
But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:
[text for text in soup.stripped_strings]
>>> [u'I linked to', u'example.com']
Try this, I am using html.parser. Let us know if you face any problems:
for data in genre1:
get_a = data.find_all("a")
text = ""
for i in get_a:
text = i.text
print(text)
Please check the indentation as I am using cellphone.
You can do the same in several ways. Css selectors are precise, easy to understand and less error prone. So you can go with selectors as well to serve the purpose:
from bs4 import BeautifulSoup
import requests
link = 'http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
genre = ' '.join([item.text.strip() for item in soup.select(".canwrap a[href*='genre']")])
print(genre)
Result:
Horror Mystery Thriller

Categories

Resources