I have a list of image urls contained in 'images'. I am trying to isolate the title from these image urls so that I can display, on the html, the image (using the whole url) and the corresponding title.
So far I have this:
titles = [image[149:199].strip() for image in images]
This gives me the stripped title in the following format (I provide two examples to show the pattern)
le_Art_Project.jpg/220px-
Rembrandt_van_Rijn_-Self-Portrait-_Google_Art_Project.jpg
and
cene_of_the_Prodigal_Son_-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg
The bits in bold (above) are the bits I would like to remove. From the start I would like to remove everything before 220px and from the end: _-_Google_Art_Project.jpg
A newbie to python, I am struggling with syntax and furthermore as I am doing this while referring to the loop of images (list), the string manipulation is not straightforward and I am unsure of how to approach this.
The whole code for reference is below:
webscraper.py:
#app.route('/') #this is what we type into our browser to go to pages. we create these using routes
#app.route('/home')
def home():
images=imagescrape()
titles=[image[99:247].strip() for image in images]
images_titles=zip(images,titles)
return render_template('home.html',images=images,images_titles=images_titles)
What I've tried / am trying:
x = txt.strip("_-_Google_Art_Project.jpg")
Looking into strip - to get rid of the last part of the unwanted string.
I am unsure of how to combine this with getting rid of the leading string that I want to remove and also do so in the most elegant way given the structure/code I already have.
Visually, I am trying to remove the leading text as shown highlighted, as well as the last part of the string which is _-_Google_Art_Project.jpg.
Visual of HTML displayed:
UPDATE:
Based on an answer below - which is very helpful but doesn't quite perfectly solve it, I am trying this approach (without using the unquote import if possible and pure python string manipulation)
def titleextract(url):
#return unquote(url[58:url.rindex("/",58)-8].replace('_',''))
title=url[58:]
return title
The above, returns:
Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220pxRembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg
but I want:
Rembrandt_van_Rijn_-_Self-Portrait
or for the second title/image in the list:
Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg
I want:
Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist
cene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg
You have this string and want to remove. Let's say I have this stored in x
y = x.lsplit("px-")[1]
z = x.rsplit("_Google_Art")[0]
This makes a list with 2 elements: stuff before "px-" in the string, and stuff after. We're just grabbing the stuff after, since you wanted to remove the stuff before. If "px-" isn't always in the string, then we need to find something else to split on. Then we split on something towards the end, and grab the stuff before it.
Edit: Addressing comment on how to split in that loop.. I think you are referring to this: titles=[image[149:199].strip() for image in images]
List comps are great but sometimes it's easier to just write it out. Haven't tested this but here's the idea:
titles = []
for image in images:
title = image[149:199].strip()
cleaned_left = title.lsplit("px-")[1]
cleaned_title = title.rsplit("_Google_Art")[0]
titles.append(cleaned_title)
import re # regular expressions used to match strings
from bs4 import BeautifulSoup # web scraping library
from urllib.request import urlopen # open a url connection
from urllib.parse import unquote # decode special url characters
#app.route('/')
#app.route('/home')
def home():
images=imagescrape()
# Iterate over all sources and extract the title from the URL
titles=(titleextract(src) for src in images)
# zip combines two lists into one.
# It goes through all elements and takes one element from the first
# and one element from the second list, combines them into a tuple
# and adds them to a sequence / generator.
images_titles = zip(images, titles)
return render_template('home.html', image_titles=images_titles)
def imagescrape():
result_images=[]
#html = urlopen('https://en.wikipedia.org/wiki/Prince_Harry,_Duke_of_Sussex')
html = urlopen('https://en.wikipedia.org/wiki/Rembrandt')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images:
result_images.append("https:"+image['src']+'\n') #concatenation!
return result_images
def titleextract(url):
# Extract the part of the string between the last two "/" characters
# Decode special URL characters and cut off the suffix
# Replace all "_" with spaces
return unquote(url[58:url.rindex("/", 58)-4]).replace('_', ' ')
{% for image, title in images_titles %}
<div class="card" style="width: 18rem;">
<img src="{{image}}" class="card-img-top" alt="...">
<div class="card-body">
<h5 class="card-title">{{title}}</h5>
<p class="card-text">Some quick example text to build on the card title and make up the bulk of the card's content.</p>
Go somewhere
</div>
</div>
{% endfor %}
Related
I'm fairly new to RegEx (and Python) in general and am trying to use it to read the temperature and description of weather via the HTML tags of a website.
I've attempted to rework examples of what I've been shown in class and read online to do this.
url = 'https://weather.com/en-AU/weather/today/l/-27.47,153.02'
contents = urllib.request.urlopen(url).read().decode("utf-8")
start_of_div = contents.find('<div class="today_nowcard-phrase">') # start of phrase line
end_of_div = start_of_div + contents[start_of_div:].find("</div>") + 6 # close of phrase line
phrase_area = contents[start_of_div:end_of_div]
print(phrase_area)
phrase = phrase_area.rfind(r'>(.*)<') # regex tester says this works
print(phrase)
There's then another section that gets the degrees which uses the same kind of layout.
It should print a phrase like 'Sunny' or 'Light Rain' or whatever else the weather is, as well as the current degrees (celsius). Instead it prints out:
<div class="today_nowcard-phrase">Sunny</div>
- 1
<div class="today_nowcard-temp"><span class="">21<sup>
- 1
Instead of -1 it should be 'Sunny' and '21' (at that point of time). The RegEx works when I put it into RegEx testing sites, but not in my actual program (probably because of some obvious error I can't see). Any help would be appreciated.
As mentioned in comments used an html parser. The elements all have nice distinctive class names you can use e.g. .today_nowcard-temp (where the leading . is a css class selector to match on element class name)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://weather.com/en-AU/weather/today/l/-27.47,153.02')
soup = bs(r.content, 'html.parser')
temp = soup.select_one('.today_nowcard-temp').text
desc = soup.select_one('.today_nowcard-phrase').text
print(temp, desc)
I heard about the filter |safe, but if I understood correctly, that's unsafe and creates a backdoor for injections.
What are the alternatives to display full posts with formatted text?
I think when you not use the filter of |safe, then output should return as text only with html markup (not rendered as html output).
But, if you need to exclude some dangerous tags such as <script>location.reload()</script>, you need to handle it with custom templatetag filter..
I got good answer from: https://stackoverflow.com/a/699483/6396981, via BeautifulSoup.
from bs4 import BeautifulSoup
from django import template
from django.utils.html import escape
register = template.Library()
INVALID_TAGS = ['script',]
def clean_html(value):
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name in INVALID_TAGS:
# tag.hidden = True # you also can use this.
tag.replaceWith(escape(tag))
return soup.renderContents()
# clean_html('<h1>This is heading</h1> and this one is xss injection <script>location.reload()</script>')
# output:
# <html><body><h1>This is heading</h1> and this one is xss injection <script>location.reload()</script></body></html>
#register.filter
def safe_exclude(text):
# eg: {{ post.description|safe_exclude|safe }}
return clean_html(text)
Hope it usefull..
I am supposed to write a code that goes into a web site and gets its title so here is the code i have
import urllib.request
def findTitle(url):
urllib.request.Request(url)
#open url
urllib.request.urlopen(url)
urllib.request.urlopen(url).read().decode('utf-8')
#set same variable equal to the end of <title> tag
endTitlePos = url.find("<title>")
#set variable equal to starting position of <title> tag
startTitlePos = url.find("<title>", endTitlePos)
startTitlePos += len("<title>")
#set new variable equal to </title>
TitleContent=url.find("</title>",startTitlePos)
#return slice of output between the two variables
title = url[startTitlePos:endTitlePos]
content_list=[]
content_list.append(title)
return content_list
def main():
url="https://google.com/search"
print(findTitle(url))
main()
we are using google for an example. Now its supposed to just print "google" but currently it prints "['//google.com/searc']" i am just curious what i am missing here, i mean it seems very simple but i dont know why its printing the url rather then the title and how also do i turn it form the list into a string?
There are several alternative to get data from webpages. The best use BeautifulSoup. In your case string split() method works well
import urllib.request
def findTitle(url):
webpage = urllib.request.urlopen(url).read()
title = str(webpage).split('<title>')[1].split('</title>')[0]
return title
>>>print(findTitle('http://www.google.com'))
Google
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.
Suppose I have a huge paragraph.
I just want the top 15 words to be shown. After than, the person clicks "more" to see the rest of the stuff.
Just whipped this up, seems to do what you want, and there's no dependency on any external JS libs.
DISCLAIMER: I haven't tried this in IE, but chrome and firefox work fine.
from django import template
from django.utils.html import escape
from django.utils.safestring import mark_safe
register = template.Library()
import re
readmore_showscript = ''.join([
"this.parentNode.style.display='none';",
"this.parentNode.parentNode.getElementsByClassName('more')[0].style.display='inline';",
"return false;",
]);
#register.filter
def readmore(txt, showwords=15):
global readmore_showscript
words = re.split(r' ', escape(txt))
if len(words) <= showwords:
return txt
# wrap the more part
words.insert(showwords, '<span class="more" style="display:none;">')
words.append('</span>')
# insert the readmore part
words.insert(showwords, '<span class="readmore">... <a href="#" onclick="')
words.insert(showwords+1, readmore_showscript)
words.insert(showwords+2, '">read more</a>')
words.insert(showwords+3, '</span>')
# Wrap with <p>
words.insert(0, '<p>')
words.append('</p>')
return mark_safe(' '.join(words))
readmore.is_safe = True
To use it, just create a templatetags folder in your app, create the __init__.py file in there, and then drop this code into readmore.py.
Then at the top of any template where you want to use it, just add: {% load readmore %}
To use the filter itself:
{{ some_long_text_var|readmore:15 }}
The :15 tells how many words you want to show before the read more link.
If you want anything fancy like ajax loading of the full content, that's quite a bit different and would require a bit more infrastructure.
use truncatechars_html
refer to : https://docs.djangoproject.com/en/1.8/ref/templates/builtins/#truncatechars-html
truncatechars_html
Similar to truncatechars, except that it is aware of HTML tags. Any tags that are opened in the string and not closed before the truncation point are closed immediately after the truncation.
For example:
{{ value|truncatechars_html:9 }}
If value is "<p>Joel is a slug</p>", the output will be "<p>Joel i...</p>".
Newlines in the HTML content will be preserved.
There is truncatewords filter, although you still need a JavaScript helper to do what you described.
from django import template
from django.utils.html import escape
from django.utils.safestring import mark_safe
register = template.Library()
#register.filter
def readmore(text, cnt=250):
text, cnt = escape(text), int(cnt)
if len(text) > cnt:
first_part = text[:cnt]
link = u'%s' % _('read more')
second_part = u'%s<span class="hide">%s</span>' % (link, text[cnt:])
return mark_safe('... '.join([first_part, second_part]))
return text
readmore.is_safe = True
I rewrote an earlier answer to be cleaner and to handle string escaping properly:
#register.filter(needs_autoescape=True)
#stringfilter
def read_more(s, show_words, autoescape=True):
"""Split text after so many words, inserting a "more" link at the end.
Relies on JavaScript to react to the link being clicked and on classes
found in Bootstrap to hide elements.
"""
show_words = int(show_words)
if autoescape:
esc = conditional_escape
else:
esc = lambda x: x
words = esc(s).split()
if len(words) <= show_words:
return s
insertion = (
# The see more link...
'<span class="read-more">…'
' <a href="#">'
' <i class="fa fa-plus-square gray" title="Show All"></i>'
' </a>'
'</span>'
# The call to hide the rest...
'<span class="more hidden">'
)
# wrap the more part
words.insert(show_words, insertion)
words.append('</span>')
return mark_safe(' '.join(words))
The HTML in there assumes you're using Bootstrap and Fontawesome, but if that's not your flavor, it's easy to adapt.
For the JavaScript, assuming you're using jQuery (if you're using Bootstrap you probably are), you'll just need to add something like this:
$(".read-more").click(function(e) {
e.preventDefault();
var t = $(this);
t.parent().find('.more').removeClass('hidden');
t.addClass('hidden');
});