my url is http://example.com/en/cat/ap+da+w_pl
Now I have a-tag like this:
<a href="{{ url_for('category',
feature=request.path+"+"+att.get('u_sg'))}}">
{{ att.get('name') }}
</a>
request.path is giving me '/en/cat/ap+da+wh_pl' BUT, I need only /ap+da+w_pl
How to do it?
I need to pass only 'ap+da+w_pl' from out of request.path from HTML only, as I have to use it in pre-coded View of Flask and my view is like THIS:
#app.route('<lan_c>/cat/<string:feature>')
def category(feature, page):
Consider current url is 'http://example.com/en/cat/ap+da+w_pl'
if user click on a-tag then I want to append value returned from 'att.get('u_sg')'.
The problem I am facing right now is my a-tag is considering 'http://example.com/en/cat/en/cat/ap+da+w_pl+w_pl2', so I wanted to send only 'ap+da+w_pl' + 'att.get('u_sg')'. So that a-tag will point to 'http://example.com/en/cat/ap+da+w_pl+w_pl2'
You could split the result by / and get the last key:
>>> r = 'http://example.com/en/cat/ap+da+w_pl'.split('/')
>>> r[-1]
'ap+da+w_pl'
This would work for /en/cat/ap+da+wh_pl the same way:
>>> r = '/en/cat/ap+da+w_pl'.split('/')
>>> r[-1]
'ap+da+w_pl'
Prepend the / if needed:
>>> '/'+(r[-1])
'/ap+da+w_pl'
Related
Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
output should be like that:
["/relative/path","//other.host/same-protocol","https://example.com"]
The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function
//I have tried this bellow one but not working its only give output: ["https://example.com"]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
print(urls)
You could try using positive lookbehind to find the quoted strings in front of href= in html
pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
urls = re.findall(pattern, html)
See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall
from typing import List
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
href_prefix = "href=\""
def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
if result == None:
result = []
is_splitted, _, rest = html.partition(href_prefix)
if not is_splitted:
return result
link = rest[:rest.find("\"")].partition("#")[0]
if link:
result.append(link)
return get_links_from_html(rest, result)
print(get_links_from_html(html))
I've got the following HTML:
<td id="uprnButton0">
<button type="button"
onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
getobject('divAddress').innerHTML = '';
GetInfoAndRoundsFor('123456789123','SWN');"
title="Get Calendar for this address"
>Show
</button>
</td>
I want to get the text in populAddr and in GetInfoAndRoundsFor i.e. the strings "14 PLACE NAME TOWN POSTCODE" and "123456789123" respectively.
So far I have tried:
button_click_text = address.find('button').get('onclick')
Which gets me the full onClick string which is great. Is the only way to get the specific sub strings doing a bit of slicing?
I've tried this:
string = """changeText('uprnButton1','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');getobject('divAddress').innerHTML = '';GetInfoAndRoundsFor('123456789123','SWN');"""
string_before = "populAddr('"
string_after = "');getobject"
print(string[string.index(string_before)+len(string_before):string.index(string_after)])
Which does work but looks like an effing mess. Is there best practice here?
Actually just thought this might be better:
string_split = string.split("'")
print(string_split[5])
print(string_split[11])
You should be able to use the following two lazy regex patterns
import re
html ='''<td id="uprnButton0">
<button type="button"
onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
getobject('divAddress').innerHTML = '';
GetInfoAndRoundsFor('123456789123','SWN');"
title="Get Calendar for this address"
>Show
</button>
</td>'''
p1 =re.compile(r"populAddr\('(.*?)'")
p2 = re.compile(r"GetInfoAndRoundsFor\('(.*?)'")
print(p1.findall(html)[0])
print(p2.findall(html)[0])
Explanation for one (same principle for both)
you can replace html variable with response.text or button_click_textwhere response.text is the requests response .text
I found this to be the quickest way of doing it and because I guess the HTML could be switched I put a couple of checks in to make sure the house number was what I searched for and the uprn was actually a number. If either of these was false then I know the code on the site has probably been tweaked:
string_split = string.split("'")
address = string_split[5]
uprn = string_split[11]
validate address starts with correct house number
print(address.startswith('15 '))
validate uprn contains a number
print(uprn[0:12].isdigit())
That is my try:
In [1]: d = """
...: <td id="uprnButton0">
...: <button type="button"
...: onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
...: getobject('divAddress').innerHTML = '';
...: GetInfoAndRoundsFor('123456789123','SWN');"
...: title="Get Calendar for this address"
...: >Show
...: </button>
...: </td>
...: """
In [2]: from bs4 import BeautifulSoup as bs
In [3]: soup = bs(d,"lxml")
In [4]: button_click_text = soup.find('button').get('onclick')
In [5]: button_click_text
Out[5]: "changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');\n getobject('divAddress').innerHTML = '';\n GetInfoAndRoundsFor('123456789123','SWN');"
In [6]: import re
...: regex = re.compile(r"'.*?'")
...: out = regex.findall(button_click_text)
...: s1 = out[2][1:-1]
...: s2 = out[-2][1:-1]
In [7]: s1
Out[7]: '14 PLACE NAME TOWN POSTCODE'
In [8]: s2
Out[8]: '123456789123'
soup.find(button) returns an object representing the first button element, and soup.find('button')['onclick'] returns the string value of the onclick attribute.
Because of this, there isn't a convenient way of fetching the value of populAddr, other than using split.
I would recommend splitting by the following:
address = address.find('button').get('onclick').split('populAddr(')[1].split(')')[0]
If you split by populAddr, you know exactly what index the address is located in (it will always be index 0).
If you split by ', you will have to manually review every page you scrape in order to verify that the address will end up in index 5.
I am a self-learner and a beginner, searched a lot but maybe have lack of searching. I am scraping some values from two web sites and I want o compare them with an HTML output. Each web pages, I am combinin two class'es and gettin into a list. But when making an output with HTML I don't want all list to print. So I made function to choose any keywords to print. When I want to print out that function, It turns out 'None' at HTML output but it turns what I wanted on console. So how to show that special list?
OS= Windows , Python3.
from bs4 import BeautifulSoup
import requests
import datetime
import os
import webbrowser
carf_meySayf = requests.get('https://www.carrefoursa.com/tr/tr/meyve/c/1015?show=All').text
carf_soup = BeautifulSoup(carf_meySayf, 'lxml')
#spans
carf_name_span = carf_soup.find_all('span', {'class' : 'item-name'})
carf_price_span = carf_soup.find_all('span', {'class' : 'item-price'})
#spans to list
carf_name_list = [span.get_text() for span in carf_name_span]
carf_price_list = [span.get_text() for span in carf_price_span]
#combine lists
carf_mey_all = [carf_name_list +' = ' + carf_price_list for carf_name_list, carf_price_list in zip(carf_name_list, carf_price_list)]
#Function to choose and print special product
def test(namelist,product):
for i in namelist:
if product in i:
print(i)
a = test(carf_mey_all,'Muz')
# Date
date = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# HTML part
html_str = """
<html>
<title>Listeler</title>
<h2>Tarih: %s</h2>
<h3>Product & Shop List</h3>
<table style="width:100%%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
%s
</tr>
</html>
"""
whole = html_str %(date,a)
Html_file= open("Meyve.html","w")
Html_file.write(whole)
Html_file.close()
the method test() must have return value, for example
def test(namelist,product):
results = ''
for i in namelist:
if product in i:
print(i)
results += '<td>%s</td>\n' % i
return results
Meyve.html results:
<html>
<title>Listeler</title>
<h2>Tarih: 2018-12-29 07:34:00</h2>
<h3>Product & Shop List</h3>
<table style="width:100%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
<td>Muz = 6,99 TL</td>
<td>İthal Muz = 12,90 TL</td>
<td>Paket Yerli Muz = 9,99 TL</td>
</tr>
</html>
note: to be valid html you need to add <body></body>
The problem is that your test() function isn't explicitly returning anything, so it is implicitly returning None.
To fix this, test() should accumulate the text it wants to return (i.e, by building a list or string) and return a string containing the text you want to insert into html_str.
I'm trying to parse an html result , grab a few urls, and then parse the output of visiting those urls.
I'm using django 1.5 /python 2.7:
views.py
#mechanize/beautifulsoup config options here.
beautifulSoupObj = BeautifulSoup(mechanizeBrowser.response().read()) #read the raw response
getFirstPageLinks = beautifulSoupObj.find_all('cite') #get first page of urls
url_data = UrlData(NumberOfUrlsFound, getDomainLinksFromGoogle)
#url_data = UrlData(5, 'myapp.com')
#return HttpResponse(MaxUrlsToGather)
print url_data.url_list()
return render(request, 'myapp/scan/process_scan.html', {
'url_data':url_data,'EnteredDomain':EnteredDomain,'getDomainLinksFromGoogle':getDomainLinksFromGoogle,
'NumberOfUrlsFound':NumberOfUrlsFound,
'getFirstPageLinks' : getFirstPageLinks,
})
urldata.py
class UrlData(object):
def __init__(self, num_of_urls, url_pattern):
self.num_of_urls = num_of_urls
self.url_pattern = url_pattern
def url_list(self):
# Returns a list of strings that represent the urls you want based on num_of_urls
# e.g. asite.com/?search?start=10
urls = []
for i in xrange(self.num_of_urls):
urls.append(self.url_pattern + '&start=' + str((i + 1) * 10) + ',')
return urls
template:
{{ getFirstPageLinks }}
{% if url_data.num_of_urls > 0 %}
{% for url in url_data.url_list %}
{{ url }}
{% endfor %}
{% endif %}
This outputs:
[<cite>www.google.com/webmasters/</cite>, <cite>www.domain.com</cite>, <cite>www.domain.comblog/</cite>, <cite>www.domain.comblog/projects/</cite>, <cite>www.domain.comblog/category/internet/</cite>, <cite>www.domain.comblog/category/goals/</cite>, <cite>www.domain.comblog/category/uncategorized/</cite>, <cite>www.domain.comblog/twit/2013/01/</cite>, <cite>www.domain.comblog/category/dog-2/</cite>, <cite>www.domain.comblog/category/goals/personal/</cite>, <cite>www.domain.comblog/category/internet/tech/</cite>]
which is generated by: getFirstPageLinks
and
https://www.google.com/search?q=site%3Adomain.com&start=10, https://www.google.com/search?q=site%3Adomain.com&start=20,
which is generated by: url_data a template variable
The problem currently is: I need to loop though each url in url_data and get the output like getFirstPageLinks is outputting it.
How can I achieve this?
Thank you.
I am trying to pull list of data from website using Beautiful Soup:
class burger(webapp2.RequestHandler):
Husam = urlopen('http://www.qaym.com/city/77/category/3/%D8%A7%D9%84%D8%AE%D8%A8%D8%B1/%D8%A8%D8%B1%D8%AC%D8%B1/').read()
def get(self, soup = BeautifulSoup(Husam)):
tago = soup.find_all("a", class_ = "bigger floatholder")
for tag in tago:
me2 = tag.get_text("\n")
template_values = {
'me2': me2
}
for template in template_values:
template = jinja_environment.get_template('index.html')
self.response.out.write(template.render(template_values))
Now when I try to show the data in template using jinja2, but it's repeat the whole template based on the number of list and put each single info in one template.
How I put the the whole list in one tag and be able to edit other tags whith out repeating?
<li>{{ me2}}</li>
To output a list of entries, you can loop over them in your jinja2 template like this:
{%for entry in me2%}
<li> {{entry}} </li>
{% endfor %}
To use this, your python code also has to put the tags into a list.
Something like this should work:
def get(self, soup=BeautifulSoup(Husam)):
tago = soup.find_all("a", class_="bigger floatholder")
# Create a list to store your entries
values = []
for tag in tago:
me2 = tag.get_text("\n")
# Append each tag to the list
values.append(me2)
template = jinja_environment.get_template('index.html')
# Put the list of values into a dict entry for jinja2 to use
template_values = {'me2': values}
# Render the template with the dict that contains the list
self.response.out.write(template.render(template_values))
References:
Jinja2 template documentation