Scraping data from website using Beautiful Soup and show it in jinja2 - python

I am trying to pull list of data from website using Beautiful Soup:
class burger(webapp2.RequestHandler):
Husam = urlopen('http://www.qaym.com/city/77/category/3/%D8%A7%D9%84%D8%AE%D8%A8%D8%B1/%D8%A8%D8%B1%D8%AC%D8%B1/').read()
def get(self, soup = BeautifulSoup(Husam)):
tago = soup.find_all("a", class_ = "bigger floatholder")
for tag in tago:
me2 = tag.get_text("\n")
template_values = {
'me2': me2
}
for template in template_values:
template = jinja_environment.get_template('index.html')
self.response.out.write(template.render(template_values))
Now when I try to show the data in template using jinja2, but it's repeat the whole template based on the number of list and put each single info in one template.
How I put the the whole list in one tag and be able to edit other tags whith out repeating?
<li>{{ me2}}</li>

To output a list of entries, you can loop over them in your jinja2 template like this:
{%for entry in me2%}
<li> {{entry}} </li>
{% endfor %}
To use this, your python code also has to put the tags into a list.
Something like this should work:
def get(self, soup=BeautifulSoup(Husam)):
tago = soup.find_all("a", class_="bigger floatholder")
# Create a list to store your entries
values = []
for tag in tago:
me2 = tag.get_text("\n")
# Append each tag to the list
values.append(me2)
template = jinja_environment.get_template('index.html')
# Put the list of values into a dict entry for jinja2 to use
template_values = {'me2': values}
# Render the template with the dict that contains the list
self.response.out.write(template.render(template_values))
References:
Jinja2 template documentation

Related

Split HTML nested list into python list

I have HTML list formed that way (It's what CKeditor create for nested list):
<ul>
<li>niv1alone</li>
<li>niv1
<ul>
<li>niv2
<ul>
<li>niv3
<ul>
<li>niv4</li>
</ul></li></ul></li></ul></li>
<li>autre niv1 alone</li>
</ul>
How do I form a "recursive list" like that:
[
('niv1alone',[]),('niv1',[('niv2',[('niv3',[('niv4',[])])])]),('autre niv1 alone',[])
]
I have tried several things with beautifulsoup but I can't get the desired result.
Here's a recursive function that functions similar to what you ask. Trick to writing recursive functions is to make the problem smaller then recurse it. Here I'm walking down the element tree and passing the children, which is a strictly smaller set than one before.
import bs4
html = '''
<ul>
<li>niv1alone</li>
<li>niv1
<ul>
<li>niv2
<ul>
<li>niv3
<ul>
<li>niv4</li>
</ul></li></ul></li></ul></li>
<li>autre niv1 alone</li>
</ul>
'''
def make_tree(body: bs4.Tag):
branch = []
for ch in body.children:
if isinstance(ch, bs4.NavigableString):
# skip whitespace
if str(ch).strip():
branch.append(str(ch).strip())
else:
branch.append(make_tree(ch))
return branch
if __name__ == '__main__':
soup = bs4.BeautifulSoup(html, 'html.parser')
tree = make_tree(soup.select_one('ul'))
print(tree)
output:
[['niv1alone'], ['niv1', [['niv2', [['niv3', [['niv4']]]]]]], ['autre niv1 alone']]

Is there an alternative to bs4's find_all() method that returns another soup object instead of a list, for further navigation?

Upon finding all the <ul>'s, I'd like to further extract the text, and the href's. The problem I'm facing particular to this bit of HTML, is that I need most, but not all the <li> items in the page. I see that when I find_all(), I am returned a list object which does not allow me to further navigate it as a soup object.
For example, in the below snippet, to ultimately create a dictionary of {'cityName': 'href',}, I have tried:
city_list = soup.find_all('ul', {'class': ''})
city_dict = {}
for city in city_list:
city_dict[city.text] = city['href']
Here is the sample minimal HTML:
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>southeast alaska</li>
</ul>
<h4>Arizona</h4>
<ul>
<li>flagstaff / sedona</li>
<li>yuma</li>
</ul>
<ul>
<li>help</li>
<li>safety</li>
<li class="fsel mobile linklike" data-mode="regular">desktop</li>
</ul>
How can I, essentially, find_all() the ul's first, and then further find only the li's that interest me?
Probably you need something like this:
city_dict = {}
for ul in soup.find_all('ul', {'class': ''}):
state_name = ul.find_previous_sibling('h4').text
print(state_name)
for link in ul.find_all('a'):
print(link['href'])
city_dict = {}
for li in soup.find_all('li'):
city_name = li.text
for link in li.find_all('a'):
city_dict[city_name] = link['href']
Try this, Thank me later :)
list_items = soup.find_all('ul',{'class':''})
list_of_dicts = []
for item in list_items:
for i in item.find_all('li'):
new_dict = {i.text:i.a.get('href')}
list_of_dicts.append(new_dict)

Dictionary loop calling same website when scraping

I'm very new to Python so this is probably straightforward, and might be an indentation issue. I'm trying to scrape over several webpages using beautiful soup, creating a list of dictionaries that I can use afterwards to manipulate the data.
The code seems to work fine, but the list I end up with (liste_flat) is just a list of the same two dictionaries. I want a list of different dictionaries.
def scrap_post(url):
url = "https://www.findproperly.co.uk/property-to-rent-london/commute/W3siaWQiOjkxMDYsImZyZXEiOjUsIm1ldGgiOiJwdWJ0cmFucyIsImxuZyI6LTAuMTI0Nzg5LCJsYXQiOjUxLjUwODR9XQ==/max-time/90/page/".format(i)
dictionary = {}
response = requests.get(url)
soup = bs(response.text,"lxml")
taille = len(soup.find_all("div", class_="col-sm-6 col-md-4 col-lg-3 pl-grid-prop not-viewed ")) #48 entries
for num_ville in range(0,taille):
print(num_ville)
apt_id = soup.find_all("div", class_="col-sm-6 col-md-4 col-lg-3 pl-grid-prop not-viewed ")[num_ville]['data-id']
entry = soup.find_all("div", class_="col-sm-6 col-md-4 col-lg-3 pl-grid-prop not-viewed ")[num_ville]
pricepw = soup.find_all('div', class_='col-xs-5 col-sm-4 price')[num_ville].find('h3').text.encode('utf-8').replace('\xc2\xa3','',).replace('pw','',).strip()
rooms = soup.find_all('div', class_='col-xs-6 type')[num_ville].find('p').text.encode('utf-8').strip()
lat = soup.find_all('div', {"itemprop":"geo"})[num_ville].find('meta', {'itemprop':'latitude'})['content']
lon = soup.find_all('div', {"itemprop":"geo"})[num_ville].find('meta', {'itemprop':'longitude'})['content']
dictionary[num_ville]={'Price per week':pricepw,'Rooms':rooms,'Latitude':lat,'Longitude':lon}
return dictionary
#get all URLs
liste_url = []
liste_url = ['https://www.findproperly.co.uk/property-to-rent-london/commute/W3siaWQiOjkxMDYsImZyZXEiOjUsIm1ldGgiOiJwdWJ0cmFucyIsImxuZyI6LTAuMTI0Nzg5LCJsYXQiOjUxLjUwODR9XQ==/max-time/90/page/''%i' %i for i in range(1,3)]
#get flats
liste_flat = [scrap_post(i) for i in liste_url]
I must somehow be looping over the same website twice. Any advice on how to make sure I'm looping over different websites?
Thanks!
Yes, you are looping over the same website, because you have hardcoded the url variable in your function.
url = "https://www.findproperly.co.uk/property-to-rent-london/commute/W3siaWQiOjkxMDYsImZyZXEiOjUsIm1ldGgiOiJwdWJ0cmFucyIsImxuZyI6LTAuMTI0Nzg5LCJsYXQiOjUxLjUwODR9XQ==/max-time/90/page/".format(i)
Meaning regardless of what you send to the function, it will always use this url. You might want to remove that. You also haven't placed a placeholder in your string and the .format(i) essentially does nothing.

Scrapy: difficulty outputting multiple rows, xPath issue?

I'm trying to get Scrapy to extract the author, date, and post from the forum https://bitcointalk.org/index.php?topic=1209137.0, and import it into my items.
My desired results are: (with extraneous html that I'll clean later)
author 1, date 1, post 1
author 2, date 2, post 2,
But instead I get:
author 1,2,3,4 date 1,2,3,4, post 1,2,3,4
I've searched around and read a few things on changing xPaths from absolute to relative, but I can't seem to get it working properly. I'm unsure if that's the root cause, or if I need to create a pipeline to transform the data?
*************UPDATE**********CODE ATTACHED*********************
class Bitorg(scrapy.Spider):
name = "bitorg"
allowed_domains = ["bitcointalk.org"]
start_urls = [
"https://bitcointalk.org/index.php?topic=1209137.0"
]
def parse(self, response):
for sel in response.xpath('..//html/body'):
item = BitorgItem()
item['author'] = sel.xpath('.//b/a[#title]').extract()
item['date'] = sel.xpath('.//td[#valign="middle"]/div[#class="smalltext"]').extract()
item['post'] = sel.xpath('.//div[#class="post"]').extract()
yield item
While the <table>, <tbody> and <tr> elements don't have attributes that can easily be selected, there is a <td> for each post with a class of poster_info.
To get a list of all posts, select on the <td> and the move up the tree using the xpath .. notation.
posts = response.xpath('//*[#class="poster_info"]/..')
Within each post, select the child elements of interest.
for post in posts:
author = ''.join(post.xpath('.//td[#class="poster_info"]/.//b/a/.//text()').extract())
title = ''.join(post.xpath('.//div[#class="subject"]/.//a/.//text()').extract())
date = ''.join(post.xpath('.//div[#class="subject"]/following-sibling::div/.//text()').extract())
print '%s, %s, %s' % (author, title, date)
You know all that code is just one big div with a small tables insides
and xpaths for authors
/html/body/div[2]/form/table[1]/tbody/tr[1]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/form/table[1]/tbody/tr[5]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/form/table[1]/tbody/tr[6]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
you can use this so that you can scrape anything
l = XPathItemLoader(item = JustDialItem(),response = response)
for i in range(1,10):
l.add_xpath('content1','//*[#id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
l.add_xpath('content2','//*[#id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
l.add_xpath('content3','//*[#id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
same way you can do also for date and post

Loop through Result to Gather Urls

I'm trying to parse an html result , grab a few urls, and then parse the output of visiting those urls.
I'm using django 1.5 /python 2.7:
views.py
#mechanize/beautifulsoup config options here.
beautifulSoupObj = BeautifulSoup(mechanizeBrowser.response().read()) #read the raw response
getFirstPageLinks = beautifulSoupObj.find_all('cite') #get first page of urls
url_data = UrlData(NumberOfUrlsFound, getDomainLinksFromGoogle)
#url_data = UrlData(5, 'myapp.com')
#return HttpResponse(MaxUrlsToGather)
print url_data.url_list()
return render(request, 'myapp/scan/process_scan.html', {
'url_data':url_data,'EnteredDomain':EnteredDomain,'getDomainLinksFromGoogle':getDomainLinksFromGoogle,
'NumberOfUrlsFound':NumberOfUrlsFound,
'getFirstPageLinks' : getFirstPageLinks,
})
urldata.py
class UrlData(object):
def __init__(self, num_of_urls, url_pattern):
self.num_of_urls = num_of_urls
self.url_pattern = url_pattern
def url_list(self):
# Returns a list of strings that represent the urls you want based on num_of_urls
# e.g. asite.com/?search?start=10
urls = []
for i in xrange(self.num_of_urls):
urls.append(self.url_pattern + '&start=' + str((i + 1) * 10) + ',')
return urls
template:
{{ getFirstPageLinks }}
{% if url_data.num_of_urls > 0 %}
{% for url in url_data.url_list %}
{{ url }}
{% endfor %}
{% endif %}
This outputs:
[<cite>www.google.com/webmasters/</cite>, <cite>www.domain.com</cite>, <cite>www.domain.comblog/</cite>, <cite>www.domain.comblog/projects/</cite>, <cite>www.domain.comblog/category/internet/</cite>, <cite>www.domain.comblog/category/goals/</cite>, <cite>www.domain.comblog/category/uncategorized/</cite>, <cite>www.domain.comblog/twit/2013/01/</cite>, <cite>www.domain.comblog/category/dog-2/</cite>, <cite>www.domain.comblog/category/goals/personal/</cite>, <cite>www.domain.comblog/category/internet/tech/</cite>]
which is generated by: getFirstPageLinks
and
https://www.google.com/search?q=site%3Adomain.com&start=10, https://www.google.com/search?q=site%3Adomain.com&start=20,
which is generated by: url_data a template variable
The problem currently is: I need to loop though each url in url_data and get the output like getFirstPageLinks is outputting it.
How can I achieve this?
Thank you.

Categories

Resources