beautifulsoup add <div> with class at end of html - python

I have an HTML file with multiple tags (there are multiple div inside the div as well). I want to add a new tag along with class to the end of the HTML at a specific position. I tried with append, insert, and insert_after/insert_before as well, however, it's not working as I expected.
My html input is:
<div id="page">
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content again once</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content second time</p>
</div>
</div>
</div>
</div>
i want to add new <div> tag with class="record" at the end, before the closing tag of <div id="records">.
output would look like this:
<div id="page">
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content again once</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content second time</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>
</div>
</div>
In my case, the number of <div class="record"> is not fixed, the number may vary always.
I would like to get a solution/suggestion for this problem using BeautifulSoup in python.

You can use insert_after after the last item in soup.find_all('div', class_='record'):
from bs4 import BeautifulSoup
html = '<div id="records"> <div class="record"> <div class="header"> <div class="title"> Something here to display </div> </div> <div class="disclaimer"> <p>Here i want to print content</p> </div> </div> <div class="record"> <div class="header"> <div class="title"> Something here to display again once </div> </div> <div class="disclaimer"> <p>Here i want to print content again once</p> </div> </div> <div class="record"> <div class="header"> <div class="title"> Something here to display second time </div> </div> <div class="disclaimer"> <p>Here i want to print content second time</p> </div> </div> </div>'
soup = BeautifulSoup(html, 'html.parser')
extra_html = '''
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>'''
soup.find_all('div', class_='record')[-1].insert_after(BeautifulSoup(extra_html, 'html.parser')) # [-1] selects the last item
Output print(soup.prettify()):
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content again once
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content second time
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content 3rd time
</p>
</div>
</div>
</div>

using .append(), it need to select the parent element or <div id="page">
newRecord = '''
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>
'''
soup = BeautifulSoup(sourceHTML, 'html.parser')
page = soup.select_one('#page')
page.append(BeautifulSoup(newRecord, 'html.parser'))
print(soup.prettify())

Related

Pyton, Selenium: I need to collect urls but there no a tags in element

Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!
If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls

How can I have my template display a list on two different div tags without having redundancies

I have a list from my Model but I want my template to display the list element in groups of 4 or half the total length of the list Example: let say i have 10 elements in my list i want 5 on the right size and 5 on the left side. Please see screenshot below.
This is how i want my page to look like:
But this is what i get:
This is my HTML file.
<div class="section-title">
<h2>Skills</h2>
<p>hsjkhvdkdjhvjkdfnv kjdf, dfhvkhdnfvkjldf,xhvnkldsv.mckldfnv ,dfhxncjcshfxdjvhcnjsdnckndjvbc d,sxbc kjdjsxcbjdksbvc kjs,bhzscs,zhcnlksjhlnzcklsnzjcjsdzcjb ds
cxdbjvcsdbzcjks,gdcbkjds,zbcn jkcdxbv,m dfxvchj bdxnvbjhdujxdnkck jdfvknc dfkjhvxjdknfxzjxvkc.
</p>
</div>
{% for skill in skills_list%}
<div class="row skills-content">
<div class="col-lg-6" data-aos="fade-up">
<div class="progress">
<span class="skill">{{skill.skill_name}} <i class="val">{{skill.skill_value}}</i></span>
<div class="progress-bar-wrap">
<div class="progress-bar" role="progressbar" aria-valuenow={{skill.skill_value}} aria-valuemin="0" aria-valuemax="100"></div>
</div>
</div>
</div>
</div>
{% endfor %}
</div>
views.py:
#### TEST
class TestView(generic.ListView):
model = Skills
template_name = 'portfolio_app/test.html'
########################URL.py
from django.urls import path
from portfolio_app.models import *
from . import views
urlpatterns = [
path('',views.fact,name='index'),
#path('index/',views.SkillView.as_view,name='index'),
path('about/',views.about_me,name='about'),
path('service/',views.ServiceView.as_view(),name='service'),
path('resume/',views.ResumeView.as_view(),name='resume'),
path('contact/',views.ContactView.as_view(),name='contact'),
path('test/',views.TestView.as_view(),name='test'),
]
You can try to move <div class="row skills-content"> outside the for loop like this:
<div class="section-title">
<h2>Skills</h2>
<p>hsjkhvdkdjhvjkdfnv kjdf, dfhvkhdnfvkjldf,xhvnkldsv.mckldfnv ,dfhxncjcshfxdjvhcnjsdnckndjvbc
d,sxbc kjdjsxcbjdksbvc kjs,bhzscs,zhcnlksjhlnzcklsnzjcjsdzcjb ds
cxdbjvcsdbzcjks,gdcbkjds,zbcn jkcdxbv,m dfxvchj bdxnvbjhdujxdnkck jdfvknc
dfkjhvxjdknfxzjxvkc.
</p>
</div>
<div class="row skills-content">
{% for skill in skills_list%}
<div class="col-lg-6" data-aos="fade-up">
<div class="progress">
<span class="skill">{{skill.skill_name}} <i class="val">{{skill.skill_value}}</i></span>
<div class="progress-bar-wrap">
<div class="progress-bar" role="progressbar" aria-valuenow={{skill.skill_value}}
aria-valuemin="0" aria-valuemax="100"></div>
</div>
</div>
</div>
{% endfor %}
</div>
And you should remove redundant last </div> to make it work correctly.
Use slice like this:
<div class="row skills-content">
<div class="col-lg-6" data-aos="fade-up">
{% for skill in skills_list|slice:":5" %}
<div class="progress">
<span class="skill">{{skill.skill_name}} <i class="val">{{skill.skill_value}}</i></span>
<div class="progress-bar-wrap">
<div class="progress-bar" role="progressbar" aria-valuenow={{skill.skill_value}} aria-valuemin="0" aria-valuemax="100"></div>
</div>
</div>
{% endfor %}
</div>
<div class="col-lg-6" data-aos="fade-up">
{% for skill in skills_list|slice:"5:" %}
<div class="progress">
<span class="skill">{{skill.skill_name}} <i class="val">{{skill.skill_value}}</i></span>
<div class="progress-bar-wrap">
<div class="progress-bar" role="progressbar" aria-valuenow={{skill.skill_value}} aria-valuemin="0" aria-valuemax="100"></div>
</div>
</div>
{% endfor %}
</div>
</div>
To get rid of redundant div's create a separate template for skills and include that in you current template using include template tag like this:
skills.html:
<div class="progress">
<span class="skill">{{skill.skill_name}} <i class="val">{{skill.skill_value}}</i></span>
<div class="progress-bar-wrap">
<div class="progress-bar" role="progressbar" aria-valuenow={{skill.skill_value}} aria-valuemin="0" aria-valuemax="100"></div>
</div>
</div>
your current temple:
<div class="row skills-content">
<div class="col-lg-6" data-aos="fade-up">
{% for skill in skills_list|slice:":5" %}
{% include 'skills.html' with skill=skill %}
{% endfor %}
</div>
</div>
similarly for second loop.

Web scraping problem : Data does not show when printed

So i tried to scrape this website: https://top-1000-sekolah.ltmpt.ac.id/site/page?id=2001
if you inspect element, there's a div with id of tab-1, tab-2,tab-3, tab-4 . So I tried ti scrape each id but somehow only tab-1 data's were grabbed. so what did I do wrong??
pk = driver.find_element_by_xpath("(//div[#id='tab-1'])")
pbm = driver.find_element_by_id('tab-2')
pu = driver.find_element_by_id('tab-3')
ppu = driver.find_element_by_id('tab-4')
The output I expect from tab-2 is :
Kemampuan Kuantitatif
2
Urut Nasional
1
Urut Provinsi
Rerata
640,253
Nilai Tertinggi
721,15
Nilai Terendah
511,14
Standar Deviasi
44,1
and currently tab-2 output is blank( ' ' )
Try doing this:
pbm = driver.find_element_by_id('tab-2')
print(pbm.text)
If that doesn't work, I suspect it is because that div class with the id of tab-2 has many child elements. You will need to select those individual child elements directly to get the data you need. Use the XPATH method which you used up top.
<div class="row">
<div class="col-lg-12 details order-2 order-lg-1">
<h3 align="center">
Kemampuan Memahami Bacaan dan Menulis
</h3>
<hr>
<div class="row">
<div class="col-lg-6 col-md-6">
<div class="count-box">
<i class="icofont-award"></i>
<span data-toggle="counter-up">5</span>
<p>Urut Nasional</p>
</div>
</div>
<div class="col-lg-6 col-md-6">
<div class="count-box">
<i class="icofont-award"></i>
<span data-toggle="counter-up">1</span>
<p>Urut Provinsi</p>
</div>
</div>
</div>
<hr>
<div class="row">
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Rerata</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>589,104</b></h3>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Nilai Tertinggi</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>709,61</b></h3>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Nilai Terendah</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>371,88</b></h3>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Standar Deviasi</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>65,96</b></h3>
</div>
</div>
</div>
</div>
</div>
</div>
For example to parse the name Kemampuan Kuantitatif,
name = driver.find_element_by_xpath('//*[#id="tab-2"]/div/div/h3')
print(name)

Scrape table with div class by sibling, if data found

I would like to scrape a html table which contains elements in <div class="..."> format. To scrape it I think I'll need to use:
if found driver.find_element_by_xpath contains(footable-row-detail-name)
get value from /following-sibling which is (class="footable-row-detail-value")
This is just one table. The site I'm scraping has a lot of tables and some tables don't have all the data (that's why "if found")
I would like to use python 3 for that.
I hope I explained it well. The HTML code for one table:
<div class="footable-row-detail-inner">
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
197. Omeopatia, 202. Linfodrenaggio manuale, 205. Massaggio classico, 664. Riflessoterapia generale
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cognome:
</div>
<div class="footable-row-detail-value">
ABBONDANZIERI Katia
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Via:
</div>
<div class="footable-row-detail-value">
Place du Cirque, 2
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
NPA:
</div>
<div class="footable-row-detail-value">
1204
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Luogo:
</div>
<div class="footable-row-detail-value">
Genève
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Tel / Cellulare:
</div>
<div class="footable-row-detail-value">
022 328 23 44
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cellulare:
</div>
<div class="footable-row-detail-value">
079 601 92 75
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
<div class="thZone">
<div class="zCat">
METHODES DE MASSAGE
</div>
<div class="zThr">
Linfodrenaggio manuale
</div>
<div class="zThr">
Massaggio classico
</div>
<div class="zCat">
METHODES PRESCRIPTIVES
</div>
<div class="zThr">
Omeopatia
</div>
<div class="zCat">
METHODES REFLEXES
</div>
<div class="zThr">
Riflessoterapia generale
</div>
</div>
</div>
</div>
Any help is appreciated.
This runs for me. I am using jupyter and running this line by line. You might encounter errors when the elements aren't loaded yet so please make adjustments if an error occurs for you.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
driver = webdriver.Chrome()
driver.get("http://asca.ch/Partners.aspx?lang=it")
cantone = driver.find_element_by_xpath("""//*[#id="ctl00_MainContent_ddl_cantons_Input"]""")
cantone.click()
cantone.send_keys('GE')
cantone.send_keys(Keys.ENTER)
confermo = driver.find_element_by_xpath("""//*[#id="MainContent__chkDisclaimer"]""")
confermo.click()
ricera = driver.find_element_by_xpath("""//*[#id="MainContent_btn_submit"]""")
ricera.click()
toggle = driver.find_elements_by_class_name("""footable-toggle""")
print(toggle)
while not toggle:
time.sleep(.2)
toggle = driver.find_elements_by_class_name("""footable-toggle""")
for r in toggle:
time.sleep(.2)
r.click()
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
while not data:
time.sleep(.2)
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
list_df = []
for r in data:
ratum = r.get_attribute('innerHTML')
datum = r.get_attribute('innerHTML')\
.replace("""<div class="footable-row-detail-inner">""","<table>")\
.replace("""<div class="footable-row-detail-row">""","<tr>")\
.replace("""<div class="footable-row-detail-name">""","<td>")\
.replace("""<div class="footable-row-detail-value">""","</td><td>")
list_df.append(dict(pd.read_html(datum)[0].values.tolist()))
df = pd.DataFrame(list_df)
df.to_csv('data.csv')
print(df)
One solution using python3 is html.parser module!
There is a simple example to get you started :)

Unable to retrieve innerText/innerHTML in Python

Hovering over innerText shows the text data but not through Python
I am trying to retrieve innertext or innerHTML from the HTML from this website (see attached image). The HTML saved/printed from BeautifulSoup does not have the content seen in the attached image of the innerText.
import requests, re
from bs4 import BeautifulSoup
r=requests.get("https://jobs.ca.gov/CalHRPublic/Search/JobSearchResults.aspx#classid=441")
c=r.content
soup=BeautifulSoup(c,"html.parser")
print (soup.prettify())
When I inspect the page in Google Chrome , click on the div block and copy the HTML, the copied HTML from Chrome inspect has all the data I am looking for.
How do I get the same data in Python or do I have to use Selenium?
<div class="card-block" id="collapse1234" itemscope="" itemtype="http://schema.org/Organization" role="tablist" aria-multiselectable="true">
<div class="row" role="presentation">
<div class="col-md-10 " role="presentation">
<a id="cphMainContent_rptResults_hlViewJobPosting_0" class="lead visitedLink" href="/CalHrPublic/Jobs/JobPosting.aspx?JobControlId=70488">ACCOUNTING ADMINISTRATOR I (SPECIALIST)</a>
</div>
<div class="col-md-2 tar">
<div id="cphMainContent_rptResults_pnlFavoriteJob_0" class="aspNetDisabled" style="display: inline;">
<i id="cphMainContent_rptResults_iIsNotFavorite_0" class="fa fa-star-o" aria-hidden="true" style="cursor:default;color:grey;opacity:.6;" title="You must be logged in to save a job as a Favorite." onclick="">
Log in to save job
</i>
<i id="cphMainContent_rptResults_iIsFavorite_0" class="fa fa-star" title="This job is saved" style="color:#fdb81e;cursor:pointer;display:none;" aria-hidden="true" onclick="removeUserFavorite(70488, $(this) );"> Job saved</i>
</div>
</div>
</div>
<div class="row" role="presentation">
<div class="col-sm-12 col-md-9" role="presentation">
<div class="row">
<div class="col-xs-12 col-sm-6" role="presentation">
<div class="working-title details row">
<div class="col-xs-6 job-label">Working Title:</div>
<div class="col-xs-6 job-details">
<span title="Keyword Relevance: 0">N/A</span>
</div>
</div>
<div class="position-number details row">
<div class="col-xs-6 job-label">Job Control:</div>
<div class="col-xs-6 job-details">
70488
</div>
</div>
<div class="salary-range details row">
<div class="col-xs-6 job-label">Salary Range:</div>
<div class="col-xs-6 job-details">
$5053.00 - $6325.00
</div>
</div>
<div class="schedule details row">
<div class="col-xs-6 job-label">Work Type/Schedule:</div>
<div class="col-xs-6 job-details">
Permanent Fulltime
</div>
</div>
</div>
<div class="col-xs-12 col-sm-6" role="presentation">
<div class="department details row">
<div class="col-xs-6 job-label">Department:</div>
<div class="col-xs-6 job-details">
Board of Equalization
</div>
</div>
<div class="location details row">
<div class="col-xs-6 job-label">Location:</div>
<div class="col-xs-6 job-details">
Sacramento County
</div>
</div>
<div class="filing-date details row">
<div class="col-xs-6 job-label">Publish Date:</div>
<div class="col-xs-6 job-details">
<time datetime="2016-06-30">
6/29/2017</time>
</div>
</div>
</div>
</div>
</div>
<div class="col-sm-12 col-md-3 align-right" role="presentation">
<div class="filing-date details row">
<div class="col-xs-12">
<div class="job-label">Filing Deadline:</div>
<div class="job-details">
<time datetime="2016-06-30">
7/14/2017
</time>
</div>
</div>
<div class="col-xs-12">
<a id="cphMainContent_rptResults_hlViewPosting_0" class="btn btn-secondary btn-block" href="/CalHrPublic/Jobs/JobPosting.aspx?JobControlId=70488">
<span class="ca-gov-icon-search"></span>
<span>View Job Posting</span>
</a>
</div>
</div>
</div>
</div>
</div>

Categories

Resources