I'm using BeautifulSoup from bs4 version: '4.10.0'
I'm doing some scraping for a project that I'm developing, and I encountered a problem, some elements that I scraped are commented for some reason.
<div class="h-[125] js-scroll-hidden" id="link-index-40">
<!-- <a href="/md5/dc0bbd5373a5ada24373640dab8defb3" class="custom-a flex items-center relative left-[-10] px-[10] py-2 hover:bg-[#00000011] ">
<div class="flex-none">
<div class="relative overflow-hidden w-[72] h-[108] flex flex-col justify-center">
<div class="absolute w-[100%] h-[90]" style="background-color: hsl(63deg 43% 73%)"></div>
<img class="relative inline-block" src="https://libgen.rs/covers/2274000/dc0bbd5373a5ada24373640dab8defb3-g.jpg" alt="" referrerpolicy="no-referrer" onerror="this.parentNode.removeChild(this)" loading="lazy" decoding="async"/>
</div>
</div>
<div class="relative top-[-1] pl-4 grow overflow-hidden">
<div class="truncate text-xs text-gray-500">English [en], pdf, 11.7MB, "The Dale Carnegie course in effective spea - Dale Carnegie.pdf"</div>
<h3 class="truncate text-xl font-bold">The Dale Carnegie course in effective speaking, human relations and developing courage and confidence, improving your memory, leadership training : how the course is conducted and what you do at each session</h3>
<div class="truncate text-sm">Dale Carnegie, 1989</div>
<div class="truncate italic">Dale Carnegie & Associates, Inc.</div>
</div>
</a>
--> </div>
I've been searching but every answer that I found, they were trying to eliminate all the contents but that's not my case.
I've tried different ways to eliminate the comments, but none of were successful.
I've tried to change the content of the tag to match the tags that have the desired format, It seemed fine at first, but it totally breaks the functionality of the methods, .find() or .find_all(), which I need for later.
I tried to find in the contents the symbols of the comments to see if I can change them manually, but they didn't appeared, I found a way to get the information but is really intensive for what I want to do, it requires transform the content of the tag which has the information and then parse it through BeautifulSoup, but I need to do it for +200 elements I need it to do it relatively quickly.
this will be my desired result:
<div class="h-[125] js-scroll-hidden" id="link-index-40">
<a href="/md5/dc0bbd5373a5ada24373640dab8defb3" class="custom-a flex items-center relative left-[-10] px-[10] py-2 hover:bg-[#00000011] ">
<div class="flex-none">
<div class="relative overflow-hidden w-[72] h-[108] flex flex-col justify-center">
<div class="absolute w-[100%] h-[90]" style="background-color: hsl(63deg 43% 73%)"></div>
<img class="relative inline-block" src="https://libgen.rs/covers/2274000/dc0bbd5373a5ada24373640dab8defb3-g.jpg" alt="" referrerpolicy="no-referrer" onerror="this.parentNode.removeChild(this)" loading="lazy" decoding="async"/>
</div>
</div>
<div class="relative top-[-1] pl-4 grow overflow-hidden">
<div class="truncate text-xs text-gray-500">English [en], pdf, 11.7MB, "The Dale Carnegie course in effective spea - Dale Carnegie.pdf"</div>
<h3 class="truncate text-xl font-bold">The Dale Carnegie course in effective speaking, human relations and developing courage and confidence, improving your memory, leadership training : how the course is conducted and what you do at each session</h3>
<div class="truncate text-sm">Dale Carnegie, 1989</div>
<div class="truncate italic">Dale Carnegie & Associates, Inc.</div>
</div>
</a>
</div>
I found this answer How can I find a comment with specified text string, but for my project it will be intensive.
Is there a way to do it natively in BeautifulSoup without changing data types or nothing very resource intensive? , ( I'm willing to use another package is there is another that is easier to deal with this situations )
Use .replace()
html = '''
<body>
<div>
<!-- <div></div>
<div></div>
<div></div> -->
</div>
</body>
'''
def remove_comments(html: str):
return html.replace('<!--', '').replace('-->', '')
remove_comments(html)
result:
'''
<body>
<div>
<div></div>
<div></div>
<div></div>
</div>
</body>
'''
Related
I am using Selenium python to automate a site. The problem I have face is, I have to upload file but there is no input type file available where I could have been using send_keys() method.
The File upload element:
<div id="data-assets-interior-file-upload" data-upload-properties="{"formId":"form-main-1","path":"data[assets][interior]","modalUpload":"Uploading...","warnExtensionsStrings":{"pdf":"<div class=\"a-box a-alert-inline a-alert-inline-warning\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"><\/i><div class=\"a-alert-content\">\n Most PDF files do not produce great results in an automated conversion process. We recommend using a Word, Mobi, ePUB or HTML file if you have one. <a href=\"\/en_US\/help\/topic\/A2GF0UFHIYG9VQ?ref_=_fg\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more.<\/a>\n <\/div><\/div><\/div>\n <div id=\"file-warn-actions\" class=\"a-form-actions a-spacing-none a-spacing-top-large\">\n <div class=\"a-row a-spacing-none\">\n <div class=\"a-column a-span6\">\n <span class=\"a-declarative\" data-action=\"potter-file-warn-extension-continue\" data-potter-file-warn-extension-continue=\"{}\">\n <span id=\"file-warn-extension-continue\" class=\"a-button a-button-base button-fill\"><span class=\"a-button-inner\"><button id=\"file-warn-extension-continue-announce\" class=\"a-button-text\" type=\"button\">\n Continue with PDF\n <\/button><\/span><\/span>\n <\/span>\n <\/div>\n <div class=\"a-column a-span6 a-span-last\">\n <span class=\"a-declarative\" data-action=\"potter-file-warn-extension-cancel\" data-potter-file-warn-extension-cancel=\"{}\">\n <span id=\"file-warn-extension-cancel\" class=\"a-button a-button-primary button-fill\"><span class=\"a-button-inner\"><button id=\"file-warn-extension-cancel-announce\" class=\"a-button-text\" type=\"button\">\n I have another format\n <\/button><\/span><\/span>\n <\/span>\n <\/div>\n <\/div>\n <\/div>","pdf-header":"Do you have another format?"},"acceptedExtensions":"doc,docx,zip,htm,html,mobi,azw,epub,rtf,txt,pdf,kpf","multipart":null,"persistSuccess":true,"warnExtensions":["pdf"],"key":"save","url":"\/en_US\/title-setup\/kindle\/A3U1YUNVSBYTMZ\/content\/action\/save","workflowId":"assets.interior","assetType":"DIGITAL_BOOK_BLOCK"}" class="a-section jele-file-field">
<div class="a-section a-spacing-none file-upload-options-section">
<p class="a-spacing-small">
</p>
<div class="a-row file-upload-extra-info-message-section">
<div class="a-column a-span12">
<div class="a-box a-alert a-alert-info"><div class="a-box-inner a-alert-container"><i class="a-icon a-icon-alert"></i><div class="a-alert-content">Use Kindle Create to transform your manuscript to an eBook with professional book themes, images, and Table of Contents. Click here to download for free.</div></div></div>
</div>
</div>
<br/>
<div class="a-row file-upload-browse-section">
<div class="a-column a-span12">
<span class="a-declarative" data-action="browse-clicked" data-browse-clicked="{"signInRequired":false,"id":"data-assets-interior-file-upload"}">
<span id="data-assets-interior-file-upload-browse-button" class="a-button a-button-primary file-upload-browse-button"><span class="a-button-inner"><button id="data-assets-interior-file-upload-browse-button-announce" class="a-button-text" type="button">
Upload Book
</button></span></span>
</span>
<span class="a-declarative" data-action="file-selected" data-file-selected="{}" id="data-assets-interior-uploader">
<span class="fileuploader a-hidden"></span>
</span>
<p class="a-spacing-top-small a-size-mini a-color-tertiary a-text-italic">
</p>
</div>
</div>
<input type="hidden" name="" value="doc,docx,zip,htm,html,mobi,azw,epub,rtf,txt,pdf,kpf" id="data-assets-interior-file-upload-accepted-extensions" class="accepted-extensions"/>
</div> </div>
Can anyone let me know, how to handle this scenario? If you are gonna recommend me some other library for it then please post relevant examples in python as well. Thank you
<span class="ui_qtext_rendered_qtext">
<p class="ui_qtext_para u-ltr u-text-align--start">
<div class="ui_qtext_image_outer">
<div class="ui_qtext_image_wrapper">
<img class="landscape ui_qtext_image zoomable_in zoomable_in_feed"
src="https://qph.fs.quoracdn.net/main-qimg-"
master_src="https://qph.fs.quoracdn.net/main-qimg-"
master_w="2400" master_h="1260">
</div>
</div>
<p class="ui_qtext_para u-ltr u-text-align--start">"It’s all doable, but
it’s technically very difficult."</p>
<p class="qtext_citation_lead">Footnotes</p>
<p class="citation" id="KLPFK">
[1]
<span class="qlink_container">
<a href="https://futurism.com/the-byte/bionic-eye-prototype"
rel="noopener nofollow" target="_blank" onclick="return
Q.openUrl(this, 184353372);" class="external_link" data-qt-
tooltip="futurism.com">This experimental bionic eye could help the
blind see</a>
</span>
</p>
</span>
I want both img and p tags in the same xpath
So i did this
copy = driver.find_elements_by_xpath("//*[contains(#class,'ui_qtext')]")
but i doesnt work because img has to be exact
I know...from the title this answer seems the same oh thousand of others. BUT I have still searched all related and similar questions. What I'm asking is, given this html (just an exemple):
<html>
<body>
<div class="div-share noprint">
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
</div>
</div>
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="some img" class="playblk" height="25" src="othersource" title="some othertitle" width="25"/></span>
</a>
</div>
<div class="div-share">
<h1>"The Divine Wings Of Tragedy" lyrics</h1></div>,
<div class="pther">
<h2><b>Symphony X Lyrics</b></h2>
</div>
<div class="ringtone">
<span id="cf_text_top"></span>
</div>
<div>
<i>[Part I - At the Four Corners of the Earth]</i>
<br/>
<br/> On the edge of paradise
<br/> Tears of woe fall, cold as ice
<br/> Hear my cry
<br/>
</div>
</body>
</html>
I want to find the only tag that has no attributes. Not an empy attr, like I saw in other questions, or a strange specific attribute, or attrs = None ... that tag has nothing else. But if I use findAll, I find all the other tag in the html. the same if I use attrs = False, attrs = None and so on..,
so there is a possibility?
thanks a lot!
You can pass a lambda function to the find_all method that checks the tag name and that there are no attrs within the element:
soup.find_all(lambda tag: tag.name == 'div' and not tag.attrs)
I am looking at scraping the below information using both selenium and bs4, and was wondering if I find the below div tag, is it possible to scrape the data inside the quotation marks? for exmaple: data-room-type-code="SUK"
<div
class="sl-flexbox room-price-item hidden-top-border"
data-room-name="Superior Shard Room"
data-bed-type="K"
data-bed-name="King"
data-pay-type-tag-filter="No Prepayment"
data-cancel-tag-filter=""
data-breakfast-tag-filter=""
data-room-type-code="SUK"
data-rate-code="ZBAR"
data-price="430"
>
<div class="room-price-basic-info">
<div class="room-price-title title-regular">Flexible Rate / CustomStay</div>
<ul class="abstract text-regular">
<li>No Prepayment</li>
</ul>
<div
class="show-detail text-btn js-show-detail"
data-index="0-productRates-0"
>
OFFER DETAILS
</div>
</div>
<div class="room-price-book-info">
<div class="number text-medium">GBP 430</div>
</div>
<div class="boot-btn text-medium js-booking-room" data-type="PRICE">
Book Now
</div>
</div>
It is hard to describe my real situation, so I directly lift website:
https://www.w3schools.com/php/php_intro.asp
The elements below is extremely long, you can just scan it quickly. As you open link, you will find every content block will be framed with two line (hr tag)with up and down side, so my purpose is to scrape every block content between two hr tag
(in fact,the difficulty is uncertain amount tags and fickle structure between every two hr tags)
How to achieve it?
<div class="w3-col l10 m12" id="main">
<div id="mainLeaderboard" style="overflow:hidden;">
<!-- MainLeaderboard-->
<!--<pre>main_leaderboard, all: [728,90][970,90][320,50][468,60]</pre>-->
<div id="snhb-main_leaderboard-0" data-google-query-id="CJmd77_F_OMCFUSJwgodAWAIsg"><div id="google_ads_iframe_/22152718/sws-hb//w3schools.com//main_leaderboard_0__container__" style="border: 0pt none;"><iframe id="google_ads_iframe_/22152718/sws-hb//w3schools.com//main_leaderboard_0" title="3rd party ad content" name="google_ads_iframe_/22152718/sws-hb//w3schools.com//main_leaderboard_0" width="468" height="60" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" srcdoc="" style="border: 0px; vertical-align: bottom;" data-google-container-id="d" data-load-complete="true"></iframe></div></div>
<!-- adspace leaderboard -->
</div>
<h1>Python <span class="color_h1">Tutorial</span></h1>
<div class="w3-clear nextprev">
<a class="w3-left w3-btn" href="/default.asp">❮ Home</a>
<a class="w3-right w3-btn" href="python_intro.asp">Next ❯</a>
</div>
<div class="w3-panel w3-info intro">
<p>Python is a programming language.</p>
<p>Python can be used on a server to create web applications.</p>
<a class="w3-btn w3-margin-bottom" href="python_intro.asp">Start learning Python now »</a>
</div>
<hr>
<h2>Learning by Examples</h2>
<p>Our "Show Python" tool makes it easy to learn Python, it shows both the
code and the result.</p>
<div class="w3-example">
<h3>Example</h3>
<div class="w3-code notranslate pythonHigh"><span class="pythoncolor" style="color:black">
<span class="pythonkeywordcolor" style="color:mediumblue">print</span>(<span class="pythonstringcolor" style="color:brown">"Hello, World!"</span>)<span class="pythonnumbercolor" style="color:red">
</span> </span></div>
<a target="_blank" class="w3-btn w3-margin-bottom" href="showpython.asp?filename=demo_default">Run example »</a>
</div>
<p><b>Click on the "Run example" button to see how it works.</b></p>
<hr>
<h2>Python File Handling</h2>
<p>In our File Handling section you will learn how to open, read, write, and
delete files.</p>
<p>Python File Handling</p>
<hr>
<h2>Python Database Handling</h2>
<p>In our database section you will learn how to access and work with MySQL and MongoDB databases:</p>
<p>Python MySQL Tutorial</p>
<p>Python MongoDB Tutorial</p>
<hr>
<h2>Python Exercises</h2>
<form autocomplete="off" id="w3-exerciseform" action="exercise.asp?filename=exercise_syntax1" method="post" target="_blank">
<h2>Test Yourself With Exercises</h2>
<div class="exercisewindow">
<h2>Exercise:</h2>
<p>Insert the missing part of the code below to output "Hello World".</p>
<div class="exerciseprecontainer">
<pre><input name="ex1" maxlength="5" style="width: 54px;">("Hello World")
</pre>
</div>
<br>
<button type="submit" class="w3-btn w3-margin-bottom">Submit Answer »</button>
<p><a target="_blank" href="exercise.asp?filename=exercise_syntax1">Start the Exercise</a></p>
</div>
</form>
<hr>
<div id="midcontentadcontainer" style="overflow:auto;text-align:center">
<!-- MidContent -->
<!--<pre>mid_content, all: [300,250][336,280][728,90][970,250][970,90][320,50][468,60]</pre>-->
<div id="snhb-mid_content-0" data-google-query-id="CNqS8r_F_OMCFUSJwgodAWAIsg"><div id="google_ads_iframe_/22152718/sws-hb//w3schools.com//mid_content_0__container__" style="border: 0pt none;"><iframe id="google_ads_iframe_/22152718/sws-hb//w3schools.com//mid_content_0" title="3rd party ad content" name="google_ads_iframe_/22152718/sws-hb//w3schools.com//mid_content_0" width="336" height="280" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" srcdoc="" style="border: 0px; vertical-align: bottom;" data-google-container-id="f" data-load-complete="true"></iframe></div></div>
</div>
<hr>
<h2>Python Examples</h2>
<p>Learn by examples! This tutorial supplements all explanations with clarifying examples.</p>
<p>See All Python Examples</p>
<hr>
<h2>Python Quiz</h2>
<p>Learn by taking a quiz! This quiz will give you a signal of how much you know, or do not know, about Python.</p>
<p>Python Quiz</p>
<hr>
<h2>Python Reference</h2>
<p>You will also find complete function and method references:</p>
<p>Reference Overview</p>
<p>Built-in Functions</p>
<p>String Methods</p>
<p>List/Array Methods</p>
<p>Dictionary Methods</p>
<p>Tuple Methods</p>
<p>Set Methods</p>
<p>File Methods</p>
<p>Python Keywords</p>
<hr>
<h2>Download Python</h2>
<p>Download Python from the official Python web site:
<a target="_blank" href="https://python.org/">https://python.org</a></p>
<hr>
<h2>Python Exam - Get Your Diploma!</h2>
<div class="w3-row">
<div class="w3-third w3-container w3-padding-24"><img src="/images/w3certified_logo_250.png" style="max-width:100%;" alt="W3Schools Certification"> </div>
<div class="w3-twothird w3-container"><h2>W3Schools' Online Certification</h2>
<p>The perfect solution for professionals who need to balance work, family, and career building.</p>
<p>More than 25 000 certificates already issued!</p>
</div>
</div>
<p><a class="w3-btn" href="/cert/default.asp">Get Your Certificate »</a></p>
<p style="clear:both;">The HTML Certificate documents your knowledge of HTML.</p>
<p>The CSS Certificate documents your knowledge of advanced CSS.</p>
<p>The JavaScript Certificate documents your knowledge of JavaScript and HTML DOM.</p>
<p>The Python Certificate documents your knowledge of Python.</p>
<p>The jQuery Certificate documents your knowledge of jQuery.</p>
<p>The SQL Certificate documents your knowledge of SQL.</p>
<p>The PHP Certificate documents your knowledge of PHP and MySQL.</p>
<p>The XML Certificate documents your knowledge of XML, XML DOM and XSLT.</p>
<p>The Bootstrap Certificate documents your knowledge of the Bootstrap framework.</p>
<div class="w3-clear nextprev">
<a class="w3-left w3-btn" href="/default.asp">❮ Home</a>
<a class="w3-right w3-btn" href="python_intro.asp">Next ❯</a>
</div>
</div>
```**strong text**
I don't know if I get this straight, but if you whant just adjust the content, you can use only css to do this, you can organize your content in "Div Blocks" and set the same class to each one and instead of hr just put a border-bottom like this
#main{ max-width:1170px; margin: 0 auto;}
.bg_block{ width:100%; border-bottom: 1px solid #666; padding: 20px; box-sizing: border-box;}
<div id='main'>
<div class='bg_block'>
<div class="w3-clear nextprev">
<a class="w3-left w3-btn" href="/default.asp">❮ Home</a>
<a class="w3-right w3-btn" href="python_intro.asp">Next ❯</a>
</div>
<div class="w3-panel w3-info intro">
<p>Python is a programming language.</p>
<p>Python can be used on a server to create web applications.</p>
<a class="w3-btn w3-margin-bottom" href="python_intro.asp">Start learning Python now »</a>
</div>
</div><!--bg_block-->
</div><!--main-->