beautifulsoup, find text using re.compile

beautifulsoup, find text using re.compile - python

Why is this not finding anything? I'm looking to extract the id out of this html.
from bs4 import BeautifulSoup
import re
a="""
<html lang="en-US">
<head>
<title>
Coverage
</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="2017-07-12T08:12:00.0000000" name="created"/>
</head>
<body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
<div id="div:{1586118a-0184-027e-07fc-99debbfc309f}{35}" style="position:absolute;left:1030px;top:449px;width:624px">
<p id="p:{dd73b86c-408c-4068-a1e7-769ad024cf2e}{40}" style="margin-top:5.5pt;margin-bottom:5.5pt">
{FB} 2 Facebook 465.8 /
<span style="color:green">
12
</span>
<span style="color:green">
5
</span>
<span style="color:green">
10
</span>
<span style="color:red">
-3
</span>
/ updated
</p>
</div>
</body>
</html>
"""
soup=BeautifulSoup(a,'html.parser')
ticker='{FB}'
target= soup.find('p', text = re.compile(ticker))
There is more than one p i just omitted the rest. I need the text= part
I've also tried the wildcards (.*) but still can get it to work.
I must get the id by ticker... i don't know anything else and the rest of the page is dynamic

This would get the "id" value for <p> tags which contains the text "{FB}":
ticker='{FB}'
target= soup.find_all('p')
for items in target:
check=items.text
if '{FB}' in check:
print (items.get("id"))
More compact way:
for elem in soup(text=re.compile(ticker)):
print (elem.parent.get("id"))

Related

Can't select dropdown in selenium python

I am new to selenium. I have managed to login in to our work practice management system. So the basic setup is fine.
I am then faced with this:
I need to drop down the Work dropdown and select a premade report (All Tasks For Export):
I have tried a lot of stuff...... CSS Selector, Class, ID
But I always get error: Message: no such element: Unable to locate element:
Code:
driver = webdriver.Chrome()
driver.get('https://xxxxx.senta.co/a/i/a')
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input[name='email']"))).send_keys("xxx#xxx.com")
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']"))).send_keys("xxxxxx")
driver.find_element(By.CSS_SELECTOR, 'button#submit').click()
dropdown = Select(driver.find_element(By.CSS_SELECTOR,'major#navjobs'))
But maybe I am selecting the wrong element entirely. I will post the HTML below. Maybe I understand the Selenium but not the HTML!! Thanks in advance.
And then the HTML for the elements in the list look like this:
OK here is the HTML of the page. Not sure it's going to help much!
<!DOCTYPE html><html lang="en" ng-app="senta" se-file-drop="onFileSelect($files)" ng-controller="BodyCtrl" ng-class="{ selectfile:selectfile }"> <head> <base href="/"> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" /> <meta name="HandheldFriendly" content="true" /> <meta name="robots" content="noindex"> <link id="favicon" rel='icon' type='image/ico' sizes='32x32' href='https://dsik6juztonps.cloudfront.net/client/public/images/favicon.ico'> <title ng-bind="$root.notiftitle + $root.title + $root.appTitle">Loading...</title> <link href="https://dsik6juztonps.cloudfront.net/client/public/dist/lib/20211115/lib.css" rel="stylesheet" /> <link href="https://dsik6juztonps.cloudfront.net/client/public/dist/_m258abe0ebbf4d5628de49b9a35dff42e/style.css" rel="stylesheet" /> <script src="https://dsik6juztonps.cloudfront.net/client/public/dist/lib/20211115/prelib.min.js"></script> <script src="https://dsik6juztonps.cloudfront.net/client/public/dist/lib/20211115/momentjs/en-gb.js"></script> </head> <body id="{{$root.bodyId}}" class="droptarget {{userClass}} {{skinClass}}" ng-class="{ 'on-scrolled':notAtTop, 'preheader-on':preheaderOn, 'preheader2-on':preheader2On, 'preheadertimer-on':preheaderTimerOn }"> <div ng-if="$root.user.loggedin" ng-include="'https://dsik6juztonps.cloudfront.net/client/public/dist/_m258abe0ebbf4d5628de49b9a35dff42e/html/en-gb/header.html'" ng-controller="NavBarCtrl"></div> <div class="dropindic"> <div class="lightbox"></div> <div class="centred"> <p class="text" style="">Drop your files here to upload into Senta</p><i class="fa fa-file"></i> <p class="selectbutton">Alternatively: <input type="file" id="selectfile"> <button type="button" class="btn btn-primary pseudoselect" ng-click="selectFile()">Select file</button> <button type="button" class="btn btn-normal" ng-click="cancelSelectFile()">Cancel</button> </p> </div> </div> <div ng-if="deepheader" class="deepheader"></div> <div class="container"> <div ui-view> <div class="positioner"> <div class="notifier"> <span class="spinning"><span class="spinner"><i class="fa fa-spin fa-refresh"></i></span></span> <span class="msg">Loading...</span> </div> </div> </div> <div id="react-root"></div> </div> <div ng-if="$root.expressionfooter" ng-include="'https://dsik6juztonps.cloudfront.net/client/public/dist/_m258abe0ebbf4d5628de49b9a35dff42e/html/en-gb/settings/expression/footer.html'" ng-controller="ExpressionTesterCtrl" ></div> <div ng-if="$root.previewfooter" ng-include="'https://dsik6juztonps.cloudfront.net/client/public/dist/_m258abe0ebbf4d5628de49b9a35dff42e/html/en-gb/modal/preview-footer.html'"></div> <script src="https://dsik6juztonps.cloudfront.net/client/public/dist/lib/20211115/postlib.min.js"></script> <script src="https://dsik6juztonps.cloudfront.net/client/public/dist/_m258abe0ebbf4d5628de49b9a35dff42e/app.min.en-gb.js"></script> <script src="https://www.gstatic.com/charts/loader.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.9.638/pdf.min.js"></script> </body><script src="https://dsik6juztonps.cloudfront.net/react/static/js/en-gb/main.eda07c95.js"></script></html>

It's look like you want find element By.CSS_SELECTOR like you are using By.XPATH.
For example, if you want to find <ul> which contains here on your screenshot
you can use:
driver.find_element(By.CLASS_NAME, 'dropdown-menu')
driver.find_element(By.CSS_SELECTOR, '.dropdown-menu')
driver.find_element(By.XPATH, '//ul[#class='dropdown-menu']')
driver.find_element(By.XPATH, '//ul[#id='work-dropdown'][#class='dropdown-menu'])
but i can't understand what you are looking for...
UPDATE:
try to find all <li> with ng-repeat='viewt in viewst track by viewt._id'
driver.find_elements(By.XPATH, '//li[#ng-repeat="viewt in viewst track by viewt._id"]')[here_is_index(even 0 idk)]
and choose needed by indexing. It's really hard to help you without html code...

Preserve the formatting in beautifulsoup

import bs4
foo = """<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<p>
This is a paragraph1
</p>
<h2>
This is heading2
</h2>
</body>
</html>"""
def remove_p(text):
obj = bs4.BeautifulSoup(text, features="html.parser")
for tag in obj.find_all("p"):
tag.decompose()
return str(obj)
foo = remove_p(foo)
print(foo)
beautifulsoup4 4.11.0
bs4 0.0.1
bs4 inserts blank lines corresponding to <p>. I expected entries corresponding to <p> tag to be deleted - no blank lines.
bs4 removes the leading spaces for opening tags. However, it doesn't remove leading spaces for closing tags </h2> and text.
I would like the function to return text with <p> entries removed without modifying the formatting. Please suggest.
Actual output
<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<h2>
This is heading2
</h2>
</body>
</html>
Expected Output
<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<h2>
This is heading2
</h2>
</body>
</html>
EDIT:
Thanks for all the suggestions to use prettify(). I have already tried using prettify() but it completely changes the formatting of the document. Excuse me for not mentioning it to start with.
To add some context, we receive these documents from our upstream, and we are supposed to just delete some nodes without changing the formatting.

This is not exactly what you want, but there is a way to prettify the code: use obj.prettify() instead of str(obj)

You can use the function Prettify that is built into BeautifulSoup
here is an example shown from the documentation of BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>

Switch all "href = (link)" with "onclick = (PythonScript(link)) "

I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?

You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))

Python - %paste

I am following a tutorial where the teacher pastes in the html inline with our scrappy shell via: %paste ( the html below)
html_doc = " "
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
" "
but I get this error:
<html>
File "<console>", line 1
<html>
SyntaxError: invalid syntax
I have imported tkinter, and looked up other reasources but cant figure out how to get html inline.

Try doing """:
html_doc = """
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
"""

How do I match a tag containing only the stated class, not any others, using BeautifulSoup?

Is there a way to use BeautifulSoup to match a tag with only the indicated class attribute, not the indicated class attribute and others? For example, in this simple HTML:
<html>
<head>
<title>
Title here
</title>
</head>
<body>
<div class="one two">
some content here
</div>
<div class="two">
more content here
</div>
</body>
</html>
is it possible to match only the div with class="two", but not match the div with class="one two"? Unless I'm missing something, that section of the documentation doesn't give me any ideas. This is the code I'm using currently:
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<title>
Title here
</title>
</head>
<body>
<div class="one two">
should not be matched
</div>
<div class="two">
this should be matched
</div>
</body>
</html>
'''
soup = BeautifulSoup(html)
div_two = soup.find("div", "two")
print(div_two.contents[0].strip())
I'm trying to get this to print this should be matched instead of should not be matched.
EDIT: In this simple example, I know that the only options for classes are "one two" or "two", but in production code, I'll only know that what I want to match will have class "two"; other tags could have a large number of other classes in addition to "two", which may not be known.
On a related note, it's also helpful to read the documentation for version 4, not version 3 as I previously linked.

Try:
divs = soup.findAll('div', class="two")
for div in divs:
if div['class'] == ['two']:
pass # handle class="two"
else:
pass # handle other cases, including but not limited to "one two"

Hope, below code helps you. Though I didn't try this one.
soup.find("div", { "class" : "two" })

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautifulsoup, find text using re.compile - python

This would get the "id" value for <p> tags which contains the text "{FB}": ticker='{FB}' target= soup.find_all('p') for items in target: check=items.text if '{FB}' in check: print (items.get("id")) More compact way: for elem in soup(text=re.compile(ticker)): print (elem.parent.get("id"))

Related

Can't select dropdown in selenium python

Preserve the formatting in beautifulsoup

Switch all "href = (link)" with "onclick = (PythonScript(link)) "

Python - %paste

How do I match a tag containing only the stated class, not any others, using BeautifulSoup?

Categories

Resources