I've string with html blocks, like
a = '<div>Test moree test <div> London is ... <p>mooo</p></div></div>'
I need get block with certain text, for example
super_func("London", a) ==> '<div> London is ... <p>mooo</p></div>'
super_func('mooo', a) = '<p>mooo</p>'
You can use the following XPath query to find an element containing certain text, regardless the element name and it's location within the HTML document :
//*[contains(text(),'certain text')]
This is a working example using lxml.html library :
from lxml import html
def super_func(keyword, htmldoc):
query = '//*[contains(text(),"{0}")]'
result = htmldoc.xpath(query.format(keyword))
if len(result) > 0:
return html.tostring(result[0])
else:
return ''
a = '<div>Test moree test <div> London is ... <p>mooo</p></div></div>'
doc = html.fromstring(a)
text = 'London'
print super_func(text, doc)
text = 'mooo'
print super_func(text, doc)
output :
<div> London is ... <p>mooo</p></div>
<p>mooo</p>
Related
Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
output should be like that:
["/relative/path","//other.host/same-protocol","https://example.com"]
The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function
//I have tried this bellow one but not working its only give output: ["https://example.com"]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
print(urls)
You could try using positive lookbehind to find the quoted strings in front of href= in html
pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
urls = re.findall(pattern, html)
See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall
from typing import List
html = """
anchor link
<a id="some-id" href="/relative/path#fragment">relative link</a>
same-protocol link
absolute URL
"""
href_prefix = "href=\""
def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
if result == None:
result = []
is_splitted, _, rest = html.partition(href_prefix)
if not is_splitted:
return result
link = rest[:rest.find("\"")].partition("#")[0]
if link:
result.append(link)
return get_links_from_html(rest, result)
print(get_links_from_html(html))
Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.
I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?
This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.
What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.
Hope this would be helpful.
Is there a way to remove all html tags from string, but leave some links and change their representation? Example:
description: <p>Animation params. For other animations, see myA.animation and the animation parameter under the API methods. The following properties are supported:</p>
<dl>
<dt>duration</dt>
<dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See demo.</dd>
</dl>
<p>
and I want to replace
myA.animation
with only 'myA.animation', but
demo
with 'demo: http://example.com'
EDIT:
Now it seems to be working:
def cleanComment(comment):
soup = BeautifulSoup(comment, 'html.parser')
for m in soup.find_all('a'):
if str(m) in comment:
if not m['href'].startswith("#"):
comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
soup = BeautifulSoup(comment, 'html.parser')
comment = soup.get_text()
return comment
This regex should work for you: (?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"
You can try it over here
In Python:
import re
text = ''
with open('textfile', 'r') as file:
text = file.read()
matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)
strings = []
for m in matches:
m = filter(bool, m)
strings.append(': '.join(m))
print(strings)
The result will look like: ['myA.animation', 'demo: http://example.com']
Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?
The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)
Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)
I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.
Using regular expressions, I can identify the email address. My code to find the email is:
def get_email(link):
mail_list = []
for i in link:
a = str(i)
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._#]*)\">", re.IGNORECASE)
ik = re.findall(email_pattern, a)
if (len(ik) == 1):
mail_list.append(i)
else:
pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]
Now, I also need to collect the phone number, mailing address and web url. I think in BeautifulSoup there must be an easy way to find those specific data.
A sample html page is as below:
<ul>
<li>
<span>Email:</span>
Message Us
</li>
<li>
<span>Website:</span>
<a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
</li>
<li>
<span>Phone:</span>
(123)456-789
</li>
</ul>
And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone.
Thanks in advance.
The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. You might need to use a regular expression to deal with the contents of the href attribute, but that's it. Also, the else: pass is unnecessary - you can just leave it out entirely.
Here's some code that does something like what you want, and might be a useful starting point:
from BeautifulSoup import BeautifulSoup
import re
# Assuming that html is your input as a string:
soup = BeautifulSoup(html)
all_contacts = []
def mailto_link(e):
'''Return the email address if the element is is a mailto link,
otherwise return None'''
if e.name != 'a':
return None
for key, value in e.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = {}
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
print all_contacts
That will produce a list of one dictionary per contact found, in this case that will just be:
[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc#gmail.com'}]