Modifying a BeautifulSoup .string with line breaks - python

I am trying to change the content of an html file with BeautifulSoup. This content will be coming from python-based text so it will have \n newlines...
newContent = """This is my content \n with a line break."""
newContent = newContent.replace("\n", "<br>")
htmlFile.find_all("div", "product").p.string = newContent
when I do this, the html file <p> text is changed to this:
This is my content <br> with a line break.
How do I change a string within a BeautifulSoup object and keep <br> breaks? if the string just contains \n then it'll create an actual line break.

You need to create separate elements; there isn't one piece of text contained in the <p> tag, but a series of text and <br/> elements.
Rather than replace \n newlines with the text <br/> (which will be escaped), split the text on newlines and insert extra elements in between:
parent = htmlFile.find_all("div", "product")[0].p
lines = newContent.splitlines()
parent.append(htmlFile.new_string(lines[0]))
for line in lines[1:]:
parent.append(htmlFile.new_tag('br'))
parent.append(htmlFile.new_string(line))
This uses the Element.append() method to add new elements to the tree, and using BeautifulSoup.new_string() and BeautifulSoup.new_tag() to create those extra elements.
Demo:
>>> from bs4 import BeautifulSoup
>>> htmlFile = BeautifulSoup('<p></p>')
>>> newContent = """This is my content \n with a line break."""
>>> parent = htmlFile.p
>>> lines = newContent.splitlines()
>>> parent.append(htmlFile.new_string(lines[0]))
>>> for line in lines[1:]:
... parent.append(htmlFile.new_tag('br'))
... parent.append(htmlFile.new_string(line))
...
>>> print htmlFile.prettify()
<html>
<head>
</head>
<body>
<p>
This is my content
<br/>
with a line break.
</p>
</body>
</html>

Related

xml.etree.ElementTree: How to replace like "innerHTML"?

I want to replace the <h1> tag of a html page.
But the content of the heading can be HTML (not just a string).
I want to insert foo <b>bold</b> bar
input:
start
<h1 class="myclass">bar <i>italic</i></h1>
end
Desired output:
start
<h1 class="myclass">foo <b>bold</b> bar</h1>
end
How to solve this with Python?
using htql:
page="""start
<h1 class="myclass">bar <i>italic</i></h1>
end
"""
import htql
x = htql.query(page, "<h1>:tx &replace('foo <b>bold</b> bar') ")[0][0]
You get:
>>> x
'start \n<h1 class="myclass">foo <b>bold</b> bar</h1>\nend\n'
parser = HTMLParser(namespaceHTMLElements=False)
etree = parser.parse('start <h1 class="myclass">bar <i>italic</i></h1> end')
for h1 in etree.findall('.//h1'):
for sub in h1:
h1.remove(sub)
html = parser.parse('foo <b>bold</b> bar')
body = html.find('.//body')
for sub in body:
h1.append(sub)
h1.text = body.text
print(ElementTree.tostring(etree))

How to extract tags from HTML file and write them to a new file?

My HTML file has the format shown below
<unit id="2" status="FINISHED" type="pe">
<S producer="Alice_EN">CHAPTER I Down the Rabbit-Hole</S>
<MT producer="ALICE_GG">CAPÍTULO I Abaixo do buraco de coelho</MT>
<annotations revisions="1">
<annotation r="1">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>
I need to extract ALL the content from two tags in the entire HTML file. The content of one of the tags that starts with <unit id ...> is in one line, but the content of the other tag that starts with "<PE producer ..." and ends with '' is spread over different lines. I need to extract the content within these two tags and write the content to a new file one after another. My output should be:
<unit id="2" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>
My code does not extract the content from all the tags of the file. Does anyone have a clue of whats is going on and how I can make this code work properly?
import codecs
import re
t=codecs.open('ALICE.per1_replaced.html','r')
t=t.read()
unitid=re.findall('<unit.*?"pe">', t)
PE=re.findall('<PE.*?</PE>', t, re.DOTALL)
for i in unitid:
for j in PE:
a=i + '\n' + j + '\n'
with open('PEtags.txt','w') as fi:
fi.write(a)
You have a problem with the code where you loop through the matches and write them to file.
If your initid and PE match counts are the same, you may adjust the code to
import re
with open('ALICE.per1_replaced.html','r') as t:
contents = t.read()
unitid=re.findall('<unit.*?"pe">', contents)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtags.txt','w') as fi:
for i, p in zip(unitid, PE):
fi.write( "{}\n{}\n".format(i, p) )

Is there a function similar to .replace() in Dominate library for Python?

I want to add HTML tags to text taken from a .txt file and then save as HTML. I'm trying to find any instances of a particular word, then 'replace' it with the same word inside an anchor tag.
Something like this:
import dominate
from dominate.tags import *
item = 'item1'
text = ['here is item1 in a line of text', 'here is item2 in a line too']
doc = dominate.document()
with doc:
for i, line in enumerate(text):
if item in text[i]:
text[i].replace(item, a(item, href='/item1'))
The above code gives an error:
TypeError: replace() argument 2 must be str, not a.
I can make this happen:
print(doc.body)
<body>
<p>here is item1 in a line of text</p>
<p>here is item2 in a line too</p>
</body>
But I want this:
print(doc.body)
<body>
<p>here is <a href='/item1'>item1</a> in a line of text</p>
<p>here is item2 in a line too</p>
</body>
There is no replace() method in Dominate, but this solution works for what I want to achieve:
Create the anchor tag as a string. Stored here in the variable 'item_atag':
item = 'item1'
url = '/item1'
item_atag = '<a href={}>{}</a>'.format(url, item)
Use Dominate library to wrap paragraph tags around each line in the original text, then convert to string:
text = ['here is item1 in a line of text', 'here is item2 in a line too']
from dominate import document
from dominate.tags import p
doc = document()
with doc.body:
for i, line in enumerate(text):
p(text[i])
html_string = str(doc.body)
Use Python's built-in replace() method for strings to add the anchor tag:
html_with_atag = html_string.replace(item, item_atag)
Finally, write the new string to a HTML file:
with open('html_file.html', 'w') as f:
f.write(html_with_atag)

Using requests and Beautifulsoup to find text in page (With CSS)

I'm doing a request to a webpage and I'm trying to retrieve some text on it. The text is splitup with span tags like this:
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
There are "inline style sheets" (CSS sheets) that says if we have to print or not the text to the screen and thus, not print the gibberish text on the screen. This is an example of 1 of the sheet:
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}
but there are more CSS files like this.. So I don't know if there are any better way to achieve my goal (print the text that shows on screen and not use the gibberish that is not displayed)
My script is able to print the text.. but all of it (with gibberish) as the following: "This is jvgviehrgjfne my gt4ugirdfgr script!"
If i understood you right, what you should do is to parse css files with regex for attributes associated with inline and provide the results to beautiful soup api. Here is a way:
import re
import bs4
page_txt = """
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
"""
css_file_read_output = """
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}"""
css_file_lines = css_file_read_output.splitlines()
css_lines_text = []
for line in css_file_lines:
inline_search = re.search(".*inline.*", line)
if inline_search is not None:
inline_group = inline_search.group()
class_name_search = re.search("\..*\{", inline_group)
class_name_group = class_name_search.group()
class_name_group = class_name_group[1:-1] # getting rid of the last { and first .
css_lines_text.append(class_name_group)
else:
pass
page_bs = bs4.BeautifulSoup(page_txt,"lxml")
wanted_text_list = []
for line in css_lines_text:
wanted_line = page_bs.find("span", class_=line)
wanted_text = wanted_line.get_text(strip=True)
wanted_text_list.append(wanted_text)
wanted_string = " ".join(wanted_text_list)

How to find spans with a specific class containing specific text using beautiful soup and re?

how can I find all span's with a class of 'blue' that contain text in the format:
04/18/13 7:29pm
which could therefore be:
04/18/13 7:29pm
or:
Posted on 04/18/13 7:29pm
in terms of constructing the logic to do this, this is what i have got so far:
new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
result = re.findall(pattern, _)
print result
I've been referring to https://stackoverflow.com/a/7732827 and https://stackoverflow.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.
edit:
to clarify the scenario, there are span's with:
<span class="blue">here is a lot of text that i don't need</span>
and
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
and note i only need 04/18/13 7:29pm not the rest of the content.
edit 2:
I also tried:
pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
result = re.findall(pattern, _)
print result
and got error:
'TypeError: expected string or buffer'
import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""
# parse the html
soup = BeautifulSoup(html_doc)
# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})
# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]
# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
if m:
found_dates.append(m.group(1))
# print the dates we collected
for date in found_dates:
print(date)
output:
04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm
This is a flexible regex that you can use:
"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
Example:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
for line in lines
for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
print i
04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM
This pattern seems to satisfy what you're looking for:
>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)

Categories

Resources