I'm doing a request to a webpage and I'm trying to retrieve some text on it. The text is splitup with span tags like this:
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
There are "inline style sheets" (CSS sheets) that says if we have to print or not the text to the screen and thus, not print the gibberish text on the screen. This is an example of 1 of the sheet:
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}
but there are more CSS files like this.. So I don't know if there are any better way to achieve my goal (print the text that shows on screen and not use the gibberish that is not displayed)
My script is able to print the text.. but all of it (with gibberish) as the following: "This is jvgviehrgjfne my gt4ugirdfgr script!"
If i understood you right, what you should do is to parse css files with regex for attributes associated with inline and provide the results to beautiful soup api. Here is a way:
import re
import bs4
page_txt = """
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
"""
css_file_read_output = """
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}"""
css_file_lines = css_file_read_output.splitlines()
css_lines_text = []
for line in css_file_lines:
inline_search = re.search(".*inline.*", line)
if inline_search is not None:
inline_group = inline_search.group()
class_name_search = re.search("\..*\{", inline_group)
class_name_group = class_name_search.group()
class_name_group = class_name_group[1:-1] # getting rid of the last { and first .
css_lines_text.append(class_name_group)
else:
pass
page_bs = bs4.BeautifulSoup(page_txt,"lxml")
wanted_text_list = []
for line in css_lines_text:
wanted_line = page_bs.find("span", class_=line)
wanted_text = wanted_line.get_text(strip=True)
wanted_text_list.append(wanted_text)
wanted_string = " ".join(wanted_text_list)
Related
this is the html tags i want get text from its span
<span class="ms-2 d-flex">
<span class="d-none d-xl-block me-1"> mobile </span>
ItsMobileNumber
</span>
so its one main span with span and some text 'ItsMobileNumber'
i want get the 'ItsMobileNumber' but when i use get_text() it getting both text like this :
mobile
ItsMobileNumber
and this is my python code
print(title.find("span").get_text())
how can i get just 'ItsMobileNumber' not inner span text ?
Try something like this:
from bs4 import BeautifulSoup as bs
soup = bs([your html file],'lxml')
data = soup.select("span.ms-2.d-flex")
for datum in data:
print(list(datum.strings)[2].strip())
The output, based only on your sample html, should be
ItsMobileNumber
Have you tried this?:
data = title.find("span").get_text()
number = [d.text for d in data]
Or,
import re
number = re.findall("[0-9]+",data)
If you want the main span as well, you can do:
main = data.contents[0]
I am trying to use the BeautifulSoup module with Python to do the following:
Within a div for HTML, for each paragraph tag, I want to add a bold tag to the first letter of each word within the paragraph. For example:
<div class="body">
<p>The quick brown fox</p>
</div>
which would read: The quick brown fox
would then become
<div class="body">
<p><b>T</b>he <b>q</b>uick <b>b</b>rown <b>f</b>ox</p>
</div>
that would read: The quick brown fox
Using bs4 i've been unable to find a good solution to do this and am open to ideas.
You could use replace_with() combined with list comprehension - Extract text / string from tag / bs4 object, process it as text and later on replace the tag with new bs4 object:
soup.p.replace_with(
BeautifulSoup(
' '.join([s.replace(s[0],f'<b>{s[0]}</b>') for s in soup.p.string.split(' ')]),'html.parser'
)
)
Example
from bs4 import BeautifulSoup
html = '''
<div class="body">
<p>The quick brown fox</p>
</div>'''
soup = BeautifulSoup(html,'html.parser')
soup.p.replace_with(
BeautifulSoup(
' '.join([s.replace(s[0],f'<b>{s[0]}</b>') for s in soup.p.string.split(' ')]),'html.parser'
)
)
soup
Output
<div class="body">
<b>T</b>he <b>q</b>uick <b>b</b>rown <b>f</b>ox
</div>
I don't know much about how Python parses HTML in detail, but I can provide you with some ideas.
To find <p> tags, you can use RegEx <p.*?>.*?</p> or use str.find("<p>") and walk until </p>.
To add <b> tags, perhaps this code will work:
def add_bold(s: str) -> str:
ret = ""
isFirstLet = True
for i in s:
if isFirstLet:
ret += "<b>" + i + "</b>"
isFirstLet = False
else:
ret += i
if i == " ": isFirstLet = True
return ret
I want to convert:
<span class = "foo">data-1</span>
<span class = "foo">data-2</span>
<span class = "foo">data-3</span>
to
<span class = "foo"> data-1 data-2 data-3 </span>
Using BeautifulSoup in Python. This HTML part exists in multiple areas of the page body, hence I want to minimize this part and scrap it. Actually the mid span was with em class hence originally separated.
Adapted from this answer to show how this could be used for your span tags:
span_tags = container.find_all('span')
# combine all the text from b tags
text = ''.join(span.get_text(strip=True) for span in span_tags)
# here you choose a tag you want to preserve and update its text
span_main = span_tags[0] # you can target it however you want, I just take the first one from the list
span_main.span.string = text # replace the text
for tag in span_tags:
if tag is not span_main:
tag.decompose()
I am trying to scrape from the <span class= ''>. The code looks like this on the pages I am scraping:
< span class = "catnum"> Disc Number < / span>
"1"
< br >
< span class = "catnum"> Track Number < / span>
"1"
< br>
< span class = "catnum" > Duration < /span>
"5:28"
<br>
What I need to get are those numbers after the </span> tag. I should also mention I am writing a larger piece of code that is scraping 1200 sites and this will have to loop over 1200 sites where the numbers in the quotation marks will change from page to page.
I tried this code as a test on one page:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("Smith.html"), "html.parser")
for tag in soup.findAll('span'):
if tag.has_key('class'):
if tag['class'] == 'catnum':
print tag.string
I know that will print ALL the 'span class' tags and not just the three I want, but I thought I would still test it to see if it worked and I got this error:
/Library/Python/2.7/site-packages/bs4/element.py:1527: UserWarning:
has_key is deprecated. Use has_attr("class") instead. key))
as said in the error message, you should use tag.has_attr("class") in place of the deprecated tag.has_key("class") method.
Hope it helps.
Simone
You can constrain your search by attribute {'class': 'catnum'} and the text inside text=re.compile('Disc Number'). Then use .next_sibling to find the text:
from bs4 import BeautifulSoup
import re
s = '''
<span class = "catnum"> Disc Number </span>
"1"
<br/>
<span class = "catnum"> Track Number </span>
"1"
<br/>
<span class = "catnum"> Duration </span>
"5:28"
<br/>'''
soup = BeautifulSoup(s, 'html.parser')
span = soup.find('span', {'class': 'catnum'}, text=re.compile(r'Disc Number'))
print span.next_sibling
how can I find all span's with a class of 'blue' that contain text in the format:
04/18/13 7:29pm
which could therefore be:
04/18/13 7:29pm
or:
Posted on 04/18/13 7:29pm
in terms of constructing the logic to do this, this is what i have got so far:
new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
result = re.findall(pattern, _)
print result
I've been referring to https://stackoverflow.com/a/7732827 and https://stackoverflow.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.
edit:
to clarify the scenario, there are span's with:
<span class="blue">here is a lot of text that i don't need</span>
and
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
and note i only need 04/18/13 7:29pm not the rest of the content.
edit 2:
I also tried:
pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
result = re.findall(pattern, _)
print result
and got error:
'TypeError: expected string or buffer'
import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""
# parse the html
soup = BeautifulSoup(html_doc)
# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})
# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]
# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
if m:
found_dates.append(m.group(1))
# print the dates we collected
for date in found_dates:
print(date)
output:
04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm
This is a flexible regex that you can use:
"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
Example:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
for line in lines
for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
print i
04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM
This pattern seems to satisfy what you're looking for:
>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)