A part of HTML looks like below. I want to extract the contents in the 'span' tags:
from bs4 import BeautifulSoup
data = """
<section><h2>Team</h2><ul><li><ul><li><span>J36</span>—<span>John</span></li><li><span>B56</span>—<span>Bratt</span></li><li><span>K3</span>—<span>Kate</span></li></ul></li></ul></section>
... """
soup = BeautifulSoup(data, "html.parser")
classification = soup.find_all('section')[0].find_all('span')
for c in classification:
print (c.text)
It works out:
J36
John
B56
Bratt
K3
Kate
But the wanted:
J36-John
B56-Bratt
K3-Kate
What's the proper BeautifulSoup way to extract the contents, other than below? Thank you.
contents = [c.text for c in classification]
l = contents[0::2]
ll = contents[1::2]
for a in zip(l, ll):
print ('-'.join(a))
You could get the next sibling tag. If it's the dash, it will be printed along with the text, otherwise just the text will be printed .
for c in classification:
if c.next_sibling:
print(c.text + str(c.next_sibling), end='')
else:
print(c.text)
Related
Good Morning,
I'm doing some HTML parsing in Python and I've run across the following which is a time & name pairing in a single table cell. I'm trying to extract each piece of info separately and have tried several different approaches to split the following string.
HTML String:
<span><strong>13:30</strong><br/>SecondWord</span></a>
My output would hopefully be:
text1 = 13:30
text2 = "SecondWord"
I'm currently using a loop through all the rows in the table, where I'm taking the text and splitting it by a new line. I noticed the HTML has a line break character in-between so it renders separately on the web, I was trying to replace this with a new line and run my split on that - however my string.replace() and re.sub() approaches don't seem to be working.
I'd love to know what I'm doing wrong.
Latest Approach:
resub_pat = r'<br/>'
rows=list()
for row in table.findAll("tr"):
a = re.sub(resub_pat,"\n",row.text).split("\n")
This is a bit hashed together, but I hope I've captured my problem! I wasn't able to find any similar issues.
You could try:
from bs4 import BeautifulSoup
import re
# the soup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# the regex object
rx = re.compile(r'(\d+:\d+)(.+)')
# time, text
text = soup.find('span').get_text()
x,y = rx.findall(text)[0]
print(x)
print(y)
Using recursive=False to get only direct text and strong.text to get the other one.
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# text1
print(soup.find("span").strong.text) # --> 13:30
# text2
print(soup.find("span").find(text=True, recursive=False)) # --> SecondWord
from bs4 import BeautifulSoup
txt = '''<span><strong>13:30</strong><br/>SecondWord</span></a>'''
soup = BeautifulSoup(txt, 'html.parser')
text1, text2 = soup.span.get_text(strip=True, separator='|').split('|')
print(text1)
print(text2)
Prints:
13:30
SecondWord
I want to scrape the data of websitses using Beautiful Soup and requests, and I've come so far that I've got the data I want but now I want to filter it:
from bs4 import BeautifulSoup
import requests
url = "website.com"
keyword = "22222"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
print(article.get("data-variant-code"))
Let's say this prints the following:
11111
22222
33333
How can I filter this so it only returns me the "22222"?
assuming that article.get("data-variant-code") prints 11111, 22222, 33333,
you can simply use an if statement:
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
x = article.get("data-variant-code")
if x == '22222':
print(x)
if you want to print the 2nd group of chars in a string delimited by space, then you can split the string using space as delimiter. This will give you a list of strings then access the 2nd item of the list.
For example:
print(article.get("data-variant-code").split(" ")[1])
result: 22222
markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
soup = BeautifulSoup(markup)
data = soup.find_all('p','li')
print(data)
The result looks like this
['<p>hey</p>,<p>How </p>,<li><p>How</p></li>,<p>are you </p>,<li><p>are you </p></li>']
How can I make it only return
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
or any ways that I can remove the duplicated p tags Thanks
BeautifulSoup is not analyzing the text of the html, instead, it is acquiring the desired tag structure. In this case, you need to perform an additional step and check if the text of an <p> is already seen in a different structure:
from bs4 import BeautifulSoup as soup
markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
s = soup(markup, 'lxml')
final_s = s.find_all(re.compile('p|li'))
final_data = [','.join(map(str, [a for i, a in enumerate(final_s) if not any(a.text == b.text for b in final_s[:i])]))]
Output:
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
If you are looking for the text and not the <li> tags, you can just search for the <p> tags and get the desired result without duplication.
>>> data = soup.find_all('p')
>>> data
[<p>hey</p>, <p>How</p>, <p>are you </p>]
Or, if you are simply looking for the text, you can use this:
>>> data = [x.text for x in soup.find_all('p')]
>>> data
['hey', 'How', 'are you ']
I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.
You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.
You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.
I am using python Beautiful soup to get the contents of:
<div class="path">
abc
def
ghi
</div>
My code is as follows:
html_doc="""<div class="path">
abc
def
ghi
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']
How can I only get the result in this form: abc,def,ghi as a single string?
Also I want to know about the output so obtained.
You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)
If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use
breadcrum = filter(None, breadcrum)
Or you can strip them all before hand (in html_doc):
mystring = mystring.replace('\n', ' ').replace('\r', '')
Either way to get your string output, do something like this:
','.join(breadcrum)
Unless I'm missing something, just combine strip and list comprehension.
Code:
from bs4 import BeautifulSoup as bsoup
ofile = open("test.html", "r")
soup = bsoup(ofile)
res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
print res
Result:
abc,def,ghi
[Finished in 0.2s]