BeautifulSoup fails to parse nested <p> elements

BeautifulSoup fails to parse nested <p> elements - python

Dependencies:
BeautifulSoup==3.2.1
In: from BeautifulSoup import BeautifulSoup
In: BeautifulSoup('<p><p>123</p></p>')
Out: <p></p><p>123</p>
Why are the two adjacent tags not in the output?

That is just BS3's parser fixing your broken html.
The P element represents a paragraph. It cannot contain block-level
elements (including P itself).

This
<p><p>123</p></p>
is not valid HTML. ps can't be nested. BS tries to clean it up.
When BS encounters the second <p> it thinks the first p is finished, so it inserts a closing </p>. The second </p> in your input then does not match an opening <p> so it is removed.

This is because BeautifulSoup has this NESTABLE_TAGS concept/setting:
When Beautiful Soup is parsing a document, it keeps a stack of open
tags. Whenever it sees a new start tag, it tosses that tag on top of
the stack. But before it does, it might close some of the open tags
and remove them from the stack. Which tags it closes depends on the
qualities of tag it just found, and the qualities of the tags in the
stack.
So when Beautiful Soup encounters a <P> tag, it closes and pops all
the tags up to and including the previously encountered tag of the
same type. This is the default behavior, and this is how
BeautifulStoneSoup treats every tag. It's what you get when a tag is
not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also
what you get when a tag shows up in RESET_NESTING_TAGS but has no
entry in NESTABLE_TAGS, the way the <P> tag does.
>>> pprint(BeautifulSoup.NESTABLE_TAGS)
{'bdo': [],
'blockquote': [],
'center': [],
'dd': ['dl'],
'del': [],
'div': [],
'dl': [],
'dt': ['dl'],
'fieldset': [],
'font': [],
'ins': [],
'li': ['ul', 'ol'],
'object': [],
'ol': [],
'q': [],
'span': [],
'sub': [],
'sup': [],
'table': [],
'tbody': ['table'],
'td': ['tr'],
'tfoot': ['table'],
'th': ['tr'],
'thead': ['table'],
'tr': ['table', 'tbody', 'tfoot', 'thead'],
'ul': []}
As a workaround, you can allow p tag to be inside p:
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup.NESTABLE_TAGS['p'] = ['p']
>>> BeautifulSoup('<p><p>123</p></p>')
<p><p>123</p></p>
Also, BeautifulSoup 3rd version is no longer maintained - you should switch to BeautifulSoup4.
When using BeautifulSoup4, you can change the underlying parser to change the behavior:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<p><p>123</p></p>')
<html><body><p></p><p>123</p></body></html>
>>> BeautifulSoup('<p><p>123</p></p>', 'html.parser')
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'html5lib')
<html><head></head><body><p></p><p>123</p><p></p></body></html>

Related

Can't get info of a lxml site with Request and BeautifulSoup

I'm trying to make a testing project that scraps info of a specific site but with no success.
I followed some tutorials i have found and even an post on stackoverflow. After all this I'm stuck!
help me stepbrothers, I'm a hot new programmer with python and I can't stop my projects.
more info: this is a lottery website that I was trying to scrap and make some analisys to get a lucky number.
I have followed this tutorials:
https://towardsdatascience.com/how-to-collect-data-from-any-website-cb8fad9e9ec5
https://beautiful-soup-4.readthedocs.io/en/latest/
Using BeautifulSoup in order to find all "ul" and "li" elements
All of you have my gratitute!
from bs4 import BeautifulSoup as bs
import requests
import html5lib
#import urllib3 # another attemp to make another req in the url ------failed
url = '''https://loterias.caixa.gov.br/Paginas/Mega-Sena.aspx'''
#another try to take results in the <ul> but I have no qualified results == None
def parse_ul(elem):#https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
result = {}
for sub in elem.find_all('li', recursive=False):
if sub.li is None:
continue
data = {k: v for k, v in sub.attrs.items()}
if sub.ul is not None:
# recurse down
data['children'] = parse_ul(sub.ul)
result[sub.li.get_text(strip=True)] = data
return result
page = requests.get(url)#taking info from website
print(page.encoding)# == UTF-8
soup = bs(page.content,features="lxml")#takes all info from the url and organizes it ==Beaultiful soup
numbers = soup.find(id='ulDezenas')#searcher in the content of this specific id// another try: soup.find('ul', {'class': ''})
result = parse_ul(soup)#try to parse info, but none is found EVEN WITH THE ORIGINAL ONE
print(numbers)#The result is below:
'''<ul class="numbers megasena" id="ulDezenas">
<li ng-repeat="dezena in resultado.listaDezenas ">{{dezena.length > 2 ? dezena.slice(1) : dezena}}</li>
</ul>'''
print(result)# == "{}" nothing found
#with open('''D:\Documents\python\_abretesesame.txt''', 'wb') as fd:
# for chunk in page.iter_content(chunk_size=128):
# fd.write(chunk)
# =======printing document(HTML) in file still no success in getting the numbers

Main issue is that the content is provided dynamically by JavaScript but you can get the information via another url:
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
will give you folowing JSON:
{'tipoJogo': 'MEGA_SENA', 'numero': 2468, 'nomeMunicipioUFSorteio': 'SÃO PAULO, SP', 'dataApuracao': '02/04/2022', 'valorArrecadado': 158184963.0, 'valorEstimadoProximoConcurso': 3000000.0, 'valorAcumuladoProximoConcurso': 0.0, 'valorAcumuladoConcursoEspecial': 36771176.89, 'valorAcumuladoConcurso_0_5': 33463457.98, 'acumulado': False, 'indicadorConcursoEspecial': 1, 'dezenasSorteadasOrdemSorteio': ['022', '041', '053', '042', '035', '057'], 'listaResultadoEquipeEsportiva': None, 'numeroJogo': 2, 'nomeTimeCoracaoMesSorte': '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'tipoPublicacao': 3, 'observacao': '', 'localSorteio': 'ESPAÇO DA SORTE', 'dataProximoConcurso': '06/04/2022', 'numeroConcursoAnterior': 2467, 'numeroConcursoProximo': 2469, 'valorTotalPremioFaixaUm': 0.0, 'numeroConcursoFinal_0_5': 2470, 'listaDezenas': ['022', '035', '041', '042', '053', '057'], 'listaDezenasSegundoSorteio': None, 'listaMunicipioUFGanhadores': [{'posicao': 1, 'ganhadores': 1, 'municipio': 'SANTOS', 'uf': 'SP', 'nomeFatansiaUL': '', 'serie': ''}], 'listaRateioPremio': [{'faixa': 1, 'numeroDeGanhadores': 1, 'valorPremio': 122627171.8, 'descricaoFaixa': '6 acertos'}, {'faixa': 2, 'numeroDeGanhadores': 267, 'valorPremio': 34158.18, 'descricaoFaixa': '5 acertos'}, {'faixa': 3, 'numeroDeGanhadores': 20734, 'valorPremio': 628.38, 'descricaoFaixa': '4 acertos'}], 'id': None, 'ultimoConcurso': True, 'exibirDetalhamentoPorCidade': True, 'premiacaoContingencia': None}
Simply extract dezenasSorteadasOrdemSorteio and prozess in list comprehension:
[n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']]
Result will be:
['22', '35', '41', '42', '53', '57']
Example
import requests
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
print([n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']])

Issues with Python BeautifulSoup parsing

I am trying to parse an html page with BeautifulSoup. The task is to get the data underlined with red color for all the lots on this page. I got the data from the left and the right block (about the lot, auction name, country etc) but getting the data from the central block seems to be problematic for me. Here is the example of what is done.
import requests
import re
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "https://www.artprice.com/artist/15079/wassily-kandinsky/lots/pasts?ipp=100"
FILE_NAME = "test"
def parse(url = URL_TEMPLATE):
result_list = {'lot': [], 'name': [], 'date': [], 'type1': [], 'type2': [], 'width': [], 'height': [], 'estimate': [], 'hummerprice': [], 'auction_date': [], 'auction': [], 'country': []}
r = requests.get(URL_TEMPLATE)
soup = bs(r.text, "html.parser")
lot_info = soup.find_all('p', class_='hidden-xs')
date_info = soup.find_all('date')
names_info = soup.find_all('a', class_='sln_lot_show')
auction_info = soup.find_all('p', class_='visible-xs')
auction_date_info = soup.find_all(string=re.compile('\d\d\s\w\w\w\s\d\d\d\d'))[1::2]
type1_info = soup.find_all('div')
for i in range(len(lot_info)):
result_list['lot'].append(lot_info[i].text)
for i in range(len(date_info)):
result_list['date'].append(date_info[i].text)
for i in range (len(names_info)):
result_list['name'].append(names_info[i].text)
for i in range(0, len(auction_info), 2):
result_list['auction'].append(soup.find_all('p', class_='visible-xs')[i].strong.string)
for i in range(1, len(auction_info), 2):
result_list['country'].append(soup.find_all('p', class_='visible-xs')[i].string)
for i in range(len(auction_date_info)):
result_list['auction_date'].append(auction_date_info[i])
return result_list
df = pd.DataFrame(data=parse())
df.to_excel("test.xlsx")
So, the task is to get the data from the central block separately for each lot on this page.

You need nth-of-type to access all those <p> elements.
This does it for just the first one to show that it works.
I'll leave it to you to clean up the output.
for div in soup.find_all('div',class_='col-xs-8 col-sm-6'):
print(div.select_one('a').text.strip())
print(div.select_one('p:nth-of-type(2)').text.strip())
print(div.select_one('p:nth-of-type(3)').text.strip())
print(div.select_one('p:nth-of-type(4)').text.strip())
break
Result:
Abstract
Print-Multiple, Print in colors, 29 1/2 x 31 1/2 in75 x 80 cm
Estimate:
€ 560 - € 784
$ 605 - $ 848
£ 500 - £ 700
￥ 4,303 - ￥ 6,025
Hammer price:
not communicated
not communicated
not communicated
not communicated

python regex to split string at <a> elements and extract link + text

Let's say I have several <a> elements in string:
s = 'Hello world. StackOverflow is a great website. ESPN is another great website.'
The goal is to split the string so I get a list similar to the one below:
l = [
"Hello world. ",
{"link": "https://stackoverflow.com/", "title": "StackOverflow"},
" is a great website. ",
{"link": "https://www.espn.com/", "title": "ESPN"},
" is another great website.",
]
The dictionaries can be any object I can extract the link and title from. Is there a regex I can use to accomplish this? Or is there a better way to do this?

BeautifulSoup is better tool to parse this string than regex. As general rule, don't use regex to parse HTML:
s = 'Hello world. StackOverflow is a great website. ESPN is another great website.'
from bs4 import BeautifulSoup, Tag, NavigableString
soup = BeautifulSoup(s, 'html.parser')
out = []
for c in soup.contents:
if isinstance(c, NavigableString):
out += [c]
elif isinstance(c, Tag) and c.name == 'a' and 'href' in c.attrs:
out += [{"link": c['href'], "title": c.text}]
from pprint import pprint
pprint(out)
Prints:
['Hello world. ',
{'link': 'https://stackoverflow.com/', 'title': 'StackOverflow'},
' is a great website. ',
{'link': 'https://www.espn.com/', 'title': 'ESPN'},
' is another great website.']

If you insist on using regex for this:
import re
s = 'Hello world. StackOverflow is a great website. ESPN is another great website.'
sites = [{"link": link, "title": title} for link, title in zip(re.findall(r'', s), re.findall(r'>(.*?)', s))]
print(sites)
Output:
[{'link': 'https://stackoverflow.com/', 'title': 'StackOverflow'}, {'link': 'https://www.espn.com/', 'title': 'ESPN'}]

Use regex or something else to capture website data

I'm trying to use python and regex to pull the price in the example website below but am not getting any results.
How can I best capture the price (I don't care about the cents, just the dollar amount)?
http://www.walmart.com/store/2516/search?dept=4044&dept_name=Home&query=43888060
Relevant HTML:
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
What would the regex be to capture the "299" or is the an easier route to get this? Thanks!

With regexp it can be a bit tricky on how accurate your pattern should be.
I quickly typed something togehter here: https://regex101.com/r/lF5vF2/1
You should get the idea and modify this one to fit your actual needs.
Kind regards

Don't use regex use a html parser like bs4:
from bs4 import BeautifulSoup
h = """<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>"""
soup = BeautifulSoup(h)
amount = soup.select_one("div.price-display.csTile-price span.sup").next_sibling.strip()
Which will give you:
299
Or use the currency-delimiter span and get the previous element:
amount = soup.select_one("span.currency-delimiter").previous.strip()
Which will give you the same. The html in your question is also dynamically generated via Javascript so you won't be getting it using urllib.urlopen, it is simply not in the source returned.
You will need something like selenium or to mimic the ajax call as below using requests .
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
from pprint import pprint as pp
pp(data)
That gives you some json:
{u'algo': u'polaris',
u'blacklist': False,
u'cluster': {u'apiserver': {u'hostname': u'dfw-iss-api8.stg0',
u'pluginVersion': u'2.3.0'},
u'searchengine': {u'hostname': u'dfw-iss-esd.stg0.mobile.walmart.com'}},
u'count': 1,
u'offset': 0,
u'performance': {u'enrichment': {u'inventory': 70}},
u'query': {u'actualQuery': u'43888060',
u'originalQuery': u'43888060',
u'suggestedQueries': []},
u'queryTime': 181,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'images': {u'largeUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180',
u'thumbnailUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180'},
u'inventory': {u'isRealTime': True,
u'quantity': 1,
u'status': u'In Stock'},
u'isWWWItem': True,
u'location': {u'aisle': [], u'detailed': []},
u'name': u'Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01',
u'price': {u'currencyUnit': u'USD',
u'isRealTime': True,
u'priceInCents': 29900},
u'productId': {u'WWWItemId': u'43888060',
u'productId': u'2FY1C7B7RMM4',
u'upc': u'88560900430'},
u'ratings': {u'rating': u'4.721',
u'ratingUrl': u'http://i2.walmartimages.com/i/CustRating/4_7.gif'},
u'reviews': {u'reviewCount': u'1436'},
u'score': u'0.507073'}],
u'totalCount': 1}
That gives you dict with all the info you could need, all you are doing is posting the params and the store number which you have in the url to http://www.walmart.com/store/ajax/search.
To get the price and name:
In [22]: import requests
In [23]: import json
In [24]: js = requests.post("http://www.walmart.com/store/ajax/search",
....: data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
In [25]: data = json.loads(js['searchResults'])
In [26]: res = data["results"][0]
In [27]: print(res["name"])
Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01
In [28]: print(res["price"])
{u'priceInCents': 29900, u'isRealTime': True, u'currencyUnit': u'USD'}
In [29]: print(res["price"]["priceInCents"])
29900
In [30]: print(res["price"]["priceInCents"]) / 100
299

Ok, just search for numerics (I added $ and .) and concat the results into a string (I used "".join()).
>>> txt = """
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
"""
>>> ''.join(re.findall('[0-9$.]',txt.replace("\n","")))
'$299.00'

Can't access data from a table in a web page

I am trying to access content from the table from this page http://www.afl.com.au/afl/stats/player-ratings/overall-standings# but however when I do so using beautiful soup in python, I am getting the data but from the 'All' filter selection not from a certain club filter. How can I achieve my goal?
I need to access all data from the table corresponding to a club in the filter. Please help me.
Please see the image below.
and also data from all pages:
I have used the following code
from bs4 import BeautifulSoup
import urllib2
import lxml.html
import xlwt
import unicodedata
infoList = []
lLink = "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#club/CD_T140"
header = {'User-Agent': 'Mozilla/5.0'}
req_for_players = urllib2.Request(lLink,headers=header)
page_for_players = urllib2.urlopen(req_for_players)
soup_for_players = BeautifulSoup(page_for_players)
table = soup_for_players.select('table.player-ratings')[0]
for group_header in table.select('tbody tr span'):
player = group_header.string
infoList.append(player)
print infoList
The list infoList thus generated contains data corresponding to the "All" filter. But I want data according to the filter I choose.

you don't need to parse the table - use Firebug or any similar tool to watch the response while clicking on some page in paginator and you'll see that it serves you the data in JSON! Pure win!!!
There you can see the URL format for JSON data:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&pageSize=40&pageNum=2
and this way you can fetch even the first page of data without parsing HTML and maybe you can fetch all data at once by setting some high value to pageSize variable

When the page loads the first time, all the players are there in the table regardless of what filter you've chosen. The filter is only invoked a short while later. That is why you are getting data on all the players.
Underneath the page is calling the following when the filter is invoked:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T140&pageSize=40&pageNum=1
to get one team's players. CD_T140 are the Western Bulldogs in this case. You can see the different possible values in the selLadderTeam select element. You cannot simply call this url however as you will get a 401 error. Looking at the headers that are sent across, there is one that stands out. A token seems to be required. So using the requests library (it's more user-friendly than urllib2) you can do the following:
>>> import requests
>>> url = "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T40&pageSize=100"
>>> h = {'x-media-mis-token':'e61767b39a7680235476bb33aa946c0e'}
>>> r = requests.get(url, headers=h)
>>> r
<Response [200]>
>>> j = r.json()
>>> len(j['playerRatings'])
45
>>> j['playerRatings'][0]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I260257', u'playerName': {u'givenName': u'Scott', u'surname': u'Pendlebury'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2005', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'OVERALL', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 1, u'ratingType': u'TEAM', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'POSITION', u'ratingPoints': 674}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MIDFIELDER'}
>>> j['playerRatings'][44]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I295012', u'playerName': {u'givenName': u'Matt', u'surname': u'Scharenberg'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2013', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'OVERALL', u'ratingPoints': 0},{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'TEAM', u'ratingPoints': 0}, {u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'POSITION', u'ratingPoints': 0}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MEDIUM_DEFENDER'}
>>>
Notes: I don't know exactly what roundID is. I increased pageSize to something that would likely return all of a team's players and removed pageNum. They could change the token at any time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup fails to parse nested <p> elements - python

Dependencies: BeautifulSoup==3.2.1 In: from BeautifulSoup import BeautifulSoup In: BeautifulSoup('<p><p>123</p></p>') Out: <p></p><p>123</p> Why are the two adjacent tags not in the output?

That is just BS3's parser fixing your broken html. The P element represents a paragraph. It cannot contain block-level elements (including P itself).

This <p><p>123</p></p> is not valid HTML. ps can't be nested. BS tries to clean it up. When BS encounters the second <p> it thinks the first p is finished, so it inserts a closing </p>. The second </p> in your input then does not match an opening <p> so it is removed.

Related

Can't get info of a lxml site with Request and BeautifulSoup

Issues with Python BeautifulSoup parsing

python regex to split string at <a> elements and extract link + text

Use regex or something else to capture website data

Can't access data from a table in a web page

Categories

Resources