Issues with Python BeautifulSoup parsing - python

I am trying to parse an html page with BeautifulSoup. The task is to get the data underlined with red color for all the lots on this page. I got the data from the left and the right block (about the lot, auction name, country etc) but getting the data from the central block seems to be problematic for me. Here is the example of what is done.
import requests
import re
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "https://www.artprice.com/artist/15079/wassily-kandinsky/lots/pasts?ipp=100"
FILE_NAME = "test"
def parse(url = URL_TEMPLATE):
result_list = {'lot': [], 'name': [], 'date': [], 'type1': [], 'type2': [], 'width': [], 'height': [], 'estimate': [], 'hummerprice': [], 'auction_date': [], 'auction': [], 'country': []}
r = requests.get(URL_TEMPLATE)
soup = bs(r.text, "html.parser")
lot_info = soup.find_all('p', class_='hidden-xs')
date_info = soup.find_all('date')
names_info = soup.find_all('a', class_='sln_lot_show')
auction_info = soup.find_all('p', class_='visible-xs')
auction_date_info = soup.find_all(string=re.compile('\d\d\s\w\w\w\s\d\d\d\d'))[1::2]
type1_info = soup.find_all('div')
for i in range(len(lot_info)):
result_list['lot'].append(lot_info[i].text)
for i in range(len(date_info)):
result_list['date'].append(date_info[i].text)
for i in range (len(names_info)):
result_list['name'].append(names_info[i].text)
for i in range(0, len(auction_info), 2):
result_list['auction'].append(soup.find_all('p', class_='visible-xs')[i].strong.string)
for i in range(1, len(auction_info), 2):
result_list['country'].append(soup.find_all('p', class_='visible-xs')[i].string)
for i in range(len(auction_date_info)):
result_list['auction_date'].append(auction_date_info[i])
return result_list
df = pd.DataFrame(data=parse())
df.to_excel("test.xlsx")
So, the task is to get the data from the central block separately for each lot on this page.

You need nth-of-type to access all those <p> elements.
This does it for just the first one to show that it works.
I'll leave it to you to clean up the output.
for div in soup.find_all('div',class_='col-xs-8 col-sm-6'):
print(div.select_one('a').text.strip())
print(div.select_one('p:nth-of-type(2)').text.strip())
print(div.select_one('p:nth-of-type(3)').text.strip())
print(div.select_one('p:nth-of-type(4)').text.strip())
break
Result:
Abstract
Print-Multiple, Print in colors, 29 1/2 x 31 1/2 in75 x 80 cm
Estimate:
€ 560 - € 784
$ 605 - $ 848
£ 500 - £ 700
¥ 4,303 - ¥ 6,025
Hammer price:
not communicated
not communicated
not communicated
not communicated

Related

python to get the json data from multiple links

I am trying to extract the JSON data from multiple links, but looks like I am doing something wrong. I am getting only the last id details. How do I get the JSON data for all the links? Also, is it possible to export all the results to a CSV file?
Please kindly guide me.
Here is the code that I am using.
import json
import requests
from bs4 import BeautifulSoup
a_list = [234147,234548,232439,234599,226672,234117,222388]
a_url = 'https://jobs.mycareerportal/careers-home/jobs'
urls = []
for n in a_list:
kurl = '{}/{}'.format(a_url, n)
soup = BeautifulSoup(requests.get(kurl).content, "html.parser")
data = [
json.loads(x.string) for x in soup.find_all("script", type="application/ld+json")
]
for d in data:
k = str(d['url']) + str(d['jobLocation']['address'])
urls.append(kurl)
print(k)
and this is the output that I am getting
PS E:\Python> & C:/Users/KristyG/Anaconda3/python.exe e:/Python/url_append.py
https://jobs.mycareerportal/careers-home/jobs/222388?{'#type': 'PostalAddress', 'addressLocality': 'Panama City', 'addressRegion': 'Florida', 'streetAddress': '4121 Hwy 98', 'postalCode': '32401-1170', 'addressCountry': 'United States'}
PS E:\Python>
Please note, I had to change the website name as I can't share it on public
I guess its just an indentation problem, try nesting the code inside the first for loop like this :
import json
import requests
from bs4 import BeautifulSoup
a_list = [234147,234548,232439,234599,226672,234117,222388]
a_url = 'https://jobs.mycareerportal/careers-home/jobs'
urls = []
for n in a_list:
kurl = '{}/{}'.format(a_url, n)
soup = BeautifulSoup(requests.get(kurl).content, "html.parser")
data = [
json.loads(x.string) for x in soup.find_all("script", type="application/ld+json")
]
for d in data:
k = str(d['url']) + str(d['jobLocation']['address'])
urls.append(kurl)
print(k)

Can't get info of a lxml site with Request and BeautifulSoup

I'm trying to make a testing project that scraps info of a specific site but with no success.
I followed some tutorials i have found and even an post on stackoverflow. After all this I'm stuck!
help me stepbrothers, I'm a hot new programmer with python and I can't stop my projects.
more info: this is a lottery website that I was trying to scrap and make some analisys to get a lucky number.
I have followed this tutorials:
https://towardsdatascience.com/how-to-collect-data-from-any-website-cb8fad9e9ec5
https://beautiful-soup-4.readthedocs.io/en/latest/
Using BeautifulSoup in order to find all "ul" and "li" elements
All of you have my gratitute!
from bs4 import BeautifulSoup as bs
import requests
import html5lib
#import urllib3 # another attemp to make another req in the url ------failed
url = '''https://loterias.caixa.gov.br/Paginas/Mega-Sena.aspx'''
#another try to take results in the <ul> but I have no qualified results == None
def parse_ul(elem):#https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
result = {}
for sub in elem.find_all('li', recursive=False):
if sub.li is None:
continue
data = {k: v for k, v in sub.attrs.items()}
if sub.ul is not None:
# recurse down
data['children'] = parse_ul(sub.ul)
result[sub.li.get_text(strip=True)] = data
return result
page = requests.get(url)#taking info from website
print(page.encoding)# == UTF-8
soup = bs(page.content,features="lxml")#takes all info from the url and organizes it ==Beaultiful soup
numbers = soup.find(id='ulDezenas')#searcher in the content of this specific id// another try: soup.find('ul', {'class': ''})
result = parse_ul(soup)#try to parse info, but none is found EVEN WITH THE ORIGINAL ONE
print(numbers)#The result is below:
'''<ul class="numbers megasena" id="ulDezenas">
<li ng-repeat="dezena in resultado.listaDezenas ">{{dezena.length > 2 ? dezena.slice(1) : dezena}}</li>
</ul>'''
print(result)# == "{}" nothing found
#with open('''D:\Documents\python\_abretesesame.txt''', 'wb') as fd:
# for chunk in page.iter_content(chunk_size=128):
# fd.write(chunk)
# =======printing document(HTML) in file still no success in getting the numbers
Main issue is that the content is provided dynamically by JavaScript but you can get the information via another url:
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
will give you folowing JSON:
{'tipoJogo': 'MEGA_SENA', 'numero': 2468, 'nomeMunicipioUFSorteio': 'SÃO PAULO, SP', 'dataApuracao': '02/04/2022', 'valorArrecadado': 158184963.0, 'valorEstimadoProximoConcurso': 3000000.0, 'valorAcumuladoProximoConcurso': 0.0, 'valorAcumuladoConcursoEspecial': 36771176.89, 'valorAcumuladoConcurso_0_5': 33463457.98, 'acumulado': False, 'indicadorConcursoEspecial': 1, 'dezenasSorteadasOrdemSorteio': ['022', '041', '053', '042', '035', '057'], 'listaResultadoEquipeEsportiva': None, 'numeroJogo': 2, 'nomeTimeCoracaoMesSorte': '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'tipoPublicacao': 3, 'observacao': '', 'localSorteio': 'ESPAÇO DA SORTE', 'dataProximoConcurso': '06/04/2022', 'numeroConcursoAnterior': 2467, 'numeroConcursoProximo': 2469, 'valorTotalPremioFaixaUm': 0.0, 'numeroConcursoFinal_0_5': 2470, 'listaDezenas': ['022', '035', '041', '042', '053', '057'], 'listaDezenasSegundoSorteio': None, 'listaMunicipioUFGanhadores': [{'posicao': 1, 'ganhadores': 1, 'municipio': 'SANTOS', 'uf': 'SP', 'nomeFatansiaUL': '', 'serie': ''}], 'listaRateioPremio': [{'faixa': 1, 'numeroDeGanhadores': 1, 'valorPremio': 122627171.8, 'descricaoFaixa': '6 acertos'}, {'faixa': 2, 'numeroDeGanhadores': 267, 'valorPremio': 34158.18, 'descricaoFaixa': '5 acertos'}, {'faixa': 3, 'numeroDeGanhadores': 20734, 'valorPremio': 628.38, 'descricaoFaixa': '4 acertos'}], 'id': None, 'ultimoConcurso': True, 'exibirDetalhamentoPorCidade': True, 'premiacaoContingencia': None}
Simply extract dezenasSorteadasOrdemSorteio and prozess in list comprehension:
[n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']]
Result will be:
['22', '35', '41', '42', '53', '57']
Example
import requests
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
print([n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']])

Scrape Tables on Multiple Pages with Single URL

I am trying to scrape data from Fangraphs. The tables are split into 21 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but Fangraphs does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 30 players, but I want the entire player pool. Two days of web searching and I am stuck. Link and my current code are below. I know they have a link to download the csv file, but that gets tedious through out the season and I would like expedite the data harvesting process. Any direction would be helpful, thank you.
https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc
import requests
import pandas as pd
url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0'
response = requests.get(url, verify=False)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
import requests
from bs4 import BeautifulSoup
import pandas as pd
params = {
"pos": "all",
"stats": "bat",
"type": "fangraphsdc"
}
data = {
'RadScriptManager1_TSM': 'ProjectionBoard1$dg1',
"__EVENTTARGET": "ProjectionBoard1$dg1",
'__EVENTARGUMENT': 'FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;1000',
'__VIEWSTATEGENERATOR': 'C239D6F0',
'__SCROLLPOSITIONX': '0',
'__SCROLLPOSITIONY': '1366',
"ProjectionBoard1_tsStats_ClientState": "{\"selectedIndexes\":[\"0\"],\"logEntries\":[],\"scrollState\":{}}",
"ProjectionBoard1_tsPosition_ClientState": "{\"selectedIndexes\":[\"0\"],\"logEntries\":[],\"scrollState\":{}}",
"ProjectionBoard1$rcbTeam": "All+Teams",
"ProjectionBoard1_rcbTeam_ClientState": "",
"ProjectionBoard1$rcbLeague": "All",
"ProjectionBoard1_rcbLeague_ClientState": "",
"ProjectionBoard1_tsProj_ClientState": "{\"selectedIndexes\":[\"5\"],\"logEntries\":[],\"scrollState\":{}}",
"ProjectionBoard1_tsUpdate_ClientState": "{\"selectedIndexes\":[],\"logEntries\":[],\"scrollState\":{}}",
"ProjectionBoard1$dg1$ctl00$ctl02$ctl00$PageSizeComboBox": "30",
"ProjectionBoard1_dg1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState": "",
"ProjectionBoard1$dg1$ctl00$ctl03$ctl01$PageSizeComboBox": "1000",
"ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState": "{\"logEntries\":[],\"value\":\"1000\",\"text\":\"1000\",\"enabled\":true,\"checkedIndices\":[],\"checkedItemsTextOverflows\":false}",
"ProjectionBoard1_dg1_ClientState": ""
}
def main(url):
with requests.Session() as req:
r = req.get(url, params=params)
soup = BeautifulSoup(r.content, 'html.parser')
data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE").get("value")
data['__EVENTVALIDATION'] = soup.find(
"input", id="__EVENTVALIDATION").get("value")
r = req.post(url, params=params, data=data)
df = pd.read_html(r.content, attrs={
'id': 'ProjectionBoard1_dg1_ctl00'})[0]
df.drop(df.columns[1], axis=1, inplace=True)
print(df)
df.to_csv("data.csv", index=False)
main("https://www.fangraphs.com/projections.aspx")
Output: view-online

How to get text from checked boxes using bs4?

im trying to get all the labels (text) from the boxes that are checked off (or questions answered) on the following site.
however i seem to not get any text out instead.
Further more the way i thought of doing the scraping, was to collect all the links first - in the right side you can switch between the pages. It also seems like this list have all links times 2...
Here is my current code (see link in there as well called main_url)
import bs4 as bs
from splinter import Browser
import time
executable_path = {'executable_path' :'C:/users/chromedriver.exe'}
browser = Browser('chrome', **executable_path)
main_url = 'https://reporting.unpri.org/surveys/PRI-Reporting-Framework-
2016/0ad07cdc-cfbc-4c5b-a79f-
2b07e93d8521/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1'
browser.visit(main_url)
source = browser.html
soup = bs.BeautifulSoup(source, 'lxml')
base_url = main_url[:-51]
urls = []
print(base_url)
for i in soup.find_all('div', class_ = 'accordion-inner n-accordion-link'):
for j in soup.find_all('a', class_ = 'tooltiper'):
urls.append(j['href'])
print(urls)
result = []
for k in urls:
ext = k[8:]
browser.visit(base_url + ext)
source1 = browser.html
soup1 = bs.BeautifulSoup(source1, 'lxml')
temp_list = []
print(browser.url)
for img in soup1.find_all('img', class_ = 'readradio'):
for t in img['src']:
if t == '/Style/img/checkedradio.png':
for x in soup1.find_all('span', class_ = 'title'):
txt = str(x.string)
temp_list.append(txt)
result.append(temp_list)
print(result)
I get the following output for the results list, which is supposed to contain the text:
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
Updated code with suggestion:
import bs4 as bs
from splinter import Browser
import time
executable_path = {'executable_path'
:'/users/nichlasrasmussen/documents/webdrivers/phantomjs'}
browser = Browser('phantomjs', **executable_path)
main_url = 'https://reporting.unpri.org/surveys/PRI-Reporting-Framework-
2016/0ad07cdc-cfbc-4c5b-a79f-
2b07e93d8521/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1'
browser.visit(main_url)
source = browser.html
soup = bs.BeautifulSoup(source, 'lxml')
base_url = main_url[:-51]
urls = []
print(base_url)
for i in soup.find_all('div', class_ = 'accordion-inner n-accordion-link'):
for j in soup.find_all('a', class_ = 'tooltiper'):
urls.append(j['href'])
print(urls)
result = []
for k in urls:
ext = k[8:]
browser.visit(base_url + ext)
source1 = browser.html
soup1 = bs.BeautifulSoup(source1, 'lxml')
temp_list = []
print(browser.url)
for label in soup1.find_all('label', class_='radio'):
t = label.find('img', class_='readradio')
if 'checkedradio' in t['src']:
content = soup1.find('span', class_='title')
temp_list.append(content.text)
result.append(temp_list)
print(result)
You can basically just refer to the parent of both the img and span.title elements which is the label.radio.
No need to do a tremendous loop starting from root (soup1)
Try this:
for label in soup1.find_all('label', class_='radio'):
t = label.find('img', class_='readradio')
if t and '/Style/img/checkedradio.png' in t.get('src'):
content = label.find('span', class_='title')
temp_list.append(content.text)

BeautifulSoup fails to parse nested <p> elements

Dependencies:
BeautifulSoup==3.2.1
In: from BeautifulSoup import BeautifulSoup
In: BeautifulSoup('<p><p>123</p></p>')
Out: <p></p><p>123</p>
Why are the two adjacent tags not in the output?
That is just BS3's parser fixing your broken html.
The P element represents a paragraph. It cannot contain block-level
elements (including P itself).
This
<p><p>123</p></p>
is not valid HTML. ps can't be nested. BS tries to clean it up.
When BS encounters the second <p> it thinks the first p is finished, so it inserts a closing </p>. The second </p> in your input then does not match an opening <p> so it is removed.
This is because BeautifulSoup has this NESTABLE_TAGS concept/setting:
When Beautiful Soup is parsing a document, it keeps a stack of open
tags. Whenever it sees a new start tag, it tosses that tag on top of
the stack. But before it does, it might close some of the open tags
and remove them from the stack. Which tags it closes depends on the
qualities of tag it just found, and the qualities of the tags in the
stack.
So when Beautiful Soup encounters a <P> tag, it closes and pops all
the tags up to and including the previously encountered tag of the
same type. This is the default behavior, and this is how
BeautifulStoneSoup treats every tag. It's what you get when a tag is
not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also
what you get when a tag shows up in RESET_NESTING_TAGS but has no
entry in NESTABLE_TAGS, the way the <P> tag does.
>>> pprint(BeautifulSoup.NESTABLE_TAGS)
{'bdo': [],
'blockquote': [],
'center': [],
'dd': ['dl'],
'del': [],
'div': [],
'dl': [],
'dt': ['dl'],
'fieldset': [],
'font': [],
'ins': [],
'li': ['ul', 'ol'],
'object': [],
'ol': [],
'q': [],
'span': [],
'sub': [],
'sup': [],
'table': [],
'tbody': ['table'],
'td': ['tr'],
'tfoot': ['table'],
'th': ['tr'],
'thead': ['table'],
'tr': ['table', 'tbody', 'tfoot', 'thead'],
'ul': []}
As a workaround, you can allow p tag to be inside p:
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup.NESTABLE_TAGS['p'] = ['p']
>>> BeautifulSoup('<p><p>123</p></p>')
<p><p>123</p></p>
Also, BeautifulSoup 3rd version is no longer maintained - you should switch to BeautifulSoup4.
When using BeautifulSoup4, you can change the underlying parser to change the behavior:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<p><p>123</p></p>')
<html><body><p></p><p>123</p></body></html>
>>> BeautifulSoup('<p><p>123</p></p>', 'html.parser')
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'html5lib')
<html><head></head><body><p></p><p>123</p><p></p></body></html>

Categories

Resources