I'm trying to make a testing project that scraps info of a specific site but with no success.
I followed some tutorials i have found and even an post on stackoverflow. After all this I'm stuck!
help me stepbrothers, I'm a hot new programmer with python and I can't stop my projects.
more info: this is a lottery website that I was trying to scrap and make some analisys to get a lucky number.
I have followed this tutorials:
https://towardsdatascience.com/how-to-collect-data-from-any-website-cb8fad9e9ec5
https://beautiful-soup-4.readthedocs.io/en/latest/
Using BeautifulSoup in order to find all "ul" and "li" elements
All of you have my gratitute!
from bs4 import BeautifulSoup as bs
import requests
import html5lib
#import urllib3 # another attemp to make another req in the url ------failed
url = '''https://loterias.caixa.gov.br/Paginas/Mega-Sena.aspx'''
#another try to take results in the <ul> but I have no qualified results == None
def parse_ul(elem):#https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
result = {}
for sub in elem.find_all('li', recursive=False):
if sub.li is None:
continue
data = {k: v for k, v in sub.attrs.items()}
if sub.ul is not None:
# recurse down
data['children'] = parse_ul(sub.ul)
result[sub.li.get_text(strip=True)] = data
return result
page = requests.get(url)#taking info from website
print(page.encoding)# == UTF-8
soup = bs(page.content,features="lxml")#takes all info from the url and organizes it ==Beaultiful soup
numbers = soup.find(id='ulDezenas')#searcher in the content of this specific id// another try: soup.find('ul', {'class': ''})
result = parse_ul(soup)#try to parse info, but none is found EVEN WITH THE ORIGINAL ONE
print(numbers)#The result is below:
'''<ul class="numbers megasena" id="ulDezenas">
<li ng-repeat="dezena in resultado.listaDezenas ">{{dezena.length > 2 ? dezena.slice(1) : dezena}}</li>
</ul>'''
print(result)# == "{}" nothing found
#with open('''D:\Documents\python\_abretesesame.txt''', 'wb') as fd:
# for chunk in page.iter_content(chunk_size=128):
# fd.write(chunk)
# =======printing document(HTML) in file still no success in getting the numbers
Main issue is that the content is provided dynamically by JavaScript but you can get the information via another url:
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
will give you folowing JSON:
{'tipoJogo': 'MEGA_SENA', 'numero': 2468, 'nomeMunicipioUFSorteio': 'SÃO PAULO, SP', 'dataApuracao': '02/04/2022', 'valorArrecadado': 158184963.0, 'valorEstimadoProximoConcurso': 3000000.0, 'valorAcumuladoProximoConcurso': 0.0, 'valorAcumuladoConcursoEspecial': 36771176.89, 'valorAcumuladoConcurso_0_5': 33463457.98, 'acumulado': False, 'indicadorConcursoEspecial': 1, 'dezenasSorteadasOrdemSorteio': ['022', '041', '053', '042', '035', '057'], 'listaResultadoEquipeEsportiva': None, 'numeroJogo': 2, 'nomeTimeCoracaoMesSorte': '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'tipoPublicacao': 3, 'observacao': '', 'localSorteio': 'ESPAÇO DA SORTE', 'dataProximoConcurso': '06/04/2022', 'numeroConcursoAnterior': 2467, 'numeroConcursoProximo': 2469, 'valorTotalPremioFaixaUm': 0.0, 'numeroConcursoFinal_0_5': 2470, 'listaDezenas': ['022', '035', '041', '042', '053', '057'], 'listaDezenasSegundoSorteio': None, 'listaMunicipioUFGanhadores': [{'posicao': 1, 'ganhadores': 1, 'municipio': 'SANTOS', 'uf': 'SP', 'nomeFatansiaUL': '', 'serie': ''}], 'listaRateioPremio': [{'faixa': 1, 'numeroDeGanhadores': 1, 'valorPremio': 122627171.8, 'descricaoFaixa': '6 acertos'}, {'faixa': 2, 'numeroDeGanhadores': 267, 'valorPremio': 34158.18, 'descricaoFaixa': '5 acertos'}, {'faixa': 3, 'numeroDeGanhadores': 20734, 'valorPremio': 628.38, 'descricaoFaixa': '4 acertos'}], 'id': None, 'ultimoConcurso': True, 'exibirDetalhamentoPorCidade': True, 'premiacaoContingencia': None}
Simply extract dezenasSorteadasOrdemSorteio and prozess in list comprehension:
[n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']]
Result will be:
['22', '35', '41', '42', '53', '57']
Example
import requests
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
print([n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']])
Related
I am trying to extract the JSON data from multiple links, but looks like I am doing something wrong. I am getting only the last id details. How do I get the JSON data for all the links? Also, is it possible to export all the results to a CSV file?
Please kindly guide me.
Here is the code that I am using.
import json
import requests
from bs4 import BeautifulSoup
a_list = [234147,234548,232439,234599,226672,234117,222388]
a_url = 'https://jobs.mycareerportal/careers-home/jobs'
urls = []
for n in a_list:
kurl = '{}/{}'.format(a_url, n)
soup = BeautifulSoup(requests.get(kurl).content, "html.parser")
data = [
json.loads(x.string) for x in soup.find_all("script", type="application/ld+json")
]
for d in data:
k = str(d['url']) + str(d['jobLocation']['address'])
urls.append(kurl)
print(k)
and this is the output that I am getting
PS E:\Python> & C:/Users/KristyG/Anaconda3/python.exe e:/Python/url_append.py
https://jobs.mycareerportal/careers-home/jobs/222388?{'#type': 'PostalAddress', 'addressLocality': 'Panama City', 'addressRegion': 'Florida', 'streetAddress': '4121 Hwy 98', 'postalCode': '32401-1170', 'addressCountry': 'United States'}
PS E:\Python>
Please note, I had to change the website name as I can't share it on public
I guess its just an indentation problem, try nesting the code inside the first for loop like this :
import json
import requests
from bs4 import BeautifulSoup
a_list = [234147,234548,232439,234599,226672,234117,222388]
a_url = 'https://jobs.mycareerportal/careers-home/jobs'
urls = []
for n in a_list:
kurl = '{}/{}'.format(a_url, n)
soup = BeautifulSoup(requests.get(kurl).content, "html.parser")
data = [
json.loads(x.string) for x in soup.find_all("script", type="application/ld+json")
]
for d in data:
k = str(d['url']) + str(d['jobLocation']['address'])
urls.append(kurl)
print(k)
I am trying to parse an html page with BeautifulSoup. The task is to get the data underlined with red color for all the lots on this page. I got the data from the left and the right block (about the lot, auction name, country etc) but getting the data from the central block seems to be problematic for me. Here is the example of what is done.
import requests
import re
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "https://www.artprice.com/artist/15079/wassily-kandinsky/lots/pasts?ipp=100"
FILE_NAME = "test"
def parse(url = URL_TEMPLATE):
result_list = {'lot': [], 'name': [], 'date': [], 'type1': [], 'type2': [], 'width': [], 'height': [], 'estimate': [], 'hummerprice': [], 'auction_date': [], 'auction': [], 'country': []}
r = requests.get(URL_TEMPLATE)
soup = bs(r.text, "html.parser")
lot_info = soup.find_all('p', class_='hidden-xs')
date_info = soup.find_all('date')
names_info = soup.find_all('a', class_='sln_lot_show')
auction_info = soup.find_all('p', class_='visible-xs')
auction_date_info = soup.find_all(string=re.compile('\d\d\s\w\w\w\s\d\d\d\d'))[1::2]
type1_info = soup.find_all('div')
for i in range(len(lot_info)):
result_list['lot'].append(lot_info[i].text)
for i in range(len(date_info)):
result_list['date'].append(date_info[i].text)
for i in range (len(names_info)):
result_list['name'].append(names_info[i].text)
for i in range(0, len(auction_info), 2):
result_list['auction'].append(soup.find_all('p', class_='visible-xs')[i].strong.string)
for i in range(1, len(auction_info), 2):
result_list['country'].append(soup.find_all('p', class_='visible-xs')[i].string)
for i in range(len(auction_date_info)):
result_list['auction_date'].append(auction_date_info[i])
return result_list
df = pd.DataFrame(data=parse())
df.to_excel("test.xlsx")
So, the task is to get the data from the central block separately for each lot on this page.
You need nth-of-type to access all those <p> elements.
This does it for just the first one to show that it works.
I'll leave it to you to clean up the output.
for div in soup.find_all('div',class_='col-xs-8 col-sm-6'):
print(div.select_one('a').text.strip())
print(div.select_one('p:nth-of-type(2)').text.strip())
print(div.select_one('p:nth-of-type(3)').text.strip())
print(div.select_one('p:nth-of-type(4)').text.strip())
break
Result:
Abstract
Print-Multiple, Print in colors, 29 1/2 x 31 1/2 in75 x 80 cm
Estimate:
€ 560 - € 784
$ 605 - $ 848
£ 500 - £ 700
¥ 4,303 - ¥ 6,025
Hammer price:
not communicated
not communicated
not communicated
not communicated
I'm trying to get some items from json content. However, the structure of that json content is foreign to me and as a result I can't fetch the value of property out of it.
I've tried so far with:
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'
def fetch_content(link):
content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(content.text,"lxml")
item = soup.select_one("script#hdpApolloPreloadedData").text
print(json.loads(item)['apiCache'])
if __name__ == '__main__':
fetch_content(link)
The result I get running the above script is:
{"VariantQuery{\"zpid\":43835884}":{"property":{"zpid":43835884,"streetAddress":"5958 SW 4th St",
Which I can't further process for that weird key in front.
Expected output:
{"zpid":43835884,"streetAddress":"5958 SW 4th St", ----
How can I get the value of that property?
You can get zpid and address by their mangled json with:
json.loads(json.loads(item.text)['apiCache'])['VariantQuery{"zpid":43835884}']['property']['zpid']
Out[1889]: 43835884
json.loads(json.loads(item.text)['apiCache'])['VariantQuery{"zpid":43835884}']['property']['streetAddress']
Out[1890]: '5958 SW 4th St'
I noticed you can always get the zpid like this:
link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'
content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(content.text,"lxml")
item = soup.select_one("script#hdpApolloPreloadedData").text
print(json.loads(item)['zpid'])
Just modify your function to the following. I also added another function (process_fetched_content()) to give you some more freedom. You could simply run it and it will take care of situations even when you have multiple keys that start with 'VariantQuery{"zpid":'. The final output is a dict with the keys being your zpid and the values being what you are looking for.
If you have a lot of zpid values, then this will let you accumulate them all together and then process them. The benefit is the list of keys is then the list of zpids you have.
Here's how you could use this code.
results = process_fetched_content(raw_dictionary = fetch_content(link, verbose=False))
print(results)
output:
{'43835884': {'zpid': 43835884, 'streetAddress': '5958 SW 4th St', 'zipcode': '33144', 'city': 'Miami', 'state': 'FL', 'latitude': 25.76661, 'longitude': -80.292801, 'price': 340000, 'dateSold': 1576875600000, 'bathrooms': 2, 'bedrooms': 3, 'livingArea': 1757, 'yearBuilt': 1973, 'lotSize': 4331, 'homeType': 'SINGLE_FAMILY', 'homeStatus': 'RECENTLY_SOLD', 'photoCount': 19, 'imageLink': 'https://photos.zillowstatic.com/p_g/IS7yxihwtuqmlq1000000000.jpg', 'daysOnZillow': 0, 'isFeatured': False, 'shouldHighlight': False, 'brokerId': 0, 'zestimate': 341336, 'rentZestimate': 2200, 'listing_sub_type': {}, 'priceReduction': '', 'isUnmappable': False, 'rentalPetsFlags': 128, 'mediumImageLink': 'https://photos.zillowstatic.com/p_c/IS7yxihwtuqmlq1000000000.jpg', 'isPreforeclosureAuction': False, 'homeStatusForHDP': 'RECENTLY_SOLD', 'priceForHDP': 340000, 'festimate': 341336, 'isListingOwnedByCurrentSignedInAgent': False, 'isListingClaimedByCurrentSignedInUser': False, 'hiResImageLink': 'https://photos.zillowstatic.com/p_f/IS7yxihwtuqmlq1000000000.jpg', 'watchImageLink': 'https://photos.zillowstatic.com/p_j/IS7yxihwtuqmlq1000000000.jpg', 'tvImageLink': 'https://photos.zillowstatic.com/p_m/IS7yxihwtuqmlq1000000000.jpg', 'tvCollectionImageLink': 'https://photos.zillowstatic.com/p_l/IS7yxihwtuqmlq1000000000.jpg', 'tvHighResImageLink': 'https://photos.zillowstatic.com/p_n/IS7yxihwtuqmlq1000000000.jpg', 'zillowHasRightsToImages': True, 'desktopWebHdpImageLink': 'https://photos.zillowstatic.com/p_h/IS7yxihwtuqmlq1000000000.jpg', 'isNonOwnerOccupied': False, 'hideZestimate': False, 'isPremierBuilder': False, 'isZillowOwned': False, 'currency': 'USD', 'country': 'USA', 'taxAssessedValue': 224131, 'streetAddressOnly': '5958 SW 4th St', 'unit': ' '}}
Code
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'
def fetch_content(link, verbose=False):
content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(content.text,"lxml")
item = soup.select_one("script#hdpApolloPreloadedData").text
d = json.loads(item)['apiCache']
d = json.loads(d)
if verbose:
print(d)
return d
def process_fetched_content(raw_dictionary=None):
if raw_dictionary is not None:
keys = [k for k in raw_dictionary.keys() if k.startswith('VariantQuery{"zpid":')]
results = dict((k.split(':')[-1].replace('}',''), d.get(k).get('property', None)) for k in keys)
return results
else:
return None
I need to parse
http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0
If you view source of the above url you will find
Expected Output:
fvRequests= css
fvRequests=7
import re
import urllib2
if __name__ == "__main__":
url = 'http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0'
# http request
response = urllib2.urlopen(url)
html = response.read()
response.close()
# finding values in html
results = re.findall(r'fvRequests\.setValue\(\d+, \d+, \'?(.*?)\'?\);', html)
keys = results[::2]
values = results[1::2]
# creating a dictionary
output = dict(zip(keys, values))
print output
The idea is to locate the script with BeautifulSoup and use regular expression pattern to find the the fvRequests.setValue() calls and extract the value of the third argument:
import re
from bs4 import BeautifulSoup
import requests
pattern = re.compile(r"fvRequests\.setValue\(\d+, \d+, '?(\w+)'?\);")
response = requests.get("http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0")
soup = BeautifulSoup(response.content)
script = soup.find("script", text=lambda x: x and "fvRequests.setValue" in x).text
print(re.findall(pattern, script))
Prints:
[u'css', u'7', u'flash', u'0', u'font', u'0', u'html', u'14', u'image', u'80', u'js', u'35', u'other', u'14']
You can go further and pack the list into a dict (solution taken from here):
dict(zip(*([iter(data)] * 2)))
which would produce:
{
'image': '80',
'flash': '0',
'js': '35',
'html': '14',
'font': '0',
'other': '14',
'css': '7'
}
I am trying to access content from the table from this page http://www.afl.com.au/afl/stats/player-ratings/overall-standings# but however when I do so using beautiful soup in python, I am getting the data but from the 'All' filter selection not from a certain club filter. How can I achieve my goal?
I need to access all data from the table corresponding to a club in the filter. Please help me.
Please see the image below.
and also data from all pages:
I have used the following code
from bs4 import BeautifulSoup
import urllib2
import lxml.html
import xlwt
import unicodedata
infoList = []
lLink = "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#club/CD_T140"
header = {'User-Agent': 'Mozilla/5.0'}
req_for_players = urllib2.Request(lLink,headers=header)
page_for_players = urllib2.urlopen(req_for_players)
soup_for_players = BeautifulSoup(page_for_players)
table = soup_for_players.select('table.player-ratings')[0]
for group_header in table.select('tbody tr span'):
player = group_header.string
infoList.append(player)
print infoList
The list infoList thus generated contains data corresponding to the "All" filter. But I want data according to the filter I choose.
you don't need to parse the table - use Firebug or any similar tool to watch the response while clicking on some page in paginator and you'll see that it serves you the data in JSON! Pure win!!!
There you can see the URL format for JSON data:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&pageSize=40&pageNum=2
and this way you can fetch even the first page of data without parsing HTML and maybe you can fetch all data at once by setting some high value to pageSize variable
When the page loads the first time, all the players are there in the table regardless of what filter you've chosen. The filter is only invoked a short while later. That is why you are getting data on all the players.
Underneath the page is calling the following when the filter is invoked:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T140&pageSize=40&pageNum=1
to get one team's players. CD_T140 are the Western Bulldogs in this case. You can see the different possible values in the selLadderTeam select element. You cannot simply call this url however as you will get a 401 error. Looking at the headers that are sent across, there is one that stands out. A token seems to be required. So using the requests library (it's more user-friendly than urllib2) you can do the following:
>>> import requests
>>> url = "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T40&pageSize=100"
>>> h = {'x-media-mis-token':'e61767b39a7680235476bb33aa946c0e'}
>>> r = requests.get(url, headers=h)
>>> r
<Response [200]>
>>> j = r.json()
>>> len(j['playerRatings'])
45
>>> j['playerRatings'][0]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I260257', u'playerName': {u'givenName': u'Scott', u'surname': u'Pendlebury'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2005', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'OVERALL', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 1, u'ratingType': u'TEAM', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'POSITION', u'ratingPoints': 674}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MIDFIELDER'}
>>> j['playerRatings'][44]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I295012', u'playerName': {u'givenName': u'Matt', u'surname': u'Scharenberg'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2013', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'OVERALL', u'ratingPoints': 0},{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'TEAM', u'ratingPoints': 0}, {u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'POSITION', u'ratingPoints': 0}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MEDIUM_DEFENDER'}
>>>
Notes: I don't know exactly what roundID is. I increased pageSize to something that would likely return all of a team's players and removed pageNum. They could change the token at any time.