Can't access data from a table in a web page - python

I am trying to access content from the table from this page http://www.afl.com.au/afl/stats/player-ratings/overall-standings# but however when I do so using beautiful soup in python, I am getting the data but from the 'All' filter selection not from a certain club filter. How can I achieve my goal?
I need to access all data from the table corresponding to a club in the filter. Please help me.
Please see the image below.
and also data from all pages:
I have used the following code
from bs4 import BeautifulSoup
import urllib2
import lxml.html
import xlwt
import unicodedata
infoList = []
lLink = "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#club/CD_T140"
header = {'User-Agent': 'Mozilla/5.0'}
req_for_players = urllib2.Request(lLink,headers=header)
page_for_players = urllib2.urlopen(req_for_players)
soup_for_players = BeautifulSoup(page_for_players)
table = soup_for_players.select('table.player-ratings')[0]
for group_header in table.select('tbody tr span'):
player = group_header.string
infoList.append(player)
print infoList
The list infoList thus generated contains data corresponding to the "All" filter. But I want data according to the filter I choose.

you don't need to parse the table - use Firebug or any similar tool to watch the response while clicking on some page in paginator and you'll see that it serves you the data in JSON! Pure win!!!
There you can see the URL format for JSON data:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&pageSize=40&pageNum=2
and this way you can fetch even the first page of data without parsing HTML and maybe you can fetch all data at once by setting some high value to pageSize variable

When the page loads the first time, all the players are there in the table regardless of what filter you've chosen. The filter is only invoked a short while later. That is why you are getting data on all the players.
Underneath the page is calling the following when the filter is invoked:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T140&pageSize=40&pageNum=1
to get one team's players. CD_T140 are the Western Bulldogs in this case. You can see the different possible values in the selLadderTeam select element. You cannot simply call this url however as you will get a 401 error. Looking at the headers that are sent across, there is one that stands out. A token seems to be required. So using the requests library (it's more user-friendly than urllib2) you can do the following:
>>> import requests
>>> url = "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T40&pageSize=100"
>>> h = {'x-media-mis-token':'e61767b39a7680235476bb33aa946c0e'}
>>> r = requests.get(url, headers=h)
>>> r
<Response [200]>
>>> j = r.json()
>>> len(j['playerRatings'])
45
>>> j['playerRatings'][0]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I260257', u'playerName': {u'givenName': u'Scott', u'surname': u'Pendlebury'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2005', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'OVERALL', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 1, u'ratingType': u'TEAM', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'POSITION', u'ratingPoints': 674}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MIDFIELDER'}
>>> j['playerRatings'][44]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I295012', u'playerName': {u'givenName': u'Matt', u'surname': u'Scharenberg'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2013', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'OVERALL', u'ratingPoints': 0},{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'TEAM', u'ratingPoints': 0}, {u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'POSITION', u'ratingPoints': 0}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MEDIUM_DEFENDER'}
>>>
Notes: I don't know exactly what roundID is. I increased pageSize to something that would likely return all of a team's players and removed pageNum. They could change the token at any time.

Related

Can't get info of a lxml site with Request and BeautifulSoup

I'm trying to make a testing project that scraps info of a specific site but with no success.
I followed some tutorials i have found and even an post on stackoverflow. After all this I'm stuck!
help me stepbrothers, I'm a hot new programmer with python and I can't stop my projects.
more info: this is a lottery website that I was trying to scrap and make some analisys to get a lucky number.
I have followed this tutorials:
https://towardsdatascience.com/how-to-collect-data-from-any-website-cb8fad9e9ec5
https://beautiful-soup-4.readthedocs.io/en/latest/
Using BeautifulSoup in order to find all "ul" and "li" elements
All of you have my gratitute!
from bs4 import BeautifulSoup as bs
import requests
import html5lib
#import urllib3 # another attemp to make another req in the url ------failed
url = '''https://loterias.caixa.gov.br/Paginas/Mega-Sena.aspx'''
#another try to take results in the <ul> but I have no qualified results == None
def parse_ul(elem):#https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
result = {}
for sub in elem.find_all('li', recursive=False):
if sub.li is None:
continue
data = {k: v for k, v in sub.attrs.items()}
if sub.ul is not None:
# recurse down
data['children'] = parse_ul(sub.ul)
result[sub.li.get_text(strip=True)] = data
return result
page = requests.get(url)#taking info from website
print(page.encoding)# == UTF-8
soup = bs(page.content,features="lxml")#takes all info from the url and organizes it ==Beaultiful soup
numbers = soup.find(id='ulDezenas')#searcher in the content of this specific id// another try: soup.find('ul', {'class': ''})
result = parse_ul(soup)#try to parse info, but none is found EVEN WITH THE ORIGINAL ONE
print(numbers)#The result is below:
'''<ul class="numbers megasena" id="ulDezenas">
<li ng-repeat="dezena in resultado.listaDezenas ">{{dezena.length > 2 ? dezena.slice(1) : dezena}}</li>
</ul>'''
print(result)# == "{}" nothing found
#with open('''D:\Documents\python\_abretesesame.txt''', 'wb') as fd:
# for chunk in page.iter_content(chunk_size=128):
# fd.write(chunk)
# =======printing document(HTML) in file still no success in getting the numbers
Main issue is that the content is provided dynamically by JavaScript but you can get the information via another url:
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
will give you folowing JSON:
{'tipoJogo': 'MEGA_SENA', 'numero': 2468, 'nomeMunicipioUFSorteio': 'SÃO PAULO, SP', 'dataApuracao': '02/04/2022', 'valorArrecadado': 158184963.0, 'valorEstimadoProximoConcurso': 3000000.0, 'valorAcumuladoProximoConcurso': 0.0, 'valorAcumuladoConcursoEspecial': 36771176.89, 'valorAcumuladoConcurso_0_5': 33463457.98, 'acumulado': False, 'indicadorConcursoEspecial': 1, 'dezenasSorteadasOrdemSorteio': ['022', '041', '053', '042', '035', '057'], 'listaResultadoEquipeEsportiva': None, 'numeroJogo': 2, 'nomeTimeCoracaoMesSorte': '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'tipoPublicacao': 3, 'observacao': '', 'localSorteio': 'ESPAÇO DA SORTE', 'dataProximoConcurso': '06/04/2022', 'numeroConcursoAnterior': 2467, 'numeroConcursoProximo': 2469, 'valorTotalPremioFaixaUm': 0.0, 'numeroConcursoFinal_0_5': 2470, 'listaDezenas': ['022', '035', '041', '042', '053', '057'], 'listaDezenasSegundoSorteio': None, 'listaMunicipioUFGanhadores': [{'posicao': 1, 'ganhadores': 1, 'municipio': 'SANTOS', 'uf': 'SP', 'nomeFatansiaUL': '', 'serie': ''}], 'listaRateioPremio': [{'faixa': 1, 'numeroDeGanhadores': 1, 'valorPremio': 122627171.8, 'descricaoFaixa': '6 acertos'}, {'faixa': 2, 'numeroDeGanhadores': 267, 'valorPremio': 34158.18, 'descricaoFaixa': '5 acertos'}, {'faixa': 3, 'numeroDeGanhadores': 20734, 'valorPremio': 628.38, 'descricaoFaixa': '4 acertos'}], 'id': None, 'ultimoConcurso': True, 'exibirDetalhamentoPorCidade': True, 'premiacaoContingencia': None}
Simply extract dezenasSorteadasOrdemSorteio and prozess in list comprehension:
[n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']]
Result will be:
['22', '35', '41', '42', '53', '57']
Example
import requests
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
print([n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']])

Beautiful soup - html parser returns dots instead of string visible on web

I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:
<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
When I send get request and parse response with BeautifulSoup using:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
I get three dots ... instead of the number 895
the element is <span class="ActorStore-statusNbHitsNumber">...</span>
How can I get the number?
If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:
You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.
Here is a complete working example:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
Output:
895
To view all the JSON data in order to access the key/value pairs, you can use:
from pprint import pprint
pprint(response.json(), indent=4)
Partial output:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},

How to access JSON elements in python swiftly and efficiently?

I am calling an API provided by Steam to get user details. I am using Python's requests and JSON library to call this. My code:
import requests
import json
response = requests.get("http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key=xxxxxxxxxxxxxxxxxxxxx&steamids=76561198330357188")
data = response.json()
print(data['response'])
The output comes:
{'players': [{'steamid': '76561198330357188', 'communityvisibilitystate': 3, 'profilestate': 1, 'personaname': 'saditstar', 'profileurl': 'https://steamcommunity.com/id/saditrahman/', 'avatar': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/f8/f8d41f4064e1df34b1b5c439e775e222fb171ed3.jpg', 'avatarmedium': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/f8/f8d41f4064e1df34b1b5c439e775e222fb171ed3_medium.jpg', 'avatarfull': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/f8/f8d41f4064e1df34b1b5c439e775e222fb171ed3_full.jpg', 'avatarhash': 'f8d41f4064e1df34b1b5c439e775e222fb171ed3', 'lastlogoff': 1628598684, 'personastate': 1, 'realname': 'Kowsar Rahman Sadit', 'primaryclanid': '103582791429521408', 'timecreated': 1473564722, 'personastateflags': 0, 'loccountrycode': 'BD'}]}
My simple question is how can I access elements such as profilestate or like personaname?
You can access these values like the following sample.
>>> vals = {'players': [{'steamid': '76561198330357188', 'communityvisibilitystate': 3, 'profilestate': 1, 'personaname': 'saditstar', 'profileurl': 'https://steamcommunity.com/id/saditrahman/', 'avatar': 'http 8d41f4064e1df34b1b5c439e775e222fb171ed3_medium.jpg', 'avatarfull': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/f8/f8d41f4064e1df34b1b5c439e775e222fb171ed3_full.jpg', 'avatarhash': 'f8d41f4064e1df34b1b5c439e775e222fb171ed3', 'lastlogoff': 1628598684, 'personastate': 1, 'realname': 'Kowsar Rahman Sadit', 'primaryclanid': '103582791429521408', 'timecreated': 1473564722, 'personastateflags': 0, 'loccountrycode': 'BD'}]}
>>> [item.get('profilestate') for item in vals['players'] ]
[1]

EBAY Finding API Date Filtering

I am trying to return a list of completed items in a given category using the ebay API. My code seems to be working however the results seem to be very limited (about 100). I was assuming there would be some limitation on how far back the api would go but even just a few days should return thousands of results for this category. Am I missing something in the code or is this just a limitation of the ebay API? I did make sure I was using production and not the sandbox.
So I have realized now that there are multiple pages to my query up to the 100 item / 100 page max. I am now running into issues with the date filtering. I see the filter reference material on site but I am still not getting the result I expect. In the updated query I am trying to pull only items completed yesterday but when running I am getting stuff from today. Is there a better way to input the date filters?
from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import os
import csv
api = finding(appid=<my appid>,config_file=None)
response = api.execute(
'findCompletedItems', {
'categoryId': '214',
'keywords' : 'prizm',
'endTimeFrom' : '2020-02-03T00:00:00.000Z',
'endTimeTo' : '2020-02-04T00:00:00.000Z' ,
'paginationInput': {
'entriesPerPage': '100',
'pageNumber': '1'
},
'sortOrder': 'EndTimeSoonest'
}
)
soup = BeautifulSoup(response.content , 'lxml')
totalitems = int(soup.find('totalentries').text)
items = soup.find_all('item')
for item in response.reply.searchResult.item:
print(item.itemId)
print(item.listingInfo.endTime)
I finally figured this out. I needed to add additional code for the item filters. The working code is below.
from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import os
import csv
api = finding(appid=<my appid>,config_file=None)
response = api.execute(
'findCompletedItems', {
'categoryId': '214',
'keywords' : 'prizm',
'itemFilter': [
{'name': 'EndTimeFrom', 'value': '2020-02-03T00:00:00.000Z'},
{'name': 'EndTimeTo', 'value': '2020-02-04T00:00:00.000Z'}
#{'name': 'MinPrice', 'value': '200', 'paramName': 'Currency', 'paramValue': 'GBP'},
#{'name': 'MaxPrice', 'value': '400', 'paramName': 'Currency', 'paramValue': 'GBP'}
],
'paginationInput': {
'entriesPerPage': '100',
'pageNumber': '100'
},
'sortOrder': 'EndTimeSoonest'
}
)
soup = BeautifulSoup(response.content , 'lxml')
totalitems = int(soup.find('totalentries').text)
items = soup.find_all('item')
for item in response.reply.searchResult.item:
print(item.itemId)
print(item.listingInfo.endTime)

Use regex or something else to capture website data

I'm trying to use python and regex to pull the price in the example website below but am not getting any results.
How can I best capture the price (I don't care about the cents, just the dollar amount)?
http://www.walmart.com/store/2516/search?dept=4044&dept_name=Home&query=43888060
Relevant HTML:
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
What would the regex be to capture the "299" or is the an easier route to get this? Thanks!
With regexp it can be a bit tricky on how accurate your pattern should be.
I quickly typed something togehter here: https://regex101.com/r/lF5vF2/1
You should get the idea and modify this one to fit your actual needs.
Kind regards
Don't use regex use a html parser like bs4:
from bs4 import BeautifulSoup
h = """<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>"""
soup = BeautifulSoup(h)
amount = soup.select_one("div.price-display.csTile-price span.sup").next_sibling.strip()
Which will give you:
299
Or use the currency-delimiter span and get the previous element:
amount = soup.select_one("span.currency-delimiter").previous.strip()
Which will give you the same. The html in your question is also dynamically generated via Javascript so you won't be getting it using urllib.urlopen, it is simply not in the source returned.
You will need something like selenium or to mimic the ajax call as below using requests .
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
from pprint import pprint as pp
pp(data)
That gives you some json:
{u'algo': u'polaris',
u'blacklist': False,
u'cluster': {u'apiserver': {u'hostname': u'dfw-iss-api8.stg0',
u'pluginVersion': u'2.3.0'},
u'searchengine': {u'hostname': u'dfw-iss-esd.stg0.mobile.walmart.com'}},
u'count': 1,
u'offset': 0,
u'performance': {u'enrichment': {u'inventory': 70}},
u'query': {u'actualQuery': u'43888060',
u'originalQuery': u'43888060',
u'suggestedQueries': []},
u'queryTime': 181,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'images': {u'largeUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180',
u'thumbnailUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180'},
u'inventory': {u'isRealTime': True,
u'quantity': 1,
u'status': u'In Stock'},
u'isWWWItem': True,
u'location': {u'aisle': [], u'detailed': []},
u'name': u'Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01',
u'price': {u'currencyUnit': u'USD',
u'isRealTime': True,
u'priceInCents': 29900},
u'productId': {u'WWWItemId': u'43888060',
u'productId': u'2FY1C7B7RMM4',
u'upc': u'88560900430'},
u'ratings': {u'rating': u'4.721',
u'ratingUrl': u'http://i2.walmartimages.com/i/CustRating/4_7.gif'},
u'reviews': {u'reviewCount': u'1436'},
u'score': u'0.507073'}],
u'totalCount': 1}
That gives you dict with all the info you could need, all you are doing is posting the params and the store number which you have in the url to http://www.walmart.com/store/ajax/search.
To get the price and name:
In [22]: import requests
In [23]: import json
In [24]: js = requests.post("http://www.walmart.com/store/ajax/search",
....: data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
In [25]: data = json.loads(js['searchResults'])
In [26]: res = data["results"][0]
In [27]: print(res["name"])
Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01
In [28]: print(res["price"])
{u'priceInCents': 29900, u'isRealTime': True, u'currencyUnit': u'USD'}
In [29]: print(res["price"]["priceInCents"])
29900
In [30]: print(res["price"]["priceInCents"]) / 100
299
Ok, just search for numerics (I added $ and .) and concat the results into a string (I used "".join()).
>>> txt = """
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
"""
>>> ''.join(re.findall('[0-9$.]',txt.replace("\n","")))
'$299.00'

Categories

Resources