Use regex or something else to capture website data

Use regex or something else to capture website data - python

I'm trying to use python and regex to pull the price in the example website below but am not getting any results.
How can I best capture the price (I don't care about the cents, just the dollar amount)?
http://www.walmart.com/store/2516/search?dept=4044&dept_name=Home&query=43888060
Relevant HTML:
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
What would the regex be to capture the "299" or is the an easier route to get this? Thanks!

With regexp it can be a bit tricky on how accurate your pattern should be.
I quickly typed something togehter here: https://regex101.com/r/lF5vF2/1
You should get the idea and modify this one to fit your actual needs.
Kind regards

Don't use regex use a html parser like bs4:
from bs4 import BeautifulSoup
h = """<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>"""
soup = BeautifulSoup(h)
amount = soup.select_one("div.price-display.csTile-price span.sup").next_sibling.strip()
Which will give you:
299
Or use the currency-delimiter span and get the previous element:
amount = soup.select_one("span.currency-delimiter").previous.strip()
Which will give you the same. The html in your question is also dynamically generated via Javascript so you won't be getting it using urllib.urlopen, it is simply not in the source returned.
You will need something like selenium or to mimic the ajax call as below using requests .
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
from pprint import pprint as pp
pp(data)
That gives you some json:
{u'algo': u'polaris',
u'blacklist': False,
u'cluster': {u'apiserver': {u'hostname': u'dfw-iss-api8.stg0',
u'pluginVersion': u'2.3.0'},
u'searchengine': {u'hostname': u'dfw-iss-esd.stg0.mobile.walmart.com'}},
u'count': 1,
u'offset': 0,
u'performance': {u'enrichment': {u'inventory': 70}},
u'query': {u'actualQuery': u'43888060',
u'originalQuery': u'43888060',
u'suggestedQueries': []},
u'queryTime': 181,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'images': {u'largeUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180',
u'thumbnailUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180'},
u'inventory': {u'isRealTime': True,
u'quantity': 1,
u'status': u'In Stock'},
u'isWWWItem': True,
u'location': {u'aisle': [], u'detailed': []},
u'name': u'Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01',
u'price': {u'currencyUnit': u'USD',
u'isRealTime': True,
u'priceInCents': 29900},
u'productId': {u'WWWItemId': u'43888060',
u'productId': u'2FY1C7B7RMM4',
u'upc': u'88560900430'},
u'ratings': {u'rating': u'4.721',
u'ratingUrl': u'http://i2.walmartimages.com/i/CustRating/4_7.gif'},
u'reviews': {u'reviewCount': u'1436'},
u'score': u'0.507073'}],
u'totalCount': 1}
That gives you dict with all the info you could need, all you are doing is posting the params and the store number which you have in the url to http://www.walmart.com/store/ajax/search.
To get the price and name:
In [22]: import requests
In [23]: import json
In [24]: js = requests.post("http://www.walmart.com/store/ajax/search",
....: data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
In [25]: data = json.loads(js['searchResults'])
In [26]: res = data["results"][0]
In [27]: print(res["name"])
Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01
In [28]: print(res["price"])
{u'priceInCents': 29900, u'isRealTime': True, u'currencyUnit': u'USD'}
In [29]: print(res["price"]["priceInCents"])
29900
In [30]: print(res["price"]["priceInCents"]) / 100
299

Ok, just search for numerics (I added $ and .) and concat the results into a string (I used "".join()).
>>> txt = """
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
"""
>>> ''.join(re.findall('[0-9$.]',txt.replace("\n","")))
'$299.00'

Related

Beautiful soup - html parser returns dots instead of string visible on web

I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:
<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
When I send get request and parse response with BeautifulSoup using:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
I get three dots ... instead of the number 895
the element is <span class="ActorStore-statusNbHitsNumber">...</span>
How can I get the number?

If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:
You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.
Here is a complete working example:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
Output:
895
To view all the JSON data in order to access the key/value pairs, you can use:
from pprint import pprint
pprint(response.json(), indent=4)
Partial output:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},

empty img src while scraping website using requests-html python

When I was using Beautifulsoup and requests module to scrape the img's src, all the img
s src are empty so then I'm assuming that the src value is generated by JavaScript. Hence, I tried to use the requests_html module instead. However, when I trying to scrape the same information after the response is rendered, only two of the img 's src has value and the rest are empty but the problem is that when I checked it on the website using developer tools, it seems that the other img's src should have a value. May I know what is the problem here?
code for bs4 and requests
from bs4 import BeautifulSoup
import requests
biliweb = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/3').text
bilisoup = BeautifulSoup(biliweb,'lxml')
for item in bilisoup.find_all('div',class_='lazy-img'):
image_html = item.find('img')
print(image_html)
code for requets_html
from requests_html import HTML, HTMLSession
session = HTMLSession()
biliweb = session.get('https://www.bilibili.com/ranking/bangumi/13/0/3')
biliweb.html.render()
for item in biliweb.html.find('.lazy-img.cover > img'):
print(item.html)
I will only show the first five results because the list is quite lengthy
With Beautifulsoup and requests
<img alt="Re：从零开始的异世界生活 第二季" src=""/>
<img alt="刀剑神域 爱丽丝篇 异界战争 -终章-" src=""/>
<img alt="没落要塞 / DECA-DENCE" src=""/>
<img alt="某科学的超电磁炮T" src=""/>
<img alt="宇崎学妹想要玩！" src=""/>
With requests_html
<img alt="Re：从零开始的异世界生活 第二季" src="https://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png#90w_120h.webp"/>
<img alt="刀剑神域 爱丽丝篇 异界战争 -终章-" src="https://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png#90w_120h.webp"/>
<img alt="没落要塞 / DECA-DENCE" src=""/>
<img alt="某科学的超电磁炮T" src=""/>
<img alt="宇崎学妹想要玩！" src=""/>

All the data is stored in a javascript variable called __INITIAL_STATE__.
The following script saves the data in a json file. Once you have this, you can easily download the images.
import requests, json
from bs4 import BeautifulSoup
page = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/3')
soup = BeautifulSoup(page.content, 'html.parser')
script = None
for s in soup.find_all("script"):
if "__INITIAL_STATE__" in s.text:
script = s.get_text(strip=True)
break
data = json.loads(script[script.index('{'):script.index('function')-2])
with open("data.json", "w") as f:
json.dump(data, f)
print(data)
Output:
{'rankList': [{'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/2f5bf4840747fc7c09932d2793e96a178cd05905.jpg', 'index_show': '更新至第5话'}, 'pts': 1903981, 'rank': 1, 'season_id': 33802, 'stat': {'danmaku': 814356, 'follow': 7135303, 'series_follow': 7267882, 'view': 33685387}, 'title': 'Re：从零开始的异世界生活 第二季', 'url': 'https://www.bilibili.com/bangumi/play/ss33802', 'pic': 'http://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png', 'play': 33685387, 'video_review': 814356}, {'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/a772451f1f031ee1a3b78e31e4fb0b851517817f.jpg', 'index_show': '更新至第16话'}, 'pts': 483317, 'rank': 2, 'season_id': 32781, 'stat': {'danmaku': 514174, 'follow': 6195736, 'series_follow': 6733547, 'view': 36351270}, 'title': '刀剑神域 爱丽丝篇 异界战争 -终章-', 'url': 'https://www.bilibili.com/bangumi/play/ss32781', 'pic': 'http://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png', 'play': 36351270, 'video_review': 514174}, {'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/d5d7441c20614dc5ddc69f333f1906a09eddcee2.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/fe191e9ffa2422103bffcd8615446f5885074c0b.jpg', 'index_show': '更新至第5话'}, 'pts': 455170, 'rank': 3, 'season_id': 33803, 'stat': ....
...
...
...

Issues with Python BeautifulSoup parsing

I am trying to parse an html page with BeautifulSoup. The task is to get the data underlined with red color for all the lots on this page. I got the data from the left and the right block (about the lot, auction name, country etc) but getting the data from the central block seems to be problematic for me. Here is the example of what is done.
import requests
import re
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "https://www.artprice.com/artist/15079/wassily-kandinsky/lots/pasts?ipp=100"
FILE_NAME = "test"
def parse(url = URL_TEMPLATE):
result_list = {'lot': [], 'name': [], 'date': [], 'type1': [], 'type2': [], 'width': [], 'height': [], 'estimate': [], 'hummerprice': [], 'auction_date': [], 'auction': [], 'country': []}
r = requests.get(URL_TEMPLATE)
soup = bs(r.text, "html.parser")
lot_info = soup.find_all('p', class_='hidden-xs')
date_info = soup.find_all('date')
names_info = soup.find_all('a', class_='sln_lot_show')
auction_info = soup.find_all('p', class_='visible-xs')
auction_date_info = soup.find_all(string=re.compile('\d\d\s\w\w\w\s\d\d\d\d'))[1::2]
type1_info = soup.find_all('div')
for i in range(len(lot_info)):
result_list['lot'].append(lot_info[i].text)
for i in range(len(date_info)):
result_list['date'].append(date_info[i].text)
for i in range (len(names_info)):
result_list['name'].append(names_info[i].text)
for i in range(0, len(auction_info), 2):
result_list['auction'].append(soup.find_all('p', class_='visible-xs')[i].strong.string)
for i in range(1, len(auction_info), 2):
result_list['country'].append(soup.find_all('p', class_='visible-xs')[i].string)
for i in range(len(auction_date_info)):
result_list['auction_date'].append(auction_date_info[i])
return result_list
df = pd.DataFrame(data=parse())
df.to_excel("test.xlsx")
So, the task is to get the data from the central block separately for each lot on this page.

You need nth-of-type to access all those <p> elements.
This does it for just the first one to show that it works.
I'll leave it to you to clean up the output.
for div in soup.find_all('div',class_='col-xs-8 col-sm-6'):
print(div.select_one('a').text.strip())
print(div.select_one('p:nth-of-type(2)').text.strip())
print(div.select_one('p:nth-of-type(3)').text.strip())
print(div.select_one('p:nth-of-type(4)').text.strip())
break
Result:
Abstract
Print-Multiple, Print in colors, 29 1/2 x 31 1/2 in75 x 80 cm
Estimate:
€ 560 - € 784
$ 605 - $ 848
£ 500 - £ 700
￥ 4,303 - ￥ 6,025
Hammer price:
not communicated
not communicated
not communicated
not communicated

Can't access data from a table in a web page

I am trying to access content from the table from this page http://www.afl.com.au/afl/stats/player-ratings/overall-standings# but however when I do so using beautiful soup in python, I am getting the data but from the 'All' filter selection not from a certain club filter. How can I achieve my goal?
I need to access all data from the table corresponding to a club in the filter. Please help me.
Please see the image below.
and also data from all pages:
I have used the following code
from bs4 import BeautifulSoup
import urllib2
import lxml.html
import xlwt
import unicodedata
infoList = []
lLink = "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#club/CD_T140"
header = {'User-Agent': 'Mozilla/5.0'}
req_for_players = urllib2.Request(lLink,headers=header)
page_for_players = urllib2.urlopen(req_for_players)
soup_for_players = BeautifulSoup(page_for_players)
table = soup_for_players.select('table.player-ratings')[0]
for group_header in table.select('tbody tr span'):
player = group_header.string
infoList.append(player)
print infoList
The list infoList thus generated contains data corresponding to the "All" filter. But I want data according to the filter I choose.

you don't need to parse the table - use Firebug or any similar tool to watch the response while clicking on some page in paginator and you'll see that it serves you the data in JSON! Pure win!!!
There you can see the URL format for JSON data:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&pageSize=40&pageNum=2
and this way you can fetch even the first page of data without parsing HTML and maybe you can fetch all data at once by setting some high value to pageSize variable

When the page loads the first time, all the players are there in the table regardless of what filter you've chosen. The filter is only invoked a short while later. That is why you are getting data on all the players.
Underneath the page is calling the following when the filter is invoked:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T140&pageSize=40&pageNum=1
to get one team's players. CD_T140 are the Western Bulldogs in this case. You can see the different possible values in the selLadderTeam select element. You cannot simply call this url however as you will get a 401 error. Looking at the headers that are sent across, there is one that stands out. A token seems to be required. So using the requests library (it's more user-friendly than urllib2) you can do the following:
>>> import requests
>>> url = "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401411&teamId=CD_T40&pageSize=100"
>>> h = {'x-media-mis-token':'e61767b39a7680235476bb33aa946c0e'}
>>> r = requests.get(url, headers=h)
>>> r
<Response [200]>
>>> j = r.json()
>>> len(j['playerRatings'])
45
>>> j['playerRatings'][0]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I260257', u'playerName': {u'givenName': u'Scott', u'surname': u'Pendlebury'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2005', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'OVERALL', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 1, u'ratingType': u'TEAM', u'ratingPoints': 674}, {u'trend': u'NO_CHANGE', u'ranking': 2, u'ratingType': u'POSITION', u'ratingPoints': 674}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MIDFIELDER'}
>>> j['playerRatings'][44]
{u'roundId': u'CD_R201401411', u'player': {u'playerId': u'CD_I295012', u'playerName': {u'givenName': u'Matt', u'surname': u'Scharenberg'}, u'captain': False, u'playerJumperNumber': None}, u'draftYear': u'2013', u'detailedRatings': [{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'OVERALL', u'ratingPoints': 0},{u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'TEAM', u'ratingPoints': 0}, {u'trend': u'NO_CHANGE', u'ranking': 0, u'ratingType': u'POSITION', u'ratingPoints': 0}], u'team': {u'teamId': u'CD_T40', u'teamName': u'Collingwood', u'teamAbbr': u'COLL', u'teamNickname': u'Magpies'}, u'position': u'MEDIUM_DEFENDER'}
>>>
Notes: I don't know exactly what roundID is. I increased pageSize to something that would likely return all of a team's players and removed pageNum. They could change the token at any time.

BeautifulSoup fails to parse nested <p> elements

Dependencies:
BeautifulSoup==3.2.1
In: from BeautifulSoup import BeautifulSoup
In: BeautifulSoup('<p><p>123</p></p>')
Out: <p></p><p>123</p>
Why are the two adjacent tags not in the output?

That is just BS3's parser fixing your broken html.
The P element represents a paragraph. It cannot contain block-level
elements (including P itself).

This
<p><p>123</p></p>
is not valid HTML. ps can't be nested. BS tries to clean it up.
When BS encounters the second <p> it thinks the first p is finished, so it inserts a closing </p>. The second </p> in your input then does not match an opening <p> so it is removed.

This is because BeautifulSoup has this NESTABLE_TAGS concept/setting:
When Beautiful Soup is parsing a document, it keeps a stack of open
tags. Whenever it sees a new start tag, it tosses that tag on top of
the stack. But before it does, it might close some of the open tags
and remove them from the stack. Which tags it closes depends on the
qualities of tag it just found, and the qualities of the tags in the
stack.
So when Beautiful Soup encounters a <P> tag, it closes and pops all
the tags up to and including the previously encountered tag of the
same type. This is the default behavior, and this is how
BeautifulStoneSoup treats every tag. It's what you get when a tag is
not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also
what you get when a tag shows up in RESET_NESTING_TAGS but has no
entry in NESTABLE_TAGS, the way the <P> tag does.
>>> pprint(BeautifulSoup.NESTABLE_TAGS)
{'bdo': [],
'blockquote': [],
'center': [],
'dd': ['dl'],
'del': [],
'div': [],
'dl': [],
'dt': ['dl'],
'fieldset': [],
'font': [],
'ins': [],
'li': ['ul', 'ol'],
'object': [],
'ol': [],
'q': [],
'span': [],
'sub': [],
'sup': [],
'table': [],
'tbody': ['table'],
'td': ['tr'],
'tfoot': ['table'],
'th': ['tr'],
'thead': ['table'],
'tr': ['table', 'tbody', 'tfoot', 'thead'],
'ul': []}
As a workaround, you can allow p tag to be inside p:
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup.NESTABLE_TAGS['p'] = ['p']
>>> BeautifulSoup('<p><p>123</p></p>')
<p><p>123</p></p>
Also, BeautifulSoup 3rd version is no longer maintained - you should switch to BeautifulSoup4.
When using BeautifulSoup4, you can change the underlying parser to change the behavior:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<p><p>123</p></p>')
<html><body><p></p><p>123</p></body></html>
>>> BeautifulSoup('<p><p>123</p></p>', 'html.parser')
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'html5lib')
<html><head></head><body><p></p><p>123</p><p></p></body></html>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use regex or something else to capture website data - python

With regexp it can be a bit tricky on how accurate your pattern should be. I quickly typed something togehter here: https://regex101.com/r/lF5vF2/1 You should get the idea and modify this one to fit your actual needs. Kind regards

Related

Beautiful soup - html parser returns dots instead of string visible on web

empty img src while scraping website using requests-html python

Issues with Python BeautifulSoup parsing

Can't access data from a table in a web page

BeautifulSoup fails to parse nested <p> elements

Categories

Resources