When I was using Beautifulsoup and requests module to scrape the img's src, all the img
s src are empty so then I'm assuming that the src value is generated by JavaScript. Hence, I tried to use the requests_html module instead. However, when I trying to scrape the same information after the response is rendered, only two of the img 's src has value and the rest are empty but the problem is that when I checked it on the website using developer tools, it seems that the other img's src should have a value. May I know what is the problem here?
code for bs4 and requests
from bs4 import BeautifulSoup
import requests
biliweb = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/3').text
bilisoup = BeautifulSoup(biliweb,'lxml')
for item in bilisoup.find_all('div',class_='lazy-img'):
image_html = item.find('img')
print(image_html)
code for requets_html
from requests_html import HTML, HTMLSession
session = HTMLSession()
biliweb = session.get('https://www.bilibili.com/ranking/bangumi/13/0/3')
biliweb.html.render()
for item in biliweb.html.find('.lazy-img.cover > img'):
print(item.html)
I will only show the first five results because the list is quite lengthy
With Beautifulsoup and requests
<img alt="Re:从零开始的异世界生活 第二季" src=""/>
<img alt="刀剑神域 爱丽丝篇 异界战争 -终章-" src=""/>
<img alt="没落要塞 / DECA-DENCE" src=""/>
<img alt="某科学的超电磁炮T" src=""/>
<img alt="宇崎学妹想要玩!" src=""/>
With requests_html
<img alt="Re:从零开始的异世界生活 第二季" src="https://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png#90w_120h.webp"/>
<img alt="刀剑神域 爱丽丝篇 异界战争 -终章-" src="https://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png#90w_120h.webp"/>
<img alt="没落要塞 / DECA-DENCE" src=""/>
<img alt="某科学的超电磁炮T" src=""/>
<img alt="宇崎学妹想要玩!" src=""/>
All the data is stored in a javascript variable called __INITIAL_STATE__.
The following script saves the data in a json file. Once you have this, you can easily download the images.
import requests, json
from bs4 import BeautifulSoup
page = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/3')
soup = BeautifulSoup(page.content, 'html.parser')
script = None
for s in soup.find_all("script"):
if "__INITIAL_STATE__" in s.text:
script = s.get_text(strip=True)
break
data = json.loads(script[script.index('{'):script.index('function')-2])
with open("data.json", "w") as f:
json.dump(data, f)
print(data)
Output:
{'rankList': [{'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/2f5bf4840747fc7c09932d2793e96a178cd05905.jpg', 'index_show': '更新至第5话'}, 'pts': 1903981, 'rank': 1, 'season_id': 33802, 'stat': {'danmaku': 814356, 'follow': 7135303, 'series_follow': 7267882, 'view': 33685387}, 'title': 'Re:从零开始的异世界生活 第二季', 'url': 'https://www.bilibili.com/bangumi/play/ss33802', 'pic': 'http://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png', 'play': 33685387, 'video_review': 814356}, {'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/a772451f1f031ee1a3b78e31e4fb0b851517817f.jpg', 'index_show': '更新至第16话'}, 'pts': 483317, 'rank': 2, 'season_id': 32781, 'stat': {'danmaku': 514174, 'follow': 6195736, 'series_follow': 6733547, 'view': 36351270}, 'title': '刀剑神域 爱丽丝篇 异界战争 -终章-', 'url': 'https://www.bilibili.com/bangumi/play/ss32781', 'pic': 'http://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png', 'play': 36351270, 'video_review': 514174}, {'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/d5d7441c20614dc5ddc69f333f1906a09eddcee2.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/fe191e9ffa2422103bffcd8615446f5885074c0b.jpg', 'index_show': '更新至第5话'}, 'pts': 455170, 'rank': 3, 'season_id': 33803, 'stat': ....
...
...
...
Related
I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:
<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
When I send get request and parse response with BeautifulSoup using:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
I get three dots ... instead of the number 895
the element is <span class="ActorStore-statusNbHitsNumber">...</span>
How can I get the number?
If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:
You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.
Here is a complete working example:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
Output:
895
To view all the JSON data in order to access the key/value pairs, you can use:
from pprint import pprint
pprint(response.json(), indent=4)
Partial output:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},
i'm trying to scrape some data from a site called laced.co.uk, and i'm a tad confused on whats going wrong. i'm new to this, so try and explain it simply (if possible please!). Here is my code ;
from bs4 import BeautifulSoup
import requests
url = "https://www.laced.co.uk/products/nike-dunk-low-retro-black-white?size=7"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text=" £195 ")
print(prices)
thank you! (the price at time of posting was 195 (it showed as the size 7 buy now price on the page).
The price is loaded within a <script> tag on the page:
<script>
typeof(dataLayer) != "undefined" && dataLayer.push({
'event': 'eec.productDetailImpression',
'page': {
'ecomm_prodid': 'DD1391-100'
},
'ecommerce': {
'detail': {
'actionField': {'list': 'Product Page'},
'products': [{
'name': 'Nike Dunk Low Retro Black White',
'id': 'DD1391-100',
'price': '195.0',
'brand': 'Nike',
'category': 'Dunk, Dunk Low, Mens Nike Dunks',
'variant': 'White',
'list': 'Product Page',
'dimension1': '195.0',
'dimension2': '7',
'dimension3': '190',
'dimension4': '332'
}]
}
}
});
</script>
You can use a regular expression pattern to search for the price. Note, there's no need for BeautifulSoup:
import re
import requests
url = "https://www.laced.co.uk/products/nike-dunk-low-retro-black-white?size=7"
result = requests.get(url)
price = re.search(r"'price': '(.*?)',", result.text).group(1)
print(f"£ {price}")
So I am trying to scrape osu! stats from my friends profile, when I trying running the code I get "None" here is the code
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://osu.ppy.sh/users/17906919').text
soup = BeautifulSoup(html_text, 'lxml')
stats = soup.find_all('dl', class_ = 'profile-stats__entry')
print(stats)
The desired data is actually presented within the HTML source under the following script tag"
<script id="json-user" type="application/json">
So we would need to pickup it and parse it as JSON below:
import requests
from bs4 import BeautifulSoup
from pprint import pprint as pp
import json
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = json.loads(soup.select_one('#json-user').string)
pp(goal['statistics'])
main('https://osu.ppy.sh/users/17906919')
Output:
{'country_rank': 18133,
'global_rank': 94334,
'grade_counts': {'a': 159, 's': 99, 'sh': 9, 'ss': 6, 'ssh': 2},
'hit_accuracy': 97.9691,
'is_ranked': True,
'level': {'current': 83, 'progress': 84},
'maximum_combo': 896,
'play_count': 9481,
'play_time': 347925,
'pp': 3868.29,
'rank': {'country': 18133},
'ranked_score': 715205885,
'replays_watched_by_others': 0,
'total_hits': 1086843,
'total_score': 3896191620}
My code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = 'https://www.rebuy.de/kaufen/videospiele-nintendo-switch?
page=1'
#opening up connection, grabbing the page
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each product
containers = page_soup.find_all("div", class_="ry-product__item ry-
product__item--large")
I want to extract item containers that hold image, title and price from this website. When I run this code it returns empty list
[]
I am sure the code works because when I type for example class_="row" it returns tags that this class contains.
I want to extract all the containers that have this class(Screenshot below) but it seems like I am choosing wrong class or because there are multiple classes in this <div> tag. What am I doing wrong?
The site loads the products dynamically through Ajax. Looking at the Chrome/Firefox network inspector reveals the address of API. The site loads the product data from there (https://www.rebuy.de/api/search?page=1&categorySanitizedPath=videospiele-nintendo-switch):
import requests
import json
from pprint import pprint
headers = {}
# headers = {"Host":"www.rebuy.de",
# "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Cookie":"SET THIS TO PREVENT ACCESS DENIED",
# "Accept-Encoding":"gzip,deflate,br",
# "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
url = "https://www.rebuy.de/api/search?page={}&categorySanitizedPath=videospiele-nintendo-switch"
page = 1
r = requests.get(url.format(page), headers=headers)
data = json.loads(r.text)
pprint(data['products'])
# print(json.dumps(data, indent=4, sort_keys=True))
Prints:
{'docs': [{'avg_rating': 5,
'badges': [],
'blue_price': 1999,
'category_id': {'0': 94, '1': 3098},
'category_is_accessory': False,
'category_name': 'Nintendo Switch',
'category_sanitized_name': 'nintendo-switch',
'cover_updated_at': 0,
'has_cover': True,
'has_percent_category': False,
'has_variant_in_stock': True,
'id': 10725297,
'name': 'FIFA 18',
'num_ratings': 1,
'price_min': 1999,
'price_recommended': 0,
'product_sanitized_name': 'fifa-18',
'root_category_name': 'Videospiele',
'variants': [{'label': 'A1',
'price': 2199,
'purchasePrice': 1456,
'quantity': 2},
{'label': 'A2',
'price': 1999,
'purchasePrice': 1919,
'quantity': 7},
{'label': 'A3',
'price': 1809,
'purchasePrice': 1919,
'quantity': 0},
{'label': 'A4',
'price': 1409,
'purchasePrice': 1919,
'quantity': 0}]},
...and so on.
One caveat, when many requests are made, the site returns Access Denied. To prevent this, you need to set headers with Cookie from your browser (to get the cookie, look inside Chrome/Firefox network inspector).
Better solution would be use of Selenium.
The issue is that these DOM elements were loaded dynamically via AJAX. If you view the source code of this site, you won't be able to find any of these classes because they haven't been created yet. One solution is to make the same request that the page does and extract the data from the response as shown here.
Another approach is to use a tool like Selenium to load these elements and interact with them dynamically.
Here's some code to retrieve and print the fields you're interested in. Hopefully this will get you started. This requires installing Chromedriver.
Note that I took the liberty to parse the results with regex a bit, but that's not critical.
import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.rebuy.de/kaufen/videospiele-nintendo-switch")
for product in driver.find_elements_by_tag_name("product"):
name_elem = product.find_element_by_class_name("ry-product-item-content__name")
print("name:\t", name_elem.get_attribute("innerHTML"))
image_elem = product.find_element_by_class_name("ry-product-item__image")
image = str(image_elem.value_of_css_property("background-image"))
print("image:\t", re.search(r"^url\((.*)\)$", image).group(1))
price_elem = product.find_element_by_class_name("ry-price__amount")
price = str(price_elem.get_attribute("innerHTML").encode("utf-8"))
print("price:\t", re.search(r"\d?\d,\d\d", price).group(0), "\n")
Output (60 results):
name: Mario Kart 8 Deluxe
image: "https://d2wr8zbg9aclns.cloudfront.net/products/010/574/253/covers/205.jpeg?time=0"
price: 43,99
name: Super Mario Odyssey
image: "https://d2wr8zbg9aclns.cloudfront.net/products/010/574/263/covers/205.jpeg?time=1508916366"
price: 40,69
...
name: South Park: Die Rektakuläre Zerreißprobe
image: "https://d2wr8zbg9aclns.cloudfront.net/products/default/205.jpeg?time=0"
price: 35,99
name: Cars 3: Driven To Win [Internationale Version]
image: "https://d2wr8zbg9aclns.cloudfront.net/products/010/967/629/covers/205.jpeg?time=1528267000"
price: 30,99
I'm trying to use python and regex to pull the price in the example website below but am not getting any results.
How can I best capture the price (I don't care about the cents, just the dollar amount)?
http://www.walmart.com/store/2516/search?dept=4044&dept_name=Home&query=43888060
Relevant HTML:
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
What would the regex be to capture the "299" or is the an easier route to get this? Thanks!
With regexp it can be a bit tricky on how accurate your pattern should be.
I quickly typed something togehter here: https://regex101.com/r/lF5vF2/1
You should get the idea and modify this one to fit your actual needs.
Kind regards
Don't use regex use a html parser like bs4:
from bs4 import BeautifulSoup
h = """<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>"""
soup = BeautifulSoup(h)
amount = soup.select_one("div.price-display.csTile-price span.sup").next_sibling.strip()
Which will give you:
299
Or use the currency-delimiter span and get the previous element:
amount = soup.select_one("span.currency-delimiter").previous.strip()
Which will give you the same. The html in your question is also dynamically generated via Javascript so you won't be getting it using urllib.urlopen, it is simply not in the source returned.
You will need something like selenium or to mimic the ajax call as below using requests .
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
from pprint import pprint as pp
pp(data)
That gives you some json:
{u'algo': u'polaris',
u'blacklist': False,
u'cluster': {u'apiserver': {u'hostname': u'dfw-iss-api8.stg0',
u'pluginVersion': u'2.3.0'},
u'searchengine': {u'hostname': u'dfw-iss-esd.stg0.mobile.walmart.com'}},
u'count': 1,
u'offset': 0,
u'performance': {u'enrichment': {u'inventory': 70}},
u'query': {u'actualQuery': u'43888060',
u'originalQuery': u'43888060',
u'suggestedQueries': []},
u'queryTime': 181,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'images': {u'largeUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180',
u'thumbnailUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180'},
u'inventory': {u'isRealTime': True,
u'quantity': 1,
u'status': u'In Stock'},
u'isWWWItem': True,
u'location': {u'aisle': [], u'detailed': []},
u'name': u'Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01',
u'price': {u'currencyUnit': u'USD',
u'isRealTime': True,
u'priceInCents': 29900},
u'productId': {u'WWWItemId': u'43888060',
u'productId': u'2FY1C7B7RMM4',
u'upc': u'88560900430'},
u'ratings': {u'rating': u'4.721',
u'ratingUrl': u'http://i2.walmartimages.com/i/CustRating/4_7.gif'},
u'reviews': {u'reviewCount': u'1436'},
u'score': u'0.507073'}],
u'totalCount': 1}
That gives you dict with all the info you could need, all you are doing is posting the params and the store number which you have in the url to http://www.walmart.com/store/ajax/search.
To get the price and name:
In [22]: import requests
In [23]: import json
In [24]: js = requests.post("http://www.walmart.com/store/ajax/search",
....: data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
In [25]: data = json.loads(js['searchResults'])
In [26]: res = data["results"][0]
In [27]: print(res["name"])
Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01
In [28]: print(res["price"])
{u'priceInCents': 29900, u'isRealTime': True, u'currencyUnit': u'USD'}
In [29]: print(res["price"]["priceInCents"])
29900
In [30]: print(res["price"]["priceInCents"]) / 100
299
Ok, just search for numerics (I added $ and .) and concat the results into a string (I used "".join()).
>>> txt = """
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
"""
>>> ''.join(re.findall('[0-9$.]',txt.replace("\n","")))
'$299.00'