Webscraping Python Website Using JSON Application - python

I am trying to get the price of one item on the website in the url below. However, I am finding some issues when looking at the source page of the website.
The url is: https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love
The part of the source page I am interested in is the following (I guess):
<script type="application/ld+json">
[{
"#context":"http://schema.org",
"#type":"Product",
"productID":"25372685655708131",
"name":"LOVE bracelet, small model",
"description":"#LOVE# bracelet, small model, yellow gold 750/1000. Supplied with a screwdriver. Width: 3.65 mm (for size 17). Now available in a slimmer version, Cartier continues to write the story of the #LOVE# bracelet. Same design, same oval shape, same story: a timeless – yet slightly slimmer – creation which is fastened using a screwdriver. The closure is designed with a functional screw on one side of the bracelet and a hinge on the other. To determine the size of your #LOVE# bracelet, measure your wrist, adding one centimetre to your size for a tighter fit, or two centimetres for a looser fit.",
"image":["https://www.cartier.com/variants/images/25372685655708131/img1/w960.jpg"],
"offers":
[{"#type":"Offer","availability":"http://schema.org/InStock","priceCurrency":"GBP","price":"4100","sku":"0400574782829","url":"https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html"}]}]
</script>
I have tried the following steps:
import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd
data = {'url':[],'offers_price':[]}
def get_price(url):
soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
return url, int(data['offers']['price'])
if __name__ == '__main__':
urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']
with Pool(processes=4) as pool:
for url, price in pool.imap_unordered(get_price, urls):
data['offers_price'].append(price)
data['url'].append(url)
print(data)
But not successful. How would you approach in this case?

I was able to get the price, but I got it from the product-price tag:
import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd
data = {'url':[],'offers_price':[]}
def get_price(url):
soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
data = json.loads(soup.find_all('product-price')[-1]['data-model'])
return url, int(data['fullPrice'])
if __name__ == '__main__':
urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']
with Pool(processes=4) as pool:
for url, price in pool.imap_unordered(get_price, urls):
data['offers_price'].append(price)
data['url'].append(url)
print(data)
Output:
{'url': ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love'], 'offers_price': [4100]}
By the way, are you sure you want to append the url and the price? I think you should do this instead:
data['offers_price'] = price
data['url'] = url

Related

Crawl dynamic page crawl website elements

I am crawling through Python.
The discount price on the page above is shaded in red, and it exists in the form of text in the script tag when you search for the website developer tool.
from bs4 import BeautifulSoup as bs4
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
print(json_data2)
enter image description here
However, if you print the code on the terminal through the code, you can see that the discount price you saw on the web browser page is printed as the normal price as shown below. How can I get that value?
The selenium module takes a long time to implement, so I want to access requests or other directions.
Using regular expressions will do the trick.
from bs4 import BeautifulSoup as bs4
import re
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
for i in json_data2:
results = re.findall(r'lastPrc : (\d+?),',i)
if results:
print(results)
OUTPUT
['1310000']
The value that you are looking for is no longer there.

Why can't I see/use the innerHTML of this site in my python interpreter?

I'm working on webscraping project currently using BS4 where I am trying to aggregate college tuition data. I'm using the site tuitiontracker.org as a data source. Once I've navigated to a specific college, I want to scrape the tuition data off of the site. When I inspect element, I can see the tuition data stored as an innerHTML, but when I use beautiful soup to find it, it returns everything about the location except for the tuition data.
Here is the url I'm trying to scrape from: https://www.tuitiontracker.org/school.html?unitid=164580
Here is the code I am using:
import urllib.request
from bs4 import BeautifulSoup
DOWNLOAD_URL = "https://www.tuitiontracker.org/school.html?unitid=164580"
def download_page(url):
return urllib.request.urlopen(url)
# print(download_page(DOWNLOAD_URL).read())
def parse_html(html):
"""Gathers data from an HTML page"""
soup = BeautifulSoup(html, features="html.parser")
# print(soup.prettify())
tuition_data = soup.find("div", attrs={"id": "price"}).innerHTML
print(tuition_data)
def main():
url = DOWNLOAD_URL
parse_html(download_page(DOWNLOAD_URL).read())
if __name__ == "__main__":
main()
When I print tuition_data, I see the relevant tags where the tuition data is stored on the page, but no number value. I've tried using .innerHTML and .string but they end up printing either None, or simply a blank space.
Really quite confused, thanks for any clarification.
The data comes from an API endpoint and is dynamically rendered by JavaScript so you won't get it with BeautifulSoup.
However, you can query the endpoint.
Here's how:
import json
import requests
url = "https://www.tuitiontracker.org/school.html?unitid=164580"
api_endpoint = f"https://www.tuitiontracker.org/data/school-data-09042019/"
response = requests.get(f"{api_endpoint}{url.split('=')[-1]}.json").json()
tuition = response["yearly_data"][0]
print(
round(tuition["price_instate_oncampus"], 2),
round(tuition["avg_net_price_0_30000_titleiv_privateforprofit"], 2),
)
Output:
75099.8 30255.86
PS. There's a lot more in that JSON. Pro tip for future web-scraping endeavors: you favorite web browser's Developer Tools should be your best friend.
Here's what it looks like behind the scenes:

Getting None when scraping for operating income from SEC EDGAR document

I'm trying to obtain the latest quarter's operating income/loss from a quarterly filling.
Desired output highlighted in green: financial statement
Here's the URL of the document that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
If you'd like to see the data point visually, it is in PART I, Item 1. Financial Statements, Operating income.
The HTML code for the figure that I'm trying to get:
<ix:nonfraction id="fact-identifier-125" name="us-gaap:OperatingIncomeLoss" contextref="FD2019Q3QTD" unitref="usd" decimals="-6" scale="6" format="ixt:numdotdecimal" data-original-id="d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isadditionalitemsonly="false">11,544</ix:nonfraction>
The code that I used to obtain this data point (11,544).:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
response = requests.get(url)
content = BeautifulSoup(response.content, 'html.parser')
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss", "contextref":"FD2019Q3QTD"})
print (operatingincomeloss)
I also tried with
operatingincomeloss = content.find('ix:nonfraction', attrs={"name": "us-gaap:OperatingIncomeLoss"}
Eventually, I want to loop through all the relevant fillings to pull this data point. Currently, I'm just getting None. When I CTRl+F through content, I can't find the ix:nonfraction tag as well.
Page is loaded via JavaScript, I've attached the XHR request made and extracted the data required.
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("#d305292495e1903-wk-Fact-6250FB76089207E7F73CB52756E0D8D0"):
print(item.text)
Output:
11,544
Updated:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("ix:nonfraction", {'contextref': 'FD2019Q3QTD', 'name': 'us-gaap:OperatingIncomeLoss'}):
print(item.text)
As #αԋɱҽԃ αмєяιcαη said, the page is loaded via JavaScript.
I have used the xhr request for this code.
Considering the attributes you have used, I have taken name attribute only, as contextref changes for each element.
You could also change the name attribute if you want to loop through other elements.
As you said you want to loop through this tag, I have printed all the output returning in the code below.
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm')
soup = BeautifulSoup(res.text, 'html.parser')
for data in soup.find_all('ix:nonfraction', {'name': 'us-gaap:OperatingIncomeLoss'}):
print(data.text)
Output:
11,544
12,612
48,305
54,780
7,442
7,496
26,329
26,580
3,687
3,892
14,371
15,044
3,221
3,414
12,142
15,285
1,795
1,765
7,199
7,193
1,155
1,127
4,811
4,980
17,300
17,694
64,852
69,082
11,544
12,612
48,305
54,780

Can't grab a phone number from a webpage

I've created a script in python to fetch a phone number from a webpage but I can't find any idea as to how I can grab that as the number is in image.
Website link
This is how that number is displayed on that page:
I've written so far:
import requests
from bs4 import BeautifulSoup
url = "use_above_link"
def get_phone_number(link):
resp = requests.get(link)
soup = BeautifulSoup(resp.text,"lxml")
phone = soup.select_one("img.phone-num-img")['src']
print(phone)
if __name__ == '__main__':
get_phone_number(url)
How can I scrape this very phone number from that webpage?
Here you go.
The clues start with the following html that indicates the tel number likely has a base64 encoding
The base64 encoded value of that tel number is MDA5NzE1MjE3NjQ4MDY=. This value is not present on that page but is present at one of the other urls you can extract from the initial page html.
Issue a second request to that url, target the [data-tel] attribute, which is where the encoded string is stored, extract the base64 encoded string and decode.
import requests
from bs4 import BeautifulSoup as bs
import base64
with requests.Session() as s:
r = s.get('https://dubai.dubizzle.com/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber')
soup = bs(r.content, 'lxml')
link = 'https://dubai.dubizzle.com' + soup.select_one('[media][href$=shownumber]')['href']
r = s.get(link)
soup = bs(r.content, 'lxml')
encoded = soup.select_one('[data-tel]')['data-tel']
tel = base64.b64decode(encoded)
print(tel)
Notes:
It looks like the rel alternate (the second url) is simply a mobile device url and that you can issue just one request and substitute in /m/ into the original url i.e.
https://dubai.dubizzle.com/m/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber#
Code then simplifies to:
import requests
from bs4 import BeautifulSoup as bs
import base64
r = requests.get('https://dubai.dubizzle.com/m/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber')
soup = bs(r.content, 'lxml')
encoded = soup.select_one('[data-tel]')['data-tel']
tel = base64.b64decode(encoded)
print(tel)
1. Use a paid OCR service
The quickest way to solve this problem will be to use an OCR service. The downside: they are not free.
eg: Set up a google cloud project and enable the vision API. Instructions here. Then pass the image you acquired to the API and get the numbers back.
import requests
from bs4 import BeautifulSoup
from google.cloud import vision
url = "use_above_link"
client = vision.ImageAnnotatorClient()
def get_phone_number(link):
resp = requests.get(link)
soup = BeautifulSoup(resp.text,"lxml")
phone_src_url = soup.select_one("img.phone-num-img")['src']
print(phone_src_url)
response = client.annotate_image({
'image': {'source': {'image_uri': phone_src_url }},
'features': [{'type': vision.enums.Feature.Type.TEXT_DETECTION}],
})
if __name__ == '__main__':
get_phone_number(url)
2. Use OPEN CV
This method is going to involve you writing a lot of code yourself. The main assumption here is you are going to parse dubizzle links. If that is the case, the font for those phone numbers is standard. You will have to parse the images of each digit from 0 to 9 into recognisable curves. Then you will need to detect the curves in each image. Detailed instructions here.
You find and cut out 10 images - one for each digit. This will be your master set. Then you will need to match the images by following the tutorial I linked. Depending on the position of each match, you will have to order the output from left to right.

Beautiful soup with find all only gives the last result

I'm trying to retrieve all the products from a page using beautiful soup. The page has pagination, and to solve it I have made a loop to make the retrieve work for all pages.
But, when I move to the next step and try to "find_all()" the tags, it only gives the data from the last page.
If I try when one isolated page it works fine, so I guest that it is a problem with getting all the html from all pages.
My code is the next:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib3 as ur
http = ur.PoolManager()
base_url = 'https://www.kiwoko.com/tienda-de-perros-online.html'
for x in range (1,int(33)+1):
dog_products_http = http.request('GET', base_url+'?p='+str(x))
soup = BeautifulSoup(dog_products_http.data, 'html.parser')
print (soup.prettify)
and ones it has finished:
soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
As I said, if I do not use the for range and only retrieve one page (example: https://www.kiwoko.com/tienda-de-perros-online.html?p=10, it works fine and gives me the 36 products.
I have copied the "soup" in a word file and search the class to see if there is a problem, but there are all the 1.153 products I'm looking for.
So, I think the soup is right, but as I look for "more than one html" I do not think that the find all is working good.
¿What could be the problem?
You do want your find inside the loop but here is a way to copy the ajax call the page makes which allows you to return more items per request and also to calculate the number of pages dynamically and make requests for all products.
I re-use connection with Session for efficiency.
from bs4 import BeautifulSoup as bs
import requests, math
results = []
with requests.Session() as s:
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p=1&product_list_limit=54&isAjax=1&_=1560702601779').json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
product_count = int(soup.select_one('.toolbar-number').text)
pages = math.ceil(product_count / 54)
if pages > 1:
for page in range(2, pages + 1):
r = s.get('https://www.kiwoko.com/tienda-de-perros-online.html?p={}&product_list_limit=54&isAjax=1&_=1560702601779'.format(page)).json()
soup = bs(r['categoryProducts'], 'lxml')
results.append(soup.select('.product-item-details'))
results = [result for item in results for result in item]
print(len(results))
# parse out from results what you want, as this is a list of tags, or do in loop above

Categories

Resources