WebScraping A Website With Json Content Gives Value Error - python

I am trying to scrape an api call with requests. This is the website
Following Is The Error That It Gives Me:
ValueError: No JSON object could be decoded
Following Is The Code :
import requests
import json
import time
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/api/event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
data = json.loads(request.text)
print(data)
How Can I Scrape This Website ?

Try this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/companies-listing/corporate-filings-event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
soup = BeautifulSoup(request.text,'html.parser')
print(soup)

The table is probably being dynamically generated with Javascript. Therefore, requests won't work. You need selenium and a headless browser to do that.

Related

Web Scrapping just return None

I'm trying to make a pop-up program with mir4 draco price. But the price return None :
import requests
from bs4 import BeautifulSoup
urll = 'https://www.xdraco.com/coin/price/'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/86.0.4240.198 Safari/537.36"}
site = requests.get(urll, headers=headers)
soup = BeautifulSoup(site.content, 'html5lib')
price = soup.find('span', class_="amount")
print(price)
You won't be able to parse a site that is dynamically loaded using JS as #jabbson mentioned.
This might be a way to get the data you want.
If you check the network requests being made by the page, you will find that it makes calls to a few different APIs. I found one that might have the info you're looking for. You can make POST requests to this API as shown below...
import requests
import json
headers = {'accept':'application/json, text/plain, */*','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
html = requests.post('https://api.mir4global.com/wallet/prices/hydra/daily', headers=headers)
output = json.loads(html.text)
# 'output' is a dictionary. If we index the last element, we can get the latest data entry
print(output['Data'][-1])
OUTPUT:
{'CreatedDT': '2022-08-04 21:55:00', 'HydraPrice': '2.1301000000000001', 'HydraAmount': '13434', 'HydraPricePrev': '2.3336000000000001', 'HydraAmountPrev': '5972', 'HydraUSDWemixRate': '2.9401340627166839', 'HydraUSDKLAYRate': '0.29840511595654395', 'USDHydraRate': '6.2627795669928084'}

Python Request returning different result than original page (browser)

I am trying to do a simple WebScrapper to monitor Nike's site here in Brazil.
Basically i want to track products that have stock right now, to check when new products are added.
My problem is that when i navigate to the site https://www.nike.com.br/snkrs#estoque I see different products compared to what I see using python requests method.
Here is the code I am using:
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.nike.com.br/snkrs#estoque'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
len(soup.find_all(class_='produto produto--comprar'))
This code gives me 40, but using the browser I can see 56 products https://prnt.sc/26jeo1i
The data comes from a different source, within 3 pages.
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
productList = []
for p in [1,2,3]:
url = f'https://www.nike.com.br/Snkrs/Estoque?p={p}&demanda=true'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productList += soup.find_all(class_='produto produto--comprar')
Output:
print(len(productList))
56

How do I Scrape the link of the website from these page

I am trying to scrape the link from amazon website but they will provide me 2 or 3 links
the link of website is https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar')
soup=BeautifulSoup(r.content, 'html.parser')
for link in soup.find_all('a',href=True):
print(link['href'])
Here is the working solution:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
base_url='https://www.amazon.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
r = requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar', headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('a',class_="a-link-normal s-underline-text s-underline-link-text a-text-normal",href=True):
p=link['href']
l=urljoin(base_url,p)
print(l)
Output:
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A05861132UJ9W79S82Z3&url=%2FFiskars-Inch-Student-Scissors-Pack%2Fdp%2FB08CL355MN%2Fref%3Dsr_1_1_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-1-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A0918144191FAIKGYK3YC&url=%2FFiskars-Inch-Blunt-Kids-Scissors%2Fdp%2FB00TJSS9ZW%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-2-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A09889161KB2CNO5NB8QC&url=%2FLind-Kitchen-Dispenser-Decorative-Stationery%2Fdp%2FB07VRLW5C6%2Fref%3Dsr_1_3_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-3-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/Zebra-Pen-Retractable-Ballpoint-18-Count/dp/B00M382RJO/ref=sr_1_4?dchild=1&qid=1633717907&s=office-products&sr=1-4
... so on

I scrape the review of post but they don't scrape

I scrape the review of post but they don't scrape solve it I am very thankful
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
r =requests.get('https://www.realpatientratings.com/botox-cosmetic')
soup=BeautifulSoup(r.content, 'lxml')
tag = soup.find_all('p',class_='text')
for u in tag:
print(u.text)
After checking xhr requests I found out that you're getting the incorrect page.
Try:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
r =requests.get('https://www.realpatientratings.com/reviews/procreviewfilters?type=surgical&star=&procedureId=147&sort=new&location=&state=0&within=0')
soup=BeautifulSoup(r.content, 'lxml')
tag = soup.find_all('p',class_='text')
for u in tag:
print(u.text)
Just changed https://www.realpatientratings.com/botox-cosmetic to https://www.realpatientratings.com/reviews/procreviewfilters?type=surgical&star=&procedureId=147&sort=new&location=&state=0&within=0

How do I properly use the find function from BeatifulSoup4 in python3?

I'm following a youtube tutorial on how to scrape an amazon product-page. First I'm trying to get the product title. Later I want to get the amazon price and the secon-hand-price. For this I'm ustin requests and bs4. Here the code so far:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Teenage-Engineering-Synthesizer-FM-Radio-AMOLED-Display/dp/B00CXSJUZS/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=op-1&qid=1594672884&sr=8-1-spons&psc=1&smid=A1GQGGPCGF8PV9&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFEMUZSUjhQMUM3NTkmZW5jcnlwdGVkSWQ9QTAwMzMwODkyQkpTNUJUUE9QUFVFJmVuY3J5cHRlZEFkSWQ9QTA4MzM4NDgxV1Y3UzVVN1lXTUZKJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',{'id' : "productTitle"})
print(title)
my title is None. So the find function doesn't find the element with the id "productTitle". But checking the soup shows, that there is an element with that id..
So what's wrong with my code?
I also tried:
title = soup.find(id = "productTitle")
Try this:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Teenage-Engineering-Synthesizer-FM-Radio-AMOLED-Display/dp/B00CXSJUZS/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=op-1&qid=1594672884&sr=8-1-spons&psc=1&smid=A1GQGGPCGF8PV9&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFEMUZSUjhQMUM3NTkmZW5jcnlwdGVkSWQ9QTAwMzMwODkyQkpTNUJUUE9QUFVFJmVuY3J5cHRlZEFkSWQ9QTA4MzM4NDgxV1Y3UzVVN1lXTUZKJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'lxml')
title = soup.find('span',{'id' : "productTitle"})
print(title.text.strip())
You do the right thing but have a "bad" parser. Read more about the differences between parsers here. I prefer lxml but also sometimes use html5lib. I also added
.text.strip()
to the print so only the title text is printed.
Note: you have to install lxml for python first!

Categories

Resources