Webscraping website - can't print price - api & json i think

Webscraping website - can't print price - api & json i think - python

having trouble with this website to print price, i think i'm close but getting errors.
please help, tx
"
{'statusDetails': {'state': 'FAILURE', 'errorCode': 'SYS-3003', 'correlationid': 'rrt-5636881267628447407-b-gsy1-18837-18822238-1', 'description': 'Invalid key identifier or token'}}
"
code:
import requests
import json
s = requests.Session()
url = 'https://www.bunnings.com.au/ozito-pxc-2-x-18v-cordless-line-trimmer-skin-only_p0167719'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
resp = s.get(url,headers=header)
api_url = f'https://api.prod.bunnings.com.au/v1/products/0167719/fulfillment/6400/radius/100000?isToggled=true'
price_resp = s.get(api_url,headers=header).json()
print(price_resp)
#price = price_resp['data']['price']['value']
#print(price)

Related

Selecting links within a div tag using beautiful soup

I am trying to run the following code
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
params = {
'q': 'Machine learning,
'hl': 'en'
}
html = requests.get('https://scholar.google.com/scholar', headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
for result in soup.select('.gs_r.gs_or.gs_scl'):
profiles=result.select('.gs_a a')['href']
The following output (error) is being shown
"TypeError: list indices must be integers or slices, not str"
What is it I am doing wrong?

The following is tested and works:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
params = {
'q': 'Machine learning',
'hl': 'en'
}
html = requests.get('https://scholar.google.com/scholar', headers=headers,
params=params).text
soup = bs(html, 'lxml')
for result in soup.select('.gs_r.gs_or.gs_scl'):
profiles=result.select('.gs_a a')
for p in profiles:
print(p.get('href'))
Result in terminal:
/citations?user=rSVIHasAAAAJ&hl=en&oi=sra
/citations?user=MnfzuPYAAAAJ&hl=en&oi=sra
/citations?user=09kJn28AAAAJ&hl=en&oi=sra
/citations?user=yxUduqMAAAAJ&hl=en&oi=sra
/citations?user=MnfzuPYAAAAJ&hl=en&oi=sra
/citations?user=9Vdfc2sAAAAJ&hl=en&oi=sra
/citations?user=lXYKgiYAAAAJ&hl=en&oi=sra
/citations?user=xzss3t0AAAAJ&hl=en&oi=sra
/citations?user=BFdcm_gAAAAJ&hl=en&oi=sra
/citations?user=okf5bmQAAAAJ&hl=en&oi=sra
/citations?user=09kJn28AAAAJ&hl=en&oi=sra
In your code, you were trying to obtain the href attribute from a list (soup.select returns a list, and soup.select_one return a single element).
See BeautifulSoup documentation here

How to decode an UTF-8 encoded API response

When I send a request to an API:
import requests
url = 'website'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
response = requests.get(url.strip(), headers=headers, timeout=10)
response.encoding = response.apparent_encoding
print(response.text)
The output is:
0e1\u10e2\u10d4\u10db\u10d0 BOSE,\u10d3\u10d0\u10ec\u10e7\u10d4\u10d1\u10d0-\u10d2\u10d0\u10e9\u10d4\u10e0\u10d4\u10d1\u10d8\u10e1 \u10e1\u10d8\u10e1\u10e2\u10d4\u10db\u10d0,\u10d4\u10da\u10d4\u10e5\u10e2\u10e0\u10dd\u10dc\u10e3\u10da\u10d8 \u10d3\u10d8\u10e4\u10d4\u10e0\u10d4\u10dc\u10ea\u10d8\u10d0\u10da\u10e3\u10e0\u10d8 \u10e1\u10d0\u10d9\u10d4\u10e2\u10d8,\u10eb\u10e0\u10d0\u10d5\u10d8\u10e1 \u10e1\u10d0\u10db\u10e3\u10ee\u10e0\u10e3\u10ed\u10d4 \u10d9\u10dd\u10dc\u10e2\u10e0\u10dd\u10da\u10d8\u10e1 \u10e1\u10d8\u10e1\u10e2\u10d4\u10db\u10d0,\u10ec\u10d4\u10d5\u10d8\u10e1 \u10d9\u10dd\u10dc\u10e2\u10e0\u10dd\u10da\u10d8\u10e1 \u10e1\u10d8\u10e1\u10e2\u10d4\u10db\u10d0,\u10e1\u10e2\u10d0\u10d1\u10d8\u10da\u10e3\u10e0\u10dd\u10d1\u10d8\u10e1 \u10e1\u10d8\u10e1\u10e2\u10d4\u10db\u10d0,\u10d3\u10d0\u10d1\u10da\u10dd\u10d9\u10d5\u10d8\u10e1 \u10e1\u10d0\u10ec\u10d8\u10dc\u10d0\u10d0\u10e6\u10db\u10d3\u10d4\u10d2\u10dd \u10d3\u10d0\u10db\u10e3\u10ee\u10e0\u10e3\u10ed\u10d4\u10d
How to decode it correctly?

In order to encode or decode from utf8 string you can use :
s = "test"
u = s.encode("utf8")
s = u.decode("utf8")
But your problem is that your string is utf-8 encoded and escaped!
So you will need to un-escape it and then re-interpret it:
s = response.text
r = s.encode('raw_unicode_escape').decode('unicode_escape')
print(r)
# -> '0e1ტემა BOSE,დაწყება-გაჩერების სისტემა,ელექტრონული დიფერენციალური საკეტი,ძრავის სამუხრუჭე კონტროლის სისტემა,წევის კონტროლის სისტემა,სტაბილურობის სისტემა,დაბლოკვის საწინააღმდეგო დამუხრუჭებ'

How to change Json data output in table format

import requests
from pprint import pprint
import pandas as pd
baseurl = "https://www.nseindia.com/"
url = f'https://www.nseindia.com/api/live-analysis-oi-spurts-underlyings'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
'like Gecko) '
'Chrome/80.0.3987.149 Safari/537.36',
'accept-language': 'en,gu;q=0.9,hi;q=0.8', 'accept-encoding': 'gzip, deflate, br'}
session = requests.Session()
request = session.get(baseurl, headers=headers, timeout=30)
cookies = dict(request.cookies)
res = session.get(url, headers=headers, timeout=30, cookies=cookies)
print(res.json())
I tried df = pd.DataFrame(res.json()) but couldn't get data in table format. How to do that Plz. Also how to select few particular columns only in data output instead of all columns.

Try this :
import json
import codecs
df = pd.DataFrame(json.loads(codecs.decode(bytes(res.text, 'utf-8'), 'utf-8-sig'))['data'])
And to select a specific columns, you can use :
mini_df = df[['symbol', 'latestOI', 'prevOI', 'changeInOI', 'avgInOI']]
>>> print(mini_df)

Yandex Spellchecker API Returns Empty Array

I am trying to harness a Russian language spellcheck API, Yandex.Speller.
The request seems to work fine in my browser. However, when I use a python script, the response is empty.
I am stumped as to what I am doing wrong.
Here is my code:
import urllib
from urllib.request import urlopen
import json
def main():
api(text_preproc())
def text_preproc():
""" Takes misspelled word/phrase,
“t”, and prepares it for
API request
"""
t = "синхрафазатрон в дубне"
text = t.replace(" ", "+")
return text
def diff_api(text):
my_url = "https://speller.yandex.net/services/spellservice.json/checkText?text="
my_headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
my_data = {
"text" : text,
"lang" : "ru",
"format" : "plain"}
my_uedata = urllib.parse.urlencode(my_data)
my_edata = my_uedata.encode('ascii')
req = urllib.request.Request(url=my_url, data=my_edata, headers=my_headers)
response = urlopen(req)
data = json.load(response)
print(data)
The response is always an empty array, no matter how I tinker with my request.
Any insight into what I might be doing wrong?

my_uedata has to be a part of the URL you send the request to.
Also, in:
def main():
api(text_preproc())
You call api() but the function is not defined. I've used diff_api().
Try this:
import json
import urllib
from urllib.request import urlopen
def main():
diff_api(text_preproc("синхрафазатрон в дубне"))
def text_preproc(phrase):
""" Takes misspelled word/phrase,
“t”, and prepares it for
API request
"""
return phrase.replace(" ", "+")
def diff_api(text):
my_url = "https://speller.yandex.net/services/spellservice.json/checkText?text="
my_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
my_data = {
"text": text,
"lang": "ru",
"format": "plain"}
my_uedata = urllib.parse.urlencode(my_data)
req = urllib.request.Request(url=my_url+my_uedata, headers=my_headers)
data = json.load(urlopen(req))
print(data)
main()
Output:
[{'code': 1, 'pos': 5, 'row': 0, 'col': 5, 'len': 14, 'word': 'синхрафазатрон', 's': ['синхрофазотрон', 'синхрофазатрон', 'синхрофазотрона']}]

Scraping with requests

what is wrong in my code, I try get the same content like in https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG but result is diffrent as I want to have.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.headers.update({"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
response=s.get('https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-
2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG')
soup=BeautifulSoup(response.text,'lxml')
print(soup.prettify())

You can use requests and pass params in to get json for the train info and prices. I haven't parsed out all the info as this is just to show you it is possible. I parse out the train ids to be able to make the subsequent requests from price info which are linked by ids to the train info
import requests
from bs4 import BeautifulSoup as bs
url = 'https://koleo.pl/pl/connections/?'
headers = {
'Accept' : 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'en-US,en;q=0.9',
'Connection' : 'keep-alive',
'Cookie' : '_ga=GA1.2.2048035736.1553000429; _gid=GA1.2.600745193.1553000429; _gat=1; _koleo_session=bkN4dWRrZGx0UnkyZ3hjMWpFNGhiS1I3TzhQMGNyWitvZlZ0QVRUVVVtWUFPMUwxL0hJYWJyYnlGTUdHYXNuL1N6QlhHMHlRZFM3eFZFcjRuK3ZubllmMjdSaU5CMWRBSTFOc1JRc2lDUGV0Y2NtTjRzbzZEd0laZWI1bjJoK1UrYnc5NWNzZzNJdXVtUlpnVE15QnRnPT0tLTc1YzV1Q2xoRHF4VFpWWTdWZDJXUnc9PQ%3D%3D--3b5fe9bb7b0ce5960bc5bd6a00bf405df87f8bd4',
'Host' : 'koleo.pl',
'Referer' : 'https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
'X-CSRF-Token' : 'heag3Y5/fh0hyOfgdmSGJBmdJR3Perle2vJI0VjB81KClATLsJxFAO4SO9bY6Ag8h6IkpFieW1mtZbD4mga7ZQ==',
'X-Requested-With' : 'XMLHttpRequest'
}
params = {
'v' : 'a0dec240d8d016fbfca9b552898aba9c38fc19d5',
'query[date]' : '19-03-2019 10:00:00',
'query[start_station]' : 'krakow-glowny',
'query[end_station]': 'radom',
'query[brand_ids][]' : '29',
'query[brand_ids][]' : '28',
'query[only_direct]' : 'false',
'query[only_purchasable]': 'false'
}
with requests.Session() as s:
data= s.get(url, params = params, headers = headers).json()
print(data)
priceUrl = 'https://koleo.pl/pl/prices/{}?v=a0dec240d8d016fbfca9b552898aba9c38fc19d5'
for item in data['connections']:
r = s.get(priceUrl.format(item['id'])).json()
print(r)

You have to use selenium in order to get that dynamically generated content. And then you can parse html with BS. For example I've parsed dates:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG')
soup = BeautifulSoup(driver.page_source, 'lxml')
for div in soup.findAll("div", {"class": 'date custom-panel'}):
date = div.findAll("div", {"class": 'row'})[0].string.strip()
print(date)
Output:
wtorek, 19 marca
środa, 20 marca

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping website - can't print price - api & json i think - python

Related

Selecting links within a div tag using beautiful soup

How to decode an UTF-8 encoded API response

How to change Json data output in table format

Yandex Spellchecker API Returns Empty Array

Scraping with requests

Categories

Resources