Extract json content in <script using BeautifulSoup

Extract json content in <script using BeautifulSoup - python

I'm trying to extract a json part of the script using beautiful soup but it prints Nothing. What's wrong?
import requests,json
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
r = requests.get('https://www.joox.com/id/id/single/xtMtD9ZdeLEdHp1w1fip8w==',headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
data = json.loads(soup.find('script').text)
print(data)
The soup result is like:
<script>
__NEXT_DATA__ = {"props":{"trackData":{"album_url":"https://imgcache.qq.com/music/photo/mid_album_300/O/F/002tkoBZ2zJPOF.jpg","code":0,"country":"hk","encodeSongId":"xtMtD9ZdeLEdHp1w1fip8w==","express_domain":"http://stream.music.joox.com/","flag":0,"gososo":0,"has_hifi":false,"has_hq":false,"imgSrc":"https://imgcache.qq.com/music/photo/mid_album_300/O/F/002tkoBZ2zJPOF.jpg","kbps_map":"{\"128\":4082449,\"192\":6522038,\"24\":825973,\"320\":10216761,\"48\":1708095,\"96\":3701266,\"ape\":0,\"flac\":0}\n","ktrack_id":0,"m4aUrl":"https://hk.stream.music.joox.com/C4000035Jew421IKj2.m4a?
Any ideas for this? thanks

You can use substring and regular expression for obtaining what you are seeking for.
import requests,json, re
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
r = requests.get('https://www.joox.com/id/id/single/xtMtD9ZdeLEdHp1w1fip8w==',headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
# finding the script element whihc contains the required data
json_data = soup.find('script', text=re.compile("__NEXT_DATA__") )
# identifying the required elements and obtaining it by sub string
data = str(json_data)[str(json_data).find('__NEXT_DATA__ = '):str(json_data).find('module={}')]
data = json.loads(data.replace('__NEXT_DATA__ = ', ''))
print(data)
Using slicing technique we are identifying the string pattern and extracting the json string alone. After that you are good to go!
Hope this helps! Cheers!

even BeautifulSoup still need Regex, why not use it directly
import re
....
r = requests.get('https://www.joox.com/id/id/single/xtMtD9ZdeLEdHp1w1fip8w==',headers=headers)
jsData = re.search(r'__NEXT_DATA__\s+=\s+(.*)', r.text)
data = json.loads(jsData.group(1))
print(data)

Related

Why does soup.findAll return []?

I am having trouble reading data from a url with BeautifulSoup
This is my code:
url1 = "https://www.barstoolsportsbook.com/events/1018282280"
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(url1, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
data = soup.findAll('div',attrs={"class":"section"})
print(data)
#for x in data:
# print(x.find('p').text)
When I print(data) I am returned []. What could be the reason for this? I would like to avoid using selenium for this task if possible.
This is the HTML for what I'm trying to grab
<div data-v-50e01018="" data-v-2a52296d="" class="section"><p data-v-5f665d29="" data-v-50e01018="" class="header strongbody2">HOT TIPS</p><p data-v-5f665d29="" data-v-50e01018="" class="tip body2"> The Mets have led after 3 innings in seven of their last nine night games against NL East Division opponents that held a losing record. </p><p data-v-5f665d29="" data-v-50e01018="" class="tip body2"> The 'Inning 1 UNDER 0.5 runs' market has hit in each of the Marlins' last nine games against NL East opponents. </p></div>

You can likely get what you want with this request:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
'Authorization': 'Basic MTI2OWJlMjItNDI2My01MTI1LWJlNzMtMDZmMjlmMmZjNWM3Omk5Zm9jajRJQkZwMUJjVUc0NGt2S2ZpWEpremVKZVpZ',
'Origin': 'https://www.barstoolsportsbook.com',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://www.barstoolsportsbook.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
}
response = requests.get('https://api.isportgenius.com.au/preview/1018282280', headers=headers)
response.json()
You should browse the network tab to see where the rest of the data is coming from, or use a webdriver.

Parsing a table with Pandas

I am trying to parse the table from https://alreits.com/screener
I have tried this:
main_url = 'https://alreits.com/screener'
r = requests.get(main_url)
df_list = pd.read_html(r.text)
df = df_list[0]
print(df)
but pandas cant find the table.
I have also tried using BeautifulSoup4 but it didnt seem to give better results.
This is the selector: #__next > div.MuiContainer-root.MuiContainer-maxWidthLg > div.MuiBox-root.jss9.Card__CardContainer-feksr6-0.fpbzHQ.ScreenerTable__CardContainer-sc-1c5wxgl-0.GRrTj > div > table > tbody
This is the full xPath: /html/body/div/div[2]/div[2]/div/table/tbody
I am trying to get the Stock symbol (under name),sector,score and market cap. The other data would be nice to have but is not necessary.
Thank You!

I found one JSON url from the dev tool. This is an easy way to extract the table instead of using selenium. Use post request to extract the data.
import requests
headers = {
'authority': 'api.alreits.com:8080',
'sec-ch-ua': '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
'sec-ch-ua-platform': '"Windows"',
'content-type': 'application/json',
'accept': '*/*',
'origin': 'https://alreits.com',
'sec-fetch-site': 'same-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://alreits.com/',
'accept-language': 'en-US,en;q=0.9',
}
params = (
('page', '0'),
('size', '500'),
('sort', ['marketCap,desc', 'score,desc', 'ffoGrowth,desc']),
)
data = '{"filters":[]}'
response = requests.post('https://api.alreits.com:8080/api/reits/screener', headers=headers, params=params, data=data)
df = pd.DataFrame(response.json())

The code below will return the data you are looking for.
import requests
import pprint
import json
headers = {'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
r = requests.post(
'https://api.alreits.com:8080/api/reits/screener?page=0&size=500&sort=marketCap,desc&sort=score,desc&sort=ffoGrowth,desc',
headers=headers, data=json.dumps({'filters':[]}))
if r.status_code == 200:
pprint.pprint(r.json())
# Now you have the data - do what you want with it
else:
print(r.status_code)

Why getting empty list while try to get links which includes specific class in python using bs4?

I'm try to get some links which are includes specific class, therefore i writed this code:
from bs4 import BeautifulSoup
import requests
def getPages(requestedURLS):
_buff = []
for url in requestedURLS:
try:
_buff.append(requests.get(url))
except requests.exceptions.RequestException as err:
print(err)
return _buff
def getProductList(pages):
links = []
for page in pages:
content = BeautifulSoup(page.content, 'html.parser')
links.extend(content.find_all("a", class_="sresult lvresult clearfix li shic"))
print(links)
def main():
pageLinks = [
"https://www.ebay.co.uk/sch/m.html?_nkw=&_armrs=1&_from=&_ssn=carabaeuro13&_clu=2&_fcid=3&_localstpos=&_stpos=&gbr=1&_pppn=r1&scp=ce0",
"https://www.ebay.co.uk/sch/m.html?_nkw=&_armrs=1&_from=&_ssn=carabaeuro13&_clu=2&_fcid=3&_localstpos=&_stpos=&gbr=1&_pgn=2&_skc=200&rt=nc",
"https://www.ebay.co.uk/sch/m.html?_nkw=&_armrs=1&_from=&_ssn=carabaeuro13&_clu=2&_fcid=3&_localstpos=&_stpos=&gbr=1&_pgn=3&_skc=400&rt=nc"
]
#Results in : <div id="Results" class="results "> <ul id="ListViewInner">
pages = getPages(pageLinks)
productList = getProductList(pages)
if __name__ == '__main__':
main()
You can check links there are lots of links which includes this class but output is empty as you see in below:
C:\Users\projects\getMarketData\venv\Scripts\python.exe C:/Users/projects/getMarketData/getData.py
[]
[]
[]
Process finished with exit code 0
What is wrong?

Using appropriate headers should do the trick:
url = 'https://www.ebay.co.uk/sch/m.html?_nkw=&_armrs=1&_from=&_ssn=carabaeuro13&_clu=2&_fcid=3&_localstpos=&_stpos=&gbr=1&_pppn=r1&scp=ce0'
request_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.ebay.co.uk',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
r = requests.get(url, headers=request_headers)
soup = BeautifulSoup(r.content, 'lxml')
items = soup.find_all('li', {'class': 'sresult lvresult clearfix li shic'})
Result:
print(len(items)) # 18
print(items[0]['listingid']) # 333200565336
Use the network tab of the developer tools to inspect the traffic. Check this out if you've never done this before.

How to get cookies value to set in requests?

I am accessing a URL https://streeteasy.com/sales/all which does not show the page unless Cookie is set. I am having no idea how this cookie value being generated. I highly doubt that cookie value is fixed so I guess I can't use a hard-coded Cookie value either.
Code below:
import requests
from bs4 import BeautifulSoup
headers = {
'authority': 'streeteasy.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'referer': 'https://streeteasy.com/sales/all',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ur;q=0.8',
'cookie': 'D_SID=103.228.157.1:Bl5GGXCWIxq4AopS1Hkr7nkveq1nlhWXlD3PMrssGpU; _se_t=0944dfa5-bfb4-4085-812e-fa54d44acc54; google_one_tap=0; D_IID=AFB68ACC-B276-36C0-8718-13AB09A55E51; D_UID=23BA0A61-D0DF-383D-88A9-8CF65634135F; D_ZID=C0263FA4-96BF-3071-8318-56839798C38D; D_ZUID=C2322D79-7BDB-3E32-8620-059B1D352789; D_HID=CE522333-8B7B-3D76-B45A-731EB750DF4D; last_search_tab=sales; se%3Asearch%3Asales%3Astate=%7C%7C%7C%7C; streeteasy_site=nyc; se_rs=123%2C1029856%2C123%2C1172313%2C2815; se%3Asearch%3Ashared%3Astate=102%7C%7C%7C%7Cfalse; anon_searcher_stage=initial; se_login_trigger=4; se%3Abig_banner%3Asearch=%7B%22123%22%3A2%7D; se%3Abig_banner%3Ashown=true; se_lsa=2019-07-08+04%3A01%3A30+-0400; _ses=BAh7DEkiD3Nlc3Npb25faWQGOgZFVEkiJWRiODVjZTA1NmYzMzZkMzZiYmU4YTk4Yjk5YmU5ZTBlBjsAVEkiEG5ld192aXNpdG9yBjsARlRJIhFsYXN0X3NlY3Rpb24GOwBGSSIKc2FsZXMGOwBUSSIQX2NzcmZfdG9rZW4GOwBGSSIxbTM5eGRPUVhLeGYrQU1jcjZIdi81ajVFWmYzQWFSQmhxZThNcG92cWxVdz0GOwBGSSIIcGlzBjsARmkUSSIOdXNlcl9kYXRhBjsARnsQOhBzYWxlc19vcmRlckkiD3ByaWNlX2Rlc2MGOwBUOhJyZW50YWxzX29yZGVySSIPcHJpY2VfZGVzYwY7AFQ6EGluX2NvbnRyYWN0RjoNaGlkZV9tYXBGOhJzaG93X2xpc3RpbmdzRjoSbW9ydGdhZ2VfdGVybWkjOhltb3J0Z2FnZV9kb3ducGF5bWVudGkZOiFtb3J0Z2FnZV9kb3ducGF5bWVudF9kb2xsYXJzaQJQwzoSbW9ydGdhZ2VfcmF0ZWYJNC4wNToTbGlzdGluZ3Nfb3JkZXJJIhBsaXN0ZWRfZGVzYwY7AFQ6EHNlYXJjaF92aWV3SSIMZGV0YWlscwY7AFRJIhBsYXN0X3NlYXJjaAY7AEZpAXs%3D--d869dc53b8165c9f9e77233e78c568f610994ba7',
}
session = requests.Session()
response = session.get('https://streeteasy.com/for-sale/downtown', headers=headers, timeout=20)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'lxml')
links = soup.select('h3 > a')
print(links)

Reading data from a website passing parameters

import requests
from lxml import html
from bs4 import BeautifulSoup
session_requests = requests.session()
sw_url = "https://www.southwest.com"
sw_url2 = "https://www.southwest.com/flight/select-flight.html?displayOnly=&int=HOMEQBOMAIR"
#result = session_requests.get(sw_url)
#tree = html.fromstring(result.text)
payload = {"name":"AirFormModel","origin":"MCI","destination":"DAL","departDate":"2018-02-28T06:00:00.000Z","returnDate":"2018-03-03T06:00:00.000Z","tripType":"true","priceType":"DOLLARS","adult":1,"senior":0,"promoCode":""}
#{
# 'origin': 'MCI',
# 'destination': 'DAL',
# 'departDate':'2018-02-28T06:00:00.000Z',
# 'returnDate':'2018-03-01T06:00:00.000Z',
# 'adult':'1'
#}
p = requests.post(sw_url,params=payload)
#print(p.text)
print(p.content)
p1 = requests.get(sw_url2)
soup = BeautifulSoup(p.text,'html.parser')
print(soup.find("div",{"class":"productPricing"}))
pr = soup.find_all("span",{"class":"currency_symbol"})
for tag in pr:
print(tag)
print('++++')
print(tag.next_sibling)
print(soup.find("div",{"class":"twoSegments"}))
soup = BeautifulSoup(p1.text,'html.parser')
print(soup.find("div",{"class":"productPricing"}))
pr = soup.find_all("span",{"class":"currency_symbol"})
for tag in pr:
print(tag)
print('++++')
print(tag.next_sibling)
print(soup.find("div",{"class":"twoSegments"}))
I need to retrieve prices for flights between 2 locations on specific dates. I identified the parameters by looking at the session info from inspector of the browser and included them in the post request.
I am not sure what I'm doing wrong here, but I am unable to read the data from the tags correctly. It's printing none.
Edit : 4/25/2018
I'm using the following code now, but it doesn't seem to help. Please advise.
import threading
from lxml import html
from bs4 import BeautifulSoup
import time
import datetime
import requests
def worker(oa,da,ods):
"""thread worker function"""
print (oa + ' ' + da + ' ' + ods + ' ' + str(datetime.datetime.now()))
url = "https://www.southwest.com/api/air-booking/v1/air-booking/page/air/booking/shopping"
rh = {
'accept': 'application/json,text/javascript,*/*;q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.5',
'cache-control': 'max-age=0',
'content-length': '454',
'content-type': 'application/json',
'referer': 'https://www.southwest.com/air/booking/select.html?originationAirportCode=MCI&destinationAirportCode=LAS&returnAirportCode=&departureDate=2018-05-29&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
fd = {
'returnAirport':'',
'twoWayTrip':'false',
'fareType':'DOLLARS',
'originAirport':oa,
'destinationAirport':da,
'outboundDateString':ods,
'returnDateString':'',
'adultPassengerCount':'1',
'seniorPassengerCount':'0',
'promoCode':'',
'submitButton':'true'
}
with requests.Session() as s:
r = s.post(url,headers = rh )
# soup = BeautifulSoup(r.content,'html.parser')
# soup = BeautifulSoup(r.content,'lxml')
print(r)
print(r.content)
print (oa + ' ' + da + ' ' + ods + ' ' + str(datetime.datetime.now()))
return
#db = MySQLdb.connect(host="localhost",user="root",passwd="vikram",db="garmin")
rcount = 0
tdelta = 55
#print(strt_date)
threads = []
count = 1
thr_max = 2
r = ["MCI","DEN","MCI","MDW","MCI","DAL"]
strt_date = (datetime.date.today() + datetime.timedelta(days=tdelta)).strftime("%m/%d/%Y")
while count < 2:
t = threading.Thread(name=r[count-1]+r[count],target=worker,args=(r[count-1],r[count],strt_date))
threads.append(t)
t.start()
count = count + 2

When you say looked at the session info from inspector of the browser, I'm assuming you meant the network tab. If that's the case, are you sure you noted the data being sent properly?
Here's the URL that gets sent by the browser, following which the page you required is fetched:
url = 'https://www.southwest.com/flight/search-flight.html'
You didn't use headers in your request, which, in my opinion, should be passed compulsorily in some cases. Here are the headers that the browser passes:
:authority:www.southwest.com
:method:POST
:path:/flight/search-flight.html
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.9
cache-control:max-age=0
content-length:564
content-type:application/x-www-form-urlencoded
origin:https://www.southwest.com
referer:https://www.southwest.com/flight/search-flight.html?int=HOMEQBOMAIR
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36
Note:
I removed the cookie header, because that would be taken care of by requests if you're using session.
The first four headers (those that begin with a colon (':')) cannot be passed in Python's requests; so, I skipped them.
Here's the dict that I used to pass the headers:
rh = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'content-length': '564',
'content-type': 'application/x-www-form-urlencoded',
'origin': 'https://www.southwest.com',
'referer': 'https://www.southwest.com/flight/search-flight.html?int=HOMEQBOMAIR',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'
}
And here is the form data sent by browser:
fd = {
'toggle_selfltnew': '',
'toggle_AggressiveDrawers': '',
'transitionalAwardSelected': 'false',
'twoWayTrip': 'true',
'originAirport': 'MCI',
# 'originAirport_displayed': 'Kansas City, MO - MCI',
'destinationAirport': 'DAL',
# 'destinationAirport_displayed': 'Dallas (Love Field), TX - DAL',
'airTranRedirect': '',
'returnAirport': 'RoundTrip',
'returnAirport_displayed': '',
'outboundDateString': '02/28/2018',
'outboundTimeOfDay': 'ANYTIME',
'returnDateString': '03/01/2018',
'returnTimeOfDay': 'ANYTIME',
'adultPassengerCount': '1',
'seniorPassengerCount': '0',
'promoCode': '',
'fareType': 'DOLLARS',
'awardCertificateToggleSelected': 'false',
'awardCertificateProductId': ''
}
Note that I commented out two of the items above, but it didn't make any difference. I assumed you'd be having only the location codes and not the full name. If you do have them or if you can extract them from the page, you can send those as well along with other data.
I don't know if it makes any difference, but I used data instead of params:
with requests.Session() as s:
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
Finally, here is the result:
>>> soup.find('span', {'class': 'currency_symbol'}).text
'$'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract json content in <script using BeautifulSoup - python

even BeautifulSoup still need Regex, why not use it directly import re .... r = requests.get('https://www.joox.com/id/id/single/xtMtD9ZdeLEdHp1w1fip8w==',headers=headers) jsData = re.search(r'__NEXT_DATA__\s+=\s+(.*)', r.text) data = json.loads(jsData.group(1)) print(data)

Related

Why does soup.findAll return []?

Parsing a table with Pandas

Why getting empty list while try to get links which includes specific class in python using bs4?

How to get cookies value to set in requests?

Reading data from a website passing parameters

Categories

Resources