Trying to find something specific in html code

Trying to find something specific in html code - python

I am trying to find a specific ID to an altcoin, but not sure how to do it. When I print, I get a very long json script and I get lost in trying to find it. Is there an easier way?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
import time
cmc = requests.get('https://coinmarketcap.com/')
soup = BeautifulSoup(cmc.content, 'html.parser')
print(soup.prettify())
The output I want is to determine the exact id corresponding to the altcoin. The output below is for one coin, but it is a long list, and I can not easily find the exact one without manually looking.
{"id":1,"name":"Bitcoin","symbol":"BTC","slug":"bitcoin","max_supply":21000000,"circulating_supply":18614718,"total_supply":18614718,"last_updated":"2021-01-30T15:00:02.000Z","quote":{"USD":{"price":34177.31601866782,"volume_24h":83208963467.24487,"percent_change_1h":1.15037986,"percent_change_24h":-10.87555443,"percent_change_7d":7.03677315,"percent_change_30d":19.84946991,"market_cap":636201099684.3843,"last_updated":"2021-01-30T15:00:02.000Z"}},"rank":1,"noLazyLoad":true}

I took a closer look at the HTML.
It appears that the JSON string data you seek is inside of a <script> tag with id "__NEXT_DATA__".
I'm not that familiar with BeautifulSoup so a more elegant way may exist to get the data. Here is the code I used.
cmc = requests.get('https://coinmarketcap.com/')
soup = BeautifulSoup(cmc.content, 'html.parser')
for item in soup.select('script[id="__NEXT_DATA__"]'):
data = json.loads(item.string) # load JSON string as a dict
desired_data = data["props"]["initialState"]["cryptocurrency"]["listingLatest"][
"data"
]
print(
json.dumps( # pretty output string
desired_data,
indent=2,
),
)
TRUNCATED OUTPUT:
[
{
"id": 1,
"name": "Bitcoin",
"symbol": "BTC",
"slug": "bitcoin",
"max_supply": 21000000,
"circulating_supply": 18614718,
"total_supply": 18614718,
"last_updated": "2021-01-30T14:51:02.000Z",
"quote": {
"USD": {
"price": 34138.18238095427,
"volume_24h": 83651976977.0413,
"percent_change_1h": 1.36922474,
"percent_change_24h": -9.82670796,
"percent_change_7d": 6.33079444,
"percent_change_30d": 19.72629419,
"market_cap": 635472638054.0323,
"last_updated": "2021-01-30T14:51:02.000Z"
}
},
"rank": 1,
"noLazyLoad": true
},
{
"id": 1027,
"name": "Ethereum",
"symbol": "ETH",
"slug": "ethereum",
"max_supply": null,
"circulating_supply": 114465285.999,
"total_supply": 114465285.999,
"last_updated": "2021-01-30T14:51:02.000Z",
"quote": {
"USD": {
"price": 1364.155096452962,
"volume_24h": 38819994919.48616,
"percent_change_1h": 1.95180621,
"percent_change_24h": -3.86551103,
"percent_change_7d": 10.22893483,
"percent_change_30d": 85.96783538,
"market_cap": 156148403262.48172,
"last_updated": "2021-01-30T14:51:02.000Z"
}
},
"rank": 2,
"noLazyLoad": true
},…

Related

How can I scrape the "Time" and other data in the advanced details section using Beautiful Soup

Here is the url that I want to scrape the data 'https://www.blockchain.com/explorer/transactions/btc/43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142'
I tried to scrape using Beautiful Soup because it doesn't have to open the browser like the Selenium does. So I tried to extract the data from the outer section
('section',{'class':'sc-f9148dd7-2 irWxzm'})
the irWxzm section
and then tried to find a little bit deeper to the targeted div tag but I don't understand why after I extracted the data from 'section',{'class':'sc-f9148dd7-2 irWxzm'}, it seemed that the data has stopped from the advanced detailsthe data from irWxzm section and I can't dive deeper to the desired div tag in the first picture.
Here my code that I wrote
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.blockchain.com/explorer/transactions/btc/43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142'
res = requests.get(url)
format = '%Y-%m-%d %H:%M'
soup = BeautifulSoup(res.content, 'html.parser')
soup2 = soup.find('section',{'class':'sc-f9148dd7-2 irWxzm'})
print(soup2)
I tried a lot but it can't find any tag under 'class':'sc-f9148dd7-2 irWxzm' except class="sc-c907597a-0 MqlNG and class="sc-c907597a-3 ctQMfW" according to second picture.
Could you help me find the way to get the data in the advanced details section please.
desired data
Thank you so very much in advance.

The page loads the data from external URL via JavaScript, to load the data you can use next example:
from datetime import datetime
import requests
api_url = "https://www.blockchain.com/explorer/api/transaction?asset=btc&id=43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142"
data = requests.get(api_url).json()
print(data)
Prints:
{
"ticker": "btc",
"transaction": {
"txid": "43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142",
"size": 381,
"version": 1,
"locktime": 0,
"fee": 170170,
"inputs": [
{
"coinbase": False,
"txid": "667f2825db6b03e349b5e4be7b4c4c5be266c242a6aaa0218480572ffc5a7b37",
"output": 0,
"sigscript": "47304402204ba063dca925f759777ed8818027c421cb4052ecf2e3b980c814bc528c73638e02206a3d58ec92d0be9915c14d6c4cef40a01d301286c90d82c1bcf166db0e94c3bb012103951bbeb5b73e530b6849fca68e470118f4b379ad9126015caf1355dc2a9e8480",
"sequence": 4294967295,
"pkscript": "76a9149c8ab044348d826b9ae88d698d575a45a6e8fc6988ac",
"value": 207730,
"address": "1FGiZB7K757EUixGcyeyME6Jp8qQZEiUUk",
"witness": [],
},
{
"coinbase": False,
"txid": "3c2dc36fd0bebc46062362aff0c4f307d1c99900c5f358fdd37b436a15d37a5f",
"output": 0,
"sigscript": "4730440220322e489e971b2c651224c2e03bea408df8c67a0a1c18ddfd20e940d90a8e61990220707ba2431bde31500ebe6a2b3c4a7974b87c4b9ee33849e1453c0831318bed14012103951bbeb5b73e530b6849fca68e470118f4b379ad9126015caf1355dc2a9e8480",
"sequence": 4294967295,
"pkscript": "76a9149c8ab044348d826b9ae88d698d575a45a6e8fc6988ac",
"value": 231716,
"address": "1FGiZB7K757EUixGcyeyME6Jp8qQZEiUUk",
"witness": [],
},
],
"outputs": [
{
"address": "1FGiZB7K757EUixGcyeyME6Jp8qQZEiUUk",
"pkscript": "76a9149c8ab044348d826b9ae88d698d575a45a6e8fc6988ac",
"value": 269276,
"spent": True,
"spender": {
"txid": "c7ed715e9f73b2792957af94d3143750525a29f6a62fd6f68d470e56e4bbef7b",
"input": 0,
},
},
{
"address": None,
"pkscript": "6a208627c703aeac41df8acad1c643d9ee9c2370f9cace1af05a0ac41219116b5e0b",
"value": 0,
"spent": False,
"spender": None,
},
],
"block": {"height": 756449, "position": 2},
"deleted": False,
"time": 1664582470,
"rbf": False,
"weight": 1524,
},
"rate": 16526.38,
"latestBlock": 769853,
"id": "43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142",
"description": False,
"fiat": "USD",
"labels": {},
}
To get the time:
print(datetime.fromtimestamp(data["transaction"]["time"]))
Prints:
2022-10-01 02:01:10

Parse complex JSON in Python

EDITED WITH LARGER JSON:
I have the following JSON and I need to get id element: 624ff9f71d847202039ec220
results": [
{
"id": "62503d2800c0d0004ee4636e",
"name": "2214524",
"settings": {
"dataFetch": "static",
"dataEntities": {
"variables": [
{
"id": "624ffa191d84720202e2ed4a",
"name": "temp1",
"device": {
"id": "624ff9f71d847202039ec220",
"name": "282c0240ea4c",
"label": "282c0240ea4c",
"createdAt": "2022-04-08T09:01:43.547702Z"
},
"chartType": "line",
"aggregationMethod": "last_value"
},
{
"id": "62540816330443111016e38b",
"device": {
"id": "624ff9f71d847202039ec220",
"name": "282c0240ea4c",
},
"chartType": "line",
}
]
}
...
Here is my code (EDITED)
url = "API_URL"
response = urllib.urlopen(url)
data = json.loads(response.read().decode("utf-8"))
print url
all_ids = []
for i in data['results']: # i is a dictionary
for variable in i['settings']['dataEntities']['variables']:
print(variable['id'])
all_ids.append(variable['id'])
But I have the following error:
for variable in i['settings']['dataEntities']['variables']:
KeyError: 'dataEntities'
Could you please help?
Thanks!!

What is it printing when you print(fetc)? If you format the json, it will be easier to read, the current nesting is very hard to comprehend.
fetc is a string, not a dict. If you want the dict, you have to use the key.
Try:
url = "API_URL"
response = urllib.urlopen(url)
data = json.loads(response.read().decode("utf-8"))
print url
for i in data['results']:
print(json.dumps(i['settings']))
print(i['settings']['dataEntities']
EDIT: To get to the id field, you'll need to dive further.
i['settings']['dataEntities']['variables'][0]['id']
So if you want all the ids you'll have to loop over the variables (assuming the list is more than one)`, and if you want them for all the settings, you'll need to loop over that too.
Full solution for you to try (EDITED after you uploaded the full JSON):
url = "API_URL"
response = urllib.urlopen(url)
data = json.loads(response.read().decode("utf-8"))
print url
all_ids = []
for i in data['results']: # i is a dictionary
for variable in i['settings']['dataEntities']['variables']:
print(variable['id'])
all_ids.append(variable['id'])
all_ids.append(variable['device']['id']
Let me know if that works.

The shared JSON is not valid. A valid JSON similar to yours is:
{
"results": [
{
"settings": {
"dataFetch": "static",
"dataEntities": {
"variables": [
{
"id": "624ffa191d84720202e2ed4a",
"name": "temp1",
"span": "inherit",
"color": "#2ccce4",
"device": {
"id": "624ff9f71d847202039ec220"
}
}
]
}
}
}
]
}
In order to get a list of ids from your JSON you need a double for cycle. A Pythonic code to do that is:
all_ids = [y["device"]["id"] for x in my_json["results"] for y in x["settings"]["dataEntities"]["variables"]]
Where my_json is your initial JSON.

How do I output specific data from a json response?

I am fairly new to using APIs in python and I am trying to create a system that outputs data from previous motorsport races. I have sent requests to an API, but I am struggling to get it to just output one specific piece of data (eg. time, location). I get this when I just print the raw JSON data sent.
{
"MRData": {
"RaceTable": {
"Races": [
{
"Circuit": {
"Location": {
"country": "Spain",
"lat": "41.57",
"locality": "Montmeló",
"long": "2.26111"
},
"circuitId": "catalunya",
"circuitName": "Circuit de Barcelona-Catalunya",
"url": "http://en.wikipedia.org/wiki/Circuit_de_Barcelona-Catalunya"
},
"date": "2020-08-16",
"raceName": "Spanish Grand Prix",
"round": "6",
"season": "2020",
"time": "13:10:00Z",
"url": "https://en.wikipedia.org/wiki/2020_Spanish_Grand_Prix"
}
],
"round": "6",
"season": "2020"
},
"limit": "30",
"offset": "0",
"series": "f1",
"total": "1",
"url": "http://ergast.com/api/f1/2020/6.json",
"xmlns": "http://ergast.com/mrd/1.4"
}
}
Just to get to grips with APIs I am simply trying to output a simple piece of data of a specific race, and once I can do that, I'll be able to scale it up and output all sorts of data. I'd assumed it would just be as simple as typing print(data['time']) (as seen below) but I get an error message saying this:
KeyError: 'time'
My source code:
import requests
response = requests.get("http://ergast.com/api/f1/2020/6.json")
data = response.json()
print (data["time"])
Any help is appreciated!

Like this...
import json
data = """{
"MRData":{
"xmlns":"http://ergast.com/mrd/1.4",
"series":"f1",
"url":"http://ergast.com/api/f1/2020/6.json",
"limit":"30",
"offset":"0",
"total":"1",
"RaceTable":{
"season":"2020",
"round":"6",
"Races":[
{
"season":"2020",
"round":"6",
"url":"https://en.wikipedia.org/wiki/2020_Spanish_Grand_Prix",
"raceName":"Spanish Grand Prix",
"Circuit":{
"circuitId":"catalunya",
"url":"http://en.wikipedia.org/wiki/Circuit_de_Barcelona-Catalunya",
"circuitName":"Circuit de Barcelona-Catalunya",
"Location":{
"lat":"41.57",
"long":"2.26111",
"locality":"Montmeló",
"country":"Spain"
}
},
"date":"2020-08-16",
"time":"13:10:00Z"
}
]
}
}
}"""
jsonData = json.loads(data)
Races is an array, in this case there is only one race so you would desigate it as ["Races"][0]
print(jsonData["MRData"]["RaceTable"]["Races"][0]["time"])

data['time'] would work if you had a flat dictionary, but you have a nested dicts/list structure, so:
data["MRData"]["RaceTable"]["Races"][0]["time"]
data["MRData"] returns another dict, which has a key "RaceTable". The value of this key is again a dictionary which has a key "Races". The value of this is a list of races, of which you only have one. The races are again dicts which have the key time.

How do you parsing nested JSON data for specific information?

I'm using the national weather service API and when you use a specific URL you get JSON data back. My program so far grabs everything including 155 hours of weather data.
Simply put I'm trying to parse the data and grab the weather for the
latest hour but everything is in a nested data structure.
My code, JSON data, and more information are below. Any help is appreciated.
import requests
import json
def get_current_weather(): #This method returns json data from the api
url = 'https://api.weather.gov/gridpoints/*office*/*any number,*any number*/forecast/hourly'
response = requests.get(url)
full_data = response.json()
return full_data
def main(): #Prints the information grabbed from the API
print(get_current_weather())
if __name__ == "__main__":
main()
In the JSON response, I get there are 3 layers before you get to the 'shortForecast' data that I'm trying to get. The first nest is 'properties, everything before it is irrelevant to my program. The second nest is 'periods' and each period is a new hour, 0 being the latest. Lastly, I just need to grab the 'shortForcast' in the first period or periods[0].
{
"#context": [
"https://geojson.org/geojson-ld/geojson-context.jsonld",
{
"#version": "1.1",
"wx": "https://api.weather.gov/ontology#",
"geo": "http://www.opengis.net/ont/geosparql#",
"unit": "http://codes.wmo.int/common/unit/",
"#vocab": "https://api.weather.gov/ontology#"
}
],
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[
*data I'm not gonna add*
]
]
},
"properties": {
"updated": "2021-02-11T05:57:24+00:00",
"units": "us",
"forecastGenerator": "HourlyForecastGenerator",
"generatedAt": "2021-02-11T07:12:58+00:00",
"updateTime": "2021-02-11T05:57:24+00:00",
"validTimes": "2021-02-10T23:00:00+00:00/P7DT14H",
"elevation": {
"value": ,
"unitCode": "unit:m"
},
"periods": [
{
"number": 1,
"name": "",
"startTime": "2021-02-11T02:00:00-05:00",
"endTime": "2021-02-11T03:00:00-05:00",
"isDaytime": false,
"temperature": 18,
"temperatureUnit": "F",
"temperatureTrend": null,
"windSpeed": "10 mph",
"windDirection": "N",
"icon": "https://api.weather.gov/icons/land/night/snow,40?size=small",
"shortForecast": "Chance Light Snow",
"detailedForecast": ""
},
{
"number": 2,
"name": "",
"startTime": "2021-02-11T03:00:00-05:00",
"endTime": "2021-02-11T04:00:00-05:00",
"isDaytime": false,
"temperature": 17,
"temperatureUnit": "F",
"temperatureTrend": null,
"windSpeed": "12 mph",
"windDirection": "N",
"icon": "https://api.weather.gov/icons/land/night/snow,40?size=small",
"shortForecast": "Chance Light Snow",
"detailedForecast": ""
},
OK, so I didn't want to edit everything again so this is the new get_current_weather method. I was able to get to 'periods but after that I'm still stumped. This is the new method.
def get_current_weather():
url = 'https://api.weather.gov/gridpoints/ILN/82,83/forecast/hourly'
response = requests.get(url)
full_data = response.json()
return full_data['properties'].get('periods')

For the dictionary object, you can access the nested elements by using indexing multiple times.
So, for your dictionary object, you can use the following to get the value for the key shortForecast for the first element in the list of dictionaries under key periods under the key properties in the main dictionary:
full_data['properties']['periods'][0]['shortForecast']

Python - How can I scrape with bs4 a javascript code)?

So I have been trying to scrape out a value from a html that is a javascript. There is alot of javascript in the code but I just want to be able to print out this one:
var spConfig=newProduct.Config({
"attributes": {
"531": {
"id": "531",
"options": [
{
"id": "18",
"hunter": "0",
"products": [
"128709"
]
},
{
"label": "40 1\/2",
"hunter": "0",
"products": [
"120151"
]
},
{
"id": "33",
"hunter": "0",
"products": [
"120152"
]
},
{
"id": "36",
"hunter": "0",
"products": [
"128710"
]
},
{
"id": "42",
"hunter": "0",
"products": [
"125490"
]
}
]
}
},
"Id": "120153",
});
So I started by doing a code that looks like:
test = bs4.find_all('script', {'type': 'text/javascript'})
print(test)
The output I am getting is pretty huge so I am not able to post it all here but one of them is the javascript as I mentioned at the top and I want to print out only var spConfig=newProduct.Config.
How am I able to do that, to be able to just print out var spConfig=newProduct.Config.... which I later can use json.loads that convert it to a json where I later on can scrape it more easier?
For any question or something I haven't explained well. I will apprecaite everything in the comment where I can improve myself aswell here in stackoverflow! :)
EDIT:
More example of what bs4 prints out for javascripts
<script type="text/javascript">varoptionsPrice=newProduct.Options({
"priceFormat": {
"pattern": "%s\u00a0\u20ac",
"precision": 2,
"requiredPrecision": 2,
"decimalSymbol": ",",
"groupSymbol": "\u00a0",
"groupLength": 3,
"integerRequired": 1
},
"showBoths": false,
"idSuffix": "_clone",
"skipCalculate": 1,
"defaultTax": 20,
"currentTax": 20,
"tierPrices": [
],
"tierPricesInclTax": [
],
"swatchPrices": null
});</script>,
<script type="text/javascript">var spConfig=newProduct.Config({
"attributes": {
"531": {
"id": "531",
"options": [
{
"id": "18",
"hunter": "0",
"products": [
"128709"
]
},
{
"label": "40 1\/2",
"hunter": "0",
"products": [
"120151"
]
},
{
"id": "33",
"hunter": "0",
"products": [
"120152"
]
},
{
"id": "36",
"hunter": "0",
"products": [
"128710"
]
},
{
"id": "42",
"hunter": "0",
"products": [
"125490"
]
}
]
}
},
"Id": "120153"
});</script>,
<scripttype="text/javascript">document.observe('dom:loaded',
function(){
varswatchesConfig=newProduct.ConfigurableSwatches(spConfig);
});</script>
EDIT update 2:
try:
product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
except Exception:
product_li_tags = []
for product_li_tag in product_li_tags:
try:
pat = "product.Config\((.+)\);"
json_str = re.search(pat, product_li_tag, flags=re.DOTALL).group(1)
print(json_str)
except:
pass
#json.loads(json_str)
print("Nothing")
sys.exit()

You can use the .text function to get the content within each tag. Then, if you know that you want to grab the code that specifically starts with "varoptionsPrice", you can filter for that:
soup = BeautifulSoup(myhtml, 'lxml')
script_blocks = soup.find_all('script', {'type': 'text/javascript'})
special_code = ''
for s in script_blocks:
if s.text.strip().startswith('varOptionsPrice'):
special_code = s.text
break
print(special_code)
EDIT: To answer your question in the comments, there are a couple of different ways of extracting the part of the text that has the JSON. You could pass it through a regexp to grab everything between the first left parentheses and before the ); at the end. Though if you want to avoid regexp completely, you could do something like:
json_stuff = special_code[special_code.find('(')+1:special_code.rfind(')')]
Then to make a usable dictionary out of it:
import json
j = json.loads(json_stuff)
print(j['defaultTax']) # This should return a value of 20

I can think of possible 3 options - which one you use might depend on the size of the project and how flexible you need it to be
Use Regex to extract the objects from the script (fastest, least flexible)
Use ANTLR or similar (eg. pyjsparser) to parse the js grammar
Use Selenium or other headless browsers that can interpret the JS for you. With this option, you can use selenium to execute a call to get the value of the variable like this
Regex Example (#1)
>>> script_body = """
var x=product.Config({
"key": {"a":1}
});
"""
>>> pat = "product.Config\((.+)\);"
>>> json_str = re.search(pat, script_body, flags=re.DOTALL).group(1)
>>> json.loads(json_str)
{'key': {'a': 1}}
>>> json.loads(json_str)['key']['a']
1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to find something specific in html code - python

Related

How can I scrape the "Time" and other data in the advanced details section using Beautiful Soup

Parse complex JSON in Python

How do I output specific data from a json response?

How do you parsing nested JSON data for specific information?

Python - How can I scrape with bs4 a javascript code)?

Categories

Resources