Python - How can I scrape with bs4 a javascript code)? - python

So I have been trying to scrape out a value from a html that is a javascript. There is alot of javascript in the code but I just want to be able to print out this one:
var spConfig=newProduct.Config({
"attributes": {
"531": {
"id": "531",
"options": [
{
"id": "18",
"hunter": "0",
"products": [
"128709"
]
},
{
"label": "40 1\/2",
"hunter": "0",
"products": [
"120151"
]
},
{
"id": "33",
"hunter": "0",
"products": [
"120152"
]
},
{
"id": "36",
"hunter": "0",
"products": [
"128710"
]
},
{
"id": "42",
"hunter": "0",
"products": [
"125490"
]
}
]
}
},
"Id": "120153",
});
So I started by doing a code that looks like:
test = bs4.find_all('script', {'type': 'text/javascript'})
print(test)
The output I am getting is pretty huge so I am not able to post it all here but one of them is the javascript as I mentioned at the top and I want to print out only var spConfig=newProduct.Config.
How am I able to do that, to be able to just print out var spConfig=newProduct.Config.... which I later can use json.loads that convert it to a json where I later on can scrape it more easier?
For any question or something I haven't explained well. I will apprecaite everything in the comment where I can improve myself aswell here in stackoverflow! :)
EDIT:
More example of what bs4 prints out for javascripts
<script type="text/javascript">varoptionsPrice=newProduct.Options({
"priceFormat": {
"pattern": "%s\u00a0\u20ac",
"precision": 2,
"requiredPrecision": 2,
"decimalSymbol": ",",
"groupSymbol": "\u00a0",
"groupLength": 3,
"integerRequired": 1
},
"showBoths": false,
"idSuffix": "_clone",
"skipCalculate": 1,
"defaultTax": 20,
"currentTax": 20,
"tierPrices": [
],
"tierPricesInclTax": [
],
"swatchPrices": null
});</script>,
<script type="text/javascript">var spConfig=newProduct.Config({
"attributes": {
"531": {
"id": "531",
"options": [
{
"id": "18",
"hunter": "0",
"products": [
"128709"
]
},
{
"label": "40 1\/2",
"hunter": "0",
"products": [
"120151"
]
},
{
"id": "33",
"hunter": "0",
"products": [
"120152"
]
},
{
"id": "36",
"hunter": "0",
"products": [
"128710"
]
},
{
"id": "42",
"hunter": "0",
"products": [
"125490"
]
}
]
}
},
"Id": "120153"
});</script>,
<scripttype="text/javascript">document.observe('dom:loaded',
function(){
varswatchesConfig=newProduct.ConfigurableSwatches(spConfig);
});</script>
EDIT update 2:
try:
product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
except Exception:
product_li_tags = []
for product_li_tag in product_li_tags:
try:
pat = "product.Config\((.+)\);"
json_str = re.search(pat, product_li_tag, flags=re.DOTALL).group(1)
print(json_str)
except:
pass
#json.loads(json_str)
print("Nothing")
sys.exit()

You can use the .text function to get the content within each tag. Then, if you know that you want to grab the code that specifically starts with "varoptionsPrice", you can filter for that:
soup = BeautifulSoup(myhtml, 'lxml')
script_blocks = soup.find_all('script', {'type': 'text/javascript'})
special_code = ''
for s in script_blocks:
if s.text.strip().startswith('varOptionsPrice'):
special_code = s.text
break
print(special_code)
EDIT: To answer your question in the comments, there are a couple of different ways of extracting the part of the text that has the JSON. You could pass it through a regexp to grab everything between the first left parentheses and before the ); at the end. Though if you want to avoid regexp completely, you could do something like:
json_stuff = special_code[special_code.find('(')+1:special_code.rfind(')')]
Then to make a usable dictionary out of it:
import json
j = json.loads(json_stuff)
print(j['defaultTax']) # This should return a value of 20

I can think of possible 3 options - which one you use might depend on the size of the project and how flexible you need it to be
Use Regex to extract the objects from the script (fastest, least flexible)
Use ANTLR or similar (eg. pyjsparser) to parse the js grammar
Use Selenium or other headless browsers that can interpret the JS for you. With this option, you can use selenium to execute a call to get the value of the variable like this
Regex Example (#1)
>>> script_body = """
var x=product.Config({
"key": {"a":1}
});
"""
>>> pat = "product.Config\((.+)\);"
>>> json_str = re.search(pat, script_body, flags=re.DOTALL).group(1)
>>> json.loads(json_str)
{'key': {'a': 1}}
>>> json.loads(json_str)['key']['a']
1

Related

How to get a value from JSON

This is the first time I'm working with JSON, and I'm trying to pull url out of the JSON below.
{
"name": "The_New11d112a_Company_Name",
"sections": [
{
"name": "Products",
"payload": [
{
"id": 1,
"name": "TERi Geriatric Patient Skills Trainer,
"type": "string"
}
]
},
{
"name": "Contact Info",
"payload": [
{
"id": 1,
"name": "contacts",
"url": "https://www.a3bs.com/catheterization-kits-8000892-3011958-3b-scientific,p_1057_31043.html",
"contacts": [
{
"name": "User",
"email": "Company Email",
"phone": "Company PhoneNumber"
}
],
"type": "contact"
}
]
}
],
"tags": [
"Male",
"Airway"
],
"_id": "0e4cd5c6-4d2f-48b9-acf2-5aa75ade36e1"
}
I have been able to access description and _id via
data = json.loads(line)
if 'xpath' in data:
xpath = data["_id"]
description = data["sections"][0]["payload"][0]["description"]
However, I can't seem to figure out a way to access url. One other issue I have is there could be other items in sections, which makes indexing into Contact Info a non starter.
Hope this helps:
import json
with open("test.json", "r") as f:
json_out = json.load(f)
for i in json_out["sections"]:
for j in i["payload"]:
for key in j:
if "url" in key:
print(key, '->', j[key])
I think your JSON is damaged, it should be like that.
{
"name": "The_New11d112a_Company_Name",
"sections": [
{
"name": "Products",
"payload": [
{
"id": 1,
"name": "TERi Geriatric Patient Skills Trainer",
"type": "string"
}
]
},
{
"name": "Contact Info",
"payload": [
{
"id": 1,
"name": "contacts",
"url": "https://www.a3bs.com/catheterization-kits-8000892-3011958-3b-scientific,p_1057_31043.html",
"contacts": [
{
"name": "User",
"email": "Company Email",
"phone": "Company PhoneNumber"
}
],
"type": "contact"
}
]
}
],
"tags": [
"Male",
"Airway"
],
"_id": "0e4cd5c6-4d2f-48b9-acf2-5aa75ade36e1"
}
You can check it on http://json.parser.online.fr/.
And if you want to get the value of the url.
import json
j = json.load(open('yourJSONfile.json'))
print(j['sections'][1]['payload'][0]['url'])
I think it's worth to write a short function to get the url(s) and make a decision whether or not to use the first found url in the returned list, or skip processing if there's no url available in your data.
The method shall looks like this:
def extract_urls(data):
payloads = []
for section in data['sections']:
payloads += section.get('payload') or []
urls = [x['url'] for x in payloads if 'url' in x]
return urls
This should print out the URL
import json
# open json file to read
with open('test.json','r') as f:
# load json, parameter as json text (file contents)
data = json.loads(f.read())
# after observing format of JSON data, the location of the URL key
# is determined and the data variable is manipulated to extract the value
print(data['sections'][1]['payload'][0]['url'])
The exact location of the 'url' key:
1st (position) of the array which is the value of the key 'sections'
Inside the array value, there is a dict, and the key 'payload' contains an array
In the 0th (position) of the array is a dict with a key 'url'
While testing my solution, I noticed that the json provided is flawed, after fixing the json flaws(3), I ended up with this.
{
"name": "The_New11d112a_Company_Name",
"sections": [
{
"name": "Products",
"payload": [
{
"id": 1,
"name": "TERi Geriatric Patient Skills Trainer",
"type": "string"
}
]
},
{
"name": "Contact Info",
"payload": [
{
"id": 1,
"name": "contacts",
"url": "https://www.a3bs.com/catheterization-kits-8000892-3011958-3b-scientific,p_1057_31043.html",
"contacts": [
{
"name": "User",
"email": "Company Email",
"phone": "Company PhoneNumber"
}
],
"type": "contact"
}
]
}
],
"tags": [
"Male",
"Airway"
],
"_id": "0e4cd5c6-4d2f-48b9-acf2-5aa75ade36e1"}
After utilizing the JSON that was provided by Vincent55.
I made a working code with exception handling and with certain assumptions.
Working Code:
## Assuming that the target data is always under sections[i].payload
from json import loads
line = open("data.json").read()
data = loads(line)["sections"]
for x in data:
try:
# With assumption that there is only one payload
if x["payload"][0]["url"]:
print(x["payload"][0]["url"])
except KeyError:
pass

How to flatten JSON response from Surveymonkey API

I'm setting up a Python function to use the Surveymonkey API to get survey responses from Surveymonkey.
The API returns responses in a JSON format with a deep recursive file structure.
I'm having issues trying to flatten this JSON so that it can go into Google Cloud Storage.
I have tried to flatten the response using the following code. Which works; however, it does not transform it to the format that I am looking for.
{
"per_page": 2,
"total": 1,
"data": [
{
"total_time": 0,
"collection_mode": "default",
"href": "https://api.surveymonkey.com/v3/responses/5007154325",
"custom_variables": {
"custvar_1": "one",
"custvar_2": "two"
},
"custom_value": "custom identifier for the response",
"edit_url": "https://www.surveymonkey.com/r/",
"analyze_url": "https://www.surveymonkey.com/analyze/browse/",
"ip_address": "",
"pages": [
{
"id": "73527947",
"questions": [
{
"id": "273237811",
"answers": [
{
"choice_id": "1842351148"
},
{
"text": "I might be text or null",
"other_id": "1842351149"
}
]
},
{
"id": "273240822",
"answers": [
{
"choice_id": "1863145815",
"row_id": "1863145806"
},
{
"text": "I might be text or null",
"other_id": "1863145817"
}
]
},
{
"id": "273239576",
"answers": [
{
"choice_id": "1863156702",
"row_id": "1863156701"
},
{
"text": "I might be text or null",
"other_id": "1863156707"
}
]
},
{
"id": "296944423",
"answers": [
{
"text": "I might be text or null"
}
]
}
]
}
],
"date_modified": "1970-01-17T19:07:34+00:00",
"response_status": "completed",
"id": "5007154325",
"collector_id": "50253586",
"recipient_id": "0",
"date_created": "1970-01-17T19:07:34+00:00",
"survey_id": "105723396"
}
],
"page": 1,
"links": {
"self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
}
}
answers_df = json_normalize(data=response_json['data'],
record_path=['pages', 'questions', 'answers'],
meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])
Instead of returning a row for each question id, I need it to return a column for each question id, choice_id, and text field.
The columns I would like to see are total_time, collection_mode, href, custom_variables.custvar_1, custom_variables.custvar_2, custom_value, edit_url, analyze_url, ip_address, pages.id, pages.questions.0.id, pages.questions.0.answers.0.choice_id, pages.questions.0.answers.0.text, pages.questions.0.answers.0.other_id
Instead of the each Question ID, Choice_id, text and answer being on a separate row. I would like a column for each one. So that there is only 1 row per survey_id or index in data

How can I fix this regex to match an object in Json and replace it as a list of Object

I have tried the following but i am failing to match the object in Json
:\s*(\{[^\"]*\})
I want to know the way to replace the object type in Json as list of object.
Here is the sample of Json:
{
"resourceType": "ChargeItem",
"id": "example",
"text": {
"status": "generated",
"session": "Done"
},
"identifier": [
{
"system": "http://myHospital.org/ChargeItems",
"value": "654321"
}
],
"definitionUri": [
"http://www.kbv.de/tools/ebm/html/01520_2904360860826220813632.html"
],
"status": "billable",
"code": {
"coding": [
{
"code": "01510",
"display": "Zusatzpauschale für Beobachtung nach diagnostischer Koronarangiografie"
}
]
}
}
I need i want to convert to this form:
{
"resourceType": "ChargeItem",
"id": "example",
"text": [{
"status": "generated",
"session": "Done"
}],
"identifier": [
{
"system": "http://myHospital.org/ChargeItems",
"value": "654321"
}
],
"definitionUri": [
"http://www.kbv.de/tools/ebm/html/01520_2904360860826220813632.html"
],
"status": "billable",
"code": [{
"coding": [
{
"code": "01510",
"display": "Zusatzpauschale für Beobachtung nach diagnostischer Koronarangiografie"
}
]
}]
}
This appears to be a few simple transformations:
First, change
"text": {
to
"text": [{
Second, change
},
"identifier": [
to
}],
"identifier": [
Third, change
"code": {
to
"code": [{
And finally, change
}
}
<EOF>
to
}]
}
<EOF>
However, it might not be as straightforward as it appears, i.e. what if the identifer section isn't always present, or doesn't immediately follow the text section?
Regular expressions are a poor choice for doing this work. It would be much better to read the json file into a native Python data structure, apply your desired changes, then save the json back to the file.
Solution using a multiline regexp search
>>> import re
>>> blocks = re.compile(r'(?ms)(.*)("text": )([{][^{}]+[}])(,.*"status": "billable"[^"]+)("code": )([{][^"]+"coding":[^]]+\]\s+\})')
>>> m = blocks.search(s)
>>> result = ""
>>> for i in range(1,len(m.groups()) + 1):
... if i not in (3,6):
... result += m.group(i)
... else:
... result += "[" + m.group(i) + "]"
...
>>> result += "\n}"

How to convert special characters to normal text when exporting to file in Python?

I have some input data from a website, that I have gathered using BeautifulSoup.
After I have collected the relevant information from the site, I want to export it to JSON.
This is what some of my output data looks like:
[
{
"time": "30\/3",
"tag": "I\u00c3\u00b8"
},
{
"time": "12\/4",
"tag": "Da"
}
]
It should be:
[
{
"time": "30/3",
"tag": "Iø"
},
{
"time": "12/4",
"tag": "Da"
}
]
Why does it look like that and how do I fix it?
i don't know about the code around it, but this issue is because your code is trying to use ascii encoding, so it can't handle the special characters
to handle special characters with json you can just set ensure_ascii to false
import json
a = [
{
"time": "30/3",
"tag": "Iø"
},
{
"time": "12/4",
"tag": "Da"
}
]
print(json.dumps(a, ensure_ascii=False, indent=4))
output:
[
{
"time": "30/3",
"tag": "Iø"
},
{
"time": "12/4",
"tag": "Da"
}
]
The issue is they're escaping the slashes and non-ASCII characters. One way is to use the json library like so:
>>> import json
>>> s = """[
... {
... "time": "30\/3",
... "tag": "I\u00c3\u00b8"
... },
... {
... "time": "12\/4",
... "tag": "Da"
... }
... ]"""
>>> json.loads(s)
[{'time': '30/3', 'tag': 'Iø'}, {'time': '12/4', 'tag': 'Da'}]

Connecting many json files to one

i get many json strings from a mysql DB an should combine them.
For example:
{
"type": "device",
"name": "Lampe",
"controls": [
{
"type": "switch",
"name": "Betrieb",
"topic": "/lampe/schalter"
}
]
}
in combination this devices should get into a array of a json file
{
"name": "Test-System",
"devices": [
{
"type": "device",
"name": "Lampe",
"controls": [
{
"type": "switch",
"name": "Betrieb",
"topic": "/lampe/schalter"
}
]
},
{
other Device
}
]
}
i do not understand how to do this in python
does someone have a idea how to do it ?
The json module can be used.
#!/usr/bin/env python3.5
import json
# Parse each device JSON file.
device1 = json.load(open("device-switch-Lampe.json"))
device2 = json.load(open("device-sensor-Wert.json"))
# more devices ...
obj = {"name": "Test-System", "devices": [device1, device2]}
print(json.dumps(obj))
Output (prettified):
{
"devices": [{
"type": "device",
"controls": [{
"type": "switch",
"topic": "/lampe/schalter",
"name": "Betrieb"
}],
"name": "Lampe"
}, {
"type": "device",
"controls": [{
"type": "sensor",
"topic": "/sensor/wert",
"name": "Wert"
}],
"name": "Sensor"
}],
"name": "Test-System"
}
There are two ways you could do this - by working on strings, or by working with Python-JSON data structures. The former would be something like
# untested code
s = '''{
"name": "Test-System",
"devices": [ '''
while True:
j = get_json_from_DB()
if not j: break # null string or None
s = s + j + ',\n'
s = s[:-2] + ']\n}\n' #[:-2 loses the last ',\n' from the loop
Or if you want to work with Python loaded-JSON then
import json
# untested code
s = {
"name": "Test-System",
"devices": []
}
while True:
j = get_json_from_DB()
if not j: break # null string or None
s['devices'].append( json.loads(j) )
# str = json.dumps(s) # ought to be valid
This latter will validate all your incoming json-strings (json.loads() will throw an exception for any bad JSON) and will be more efficient for large numbers of devices. It's therefore to be preferred unless you are working in a RAM-constrained embedded system with small numbers of devices, where the greater memory footprint of the latter is a problem.

Categories

Resources