I am trying to extract structured data within a json statement inside a html page. Therefore I retrieved the html and got the json via xpath:
json.loads(response.xpath('//*[#id="product"]/script[2]/text()').extract_first())
The data starts like this:
response.xpath('//*[#id="product"]/script[2]/text()').extract_first()
"\r\ndataLayer.push({\r\n\t'event': 'EECproductDetailView',\r\n\t'ecommerce': {\r\n\t\t'detail': {\r\n\r\n\t\t\t'products': [{\r\n\t\t\t\t'id': '14171171',\r\n\t\t\t\t'name': 'Gingium 120mg',\r\n\t\t\t\t'price': '27.9',\r\n\r\n\t\t\t\t'brand': 'Hexal AG',\r\n\r\n\r\n\t\t\t\t'variant': 'Filmtabletten, 60 Stück, N2',\r\n\r\n\r\n\t\t\t\t'category': 'gedaechtnis-konzentration'\r\n\t\t\t}]\r\n\t\t}\r\n\t}\r\n});\r\n"
Sample structured json:
<script>
dataLayer.push({
'event': 'EECproductDetailView',
'ecommerce': {
'detail': {
'products': [{
'id': '14122171',
'name': 'test',
'price': '27.9'
}]
}
}
});
</script>
The error message is:
>>> json.loads(response.xpath('//*[#id="product"]/script[2]/text()').extract_first())
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)
I also tried to decode:
>>> json.loads(response.xpath('//*[#id="product"]/script[2]/text()').extract_first().decode("utf-8"))
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>>
How can I pull the product data into a python dictionary?
Many issues exist in your approach that I will discuss them below. You want to parse the value passed to push function as json and you have this as input:
dataLayer.push({
'event': 'EECproductDetailView',
'ecommerce': {
'detail': {
'products': [{
'id': '14122171',
'name': 'test',
'price': '27.9'
}]
}
}
});
Issues:
This data is raw. You shouldn't pass it directly to json.loads, to resolve this try to grab {'event' .... } from your string via regex or some string interpolation. For example if your data format is always like this and other javascripts are not defined in scope via {} then grab the index of first { and last } and do substring to get the main data.
This data contains ' as string indicators, but json standard use double quotes ". You should take care of replacing them as well.
After resolving issues you can use json.loads to parse your input.
Related
I am trying to convert event json from aws lambda function to python dictionary to that I can the event type and cluster identifier but I am getting an error which i am not able to resolve
Below is my code and error
import json
st = """
{
"Records":[
{
"EventSource":"aws:sns",
"EventVersion":"1.0",
"EventSubscriptionArn":"arn:aws:sns:us-east-2:427128480143:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e2ddc5e3c",
"Sns":{
"Type":"Notification",
"MessageId":"fecb7b39-a861-5450-9179-23d0ce68f268",
"TopicArn":"arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic",
"Subject":"AmazonRedshiftINFO - Cluster Created",
"Message":"{\"Event Source\":\"cluster\",\"Resource\":\"qa-redshift-cluster\",\"Event Time\":\"2021-04-08 20:12:57.466\",\"Identifier Link\":\"https://console.aws.amazon.com/redshift/home?region=us-east-2#cluster-details:cluster=qa-redshift-cluster \",\"Severity\":\"INFO\",\"Category\":[\"Management\"],\"About this Event\":\"http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#REDSHIFT-EVENT-2000 \",\"Event Message\":\"Amazon Redshift cluster \\'qa-redshift-cluster\\' has been created at 2021-04-08 20:12 UTC and is ready for use.\"}",
"Timestamp":"2021-04-08T20:12:57.905Z",
"SignatureVersion":"1",
"Signature":"jS2AK8hf/rebeXMsfFw0DJx+788w+RiDTXyAbLNZzYdE5Mlhi6GRRIns8VaAJZc5otXkkhGshvgvuE0JsUiOhXBccGEJY6Z+lhx6cLSQoEdQB0DRfwltWKOfpQ88LZXJuKCxYcsPL3y5veBdjJqwjdBZtcVYV9BYRhQvT6q/0MpDcJeOYzMkdpR2ULX5B7GasZ3AV0WbKxIjjzFSd3GNud2m85obRrB//NHQwqN6ydvg7PaN/sXuyJjmguRok27O6YNkIqzKjC4JxDbTl4BwyIbck59edbHC5kxmfpoc/RqjF/kUGaqnf0HYOUuMfDT85+9wYVz8vFFER1v3NnKRLA==",
"SigningCertUrl":"https://sns.us-east-2.amazonaws.com/SimpleNotificationService-010a507c1833636cd94bdd98bd93083a.pem",
"UnsubscribeUrl":"https://sns.us-east-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e7ddc5e3c",
"MessageAttributes":{
}
}
}
]
}
"""
data1 = json.loads(st)
print(type(data1))
print(data1)
data1 = json.loads(st)
print(type(data1))
print(data1)
{'Records': [{'EventSource': 'aws:sns', 'EventVersion': '1.0', 'EventSubscriptionArn': 'arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e1ddc5e3c', 'Sns': {'Type': 'Notification', 'MessageId': 'fecb7b39-a861-5450-9189-23d0ce68f268', 'TopicArn': 'arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic', 'Subject': '[Amazon Redshift INFO] - Cluster Created', 'Message': '{"Event Source":"cluster","Resource":"qa-redshift-cluster","Event Time":"2021-04-08 20:12:57.466","Identifier Link":"https://console.aws.amazon.com/redshift/home?region=us-east-2#cluster-details:cluster=qa-redshift-cluster ","Severity":"INFO","Category":["Management"],"About this Event":"http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#REDSHIFT-EVENT-2000 ","Event Message":"Amazon Redshift cluster \'qa-redshift-cluster\' has been created at 2021-04-08 20:12 UTC and is ready for use."}', 'Timestamp': '2021-04-08T20:12:57.905Z', 'SignatureVersion': '1', 'Signature': 'jS2AK8hf/rebeXMsfFw0DJx+788w+RiDTXyAbLNZzYdE5Mlhi6GRRIns8VaAJZc5otXkkhGshvgvuE0JsUiOhXBccGEJY6Z+lhx6cLSQoEdQB0DRfwltWKOfpQ88LZXJuKCxYcsPL3y5veBdjJqwjdBZtcVYV9BYRhQvT6q/0MpDcJeOYzMkdpR2ULX5B7GasZ3AV0WbKxIjjzFSd3GNud2m85obRrB//NHQwqN6ydvg7SaN/sXuyJjmguRok27O6YNkIqzKjC4JxDbTl4BwyIbck59edbHC5kxmfpoc/RqjF/kUGaqnf0HYOUuMfDT85+9wYVz8vFFER1v3NnKRLA==', 'SigningCertUrl': 'https://sns.us-east-2.amazonaws.com/SimpleNotificationService-010a507c1833636cd94bdb98bd93083a.pem', 'UnsubscribeUrl': 'https://sns.us-east-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e1ddc5e3c', 'MessageAttributes': {}}}]}
Error
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/TerraformSeriousProject/boto3Examples/src/test.py", line 29, in <module>
data1 = json.loads(st)
File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 13 column 26 (char 520)
I used https://jsonformatter.curiousconcept.com/# tool to format the json
Loading like this in Python you will need to double the backslashes on the Message line. The slash that you have there currently will be used to create the string.
You can see this by doing a print of st
However the better way to do this would be to not have st as a string but a dictionary (i.e. remove the triple quotes), then you don't need to mess around with the json library at all here
st = {
"Records":[
{
"EventSource":"aws:sns",
"EventVersion":"1.0",
"EventSubscriptionArn":"arn:aws:sns:us-east-2:427128480143:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e2ddc5e3c",
"Sns":{
"Type":"Notification",
"MessageId":"fecb7b39-a861-5450-9179-23d0ce68f268",
"TopicArn":"arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic",
"Subject":"AmazonRedshiftINFO - Cluster Created",
"Message":"{\"Event Source\":\"cluster\",\"Resource\":\"qa-redshift-cluster\",\"Event Time\":\"2021-04-08 20:12:57.466\",\"Identifier Link\":\"https://console.aws.amazon.com/redshift/home?region=us-east-2#cluster-details:cluster=qa-redshift-cluster
\",\"Severity\":\"INFO\",\"Category\":[\"Management\"],\"About this Event\":\"http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#REDSHIFT-EVENT-2000 \",\"Event Message\":\"Amazon Redshift cluster \\'qa-redshift-cluster\\' ha
s been created at 2021-04-08 20:12 UTC and is ready for use.\"}",
"Timestamp":"2021-04-08T20:12:57.905Z",
"SignatureVersion":"1",
"Signature":"jS2AK8hf/rebeXMsfFw0DJx+788w+RiDTXyAbLNZzYdE5Mlhi6GRRIns8VaAJZc5otXkkhGshvgvuE0JsUiOhXBccGEJY6Z+lhx6cLSQoEdQB0DRfwltWKOfpQ88LZXJuKCxYcsPL3y5veBdjJqwjdBZtcVYV9BYRhQvT6q/0MpDcJeOYzMkdpR2ULX5B7GasZ3AV0WbKxIjjzFSd3GNud2m85obRrB//NHQwqN6ydv
g7PaN/sXuyJjmguRok27O6YNkIqzKjC4JxDbTl4BwyIbck59edbHC5kxmfpoc/RqjF/kUGaqnf0HYOUuMfDT85+9wYVz8vFFER1v3NnKRLA==",
"SigningCertUrl":"https://sns.us-east-2.amazonaws.com/SimpleNotificationService-010a507c1833636cd94bdd98bd93083a.pem",
"UnsubscribeUrl":"https://sns.us-east-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e7ddc5e3c",
"MessageAttributes":{
}
}
}
]
}
print(st['Records'][0]['Sns']['Message'])
{"Event Source":"cluster","Resource":"qa-redshift-cluster","Event Time":"2021-04-08 20:12:57.466","Identifier Link":"https://console.aws.amazon.com/redshift/home?region=us-east-2#cluster-details:cluster=qa-redshift-cluster ","Severity":"INFO","Category":["Management"],"About this Event":"http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#REDSHIFT-EVENT-2000 ","Event Message":"Amazon Redshift cluster \'qa-redshift-cluster\' has been created at 2021-04-08 20:12 UTC and is ready for use."}
You have to escape double quotes in python with two backslashes, like this \\".
This event would work:
st = """
{"Records":[{
"EventSource":"aws:sns",
"EventVersion":"1.0",
"EventSubscriptionArn":"arn:aws:sns:us-east-2:427128480143:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e2ddc5e3c",
"Sns":{
"Type":"Notification",
"MessageId":"fecb7b39-a861-5450-9179-23d0ce68f268",
"TopicArn":"arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic",
"Subject":"AmazonRedshift INFO - Cluster Created",
"Message":"{\\"Event Source\\":\\"cluster\\",\\"Resource\\":\\"qa-redshift-cluster\\",\\"Event Time\\":\\"2021-04-08 20:12:57.466\\",\\"Identifier Link\\":\\"https://console.aws.amazon.com/redshift/home?region=us-east-2#cluster-details:cluster=qa-redshift-cluster \\",\\"Severity\\":\\"INFO\\",\\"Category\\":[\\"Management\\"],\\"About this Event\\":\\"http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#REDSHIFT-EVENT-2000 \\",\\"Event Message\\":\\"Amazon Redshift cluster 'qa-redshift-cluster' has been created at 2021-04-08 20:12 UTC and is ready for use.\\"}",
"Timestamp":"2021-04-08T20:12:57.905Z",
"SignatureVersion":"1",
"Signature":"jS2AK8hf/rebeXMsfFw0DJx+788w+RiDTXyAbLNZzYdE5Mlhi6GRRIns8VaAJZc5otXkkhGshvgvuE0JsUiOhXBccGEJY6Z+lhx6cLSQoEdQB0DRfwltWKOfpQ88LZXJuKCxYcsPL3y5veBdjJqwjdBZtcVYV9BYRhQvT6q/0MpDcJeOYzMkdpR2ULX5B7GasZ3AV0WbKxIjjzFSd3GNud2m85obRrB//NHQwqN6ydvg7PaN/sXuyJjmguRok27O6YNkIqzKjC4JxDbTl4BwyIbck59edbHC5kxmfpoc/RqjF/kUGaqnf0HYOUuMfDT85+9wYVz8vFFER1v3NnKRLA==",
"SigningCertUrl":"https://sns.us-east-2.amazonaws.com/SimpleNotificationService-010a507c1833636cd94bdd98bd93083a.pem",
"UnsubscribeUrl":"https://sns.us-east-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-2:427128480243:terraform-redshift-sns-topic:cbdfec04-7502-4509-9954-435e7ddc5e3c",
"MessageAttributes":{
}
}
}
]
}
"""
I'm trying to parse a website with the requests module on AWS EC2.
import requests
some_data = {'a':'',
'b':''}
with requests.Session() as s:
result = s.post('http://website.com',data=some_data)
print(result.text)
The page is responding as below:
{
"arrangetype":"U",
"list": [
{
"product_no":43,
"display_order":4,
"is_selling":"T",
"product_empty":"F",
"fix_position":null,
"is_auto_sort":false
},
{
"product_no":44,
"display_order":6,
"is_selling":"T",
"product_empty":"F",
"fix_position":null,
"is_auto_sort":false
}
],
"length":2
}
so I did as below thanks to Roadrunner's help(parsing and getting list from response of get request
)
import requests
from json import loads
some_data = {'a':'',
'b':''}
with requests.Session() as s:
result = s.post('http://website.com',data=some_data)
json_dict = loads(result.text)
print([x["product_no"] for x in json_dict["list"]])
It is perfectly working on local PC(windows10), but now i'm working on AWS EC2 to run the code, however, it occurs error.
The name of the file is timesale.py btw.
Traceback (most recent call last):
File "timesale.py", line 657, in <module>
Timesale_Automation()
File "timesale.py", line 397, in Timesale_Automation
status_check()
File "timesale.py", line 345, in status_check
json_dict = loads(now_sale_page2.text)
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I couldn't figure out why this error occurs.
I looked up overstackflow and there were some articles about decoding json but
result.text.decode('utf-8')
returns, I cannot decode str....
interestingly, I put print(result) between the codes and it is not 'None'
Why this would happen?
I have the following string I need to deserialize:
{
"bw": 20,
"center_freq": 2437,
"channel": 6,
"essid": "DIRECT-sB47" Philips 6198",
"freq": 2437
}
This is almost a correct JSON, except for the quote in the value DIRECT-sB47" Philips 6198 which prematurely ends the string, breaking the rest of the JSON.
Is there a way to deserialize elements which have the pattern
"key": "something which includes a quote",
or should I try to first pre-process the string with a regex to remove that quote (I do not care about it, nor about any other weird characters in the keys or values)?
UPDATE: sorry for not posting the code (it is a standard deserialization via json). The code is also available at repl.it
import json
data = '''
{
"bw": 20,
"center_freq": 2437,
"channel": 6,
"essid": "DIRECT-sB47" Philips 6198",
"freq": 2437
}
'''
trans = json.loads(data)
print(trans)
The traceback:
Traceback (most recent call last):
File "main.py", line 12, in <module>
trans = json.loads(data)
File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 6 column 26 (char 79)
The same code without the quote works fine
import json
data = '''
{
"bw": 20,
"center_freq": 2437,
"channel": 6,
"essid": "DIRECT-sB47 Philips 6198",
"freq": 2437
}
'''
trans = json.loads(data)
print(trans)
COMMENT: I realize that the provider of the JSON should fix their code (I opened a bug report with them). In the meantime, until the bug is fixed (if it is) I would like to try a workaround.
I ended up analyzing the exception which includes the place of the faulty character, removing it and deserializing again (in a loop).
Worst case the whole data string is swallowed, which in my case is better than crashing.
import json
import re
data = '''
{
"bw": 20,
"center_freq": 2437,
"channel": 6,
"essid": "DIRECT-sB47" Philips 6198",
"freq": 2437
}
'''
while True:
try:
trans = json.loads(data)
except json.decoder.JSONDecodeError as e:
s = int(re.search(r"\.*char (\d+)", str(e)).group(1))-2
print(f"incorrect character at position {s}, removing")
data = data[:s] + data[(s + 1):]
else:
break
print(trans)
I'm having an issue I can't understand. I sending a json data as string via Redis (as a queue) and the receiver is throwing the following error :
[ERROR JSON (in queue)] - {"ip": null, "domain": "somedomain.com", "name": "Some user name", "contact_id": 12345, "signature":
"6f496a4eaba2c1ea4e371ea2c4951ad92f41ddf45ff4949ffa761b0648a22e38"} => end is out of bounds
The code that throws the exception is the following :
try:
item = json.loads(item[1])
except ValueError as e:
sys.stderr.write("[ERROR JSON (in queue)] - {1} => {0}\n".format(str(e), str(item)))
return None
What is really odd, is that if I open a python console and do the following :
>>> import json
>>> s = '{"ip": null, "domain": "somedomain.com", "name": "Some user name", "contact_id": 12345, "signature": "6f496a4eaba2c1ea4e371ea2c4951ad92f41ddf45ff4949ffa761b0648a22e38"}'
>>> print s
I have no issue, the string (copy/pasted in the Python console) yield no errors at all, but my original code is throwing one!
Do you have any idea about what is causing the issue?
You are loading item[1], which is the second character of the string items:
>>> json.loads('"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 381, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: end is out of bounds
You should write:
item = json.loads(item)
I am trying to access the mqlread API from Freebase but am getting a "Not Found" 404:
api_key = open("freebaseApiKey").read()
mqlread_url = 'https://www.googleapis.com/freebase/v1/mqlread'
mql_query = '[{"mid": null,"name": null, "type": "/location/statistical_region","limit": 100}]'
cursor = ""
topicService_url = 'https://www.googleapis.com/freebase/v1/topic'
params = {
'key': api_key,
'filter': '/location/statistical_region',
'limit': 0
}
for i in xrange(1000):
mql_url = mqlread_url + '?query=' + mql_query + "&cursor=" + cursor
print mql_url
statisticalRegionsResult = json.loads(urllib.urlopen(mql_url).read())
....
Obviously when I run my python file I get:
https://www.googleapis.com/freebase/v1/mqlread?query=[{"mid": null,"name": null, "type": "/location/statistical_region","limit": 100}]&cursor=
Traceback (most recent call last):
File "[Filepath]...FreeBaseDownload.py", line 37, in <module>
statisticalRegionsResult = json.loads(urllib.urlopen(mql_url).read())
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/decoder.py", line 338, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
What am I doing wrong with the API? I've read things about mqlread being deprecated, what is the parallel for my quest to get all statistical regions (the mids) in Freebase?
It was deprecated over a year ago. It was finally shut down May 2.
https://groups.google.com/forum/#!topic/freebase-discuss/WEnyO8f7xOQ
The only source for this information now is the Freebase data dump.
https://developers.google.com/freebase/#freebase-rdf-dumps