JSONDecodeError when trying to extract a json - python

I am getting this error every time i try to extract the json in this api
Request_URL='https://freeserv.dukascopy.com/2.0/api/group=quotes&method=realtimeSentimentIndex&enabled=true&key=bsq3l3p5lc8w4s0c&type=swfx&jsonp=_callbacks____1kvynkpid'
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
import json
import requests
import pandas as pd
r = requests.get(Request_URL)
df = pd.DataFrame(r.json())

The problem is that the response coming back is in JSONP format. That is, it is JavaScript consisting of a call to a function with an argument that is a JavaScript structure (which should be a valid JSON string if it had single quotes around it, but there is no guarantee that it is). In part it looks like:
_callbacks____1kvynkpid([{"id":"10012" ...])
So we need to first remove the JavaScript call, which are the leading characters up to and including the first ( character and the final ):
import requests
import json
request_url = 'https://freeserv.dukascopy.com/2.0/api?group=quotes&method=realtimeSentimentIndex&enabled=true&key=bsq3l3p5lc8w4s0c&type=swfx&jsonp=_callbacks____1kvynkpid'
r = requests.get(request_url)
text = r.text
idx = text.index('(')
# skip everything up to and including opening '(' and then skip closing ')'
text = text[idx+1:-1]
print(json.loads(text))
Prints:
[{'id': '10012', 'title': 'ESP.IDX/EUR', 'date': '1636925400000', 'long': '71.43', 'short': '28.57'}, {'id': '10104', 'title': 'AUS.IDX/AUD', 'date': '1636925400000', 'long': '70.59', 'short': '29.41'}, {'id': '10266', 'title': 'NLD.IDX/EUR', 'date': '1636925400000', 'long': '73.48', 'short': '26.52'},
... data too big too fully reproduce
{'id': '82862', 'title': 'MAT/USD', 'date': '1636925400000', 'long': '70.27', 'short': '29.73'}, {'id': '82866', 'title': 'ENJ/USD', 'date': '1636925400000', 'long': '72.16', 'short': '27.84'}]
In this case the structure when interpreted as a string adhered to the JSON format and so we were able to parse it with json.loads(). But what if the JavaScript structure had been (in part):
[{'id':'10012'}]
This is both legal JavaScript and legal Python, but not legal JSON because strings must be enclosed within double-quotes for it to be valid JSON. But since it is legal Python, we could use ast.literal_eval:
import requests
import ast
request_url = 'https://freeserv.dukascopy.com/2.0/api?group=quotes&method=realtimeSentimentIndex&enabled=true&key=bsq3l3p5lc8w4s0c&type=swfx&jsonp=_callbacks____1kvynkpid'
r = requests.get(request_url)
text = r.text
idx = text.index('(')
# skip everything up to and including opening '(' and then skip closing ')'
text = text[idx+1:-1]
print(ast.literal_eval(text))
Of course, for the current situation both json.loads and ast.literal_eval happened to work. However, if the JavaScript structure had been:
[{id:'10012'}]
This is valid JavaScript but, alas, not valid Python and cannot be parsed with either json.loads or ast.literal_eval.

Related

Dictionary object has no attribute split

My data file looks like this:
{'data': 'xyz', 'code': '<:c:605445> **[Code](https://traindata/35.6547,56475', 'time': '2021-12-30T09:56:53.547', 'value': 'True', 'stats': '96/23', 'dupe_id': 'S<:c-74.18'}
I'm trying to print this line:
35.6547,56475
Here is my code:
data = "above mentioned data"
for s in data.values():
print(s)
while data != "stop":
if data == "quit":
os.system("disconnect")
else:
x, y = s.split(',', 1)
The output is:
{'data': 'xyz', 'code': '<:c:605445> **[Code](https://traindata/35.6547,56475', 'time': '2021-12-30T09:56:53.547', 'value': 'True', 'stats': '95/23', 'dupe_id': 'S<:c-74.18'}
x, y = s.split(',', 1)
AttributeError: 'dict' object has no attribute 'split'
I've tried converting it into tuple, list but I'm getting the same error. The input in x,y should be the above mentioned expected output (35.6547,56475).
Any help will be highly appreciated.
You can do it like this:
x,y = d['code'].split('/')[-1].split(',')
That means, you need to access the dictionary by one of it's keys, here you want to go for 'code'. You retrieve the string '<:c:605445> **[Code](https://traindata/35.6547,56475' which you can now either parse via regex or you just do a split at the '/' and take the last element of it using [-1]. Then you can just split the remaining numbers, that you are actually looking for and write them to x and y respectively.
Of course, you might want to check your incoming data to be valid by catching the KeyError you mentioned in the comments:
try:
x,y = d['code'].split('/')[-1].split(',')
except KeyError:
print(f'Data invalid. Key "code" not found. Got: {data} instead')
Another option would be to use a simple regex on the code element - regex starting at the end of the string, find all digits to a . find all digits to a , find all digits.
import re
d = {'data': 'xyz', 'code': ':c: **[Code](https://traindata/35.6547,56475', 'time': '2021-12-30T09:56:53.547', 'value': 'True', 'stats': '96/23', 'dupe_id': 'S<:c-74.18'}
print(re.findall(r'\d+.\d+,\d+$', d['code'])[0])
you can only split text not a dictionary type
first get the text that you want to split
d['code']

Json lines (Jsonl) generator to csv format

I have a large Jsonl file (6GB+) which I need to convert to .csv format. After running:
import json
with open(root_dir + 'filename.json') as json_file:
for line in json_file:
data = json.loads(line)
print(data)
Many records of the below format are returned:
{'url': 'https://twitter.com/CHItraders/status/945958273861275648', 'date': '2017-12-27T10:03:22+00:00', 'content': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'renderedContent': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - BizNews.com biznews.com/wealth-buildin…', 'id': 945958273861275648, 'username': 'CHItraders', 'user': {'username': 'CHItraders', 'displayname': 'CHItraders', 'id': 185663478, 'description': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'rawDescription': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'descriptionUrls': [], 'verified': False, 'created': '2010-09-01T14:52:28+00:00', 'followersCount': 1196, 'friendsCount': 490, 'statusesCount': 38888, 'favouritesCount': 10316, 'listedCount': 58, 'mediaCount': 539, 'location': 'Chicago, IL', 'protected': False, 'linkUrl': None, 'linkTcourl': None, 'profileImageUrl': 'https://pbs.twimg.com/profile_images/623935252357058560/AaeCRlHB_normal.jpg', 'profileBannerUrl': 'https://pbs.twimg.com/profile_banners/185663478/1437592670'}, 'outlinks': ['http://BizNews.com', 'https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter'], 'outlinksss': 'http://BizNews.com https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter', 'tcooutlinks': ['https :// t.co/C1kEhoLaHr', 'https :// t.co/sZT43PBDrM'], 'tcooutlinksss': 'https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'replyCount': 0, 'retweetCount': 0, 'likeCount': 0, 'quoteCount': 0, 'conversationId': 945958273861275648, 'lang': 'en', 'source': 'Twitter Web Client', 'media': None, 'retweetedTweet': None, 'quotedTweet': None, 'mentionedUsers': None}
Due to the size of the file, I can't use the conversion:
with open(root_dir + 'filename.json', 'r', encoding ='utf-8-sig') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
newdf = pd.read_json(StringIO(data_json_str))
newdf.to_csv(root_dir + 'output.csv')
due to MemoryError. I am trying to use the below generator and write each line to the csv, which should negate the MemoryError issue:
def yield_line_delimited_json(path):
"""
Read a line-delimited json file yielding each row as a record
:param str path:
:rtype: list[object]
"""
with open(path, 'r') as json_file:
for line in json_file:
yield json.loads(line)
new = yield_line_delimited_json(root_dir + 'filename.json')
with open(root_dir + 'output.csv', 'w') as f:
for x in new:
f.write(str(x))
However, the data is not written to the .csv format. Any advice on why the data isn't writing to the csv file is greatly appreciated!
The generator seems completely superfluous.
with open(root_dir + 'filename.json') as old, open(root_dir + 'output.csv', 'w') as csvfile:
new = csv.writer(csvfile)
for x in old:
row = json.loads(x)
new.writerow(row)
If one line of JSON does not simply produce an array of strings and numbers, you still need to figure out how to convert it from whatever structure is inside the JSON to something which can usefully be serialized as a one-dimensional list of strings and numbers of a fixed length.
If your JSON can be expected to reliably contain a single dictionary with a fixed set of keyword-value pairs, maybe try
from csv import DictWriter
import json
with open(jsonfile, 'r') as inp, open(csvfile, 'w') as outp:
writer = DictWriter(outp, fieldnames=[
'url', 'date', 'content', 'renderedContent', 'id', 'username',
# 'user', # see below
'outlinks', 'outlinksss', 'tcooutlinks', 'tcooutlinksss',
'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
'conversationId', 'lang', 'source', 'media', 'retweetedTweet',
'quotedTweet', 'mentionedUsers'])
for line in inp:
row = json.loads(line)
writer.writerow(row)
I omitted the user field because in your example, this key contains a nested structure which cannot easily be transformed into CSV without further mangling. Perhaps you would like to extract just user.id into a new field user_id; or perhaps you would like to lift the entire user structure into a flattened list of additional columns in the main record like user_username, user_displayname, user_id, etc?
In some more detail, CSV is basically a two-dimensional matrix where every row is a one-dimensional collection of columns corresponding to one record in the data, where each column can contain one string or one number. Every row needs to have exactly the same number of columns, though you can leave some of them empty.
JSON which can trivially be transformed into CSV would look like
["Adolf", "1945", 10000000]
["Joseph", "1956", 25000000]
["Donald", null, 1000000]
JSON which can be transformed to CSV by some transformation (which you'd have to specify separately, like for example with the dictionary key ordering specified above) might look like
{"name": "Adolf", "dod": "1945", "death toll": 10000000}
{"dod": "1956", "name": "Joseph", "death toll": 25000000}
{"death toll": 1000000, "name": "Donald"}
(Just to make it more interesting, one field is missing, and the dictionary order varies from one record to the next. This is not typical, but definitely within the realm of valid corner cases that Python could not possibly guess on its own how to handle.)
Most real-world JSON is significantly more complex than either of these simple examples, to the point where we can say that the problem is not possible to solve in the general case without manual work on your part to separate out and normalize the data you want into a structure which is suitable for representing as a CSV matrix.

How to extract only a specific value from a python sublist that I got as an API response from Monkeylearn

I have been training a text classification model in Monkeylearn and as a response to my API query, I get a python list as a result. I want to extract only the specific text classification value from it. Attaching the code below.
ml = MonkeyLearn('42b2344587')
data = reddittext[2] # dataset in a python list
model_id = 'cl7C'
result = ml.classifiers.classify(model_id, data)
print(result.body) #response from API in list format
Output I get is :
[{'text': 'comment\n', 'external_id': None, 'error': False, 'classifications': []},
{'text': 'So this is the worst series of Kohli like in years.\n', 'external_id': None, 'error': False, 'classifications': []},
{'text': 'Saini ODI average at 53 😂\n', 'external_id': None, 'error': False, 'classifications': [{'tag_name': 'Batting', 'tag_id': 122983950, 'confidence': 0.64}]}]
I want to only print the classifications - tag_name ie "Batting" from this list.
type(result.body)
the output I get is: List
The result.body is a list of dicts and text, also known as the JSON format.
You can get the desired information by iterating through the lists with a for-loop and performing dictionary look ups with d["key"] if you know the key exists or d.get("key") if you don't know whether the key exists in the dictionary. The get command will return None if the key tag_name doesn't exist.
for entry in result.body:
for classification in entry['classifications']:
tag_name = classification.get('tag_name')
if tag_name is not None:
print(tag_name)
Since I don't know if response format is fixed or not, assuming it isn't.
Use Json to encode response to string, and use regex to find string.
This way, you can match multiple occurances. Since you're receiving most likely json files, json module won't complain about encoding it.
import json
import re
testcase = [{'text': 'comment\n', 'external_id': None, 'error': False, 'classifications': []},
{'text': 'So this is the worst series of Kohli like in years.\n', 'external_id': None, 'error': False, 'classifications': []},
{'text': 'Saini ODI average at 53 😂\n', 'external_id': None, 'error': False, 'classifications': [{'tag_name': 'Batting', 'tag_id': 122983950, 'confidence': 0.64}]}]
# if data format is fixed
print(testcase[-1]['classifications'][0]['tag_name'])
# if not, expensive but works.
def json_find(source, key_name):
json_str = json.dumps(source)
pattern = f'(?<={key_name}": ")([^,"]*)'
found = re.findall(pattern, json_str)
return found
print(json_find(testcase, 'tag_name')[0])
Result:
Batting
Batting

Convert a String to Python Dictionary or JSON Object

Here is the problem - I have a string in the following format (note: there are no line breaks). I simply want this string to be serialized in a python dictionary or a json object to navigate easily. I have tried both ast.literal_eval and json but the end result is either an error or simply another string. I have been scratching my head over this for sometimes and I know there is a simple and elegant solution than to just write my own parser.
{
table_name:
{
"columns":
[
{
"col_1":{"col_1_1":"value_1_1","col_1_2":"value_1_2"},
"col_2":{"col_2_1":"value_2_1","col_2_2":"value_2_2"},
"col_3":"value_3","col_4":"value_4","col_5":"value_5"}],
"Rows":1,"Total":1,"Flag":1,"Instruction":none
}
}
Note, that JSON decoder expects each property name to be enclosed in double quotes.Use the following approach with re.sub() and json.loads() functions:
import json, re
s = '{table_name:{"columns":[{"col_1":{"col_1_1":"value_1_1","col_1_2":"value_1_2"},"col_2":{"col_2_1":"value_2_1","col_2_2":"value_2_2"},"col_3":"value_3","col_4":"value_4","col_5":"value_5"}],"Rows":1,"Total":1,"Flag":1,"Instruction":none}}'
s = re.sub(r'\b(?<!\")([_\w]+)(?=\:)', r'"\1"', s).replace('none', '"None"')
obj = json.loads(s)
print(obj)
The output:
{'table_name': {'columns': [{'col_5': 'value_5', 'col_2': {'col_2_1': 'value_2_1', 'col_2_2': 'value_2_2'}, 'col_3': 'value_3', 'col_1': {'col_1_2': 'value_1_2', 'col_1_1': 'value_1_1'}, 'col_4': 'value_4'}], 'Flag': 1, 'Total': 1, 'Instruction': 'None', 'Rows': 1}}

Vincent map html output not valid html

I'm writing some code with Python and Vincent to display some map data.
The example from the docs looks like this:
import vincent
county_topo = r'us_counties.topo.json'
state_topo = r'us_states.topo.json'
geo_data = [{'name': 'counties',
'url': county_topo,
'feature': 'us_counties.geo'},
{'name': 'states',
'url': state_topo,
'feature': 'us_states.geo'}]
vis = vincent.Map(geo_data=geo_data, scale=3000, projection='albersUsa')
del vis.marks[1].properties.update
vis.marks[0].properties.update.fill.value = '#084081'
vis.marks[1].properties.enter.stroke.value = '#fff'
vis.marks[0].properties.enter.stroke.value = '#7bccc4'
vis.to_json('map.json', html_out=True, html_path='map_template.html')
Running this code outputs an html file, but it's formatted improperly. It's in some kind of python string representation, b'<html>....</html>'.
If I remove the quotes and the leading b, the html page works as expected when run through the built in python server.
What's wrong with my output statement?
From the Docs:
A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the
literal should become a bytes literal in Python 3 (e.g. when code is
automatically converted with 2to3). A 'u' or 'b' prefix may be
followed by an 'r' prefix.
You can slice it using:
with open('map_template.html', 'w') a f:
html = f.read()[2:-1]
f.truncate()
f.write(html)
This will open your html file,
b'<html><head><title>MyFile</title></head></html>'
And remove the first 2 and last character, giving you:
<html><head><title>MyFile</title></head></html>

Categories

Resources