unable to convert a json to dataframe - python

I'm trying to convert a huge JSON file to a data frame in order to preprocess it for sentiment analysis.But unable to convert it.
The problem is at pd.read_json
import json
import pandas as pd
with open("/content/drive/My Drive/timeline_1.jsonl") as f:
data = f.readlines()
data_json_str = "[" + ','.join(data) + "]"
data_df = pd.read_json(data_json_str)
ValueError: Unmatched ''"' when decoding 'string'

Your data is probably corrupted, at least in one place (maybe more).
One method to find such a place is to run your code, not on the whole file,
but on chunks of it.
For example, run your code on:
the first half of your file,
the second half.
If any part runs OK, then it is free from errors.
The next step is to repeat the above procedure on each "failed" chunk.
Another method: Look thoroughly at your StackTrace, maybe somewhere there is
the line number in the source file (do not confuse it with the line number
of Python code).
For now you assemble the whole text as a single line, so even if the StackTrace
contains such number, it is most likely just 1.
To ease your investigation, change your code in such a way that each
source line is in a separate line in the joined text. Something like:
data_json_str = "[" + ',\n'.join(data) + "]"
Then execute your code again and read the number shown (where the error occurred),
now equal to the number of source line.
Then look at this line, correct it and your code should run with no error.
Edit after your comment with source data
In your data I noticed that:
it contains two JSON objects (rows),
but without any comma between them.
I made the following completions and changes:
added [ and ] at the beginning / end,
added a comma after the first {...}.
so that the input string was:
data_json_str = '''[
{"id": "99014576299056245", "created_at": "2017-11-16T14:28:53.919Z",
"sensitive": false, "spoiler_text": "", "language": "en",
"uri": "mastodon.gamedev.place/users/jaggy/statuses/99014576299056245",
"instance": "mastodon.gamedev.place",
"content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>",
"account_id": "434", "tag_list": [], "media_attachments": [], "emojis": [], "mentions": []},
{"id": "99014544879467317", "created_at": "2017-11-16T14:20:54.462Z", "sensitive": false}
]'''
Then performed your instruction to read this string:
data_df = pd.read_json(data_json_str)
and got a DataFrame with 2 rows (no error).
Initially I suspected &apos; as a possible source of error, but read_json
did cope also with this case.
But when I deleted the comma after the first {...}, I got error:
ValueError: Unexpected character found when decoding array value (2)
(other error than yours).
I use Python 3.7.0 and Pandas 0.25.
If you have some older version of either Python or Pandas, maybe you should
upgrade them?
The real problem is probably connected with some "weak point" in JSON
parser (I'm not sure wherher it is a part of Python or Pandas).
Before you upgrade, perform another test: Drop the mentioned &apos; from
the input string and attempt read_json again.
If you get this time a proper result, this will confirm my suspicion
that the JSON parser in your installation has flaws and will be an
important support of my advice to upgrade your software.

Use pandas.io.json.json_normalize:
Data:
Given the data as a list of dicts in a file named test.json
[{
"id": "99014576299056245",
"created_at": "2017-11-16T14:28:53.919Z",
"sensitive": false,
"spoiler_text": "",
"language": "en",
"uri": "mastodon.gamedev.place/users/jaggy/statuses/99014576299056245",
"instance": "mastodon.gamedev.place",
"content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>",
"account_id": "434",
"tag_list": [],
"media_attachments": [],
"emojis": [],
"mentions": []
}, {
"id": "99014544879467317",
"created_at": "2017-11-16T14:20:54.462Z",
"sensitive": false,
"spoiler_text": "",
"language": "en",
"uri": "mastodon.gamedev.place/users/jaggy/statuses/99014544879467317",
"instance": "mastodon.gamedev.place",
"content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>",
"account_id": "434",
"tag_list": [],
"media_attachments": [],
"emojis": [],
"mentions": []
}
]
Code to read the data:
import pandas as pd
import json
from pathlib import Path
from pandas.io.json import json_normalize
# path to file
p = Path(r'c:\some_directory_with_data\test.json')
# read the file in and load using the json module
with p.open('r', encoding='utf-8') as f:
data = json.loads(f.read())
# create a dataframe
df = json_normalize(data)
# dataframe view
id created_at sensitive spoiler_text language uri instance content account_id tag_list media_attachments emojis mentions
99014576299056245 2017-11-16T14:28:53.919Z False en mastodon.gamedev.place/users/jaggy/statuses/99014576299056245 mastodon.gamedev.place <p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p> 434 [] [] [] []
99014544879467317 2017-11-16T14:20:54.462Z False en mastodon.gamedev.place/users/jaggy/statuses/99014544879467317 mastodon.gamedev.place <p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p> 434 [] [] [] []
Option 2:
Data
The data is in a file, in the form of rows of dicts
Not in a list
Separated with a newline
This is not a valid JSON file
{"id": "99014576299056245", "created_at": "2017-11-16T14:28:53.919Z", "sensitive": false, "spoiler_text": "", "language": "en", "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014576299056245", "instance": "mastodon.gamedev.place", "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>", "account_id": "434", "tag_list": [], "media_attachments": [], "emojis": [], "mentions": []}
{"id": "99014544879467317", "created_at": "2017-11-16T14:20:54.462Z", "sensitive": false, "spoiler_text": "", "language": "en", "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014544879467317", "instance": "mastodon.gamedev.place", "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>", "account_id": "434", "tag_list": [], "media_attachments": [], "emojis": [], "mentions": []}
Code to read this data
Reading the file in with the following code
data will be a list of str, where each row of the file, is a str in the list
Use ast.literal_eval to convert the str back to a dict
literal_eval won't work if there are invalid values in the str (e.g. false instead of False, true instead of True).
This will cause a ValueError: malformed node or string: <_ast.Name object at 0x000002B7240B7888>, which isn't a particularly helpful error
I've added a try-except block to print any row that causes an issue, add to the values_to_fix dict until you get them all.
import pandas as pd
import json
from pathlib import Path
from pandas.io.json import json_normalize
from ast import literal_eval
# path to file
p = Path(r'c:\some_directory_with_data\test.json')
list_of_dicts = list()
with p.open('r', encoding='utf-8') as f:
data = f.readlines()
for x in data:
values_to_fix = {'false': 'False',
'true': 'True',
'none': 'None'}
for k, v in values_to_fix.items():
x = x.replace(k, v)
try:
x = literal_eval(x)
list_of_dicts.append(x)
except ValueError as e:
print(e)
print(x)
df = json_normalize(list_of_dicts)
# this output is the same as that shown above

Related

Python retrieve specified nested JSON value

I have a .json file with many entries looking like this:
{
"name": "abc",
"time": "20220607T190731.442",
"id": "123",
"relatedIds": [
{
"id": "456",
"source": "sourceA"
},
{
"id": "789",
"source": "sourceB"
}
],
}
I am saving each entry in a python object, however, I only need the related ID from source A. Problem is, the related ID from source A is not always first place in that nested list.
So data['relatedIds'][0]['id'] is not reliable to yield the right Id.
Currently I am solving the issue like this:
import json
with open("filepath", 'r') as file:
data = json.load(file)
for value in data['relatedIds']:
if(value['source'] == 'sourceA'):
id_from_a = value['id']
entry = Entry(data['name'], data['time'], data['id'], id_from_a)
I don't think this approach is the optimal solution though, especially if relatedIds list gets longer and more entries appended to the JSON file.
Is there a more sophisticated way of singling out this 'id' value from a specified source without looping through all entries in that nested list?
For a cleaner solution, you could try using python's filter() function with a simple lambda:
import json
with open("filepath", 'r') as file:
data = json.load(file)
filtered_data = filter(lambda a : a["source"] == "sourceA", data["relatedIds"])
id_from_a = next(filtered_data)['id']
entry = Entry(data['name'], data['time'], data['id'], id_from_a)
Correct me if I misunderstand how your json file looks, but it seems to work for me.
One step at a time, in order to get to all entries:
>>> data["relatedIds"]
[{'id': '789', 'source': 'sourceB'}, {'id': '456', 'source': 'sourceA'}]
Next, in order to get only those entries with source=sourceA:
>>> [e for e in data["relatedIds"] if e["source"] == "sourceA"]
[{'id': '456', 'source': 'sourceA'}]
Now, since you don't want the whole entry, but just the ID, we can go a little further:
>>> [e["id"] for e in data["relatedIds"] if e["source"] == "sourceA"]
['456']
From there, just grab the first ID:
>>> [e["id"] for e in data["relatedIds"] if e["source"] == "sourceA"][0]
'456'
Can you get whatever generates your .json file to produce the relatedIds as an object rather than a list?
{
"name": "abc",
"time": "20220607T190731.442",
"id": "123",
"relatedIds": {
"sourceA": "456",
"sourceB": "789"
}
}
If not, I'd say you're stuck looping through the list until you find what you're looking for.

How to parse the json data using python?

I've been messing around with JSON for some time. I want to get the values of "box" and "text" in this format using python can someone help me out how to resolve this example output:[92,197,162,215,AUTHORS,...!]
{ "form": [ { "box": [ 92,162,197,215], "text": "AUTHORS", "label": "question", "words": [ { "box": [ 92,197,162,215 ],"text": "AUTHORS"} ], "linking": [[0,13]],"id": 0 },
import os
import json
# Directory name consisting of json
file = open('033.json')
data = json.load(file)
result = []
for value in data['form']:
my_dict=[]
my_dict=value.get('box')
print(my_dict)
result.append(my_dict)
Probably like this:
collector = []
for obj in form:
collector.append({"box": obj["box"], "text": obj["text"]})
print(collector)
Okay, few issues with your code -
Why is your list named my_dict? A name should indicate what the object is/ what it contains. Your name does the opposite and if someone works with that code in the future then it will most likely confuse them.
Why are you initializing a list before doing this value.get('box')?
As for the solution, it is a short piece of code that would require 2 lines of code.
result = []
for form_dict in data['form']:
result.append(tuple(form_dict[key]
for key in ('box', 'text') if key in form_dict))
That piece of code would result in this: [([92, 162, 197, 215], 'AUTHORS')] based on the data you provided.
This is assuming that there can be more items in the data['form'] list, otherwise the for loop is not needed.

Reading json in python separated by newlines

I am trying to read some json with the following format. A simple pd.read_json() returns ValueError: Trailing data. Adding lines=True returns ValueError: Expected object or value. I've tried various combinations of readlines() and load()/loads() so far without success.
Any ideas how I could get this into a dataframe?
{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
}
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}
The sample you have above isn't valid JSON. To be valid JSON these objects need to be within a JS array ([]) and be comma separated, as follows:
[{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
},
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}]
I just tried on my machine. When formatted correctly, it works
>>> pd.read_json('data.json')
content source title url
0 kdjfsfkjlffsdkj {'name': 'jfkldsjf'} dsldkjfslj vkljfklgjkdlgj
1 djlskgfdklgjkfgj {'name': 'ldfjkdfjs'} lfsjdfklfldsjf lkjlfggdflkjgdlf
Another solution if you do not want to reformat your files.
Assuming your JSON is in a string called my_json you could do:
import json
import pandas as pd
splitted = my_json.split('\n\n')
my_list = [json.loads(e) for e in splitted]
df = pd.DataFrame(my_list)
Thanks for the ideas internet. None quite solved the problem in the way I needed (I had lots of newline characters in the strings themselves which meant I couldn't split on them) but they helped point the way. In case anyone has a similar problem, this is what worked for me:
with open('path/to/original.json', 'r') as f:
data = f.read()
data = data.split("}\n")
data = [d.strip() + "}" for d in data]
data = list(filter(("}").__ne__, data))
data = [json.loads(d) for d in data]
with open('path/to/reformatted.json', 'w') as f:
json.dump(data, f)
df = pd.read_json('path/to/reformatted.json')
If you can use jq then solution is simpler:
jq -s '.' path/to/original.json > path/to/reformatted.json

How to generate JSON data with python 2.7+

I have to following bit of JSON data which is a snippet from a large file of JSON.
I'm basically just looking to expand this data.
I'll worry about adding it to the existing JSON file later.
The JSON data snippet is:
"Roles": [
{
"Role": "STACiWS_B",
"Settings": {
"HostType": "AsfManaged",
"Hostname": "JTTstSTBWS-0001",
"TemplateName": "W2K16_BETA_4CPU",
"Hypervisor": "sys2Director-pool4",
"InCloud": false
}
}
],
So what I want to do is to make many more datasets of "role" (for lack of a better term)
So something like this:
"Roles": [
{
"Role": "Clients",
"Settings": {
"HostType": "AsfManaged",
"Hostname": "JTClients-0001",
"TemplateName": "Win10_RTM_64_EN_1511",
"Hypervisor": "sys2director-pool3",
"InCloud": false
}
},
{
"Role": "Clients",
"Settings": {
"HostType": "AsfManaged",
"Hostname": "JTClients-0002",
"TemplateName": "Win10_RTM_64_EN_1511",
"Hypervisor": "sys2director-pool3",
"InCloud": false
}
},
I started with some python code that looks like so, but, it seems I'm fairly far off the mark
import json
import pprint
Roles = ["STACiTS","STACiWS","STACiWS_B"]
RoleData = dict()
RoleData['Role'] = dict()
RoleData['Role']['Setttings'] = dict()
ASFHostType = "AsfManaged"
ASFBaseHostname = ["JTSTACiTS","JTSTACiWS","JTSTACiWS_"]
HypTemplateName = "W2K12R2_4CPU"
HypPoolName = "sys2director"
def CreateASF_Roles(Roles):
for SingleRole in Roles:
print SingleRole #debug purposes
if SingleRole == 'STACiTS':
print ("We found STACiTS!!!") #debug purposes
NumOfHosts = 1
for NumOfHosts in range(20): #Hardcoded for STACiTS - Generate 20 STACiTS datasets
RoleData['Role']=SingleRole
RoleData['Role']['Settings']['HostType']=ASFHostType
ASFHostname = ASFBaseHostname + '-' + NumOfHosts.zfill(4)
RoleData['Role']['Settings']['Hostname']=ASFHostname
RoleData['Role']['Settings']['TemplateName']=HypTemplateName
RoleData['Role']['Settings']['Hypervisor']=HypPoolName
RoleData['Role']['Settings']['InCloud']="false"
CreateASF_Roles(Roles)
pprint.pprint(RoleData)
I keep getting this error, which is confusing, because I thought dictionaries could have named indices.
Traceback (most recent call last):
File ".\CreateASFRoles.py", line 34, in <module>
CreateASF_Roles(Roles)
File ".\CreateASFRoles.py", line 26, in CreateASF_Roles
RoleData['Role']['Settings']['HostType']=ASFHostType
TypeError: string indices must be integers, not str
Any thoughts are appreciated. thanks.
Right here:
RoleData['Role']=SingleRole
You set RoleData to be the string 'STACiTS'. So then the next command evaluates to:
'STACiTS'['Settings']['HostType']=ASFHostType
Which of course is trying to index into a string with another string, which is your error. Dictionaries can have named indices, but you overwrote the dictionary you created with a string.
You likely intended to create RoleData["Settings"] as a dictionary then assign to that, rather than RoleData["Role"]["Settings"]
Also on another note, you have another syntax error up here:
RoleData['Role']['Setttings'] = dict()
With a mispelling of "settings" that will probably cause similar problems for you later on unless fixed.

Python - JSON to CSV table?

I was wondering how I could import a JSON file, and then save that to an ordered CSV file, with header row and the applicable data below.
Here's what the JSON file looks like:
[
{
"firstName": "Nicolas Alexis Julio",
"lastName": "N'Koulou N'Doubena",
"nickname": "N. N'Koulou",
"nationality": "Cameroon",
"age": 24
},
{
"firstName": "Alexandre Dimitri",
"lastName": "Song-Billong",
"nickname": "A. Song",
"nationality": "Cameroon",
"age": 26,
etc. etc. + } ]
Note there are multiple 'keys' (firstName, lastName, nickname, etc.). I would like to create a CSV file with those as the header, then the applicable info beneath in rows, with each row having a player's information.
Here's the script I have so far for Python:
import urllib2
import json
import csv
writefilerows = csv.writer(open('WCData_Rows.csv',"wb+"))
api_key = "xxxx"
url = "http://worldcup.kimonolabs.com/api/players?apikey=" + api_key + "&limit=1000"
json_obj = urllib2.urlopen(url)
readable_json = json.load(json_obj)
list_of_attributes = readable_json[0].keys()
print list_of_attributes
writefilerows.writerow(list_of_attributes)
for x in readable_json:
writefilerows.writerow(x[list_of_attributes])
But when I run that, I get a "TypeError: unhashable type:'list'" error. I am still learning Python (obviously I suppose). I have looked around online (found this) and can't seem to figure out how to do it without explicitly stating what key I want to print...I don't want to have to list each one individually...
Thank you for any help/ideas! Please let me know if I can clarify or provide more information.
Your TypeError is occuring because you are trying to index a dictionary, x with a list, list_of_attributes with x[list_of_attributes]. This is not how python works. In this case you are iterating readable_json which appears it will return a dictionary with each iteration. There is no need pull values out of this data in order to write them out.
The DictWriter should give you what your looking for.
import csv
[...]
def encode_dict(d, out_encoding="utf8"):
'''Encode dictionary to desired encoding, assumes incoming data in unicode'''
encoded_d = {}
for k, v in d.iteritems():
k = k.encode(out_encoding)
v = unicode(v).encode(out_encoding)
encoded_d[k] = v
return encoded_d
list_of_attributes = readable_json[0].keys()
# sort fields in desired order
list_of_attributes.sort()
with open('WCData_Rows.csv',"wb+") as csv_out:
writer = csv.DictWriter(csv_out, fieldnames=list_of_attributes)
writer.writeheader()
for data in readable_json:
writer.writerow(encode_dict(data))
Note:
This assumes that each entry in readable_json has the same fields.
Maybe pandas could do this - but I newer tried to read JSON
import pandas as pd
df = pd.read_json( ... )
df.to_csv( ... )
pandas.DataFrame.to_csv
pandas.io.json.read_json
EDIT:
data = ''' [
{
"firstName": "Nicolas Alexis Julio",
"lastName": "N'Koulou N'Doubena",
"nickname": "N. N'Koulou",
"nationality": "Cameroon",
"age": 24
},
{
"firstName": "Alexandre Dimitri",
"lastName": "Song-Billong",
"nickname": "A. Song",
"nationality": "Cameroon",
"age": 26,
}
]'''
import pandas as pd
df = pd.read_json(data)
print df
df.to_csv('results.csv')
result:
age firstName lastName nationality nickname
0 24 Nicolas Alexis Julio N'Koulou N'Doubena Cameroon N. N'Koulou
1 26 Alexandre Dimitri Song-Billong Cameroon A. Song
With pandas you can save it in csv, excel, etc (and maybe even directly in database).
And you can do some operations on data in table and show it as graph.

Categories

Resources