Reading json in python separated by newlines - python

I am trying to read some json with the following format. A simple pd.read_json() returns ValueError: Trailing data. Adding lines=True returns ValueError: Expected object or value. I've tried various combinations of readlines() and load()/loads() so far without success.
Any ideas how I could get this into a dataframe?
{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
}
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}

The sample you have above isn't valid JSON. To be valid JSON these objects need to be within a JS array ([]) and be comma separated, as follows:
[{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
},
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}]
I just tried on my machine. When formatted correctly, it works
>>> pd.read_json('data.json')
content source title url
0 kdjfsfkjlffsdkj {'name': 'jfkldsjf'} dsldkjfslj vkljfklgjkdlgj
1 djlskgfdklgjkfgj {'name': 'ldfjkdfjs'} lfsjdfklfldsjf lkjlfggdflkjgdlf

Another solution if you do not want to reformat your files.
Assuming your JSON is in a string called my_json you could do:
import json
import pandas as pd
splitted = my_json.split('\n\n')
my_list = [json.loads(e) for e in splitted]
df = pd.DataFrame(my_list)

Thanks for the ideas internet. None quite solved the problem in the way I needed (I had lots of newline characters in the strings themselves which meant I couldn't split on them) but they helped point the way. In case anyone has a similar problem, this is what worked for me:
with open('path/to/original.json', 'r') as f:
data = f.read()
data = data.split("}\n")
data = [d.strip() + "}" for d in data]
data = list(filter(("}").__ne__, data))
data = [json.loads(d) for d in data]
with open('path/to/reformatted.json', 'w') as f:
json.dump(data, f)
df = pd.read_json('path/to/reformatted.json')

If you can use jq then solution is simpler:
jq -s '.' path/to/original.json > path/to/reformatted.json

Related

How to correct leading zeroes in JSON with python

I have a wrongly-formatted JSON file where I have numbers with leading zeroes.
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
arc = json.loads(p)
I get this error.
JSONDecodeError: Expecting ',' delimiter: line 8 column 24 (char 107)
Here's what is on char 107:
print(p[107])
#0
The problem is: this is the data I have. Here I am only showing two examples, but my file has millions of lines to be parsed, I need a script. At the end of the day, I need this string:
"""[
{
"name": "Alice",
"RegisterNumber": "911100020001"
},
{
"name": "Bob",
"RegisterNumber": "000111110300"
}
]"""
How can I do it?
Read the file (best line by line) and replace all the values with their string representation. You can use regular expressions for that (remodule).
Then save and later parse the valid json.
If it fits into memory, you don't need to save the file of course, but just loads the then valid json string.
Here is a simple version:
import json
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
from re import sub
p = sub(r"(\d{12})", "\"\\1\"", p)
arc = json.loads(p)
print(arc[1])
This probably won't be pretty but you could probably fix this using a regex.
import re
p = "..."
sub = re.sub(r'"RegisterNumber":\W([0-9]+)', r'"RegisterNumber": "\1"', p)
json.loads(sub)
This will match all the case where you have the RegisterNumber followed by numbers.
Since the problem is the leading zeroes, tne easy way to fix the data would be to split it into lines and fix any lines that exhibit the problem. It's cheap and nasty, but this seems to work.
data = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
result = []
for line in data.splitlines():
if ': 0' in line:
while ": 0" in line:
line = line.replace(': 0', ': ')
result.append(line.replace(': ', ': "')+'"')
else:
result.append(line)
data = "".join(result)
arc = json.loads(data)
print(arc)

how to extract specific data from json and put in to csv using python

I have a JSON which is in nested form. I would like to extract specific data from json and put into csv using pandas python.
data = {
"class":"hudson.model.Hudson",
"jobs":[
{
"_class":"hudson.model.FreeStyleProject",
"name":"git_checkout",
"url":"http://localhost:8080/job/git_checkout/",
"builds":[
{
"_class":"hudson.model.FreeStyleBuild",
"duration":1201,
"number":6,
"result":"FAILURE",
"url":"http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class":"hudson.model.FreeStyleProject",
"name":"output",
"url":"http://localhost:8080/job/output/",
"builds":[
]
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name":"pipeline_test",
"url":"http://localhost:8080/job/pipeline_test/",
"builds":[
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":9274,
"number":85,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/85/"
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":4251,
"number":84,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
From the above JSON i want to fetch jobs name value and builds result value . I am new to python any help will be appreciated .
Till now i have tried
main_data = data['jobs]
json_normalize(main_data,['builds'],
record_prefix='jobs_', errors='ignore')
which gives information only build key values and not the name of job .
Can anyone help ?
Expected Output:
Considering only first build result value you can need to be in csv column you can achieve this using pandas.
data = {
"class": "hudson.model.Hudson",
"jobs": [
{
"_class": "hudson.model.FreeStyleProject",
"name": "git_checkout",
"url": "http://localhost:8080/job/git_checkout/",
"builds": [
{
"_class": "hudson.model.FreeStyleBuild",
"duration": 1201,
"number": 6,
"result": "FAILURE",
"url": "http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class": "hudson.model.FreeStyleProject",
"name": "output",
"url": "http://localhost:8080/job/output/",
"builds": []
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name": "pipeline_test",
"url": "http://localhost:8080/job/pipeline_test/",
"builds": [
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 9274,
"number": 85,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/85/"
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 4251,
"number": 84,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
res['name'].append(name_dict.get('name','NA'))
resultval = name_dict['builds'][0].get('result') if len(name_dict['builds'])>0 else 'NA'
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/file_timer/jobs.csv", index=False)
Check the csv file output
name,result
git_checkout,FAILURE
output,NA
pipeline_test,SUCCESS
If 'NA' result want to skip then
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
if len(name_dict['builds'])==0:
continue
res['name'].append(name_dict.get('name', 'NA'))
resultval = name_dict['builds'][0].get('result')
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/akash.pagar/shell_learning/file_timer/jobs.csv", index=False)
Output will bw like
name,result
git_checkout,FAILURE
pipeline_test,SUCCESS
Simply with build number,
for job in data.get('jobs'):
for build in job.get('builds'):
print(job.get('name'), build.get('number'), build.get('result'))
gives the result
git_checkout 6 FAILURE
pipeline_test 85 SUCCESS
pipeline_test 84 SUCCESS
If you want to get the result of latest build, and pretty sure about the build number always in decending order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), job.get('builds')[0].get('result'))
and if you are not sure the order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), sorted(job.get('builds'), key=lambda k: k.get('number'))[-1].get('result'))
then the result will be:
git_checkout FAILURE
pipeline_test SUCCESS
Assuming last build is the last element of its list and you don't care about jobs with no builds, this does:
import pandas as pd
#data = ... #same format as in the question
z = [(job["name"], job["builds"][-1]["result"]) for job in data["jobs"] if len(job["builds"])]
df = pd.DataFrame(data=z, columns=["name", "result"])
#df.to_csv #TODO
Also we don't necessarily need pandas to create the csv file.
You could do:
import csv
#z = ... #see previous code block
with open("f.csv", 'w') as fp:
csv.writer(fp).writerows([("name", "result")] + z)

Writing multiple json objects to a json file

I have a list of json objects that I would like to write to a json file. Example of my data is as follows:
{
"_id": "abc",
"resolved": false,
"timestamp": "2017-04-18T04:57:41 366000",
"timestamp_utc": {
"$date": 1492509461366
},
"sessionID": "abc",
"resHeight": 768,
"time_bucket": ["2017-year", "2017-04-month", "2017-16-week", "2017-04-18-day", "2017-04-18 16-hour"],
"referrer": "Standalone",
"g_event_id": "abc",
"user_agent": "abc"
"_id": "abc",
} {
"_id": "abc",
"resolved": false,
"timestamp": "2017-04-18T04:57:41 366000",
"timestamp_utc": {
"$date": 1492509461366
},
"sessionID": "abc",
"resHeight": 768,
"time_bucket": ["2017-year", "2017-04-month", "2017-16-week", "2017-04-18-day", "2017-04-18 16-hour"],
"referrer": "Standalone",
"g_event_id": "abc",
"user_agent": "abc"
}
I would like to wirte this to a json file. Here's the code that I am using for this purpose:
with open("filename", 'w') as outfile1:
for row in data:
outfile1.write(json.dumps(row))
But this gives me a file with only 1 long row of data. I would like to have a row for each json object in my original data. I know there are some other StackOverflow questions that are trying to address somewhat similar situation (by externally inserting '\n' etc.), but it hasn't worked in my case for some reason. I believe there has to be a pythonic way to do this.
How do I achieve this?
The format of the file you are trying to create is called JSON lines.
It seems, you are asking why the jsons are not separated with a newline. Because write method does not append the newline.
If you want implicit newlines you should better use print function:
with open("filename", 'w') as outfile1:
for row in data:
print(json.dumps(row), file=outfile1)
Use the indent argument to output json with extra whitespace. The default is to not output linebreaks or extra spaces.
with open('filename.json', 'w') as outfile1:
json.dump(data, outfile1, indent=4)
https://docs.python.org/3/library/json.html#basic-usage

In Python, how to parse the complex json file

I want to get "path" from the below json file; I used json.load to get read json file and then parse one by one using for key, value in data.items() and it leads to lot of for loop (Say 6 loops) to get to the value of "path"; Is there any simple method to retrieve the value of path?
The complete json file can be found here and below is the snippet of it.
{
"products": {
"com.ubuntu.juju:12.04:amd64": {
"version": "2.0.1",
"arch": "amd64",
"versions": {
"20161129": {
"items": {
"2.0.1-precise-amd64": {
"release": "precise",
"version": "2.0.1",
"arch": "amd64",
"size": 23525972,
"path": "released/juju-2.0.1-precise-amd64.tgz",
"ftype": "tar.gz",
"sha256": "f548ac7b2a81d15f066674365657d3681e3d46bf797263c02e883335d24b5cda"
}
}
}
}
},
"com.ubuntu.juju:14.04:amd64": {
"version": "2.0.1",
"arch": "amd64",
"versions": {
"20161129": {
"items": {
"2.0.1-trusty-amd64": {
"release": "trusty",
"version": "2.0.1",
"arch": "amd64",
"size": 23526508,
"path": "released/juju-2.0.1-trusty-amd64.tgz",
"ftype": "tar.gz",
"sha256": "7b86875234477e7a59813bc2076a7c1b5f1d693b8e1f2691cca6643a2b0dc0a2"
}
}
}
}
},
You can use recursive generator:
def get_paths(data):
if 'path' in data:
yield data['path']
for k in data.keys():
if isinstance(data[k], dict):
for i in get_paths(data[k]):
yield i
for path in get_paths(json_data): # loaded json data
print(path)
Is path key always at the same depth in the loaded json (which is a dict so) ? If so, what about doing
products = loaded_json['products']
for product in products.items():
print product[1].items()[2][1].items()[0][1].items()[0][1].items()[0][1]['path']
If not, the answer of Yevhen Kuzmovych is clearly better, cleaner and more general than mine.
If you only care about the path, I think using any JSON parser is an overkill, you can just use built in re regex and use the following pattern (\"path\":\s*\")(.*\s*)(?=\",). I didn't test the whole file but should be able to figure out the best pattern fairly easily.
If you only need the file names present in path field, you can easily get them by simply parsing the file:
import re
files = []
pathre = re.compile(r'\s*"path"\s*:\s*"(.*?)"')
with open('file.json') as fd:
for line in fd:
if "path" in line:
m = pathre.match(line)
if m is not None:
files.append(m.group(1))
If you need to process simultaneously the path and sha256 fields:
files = []
pathre = re.compile(r'\s*"path"\s*:\s*"(.*?)"')
share = re.compile(r'\s*"sha256"\s*:\s*"(.*?)"')
path = None
with open('file.json') as fd:
for line in fd:
if "path" in line:
m = pathre.match(line)
path = m.group(1)
elif "sha256" in line:
m = share.match(line)
if path is not None:
files.append((path, m.group(1)))
path = None
You can use a query language like JSONPath. Here you find the Python implementation: https://pypi.python.org/pypi/jsonpath-rw
Assuming you have your JSON content already loaded, you can do something like the following:
from jsonpath_rw import jsonpath, parse
# Load your JSON content first from a file or from a string
# json_data = ...
jsonpath_expr = parse('products..path')
for match in jsonpath_expr.find(json_data):
print(match.value)
For a further discussion you can read this: Is there a query language for JSON?

How do I read a json file into python?

I'm new to JSON and Python, any help on this would be greatly appreciated.
I read about json.loads but am confused
How do I read a file into Python using json.loads?
Below is my JSON file format:
{
"header": {
"platform":"atm"
"version":"2.0"
}
"details":[
{
"abc":"3"
"def":"4"
},
{
"abc":"5"
"def":"6"
},
{
"abc":"7"
"def":"8"
}
]
}
My requirement is to read the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame.
Open the file, and get a filehandle:
fh = open('thefile.json')
https://docs.python.org/2/library/functions.html#open
Then, pass the file handle into json.load(): (don't use loads - that's for strings)
import json
data = json.load(fh)
https://docs.python.org/2/library/json.html#json.load
From there, you can easily deal with a python dictionary that represents your json-encoded data.
new_list = [(detail['abc'], detail['def']) for detail in data['details']]
Note that your JSON format is also wrong. You will need comma delimiters in many places, but that's not the question.
I'm trying to understand your question as best as I can, but it looks like it was formatted poorly.
First off your json blob is not valid json, it is missing quite a few commas. This is probably what you are looking for:
{
"header": {
"platform": "atm",
"version": "2.0"
},
"details": [
{
"abc": "3",
"def": "4"
},
{
"abc": "5",
"def": "6"
},
{
"abc": "7",
"def": "8"
}
]
}
Now assuming you are trying to parse this in python you will have to do the following.
import json
json_blob = '{"header": {"platform": "atm","version": "2.0"},"details": [{"abc": "3","def": "4"},{"abc": "5","def": "6"},{"abc": "7","def": "8"}]}'
json_obj = json.loads(json_blob)
final_list = []
for single in json_obj['details']:
final_list.append((int(single['abc']), int(single['def'])))
print(final_list)
This will print the following: [(3, 4), (5, 6), (7, 8)]

Categories

Resources