Importing JSON in Python and Removing Header

Importing JSON in Python and Removing Header - python

I'm trying to write a simple JSON to CSV converter in Python for Kiva. The JSON file I am working with looks like this:
{"header":{"total":412045,"page":1,"date":"2012-04-11T06:16:43Z","page_size":500},"loans":[{"id":84,"name":"Justine","description":{"languages":["en"], REST OF DATA
The problem is, when I use json.load, I only get the strings "header" and "loans" in data, but not the actual information such as id, name, description, etc. How can I skip over everything until the [? I have a lot of files to process, so I can't manually delete the beginning in each one. My current code is:
import csv
import json
fp = csv.writer(open("test.csv","wb+"))
f = open("loans/1.json")
data = json.load(f)
f.close()
for item in data:
fp.writerow([item["name"]] + [item["posted_date"]] + OTHER STUFF)

Instead of
for item in data:
use
for item in data['loans']:
The header is stored in data['header'] and data itself is a dictionary, so you'll have to key into it in order to access the data.

data is a dictionary, so for item in data iterates the keys.
You probably want for loan in data['loans']:

Related

How to get a python array from JSON

Can't solve how to convert JSON to python so that all data be in an array.
I used code to extract JSON, but the problem is that to extract strings from each new JSON data set is a new issue due to the inequality of the number of columns.
import json
data = open('acndata_sessions.json')
json.load(data)
I also tried to use https://app.quicktype.io/, but the function result is:
data_from_dict(json.loads(json_string)) doesn't work.
Data set: json

This question seems to have been asked before. Convert JSON array to Python list
They use 'json.loads' instead of 'json.load' which you use. Both are functions, but they are different.

I think your looking for something like this.
import json
with open('myfile.json','r') as jsonFile:
pythonJSON=json.load(jsonFile)
jsonFile.close()
print(pythonJSON.keys())
The json.loads() is used when you have a string type. If the example above doesn't work with just json.load() try it with json.loads().

json.load already gives you a dictionary. You just have to use it and iterate through _items
import json
data = open('acndata_sessions.json')
data_dict = json.load(data)
# Load items from dictionary
items = data_dict['_items']
# Iterate in items
for item in items:
# Print the item
print(item)
# Or if you want to further iterate on user inputs present in this item
user_inputs = item['userInputs']
# Check if there are any user inputs present in this item
if user_inputs:
for user_input in user_inputs:
print(user_input)

Try checking this question. You can parse a json file like this:
import json
# read file
with open('example.json', 'r') as myfile:
data=myfile.read()
# parse file
obj = json.loads(data)
# show values
print(str(obj['_items']))#returns a dict
print(str(obj['_meta']))

Scraping only select fields from a JSON file

I'm trying to produce only the following JSON data fields, but for some reason it writes the entire page to the .html file? What am I doing wrong? It should only produce the boxes referenced e.g. title, audiosource url, medium sized image, etc?
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
# data.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(data, ensure_ascii=False))

You want to differentiate from your input data and your output data. In your for loop, you are referencing the same variable data that you are using to take input in as you are using to output. You want to add the selected data from the input to a list containing the output.
Don't re-use the same variable names. Here is what you want:
import urllib
import json
import io
url = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(url.read().decode('utf-8'))
posts = []
for post in data['posts']:
posts.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(posts, ensure_ascii=False))

You are loading the whole json in the variable data, and you are dumping it without changing it. That's the reason why this is happening. What you need to do is put whatever you want into a new variable and then dump it.
See the line -
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
it does nothing. So, data remains unchanged. Do what Mark Tolonen suggested and it'll be fine.

Mongoexport exporting invalid json files

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.

Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.

Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref

A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.

How can I update a specific value on a custom configuration file?

Assuming I have a configuration txt file with this content:
{"Mode":"Classic","Encoding":"UTF-8","Colors":3,"Blue":80,"Red":90,"Green":160,"Shortcuts":[],"protocol":"2.1"}
How can i change a specific value like "Red":90 to "Red":110 in the file without changing its original format?
I have tried with configparser and configobj but as they are designed for .INI files I couldn't figure out how to make it work with this custom config file. I also tried splitting the lines searching for the keywords witch values I wanted to change but couldn't save the file the same way it was before. Any ideas how to solve this? (I'm very new in Python)

this looks like json so you could:
import json
obj = json.load(open("/path/to/jsonfile","r"))
obj["Blue"] = 10
json.dump(obj,open("/path/to/mynewfile","w"))
but be aware that a json dict does not have an order.
So the order of the elements is not guaranteed (and normally it's not needed) json lists have an order though.

Here's how you can do it:
import json
d = {} # store your data here
with open('config.txt','r') as f:
d = json.loads(f.readline())
d['Red']=14
d['Green']=15
d['Blue']=20
result = "{\"Mode\":\"%s\",\"Encoding\":\"%s\",\"Colors\":%s,\
\"Blue\":%s,\"Red\":%s,\"Green\":%s,\"Shortcuts\":%s,\
\"protocol\":\"%s\"}"%(d['Mode'],d['Encoding'],d['Colors'],
d['Blue'],d['Red'],d['Green'],
d['Shortcuts'],d['protocol'])
with open('config.txt','w') as f:
f.write(result)
f.close()
print result

Is there a memory efficient and fast way to load big JSON files?

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.

There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.

So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.

Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.

It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.

On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.

Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.

Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/

"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.

in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc

You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)

So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Importing JSON in Python and Removing Header - python

Instead of for item in data: use for item in data['loans']: The header is stored in data['header'] and data itself is a dictionary, so you'll have to key into it in order to access the data.

data is a dictionary, so for item in data iterates the keys. You probably want for loan in data['loans']:

Related

How to get a python array from JSON

Scraping only select fields from a JSON file

Mongoexport exporting invalid json files

How can I update a specific value on a custom configuration file?

Is there a memory efficient and fast way to load big JSON files?

Categories

Resources