Parse a json txt file with python - python

I have a txt file which contains json string on each line with first line like this:
{"rating": 9.3, "genres": ["Crime", "Drama"], "rated": "R", "filming_locations": "Ashland, Ohio, USA", "language": ["English"], "title": "The Shawshank Redemption", "runtime": ["142 min"], "poster": "http://img3.douban.com/lpic/s1311361.jpg", "imdb_url": "http://www.imdb.com/title/tt0111161/", "writers": ["Stephen King", "Frank Darabont"], "imdb_id": "tt0111161", "directors": ["Frank Darabont"], "rating_count": 894012, "actors": ["Tim Robbins", "Morgan Freeman", "Bob Gunton", "William Sadler", "Clancy Brown", "Gil Bellows", "Mark Rolston", "James Whitmore", "Jeffrey DeMunn", "Larry Brandenburg", "Neil Giuntoli", "Brian Libby", "David Proval", "Joseph Ragno", "Jude Ciccolella"], "plot_simple": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.", "year": 1994, "country": ["USA"], "type": "M", "release_date": 19941014, "also_known_as": ["Die Verurteilten"]}
I want to get data for imdb_id and title.
I have tried:
import json
data = json.load('movie_acotrs_data.txt')
But got 'str' object has no attribute 'read'
What should I do to get the result I expected?

json.load() expects the file to be just one long JSON string. You can't use it if it's a separate JSON string on each line. You need to read each line and call json.loads().
import json
with open('movie_actors_data.txt') as f:
data = list(map(json.loads, f))
data will be a list of dictionaries.
If you just want a few properties, you can use a list comprehension.
with open('movie_actors_data.txt') as f:
data = [{"title": x["title"], "imdb_id": x["imdb_id"]} for x in map(json.loads, f)]

json.load takes an open file, not a file path.
data = json.load(open('movie_acotrs_data.txt'))

Related

How can I format json via python?

I have a very basic problem that I can't figure out. I keep getting a "End of file expected.json" when trying to write object data to a json file. I was wondering how I can fix that? I do it by writing in a for loop. Not sure how I can format.
This is the code in question
with open("data.json", "w") as outfile:
for x,y in structures.infrastructures.items():
outfile.write(Sector(x, y["Depended on by"],y["Depends on"], y["Sub sections"]).toJson())
and this is the output
{
"name": "Chemical",
"depended_on": [
"Critical Manufacturing",
"Nuclear Reactors, Waste, and Material Management",
"Commercial",
"Healthcare and Public Health",
"Food and Agriculture",
"Energy"
],
"depends_on": [
"Emergency Services",
"Energy",
"Food and Agriculture",
"Healthcare and Public Health",
"Information Technology",
"Nuclear Reactors, Waste, and Material Management",
"Transportation Systems",
"Water"
],
"sub_sections": [
"Chemical Plants",
"Chemical Refineries",
"Labs"
],
"Status": 0,
"Strain": 0
}{ -> this is where the error is
"name": "Commercial",
"depended_on": [
....
....
etc
This is my toJson method:
def toJson(self):
return json.dumps(self, default=lambda o: o.__dict__, indent=4)
But yeah how can I implement it where my object data is written in JSON format?
A valid json file can only contain one object. Collect your data into a list and write it with a single call, or simulate the format with your code.
You have to use a function to parse the .json
I hope this helps
https://scriptcrunch.com/parse-json-file-using-python/#:~:text=How%20To%20Parse%20JSON%20File%20Content%20Using%20Python,also%20loop%20through%20all%20the%20JSON%20objects.%20

how to split large json file according to the value of a key?

I have a large json file that I would like to split according to the key "metadata". One example of record is
{"text": "The primary outcome of the study was hospital mortality; secondary outcomes included ICU mortality and lengths of stay for hospital and ICU. ICU mortality was defined as survival of a patient at ultimate discharge from the ICU and hospital mortality was defined as survival at discharge or transfer from our hospital.", "label": "conclusion", "metadata": "18982114"}
There are many records in the json file where the key "metadata" is "18982114". How can I extract all of these records and store them into a separate json file? Ideally, I'm looking for a solution that includes no loading and looping over the file, otherwise it would be very cumbersome every time I query it. I think by using shell command maybe it's doable, but unfortunately I'm not an expert in shell commands...so I would highly appreciate a non-looping fast query solution, thx!
==========================================================================
here are some samples of the file (contains 5 records):
{"text": "Finally, after an emergency laparotomy, patients who received i.v. vasoactive drugs within the first 24 h on ICU were 3.9 times more likely to die (OR 3.85; 95% CI, 1.64 -9.02; P\u00bc0.002). No significant prognostic factors were determined by the model on day 2.", "label": "conclusion", "metadata": "18982114"}
{"text": "Kinetics ofA TP Binding to Normal and Myopathic", "label": "conclusion", "metadata": "10700033"}
{"text": "Observed rate constants, k0b,, were obtained by fitting the equation I(t)=oe-kobs+C by the method of moments, where I is the observed fluorescence intensity, and I0 is the amplitude of fluorescence change. 38 ", "label": "conclusion", "metadata": "235564322"}
{"text": "The capabilities of modern angiographic platforms have recently improved substantially.", "label": "conclusion", "metadata": "2877272"}
{"text": "Few studies have concentrated specifically on the outcomes after surgery.", "label": "conclusion", "metadata": "18989842"}
The job is to fast retrieve the text for the record with metadata "18982114"
Use json package to convert the json object into a dictionary then use the data stored in the metadata key. here is an working example:
# importing the module
import json
# Opening JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Print the type of data variable
print("Type:", type(data))
# Print the data of dictionary
print("metadata: ", data['metadata'])
You can try this approach:
import json
with open('data.json') as data_json:
data = json.load(data_json)
MATCH_META_DATA = '18982114'
match_records = []
for part_data in data:
if part_data.get('metadata') == MATCH_META_DATA:
match_records.append(part_data)
Let us imagine we have the following JSON content in example.json:
{
"1":{"text": "Some text 1.", "label": "xxx", "metadata": "18982114"},
"2":{"text": "Some text 2.", "label": "yyy", "metadata": "18982114"},
"3":{"text": "Some text 3.", "label": "zzz", "metadata": "something else"}
}
You can do the following:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
# 1. read json content from file
my_json = None
with open("example.json", "r") as file:
my_json = json.load(file)
# 2. filter content
# you can use a list instead of a new dictionary if you don't want to create a new json file
new_json_data = {}
for record_id in my_json:
if my_json[record_id]["metadata"] == str(18982114):
new_json_data[record_id] = my_json[record_id]
# 3. write a new json with filtered data
with open("result.json"), "w") as file:
json.dump(new_json_data, file)
This will output the following result.json file:
{"1": {"text": "Some text 1.", "label": "", "metadata": "18982114"}, "2": {"text": "Some text 2.", "label": "", "metadata": "18982114"}}

merge & write two jsonl (json lines) files into a new jsonl file in python3.6

Hello I have two jsonl files like so:
one.jsonl
{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}
second.jsonl
{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}
And my goal is to write a new jsonl file (with encoding preserved) name merged_file.jsonl which will look like this:
{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}
{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}
My approach is like this:
import json
import glob
result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
try:
result.append(extract_json(infile)) #tried json.loads(infile) too
except ValueError:
print(f)
#write the file in BOM TO preserve the emojis and special characters
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
json.dump(result, outfile)
However I am met with this error:
TypeError: Object of type generator is not JSON serializable I will apprecite your hint/help in any ways. Thank you! I have looked other SO repos, they are all writing normal json files, which should work in my case too, but its keep failing.
Reading single file like this works:
data_json = io.open('one.jsonl', mode='r', encoding='utf-8-sig') # Opens in the JSONL file
data_python = extract_json(data_json)
for line in data_python:
print(line)
####outputs####
#{'name': 'one', 'description': 'testDescription...', 'comment': '1'}
#{'name': 'two', 'description': 'testDescription2...', 'comment': '2'}
It is possible that extract_json returns a generator instead of a list/dict which is json serializable
since it is jsonl, which means each line is a valid json
so you just need to tweak your existing code a little bit.
import json
import glob
result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
for line in infile.readlines():
try:
result.append(json.loads(line)) # read each line of the file
except ValueError:
print(f)
# This would output jsonl
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
#json.dump(result, outfile)
#write each line as a json
outfile.write("\n".join(map(json.dumps, result)))
Now that I think about it you didn't even have to load it using json, except it will help you sanitize any badly formatted JSON lines is all
you could collect all the lines in one shot like this
outfile = open('merged_file.jsonl','w', encoding= 'utf-8-sig')
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
for line in infile.readlines():
outfile.write(line)
outfile.close()
another super easy way to do this, if you don't care about json validation
cat folder_with_all_jsonl/*.jsonl > merged_file.jsonl

how do I append another column/variable to json file from a list of values?

I am a beginner to python and scripting so I am unfamiliar with how json innately works, but this is the problem I have. I wrote a script which took values of the "location" variable from the json file I was reading and used googlemaps API to find the country this location was in. However, as some of these locations are repeat, and I did not want to repeatedly check the same location over and over. I stored all the values retrieved from the location variable in a list, then converted the list into a set to get rid of duplicates.
My question is this: once I have retrieved the country data (I have the data stored in a list), how can I add this country data to my original json file?
For instance, these are a few of my tests from the json file I am reading.
{"login": "hi", "name": "hello", "location": "Seoul, South Korea"}
{"login": "hi", "name": "hello", "location": null}
{"login": "hi", "name": "hello", "location": "Berlin, Germany"}
{"login": "hi", "name": "hello", "location": "Pittsburgh, PA"}
{"login": "hi", "name": "hello", "location": "London"}
{"login": "hi", "name": "hello", "location": "Tokyo, Japan"}
input = codecs.open(inputFile, 'r', 'utf8')
for line in input.readlines():
temp = json.loads(line)
if (temp['location'] != None): #some locations are Null
locationList.append(temp['location'])
input.close()
locationList = list(set(locationList))
print(locationList)
#getting location data, storing it in countryList variable
for uniqueLocation in locationList:
geocodeResult = gm.geocode(uniqueLocation)[0] #getting information about each location
geoObject = geocodeResult['address_components'] #gettnig just the address components
for item in geoObject: #iterating in object
if item['types'][0] == 'country': #if first element of this item is country
countryName = item['long_name'] #then retrieve long_name from this item
countryList.append(countryName)
print(countryList)
Check this out:
How to append data to a json file?

How to use Python decode multiple json object with it in one json file [duplicate]

I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.
Using Python 3 and the json module, how can I read one JSON object at a time from the file into memory?
The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.
Record in readable format:
{
"results": {
"__metadata": {
"type": "DataServiceProviderDemo.Address"
},
"Street": "NE 228th",
"City": "Sammamish",
"State": "WA",
"ZipCode": "98074",
"Country": "USA"
}
}
}
Actual format. New records start one after the other without any breaks.
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode() method.
The following will yield complete objects as the parser finds them:
from json import JSONDecoder
from functools import partial
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
This function will read chunks from the given file object in buffersize chunks, and have the decoder object parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.
Use it like this:
with open('yourfilename', 'r') as infh:
for data in json_parse(infh):
# process object
Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you do have newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Python instead.
Here is a slight modification of Martijn Pieters' solution, which will handle JSON strings separated with whitespace.
def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048,
delimiters=None):
remainder = ''
for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
remainder += chunk
while remainder:
try:
stripped = remainder.strip(delimiters)
result, index = decoder.raw_decode(stripped)
yield result
remainder = stripped[index:]
except ValueError:
# Not enough data to decode, read more
break
For example, if data.txt contains JSON strings separated by a space:
{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}
then
In [47]: list(json_parse(open('data')))
Out[47]:
[{u'Accepts Credit Cards': True,
u'Price Range': 1,
u'business_id': u'1',
u'type': u'food'},
{u'Accepts Credit Cards': True,
u'Price Range': 2,
u'business_id': u'2',
u'type': u'cloth'},
{u'Accepts Credit Cards': False,
u'Price Range': 3,
u'business_id': u'3',
u'type': u'sports'}]
If your JSON documents contains a list of objects, and you want to read one object one-at-a-time, you can use the iterative JSON parser ijson for the job. It will only read more content from the file when it needs to decode the next object.
Note that you should use it with the YAJL library, otherwise you will likely not see any performance increase.
That being said, unless your file is really big, reading it completely into memory and then parsing it with the normal JSON module will probably still be the best option.

Categories

Resources