I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:
{
"key11": value11,
"key12": value12,
}
{
"key21": value21,
"key22": value22,
}
…
The way I'm importing it now is:
content = open(file_path, "r").read()
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")
Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).
Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?
Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!
Just read each line and construct a json object at this time:
with open(file_path) as f:
for line in f:
j_content = json.loads(line)
This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.
There is also this answer.:
https://stackoverflow.com/a/7795029/671543
contents = open(file_path, "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.
{
"key11": 11,
"key12": 12
}
{
"key21": 21,
"key22": 22
}
Just read line-by-line, and build the JSON blocks as you go:
with open(args.infile, 'r') as infile:
# Variable for building our JSON block
json_block = []
for line in infile:
# Add the line to our JSON block
json_block.append(line)
# Check whether we closed our JSON block
if line.startswith('}'):
# Do something with the JSON dictionary
json_dict = json.loads(''.join(json_block))
print(json_dict)
# Start a new block
json_block = []
If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.
This expands Cohen's answer:
content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')
json_lines = []
for line in file_buffer.splitlines():
j_content = json.loads(line)
json_lines.append(j_content)
df_readback = pd.DataFrame(json_lines)
This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.
Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines
The code:
for line in json_input.splitlines():
one_json = json.loads(line)
The line by line reading approach is good, as mentioned in some of the above answers.
However across multiple JSON tree structures I would recommend decomposition into 2 functions to have more robust error handling.
For example,
def load_cases(file_name):
with open(file_name) as file:
cases = (parse_case_line(json.loads(line)) for line in file)
cases = filter(None, cases)
return list(cases)
parse_case_line can encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.
Another advantage of this approach is filter handles multiple \n in the middle of your json object, and parses the whole file :-).
Just read it line by line and parse e through a stream
while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.
Related
I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.
This is my first question here, I'm new to python and trying to figure some things out to set up an automatic 3D model processing chain that relies on data being stored in JSON files moving from one server to another.
The problem is that I need to store absolute paths to files that are being processed, but these absolute paths should be modified in the original JSON files upon the first time that they are processed.
Basically the JSON file comes in like this:
{
"normaldir": "D:\\Outgoing\\1621_1\\",
"projectdir": "D:\\Outgoing\\1622_2\\"
}
And I would like to rename the file paths to
{
"normaldir": "X:\\Incoming\\1621_1\\",
"projectdir": "X:\\Incoming\\1622_2\\",
}
What I've been trying to do is replace the first part of the path using this code, but it isn't working:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as file:
content = file.read()
file.seek(0)
content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
file.write(content)
However this was not working at all, so I tried interpreting the JSON file properly and replacing the key code from here:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as settingsData:
settings = json.load(settingsData)
settings['normaldir'] = 'X:\\Incoming\\1621_1\\'
settings['projectdir'] = 'X:\\Incoming\\1622_2\\'
settingsData.seek(0) # rewind to beginning of file
settingsData.write(json.dumps(settings,indent=2,sort_keys=True)) #write the updated version
settingsData.truncate() #truncate the remainder of the data in the file
This works perfectly, however I'm replacing the whole path so it won't really work for every JSON file that I need to process. What I would really like to do is to take a JSON key corresponding to a file path, keep the last 8 characters and replace the rest of the patch with a new string, but I can't figure out how to do this using json in python, as far as I can tell I can't edit part of a key.
Does anyone have a workaround for this?
Thanks!
Your replace logic failed as you need to reassign content to the new string,str.replace is not an inplace operation, it creates a new string:
content = content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
Using the json approach just do a replace too, using the current value:
settings['normaldir'] = settings['normaldir'].replace("D:\\Outgoing\\", "X:\\Incoming\\")
You also would want truncate() before you write or just reopen the file with w and dump/write the new value, if you really wanted to just keep the last 8 chars and prepend a string:
settings['normaldir'] = "X:\\Incoming\\" + settings['normaldir'][-8:]
Python come with a json library.
With this library, you can read and write JSON files (or JSON strings).
Parsed data is converted to Python objects and vice versa.
To use the json library, simply import it:
import json
Say your data is stored in input_data.json file.
input_data_path = "input_data.json"
You read the file like this:
import io
with io.open(input_data_path, mode="rb") as fd:
obj = json.load(fd)
or, alternatively:
with io.open(input_data_path, mode="rb") as fd:
content = fd.read()
obj = json.loads(content)
Your data is automatically converted into Python objects, here you get a dict:
print(repr(obj))
# {u'projectdir': u'D:\\Outgoing\\1622_2\\',
# u'normaldir': u'D:\\Outgoing\\1621_1\\'}
note: I'm using Python 2.7 so you get the unicode string prefixed by "u", like u'projectdir'.
It's now easy to change the values for normaldir and projectdir:
obj["normaldir"] = "X:\\Incoming\\1621_1\\"
obj["projectdir"] = "X:\\Incoming\\1622_2\\"
Since obj is a dict, you can also use the update method like this:
obj.update({'normaldir': "X:\\Incoming\\1621_1\\",
'projectdir': "X:\\Incoming\\1622_2\\"})
That way, you use a similar syntax like JSON.
Finally, you can write your Python object back to JSON file:
output_data_path = "output_data.json"
with io.open(output_data_path, mode="wb") as fd:
json.dump(obj, fd)
or, alternatively with indentation:
content = json.dumps(obj, indent=True)
with io.open(output_data_path, mode="wb") as fd:
fd.write(content)
Remarks: reading/writing JSON objects is faster with a buffer (the content variable).
.replace returns a new string, and don't change it. But you should not treat json-files as normal text files, so you can combine parsing json with replace:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'rb') as settingsData:
settings = json.load(settingsData)
settings = {k: v.replace("D:\\Outgoing\\", "X:\\Incoming\\")
for k, v in settings.items()
}
with open(configfile, 'wb') as settingsData:
json.dump(settings, settingsData)
I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. However, after writing some code like this,
with open(filename, 'r') as f:
parser = ijson.parse(f)
for prefix, event, value in parser:
if prefix == "name":
print(value)
I found that the code parses only the first line and not the rest of the lines from the file!!
Here is how a portion of my json file looks like:
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}
In my opinion, I think ijson parses only one json object.
Can someone please suggest how to work around this?
Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:
# each JSON is small, there's no need in iterative processing
import json
with open(filename, 'r') as f:
for line in f:
data = json.loads(line)
# data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
# contain correspoding values
Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data". See bug reports here:
https://github.com/isagalaev/ijson/issues/40
https://github.com/isagalaev/ijson/issues/42
https://github.com/isagalaev/ijson/issues/67
python: how do I parse a stream of json arrays with ijson library
It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:
import io
import ijson
with open(filename, encoding="UTF-8") as json_file:
cursor = 0
for line_number, line in enumerate(json_file):
print ("Processing line", line_number + 1,"at cursor index:", cursor)
line_as_file = io.StringIO(line)
# Use a new parser for each line
json_parser = ijson.parse(line_as_file)
for prefix, type, value in json_parser:
print ("prefix=",prefix, "type=",type, "value=",value)
cursor += len(line)
You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate() from: Accessing the index in 'for' loops?
I am downloading Json files from an API, I use the following code to write the JSON. Each item the loop gives me a JSON file. I need to save it and extract entities from the appended JSON file using a loop.
for item in style_ls:
dat = get_json(api, item)
specs_dict[item] = dat
with open("specs_append.txt", "a") as myfile:
json.dump(dat, myfile)
myfile.close()
print item
with open ("specs_data.txt", "w") as my file:
json.dump(spec_dict, myfile)
myfile.close()
I know that I cannot get a valid JSON format from the specs_append.txt, but I can get one from the specs_data.txt. I am doing the first one just because my program needs atleast 3-4 days to complete and there are high chances that my system may shutdown. So is there anyway I can do this efficiently ?
If not is there anyway I can extract it from specs_append.txt <{JSON}{JSON}> format (which is not a valid JSON format)?
If not should I write specs_dict to a txt file every time in the loop, so that even if program gets terminated i can start if from that point in loop and still get a valid json format?
I suggest several possible solutions.
One solution is to write custom code to slurp in the input file. I would suggest putting a special line before each JSON object in the file, such as: ###
Then you could write code like this:
import json
def json_get_objects(f):
temp = ''
line = next(f) # pull first line
assert line == SPECIAL_LINE
for line in f:
if line != SPECIAL_LINE:
temp += line
else:
# found special marker, temp now contains a complete JSON object
j = json.loads(temp)
yield j
temp = ''
# after loop done, yield up last JSON object
if temp:
j = json.loads(temp)
yield j
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
Two notes on this. First, I am simply appending to a string over and over; this used to be a very slow way to do this in Python, so if you are using a very old version of Python, don't do it this way unless your JSON objects are very small. Second, I wrote code to split the input and yield up JSON objects one at a time, but you could also use a guaranteed-unique string, slurp in all the data with a single call to f.read() and then split on your guaranteed-unique string using the str.split() method function.
Another solution would be to write the whole file as a valid JSON list of valid JSON objects. Write the file like this:
{"mylist":[
# first JSON object, followed by a comma
# second JSON object, followed by a comma
# third JSON object
]}
This would require your file appending code to open the file with writing permission, and seek to the last ] in the file before writing a comma plus newline, then the new JSON object on the end, and then finally writing ]} to close out the file. If you do it this way, you can use json.loads() to slurp the whole thing in and have a list of JSON objects.
Finally, I suggest that maybe you should just use a database. Use SQLite or something and just throw the JSON strings in to a table. If you choose this, I suggest using an ORM to make your life simple, rather than writing SQL commands by hand.
Personally, I favor the first suggestion: write in a special line like ###, then have custom code to split the input on those marks and then get the JSON objects.
EDIT: Okay, the first suggestion was sort of assuming that the JSON was formatted for human readability, with a bunch of short lines:
{
"foo": 0,
"bar": 1,
"baz": 2
}
But it's all run together as one big long line:
{"foo":0,"bar":1,"baz":2}
Here are three ways to fix this.
0) write a newline before the ### and after it, like so:
###
{"foo":0,"bar":1,"baz":2}
###
{"foo":0,"bar":1,"baz":2}
Then each input line will alternately be ### or a complete JSON object.
1) As long as SPECIAL_LINE is completely unique (never appears inside a string in the JSON) you can do this:
with open("specs_data.txt", "r") as f:
temp = f.read() # read entire file contents
lst = temp.split(SPECIAL_LINE)
json_objects = [json.loads(x) for x in lst]
for j in json_objects:
pass # do something with JSON object j
The .split() method function can split up the temp string into JSON objects for you.
2) If you are certain that each JSON object will never have a newline character inside it, you could simply write JSON objects to the file, one after another, putting a newline after each; then assume that each line is a JSON object:
import json
def json_get_objects(f):
for line in f:
if line.strip():
yield json.loads(line)
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
I like the simplicity of option (2), but I like the reliability of option (0). If a newline ever got written in as part of a JSON object, option (0) would still work, but option (2) would error.
Again, you can also simply use an actual database (SQLite) with an ORM and let the database worry about the details.
Good luck.
Append json data to a dict on every loop.
In the end dump this dict as a json and write it to a file.
For getting you an idea for appending data to dict:
>>> d1 = {'suku':12}
>>> t1 = {'suku1':212}
>>> d1.update(t1)
>>> d1
{'suku1': 212, 'suku': 12}
I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:
f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
with open(filename, 'r') as file:
data = simplejson.load(file)
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.
How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.
You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).
You can still repair this by using some creative re-stitching. The following generator function should do it:
import json
def read_objects(filename):
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
line = next(inputfile).strip()
while line:
try:
obj, index = decoder.raw_decode(line)
yield obj
line = line[index:]
except ValueError:
# Assume we didn't have a complete object yet
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
This should be able to read all your JSON objects in sequence:
for filename in all_files:
for data in read_objects(filename):
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.