I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?
For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.
# Contents of file 'file.json' ('\u00c3\u00a8' is 'è')
# { "name": "\u00c3\u00a8" }
with open('file.json', 'r') as input:
j = json.load(input)
j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')
Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.
Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().
Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.
>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
... reader = getreader('raw_unicode_escape')(input)
... j = json.loads(reader.read().encode('raw_unicode_escape'))
... print(j['name'])
...
è
Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.
Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().
Related
I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")
This is my first question here, I'm new to python and trying to figure some things out to set up an automatic 3D model processing chain that relies on data being stored in JSON files moving from one server to another.
The problem is that I need to store absolute paths to files that are being processed, but these absolute paths should be modified in the original JSON files upon the first time that they are processed.
Basically the JSON file comes in like this:
{
"normaldir": "D:\\Outgoing\\1621_1\\",
"projectdir": "D:\\Outgoing\\1622_2\\"
}
And I would like to rename the file paths to
{
"normaldir": "X:\\Incoming\\1621_1\\",
"projectdir": "X:\\Incoming\\1622_2\\",
}
What I've been trying to do is replace the first part of the path using this code, but it isn't working:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as file:
content = file.read()
file.seek(0)
content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
file.write(content)
However this was not working at all, so I tried interpreting the JSON file properly and replacing the key code from here:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as settingsData:
settings = json.load(settingsData)
settings['normaldir'] = 'X:\\Incoming\\1621_1\\'
settings['projectdir'] = 'X:\\Incoming\\1622_2\\'
settingsData.seek(0) # rewind to beginning of file
settingsData.write(json.dumps(settings,indent=2,sort_keys=True)) #write the updated version
settingsData.truncate() #truncate the remainder of the data in the file
This works perfectly, however I'm replacing the whole path so it won't really work for every JSON file that I need to process. What I would really like to do is to take a JSON key corresponding to a file path, keep the last 8 characters and replace the rest of the patch with a new string, but I can't figure out how to do this using json in python, as far as I can tell I can't edit part of a key.
Does anyone have a workaround for this?
Thanks!
Your replace logic failed as you need to reassign content to the new string,str.replace is not an inplace operation, it creates a new string:
content = content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
Using the json approach just do a replace too, using the current value:
settings['normaldir'] = settings['normaldir'].replace("D:\\Outgoing\\", "X:\\Incoming\\")
You also would want truncate() before you write or just reopen the file with w and dump/write the new value, if you really wanted to just keep the last 8 chars and prepend a string:
settings['normaldir'] = "X:\\Incoming\\" + settings['normaldir'][-8:]
Python come with a json library.
With this library, you can read and write JSON files (or JSON strings).
Parsed data is converted to Python objects and vice versa.
To use the json library, simply import it:
import json
Say your data is stored in input_data.json file.
input_data_path = "input_data.json"
You read the file like this:
import io
with io.open(input_data_path, mode="rb") as fd:
obj = json.load(fd)
or, alternatively:
with io.open(input_data_path, mode="rb") as fd:
content = fd.read()
obj = json.loads(content)
Your data is automatically converted into Python objects, here you get a dict:
print(repr(obj))
# {u'projectdir': u'D:\\Outgoing\\1622_2\\',
# u'normaldir': u'D:\\Outgoing\\1621_1\\'}
note: I'm using Python 2.7 so you get the unicode string prefixed by "u", like u'projectdir'.
It's now easy to change the values for normaldir and projectdir:
obj["normaldir"] = "X:\\Incoming\\1621_1\\"
obj["projectdir"] = "X:\\Incoming\\1622_2\\"
Since obj is a dict, you can also use the update method like this:
obj.update({'normaldir': "X:\\Incoming\\1621_1\\",
'projectdir': "X:\\Incoming\\1622_2\\"})
That way, you use a similar syntax like JSON.
Finally, you can write your Python object back to JSON file:
output_data_path = "output_data.json"
with io.open(output_data_path, mode="wb") as fd:
json.dump(obj, fd)
or, alternatively with indentation:
content = json.dumps(obj, indent=True)
with io.open(output_data_path, mode="wb") as fd:
fd.write(content)
Remarks: reading/writing JSON objects is faster with a buffer (the content variable).
.replace returns a new string, and don't change it. But you should not treat json-files as normal text files, so you can combine parsing json with replace:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'rb') as settingsData:
settings = json.load(settingsData)
settings = {k: v.replace("D:\\Outgoing\\", "X:\\Incoming\\")
for k, v in settings.items()
}
with open(configfile, 'wb') as settingsData:
json.dump(settings, settingsData)
If I want to save a Dictionary structure to a file and read this Dictionary from the file directly later, I have two methods but I do not know the differences between the two methods. Could anyone explain it?
Here is a simple example. Suppose this is my dictionary:
D = {'zzz':123,
'lzh':321,
'cyl':333}
The first method to save it to the file:
with open('tDF.txt','w') as f: # save
f.write(str(D) + '\n')
with open('tDf.txt','r') as f:
Data = f.read() # read. Data is string
Data = eval(Data) # convert to Dictionary structure format
The second method (using pickle):
import pickle
with open('tDF.txt','w') as f: # save
pickle.dump(D,f)
with open('tDF.txt','r') as f:
D = pickle.load(f) # D is Dictionary structure format
I think the first method is much simple. What is the differences?
Thanks!
Writing str value representation
If you writhe str value of your data, you rely on the fact, it is properly shaped.
In some cases (e.g. float numbers, but also more complex objects) you would loose some precision or information.
Using repr instead of str might improve the situation a bit, as repr is supposed to provide the text in a form, which is likely to be working in case of reading it back (but without any guarantee)
Writing pickled data
Pickle is taking care about every bit, so you will have serialized precise information.
This is quite significant difference.
Using other serialization methods
Personally, I prefer serializing into json or sometime yaml, as this form of data is well readable, portable and can be even edited.
Serialize to JSON
For json it works this way:
import json
data = {"a", "aha", "b": "bebe", age: 123, num: 3.1415}
with open("data.json", "w") as f:
json.dump(data, f)
with open("data.json", "r") as f:
readdata = json.load(data, f)
print readdata
Serialize to YAML
With YAML:
Firt be sure, you have some YAML lib installed, e.g.:
$ pip install pyyaml
Personally, I have it installed all the time, as I use it very often.
Then, then script changes only a bit:
import yaml
data = {"a", "aha", "b": "bebe", age: 123, num: 3.1415}
with open("data.yaml", "w") as f:
yaml.dump(data, f)
with open("data.yaml", "r") as f:
readdata = yaml.load(data, f)
print readdata
Conclusions
For rather simple data types, the methods described above works easily.
In case you start using instances of classes you have defined, it would require proper definition of
loaders and serializers for given formats. Describing this is out of scope of this question, but it
is definitely possible for all cases, where some solution exists (as there are types of values, which
are not possible to serialize reliably, like file pointers, database connections etc.)
I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:
{
"key11": value11,
"key12": value12,
}
{
"key21": value21,
"key22": value22,
}
…
The way I'm importing it now is:
content = open(file_path, "r").read()
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")
Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).
Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?
Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!
Just read each line and construct a json object at this time:
with open(file_path) as f:
for line in f:
j_content = json.loads(line)
This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.
There is also this answer.:
https://stackoverflow.com/a/7795029/671543
contents = open(file_path, "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.
{
"key11": 11,
"key12": 12
}
{
"key21": 21,
"key22": 22
}
Just read line-by-line, and build the JSON blocks as you go:
with open(args.infile, 'r') as infile:
# Variable for building our JSON block
json_block = []
for line in infile:
# Add the line to our JSON block
json_block.append(line)
# Check whether we closed our JSON block
if line.startswith('}'):
# Do something with the JSON dictionary
json_dict = json.loads(''.join(json_block))
print(json_dict)
# Start a new block
json_block = []
If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.
This expands Cohen's answer:
content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')
json_lines = []
for line in file_buffer.splitlines():
j_content = json.loads(line)
json_lines.append(j_content)
df_readback = pd.DataFrame(json_lines)
This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.
Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines
The code:
for line in json_input.splitlines():
one_json = json.loads(line)
The line by line reading approach is good, as mentioned in some of the above answers.
However across multiple JSON tree structures I would recommend decomposition into 2 functions to have more robust error handling.
For example,
def load_cases(file_name):
with open(file_name) as file:
cases = (parse_case_line(json.loads(line)) for line in file)
cases = filter(None, cases)
return list(cases)
parse_case_line can encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.
Another advantage of this approach is filter handles multiple \n in the middle of your json object, and parses the whole file :-).
Just read it line by line and parse e through a stream
while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.
So, I saved a list to a file as a string. In particular, I did:
f = open('myfile.txt','w')
f.write(str(mylist))
f.close()
But, later when I open this file again, take the (string-ified) list, and want to change it back to a list, what happens is something along these lines:
>>> list('[1,2,3]')
['[', '1', ',', '2', ',', '3', ']']
Could I make it so that I got the list [1,2,3] from the file?
There are two easiest major options here. First, using ast.literal_eval:
>>> import ast
>>> ast.literal_eval('[1,2,3]')
[1, 2, 3]
Unlike eval, this is safer since it will only evaluate python literals, such as lists, dictionaries, NoneTypes, strings, etc. This would throw an error if we use a code inside.
Second, make use of the json module, using json.loads:
>>> import json
>>> json.loads('[1,2,3]')
[1, 2, 3]
A great advantage of using json is that it's cross-platform, and you can also write to file easily.
with open('data.txt', 'w') as f:
json.dump([1, 2, 3], f)
In [285]: import ast
In [286]: ast.literal_eval('[1,2,3]')
Out[286]: [1, 2, 3]
Use ast.literal_eval instead of eval whenever possible:
ast.literal_eval:
Safely evaluate[s] an expression node or a string containing a Python
expression. The string or node provided may only consist of the
following Python literal structures: strings, numbers, tuples, lists,
dicts, booleans, and None.
Edit: Also, consider using json. json.loads operates on a different string format, but is generally faster than ast.literal_eval. (So if you use json.load, be sure to save your data using json.dump.) Moreover, the JSON format is language-independent.
Python developers traditionally use pickle to serialize their data and write it to a file.
You could do so like so:
import pickle
mylist = [1,2,3]
f = open('myfile', 'wb')
pickle.dump(mylist, f)
And then reopen like so:
import pickle
f = open('myfile', 'rb')
mylist = pickle.load(f) # [1,2,3]
I would write python objects to file using the built-in json encoding or, if you don't like json, with pickle and cpickle. Both allow for easy deserialization and serialization of data. I'm on my phone but when I get home I'll upload sample code.
EDIT:
Ok, get ready for a ton of Python code, and some opinions...
JSON
Python has builtin support for JSON, or JavaScript Object Notation, a lightweight data interchange format. JSON supports Python's basic data types, such as dictionaries (which JSON calls objects: basically just key-value pairs, and lists: comma-separated values encapsulated by [ and ]. For more information on JSON, see this resource. Now to the code:
import json #Don't forget to import
my_list = [1,2,"blue",["sub","list",5]]
with open('/path/to/file/', 'w') as f:
string_to_write = json.dumps(my_list) #Dump the string
f.write(string_to_write)
#The character string [1,2,"blue",["sub","list",5]] is written to the file
Note that the with statement will close the file automatically when the block finishes executing.
To load the string back, use
with open('/path/to/file/', 'r') as f:
string_from_file = f.read()
mylist = json.loads(string_from_file) #Load the string
#my_list is now the Python object [1,2,"blue",["sub","list",5]]
I like JSON. Use JSON unless you really, really have a good reason for not.
CPICKLE
Another method for serializing Python data to a file is called pickling, in which we write more than we 'need' to a file so that we have some meta-information about how the characters in the file relate to Python objects. There is a builtin pickle class, but we'll use cpickle, because it is implemented in C and is much, much faster than pickle (about 100X but I don't have a citation for that number). The dumping code then becomes
import cpickle #Don't forget to import
with open('/path/to/file/', 'w') as f:
string_to_write = cpickle.dumps(my_list) #Dump the string
f.write(string_to_write)
#A very weird character string is written to the file, but it does contain the contents of our list
To load, use
with open('/path/to/file/', 'r') as f:
string_from_file = f.read()
mylist = cpickle.loads(string_from_file) #Load the string
#my_list is now the Python object [1,2,"blue",["sub","list",5]]
Comparison
Note the similarities between the code we wrote using JSON and the code we wrote using cpickle. In fact, the only major difference between the two methods is what text (which characters) gets actually written to the file. I believe JSON is faster and more space-efficient than cpickle - but cpickle is a valid alternative. Also, JSON format is much more universal than cpickle's weird syntax.
A note on eval
Please don't use eval() haphazardly. It seems like you're relatively new to Python, and eval can be a risky function to jump right into. It allows for the unchecked evaluation of any Python code, and as such can be a) risky, if you ever are relying on the user to input text, and b) can lead to sloppy, non-Pythonic code.
That last point is just my two cents.
tl:dr; Use JSON to dump and load Python objects to file
Write to file without brackets: f.write(str(mylist)[1:-1])
After reading the line, split it to get a list: data = line.split(',')
To convert to integers: data = map(int, data)