Splitting 250GB JSON file containing multiple tables into parquet - python

I have a JSON file with the following exemplified format,
{
"Table1": {
"Records": [
{
"Key1Tab1": "SomeVal",
"Key2Tab1": "AnotherVal"
},
{
"Key1Tab1": "SomeVal2",
"Key2Tab1": "AnotherVal2"
}
]
},
"Table2": {
"Records": [
{
"Key1Tab1": "SomeVal",
"Key2Tab1": "AnotherVal"
},
{
"Key1Tab1": "SomeVal2",
"Key2Tab1": "AnotherVal2"
}
]
}
}
The root keys are table names from an SQL database and its corresponding value is the rows.
I want to split the JSON file into seperate parquet files each representing a table.
Ie. Table1.parquet and Table2.parquet.
The big issue is the size of the file preventing me from loading it into memory.
Hence, I tried to use dask.bag to accommodate for the nested structure of the file.
import dask.bag as db
from dask.distributed import Client
client = Client(n_workers=4)
lines = db.read_text("filename.json")
But assessing the output with lines.take(4) shows that dask can't read the new lines correct.
('{\n', ' "Table1": {\n', ' "Records": [\n', ' {\n')
I've tried to search for solutions to the specific problem but without luck.
Is there any chance that the splitting can be solved with dask or is there other tools that could do the job?

As suggested here try the dask.dataframe.read_json() method
This may be sufficient, though I am unsure how it will behave if you don't have enough memory to store the entire resulting dataframe in-memory..
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
df = dd.read_json("filename.json")
df.to_parquet("filename.parquet", engine='pyarrow')
docs
https://distributed.dask.org/en/latest/manage-computation.html#dask-collections-to-futures
https://examples.dask.org/dataframes/01-data-access.html#Write-to-Parquet
If Dask doesn't process the file in chunks when on a single system (it may not happily do so as JSON is distinctly unfriendly to parse in such a way .. though I unfortunately don't have access to my test system to verify this) and the system memory is unable to handle the giant file, you may be able to extend the system memory with disk space by creating a big swapfile.
Note that this will create a ~300G file (increase count field for more) and be may be incredibly slow compared to memory (but perhaps still fast enough for your needs, especially if it's a 1-off).
# create and configure swapfile
dd if=/dev/zero of=swapfile.img bs=10M count=30000 status=progress
chmod 600 swapfile.img
mkswap swapfile.img
swapon swapfile.img
#
# run memory-greedy task
# ...
# ensure processes have exited
#
# disable and remove swapfile to reclaim disk space
swapoff swapfile.img # may hang for a long time
rm swapfile.img

The problem is, that dask will split the file on newline characters by default, and you can't guarantee that this will not be in the middle of one of your tables. Indeed, even if you get it right, you still need to manipulate the resultant text to make complete JSON objects for each partition.
For example:
def myfunc(x):
x = "".join(x)
if not x.endswith("}"):
x = x[:-2] + "}"
if not x.startswith("{"):
x = "{" + x
return [json.loads(x)]
db.read_text('temp.json',
linedelimiter="\n },\n",
blocksize=100).map_partitions(myfunc)
In this case, I have purposefully made the blocksize smaller than each part to demonstrate: you will get a JSON object or nothing for each partition.
_.compute()
[{'Table1': {'Records': [{'Key1Tab1': 'SomeVal', 'Key2Tab1': 'AnotherVal'},
{'Key1Tab1': 'SomeVal2', 'Key2Tab1': 'AnotherVal2'}]}},
{},
{'Table2': {'Records': [{'Key1Tab1': 'SomeVal', 'Key2Tab1': 'AnotherVal'},
{'Key1Tab1': 'SomeVal2', 'Key2Tab1': 'AnotherVal2'}]}},
{},
{},
{}]
Of course, in your case you can immediately do something with the JSON rather than return it, or you can map to your writing function next in the chain.

When working with large files the key to success is processing the data as a stream, i.e. in filter-like programs.
The JSON format is easy to parse. The following program reads the input char by char (I/O should be bufferred) and cuts the top-level JSON object to separate objects. It properly follows the data structure and not the formatting.
The demo program just prints "--NEXT OUTPUT FILE--", where real output file switch should be implemented. Whitespace stripping is implemented as a bonus.
import collections
OBJ = 'object'
LST = 'list'
def out(ch):
print(ch, end='')
with open('json.data') as f:
stack = collections.deque(); push = stack.append; pop = stack.pop
esc = string = False
while (ch := f.read(1)):
if esc:
esc = False
elif ch == '\\':
esc = True
elif ch == '"':
string = not string
if not string:
if ch in {' ', '\t', '\r', '\n'}:
continue
if ch == ',':
if len(stack) == 1 and stack[0] == OBJ:
out('}\n')
print("--- NEXT OUTPUT FILE ---")
out('{')
continue
elif ch == '{':
push(OBJ)
elif ch == '}':
if pop() is not OBJ:
raise ValueError("unmatched { }")
elif ch == '[':
push(LST)
elif ch == ']':
if pop() is not LST:
raise ValueError("unmatched [ ]")
out(ch)
Here is a sample output for my testfile:
{"key1":{"name":"John","surname":"Doe"}}
--- NEXT OUTPUT FILE ---
{"key2":"string \" ] }"}
--- NEXT OUTPUT FILE ---
{"key3":13}
--- NEXT OUTPUT FILE ---
{"key4":{"sub1":[null,{"l3":true},null]}}
The original file was:
{
"key1": {
"name": "John",
"surname": "Doe"
},
"key2": "string \" ] }", "key3": 13,
"key4": {
"sub1": [null, {"l3": true}, null]
}
}

Related

Converting JSON to CSV using Python. How to remove certain text/characters if found, and how to better format the cell?

I apologise in advanced if i have not provided enough information,using wrong terminology or im not formatting my question correctly. This is my first time asking questions here.
This is the script for the python script: https://pastebin.com/WWViemwf
This is the script for the JSON file (contains the first 4 elements hydrogen, helium, lithium, beryllium): https://pastebin.com/fyiijpBG
As seen, I'm converting the file from ".json" to ".csv".
The JSON file sometimes contains fields that say "NotApplicable" or "Unknown". Or it will show me weird text that I'm not familiar with.
For example here:
"LiquidDensity": {
"data": "NotAvailable",
"tex_description": "\\text{liquid density}"
},
And here:
"MagneticMoment": {
"data": "Unknown",
"tex_description": "\\text{magnetic dipole moment}"
},
Here is the code ive made to convert from ".json" to ".csv":
#liquid density
liquid_density = element_data["LiquidDensity"]["data"]
if isinstance(liquid_density, dict):
liquid_density_value = liquid_density["value"]
liquid_density_unit = liquid_density["tex_unit"]
else:
liquid_density_value = liquid_density
liquid_density_unit = ""
However in the csv file it shows up like this.
I'm also trying to remove these characters that i'm seeing in the ".csv" file.
In the JSON file, this is how the data is viewed:
"AtomicMass": {
"data": {
"value": "4.002602",
"tex_unit": "\\text{u}"
},
"tex_description": "\\text{atomic mass}"
},
And this is how i coded to convert, using Python:
#atomic mass
atomic_mass = element_data["AtomicMass"]["data"]
if isinstance(atomic_mass, dict):
atomic_mass_value = atomic_mass["value"]
atomic_mass_unit = atomic_mass["tex_unit"]
else:
atomic_mass_value = atomic_mass
atomic_mass_unit = ""
What have i done wrong?
I've tried replacing:
#melting point
melting_point = element_data["MeltingPoint"]["data"]
if isinstance(melting_point, dict):
melting_point_value = melting_point["value"]
melting_point_unit = melting_point["tex_unit"]
else:
melting_point_value = melting_point
melting_point_value = ""
With:
#melting point
melting_point = element_data["MeltingPoint"]["data"]
if isinstance(melting_point, dict):
melting_point_value = melting_point["value"]
melting_point_unit = melting_point["tex_unit"]
elif melting_point == "NotApplicable" or melting_point == "Unknown":
melting_point_value = ""
melting_point_unit = ""
else:
melting_point_value = melting_point
melting_point_unit = ""
However that doesn't seem to work.
Your code is fine, what went wrong is at the writing, let me take out some part of it.
#I will only be using Liquid Density as example, so I won't be showing the others
headers = [..., "Liquid Density", ...]
#liquid_density data reading part
liquid_density = element_data["LiquidDensity"]["data"]
if isinstance(liquid_density, dict):
liquid_density_value = liquid_density["value"]
liquid_density_unit = liquid_density["tex_unit"]
else:
liquid_density_value = liquid_density
liquid_density_unit = ""
#your writing of the data into the csv
writer.writerow([..., liquid_density, ...])
You write liquid_density directly into your csv, that is why it shows the dictionary. If you want to write the value only, I believe you should change the value in write line to
writer.writerow([..., liquid_density_value, ...])

Python dict: print unique entries based on value

I'm making a script which fetches movies and shows from different services. I need a functionality where if a movie is available on a streaming platform (e.g. Paramount) in both 4k and HD, then I want to only show 4K result.
However, If the title is only available for purchase, then I want to exclude that from the results.
resp = {
# Fetches JSON response as dict from server
# which contains offers as a list of dictionaries
"offers": [
{
"monetization_type": "flatrate",
"package_short_name": "pmp",
"presentation_type": "4k",
},
{
"monetization_type": "flatrate",
"package_short_name": "pmp",
"presentation_type": "hd",
},
{
"monetization_type": "flatrate",
"package_short_name": "fxn",
"presentation_type": "hd",
},
{
"monetization_type": "buy",
"package_short_name": "itu",
"presentation_type": "4k",
},
]
}
def get_Stream_info(obj, results=[]):
try:
if obj["offers"]:
count = 0
for i in range(len(obj["offers"])):
srv = obj["offers"][i]["monetization_type"]
qty = obj["offers"][i]["presentation_type"]
pkg = obj["offers"][i]["package_short_name"]
if srv == "flatrate" and qty in ["4k", "hd"]:
results.append(f"Stream [{i+1}]: US - {pkg} - {qty}")
count = 1
else:
errstr = f"No streaming options available."
if count == 0:
results.append(errstr)
except KeyError:
results.append(f"Not available.")
return "\n".join(results)
if __name__ == "__main__":
print(get_Stream_info(resp))
Result:
Stream [1]: US - pmp - 4k # ParamountPlus
Stream [2]: US - pmp - hd # ParamountPlus
Stream [3]: US - fxn - hd # FoxNow
4K and HD are available on ParamountPlus but I only want to print 4K.
Finally HD on all others where 4K isn't available.
What if you create a dictionary that rates the qualities? Could be good if you later have streams that are SD or other formats. That way you are always showing only the best quality stream from each service with minimal code:
qty_ratings = {
'4k': 1,
'hd': 2,
'sd': 3
}
Append the highest quality stream:
if monetize == 'flatrate':
# Check if there are more than one stream from the same service
if len(qty) > 1:
qty = min([qty_ratings[x] for x in qty])
result.append(f"Stream {[i]}: US - {service} - {qty}")
return "\n".join(result)
Personally I would explicitly sort the streams, not because it's more efficient, but because it makes it clear what I'm doing.
Assuming your offers are defined as in the question:
ranking = ("sd", "hd", "4k")
def get_best_stream(streams):
return max(streams, key=lambda x: ranking.index(x["presentation_type"]))
get_best_stream(offers)
As #Gustaf notes, this is forward compatible. I've used a tuple rather than a dict since we really only care about order (perhaps even more explicit would be an enum).
If you want to keep one offer from every source, I would encode this explicitly:
def get_streams(streams):
sources = set(x["package_short_name"] for x in streams)
return [
get_best_stream(s for s in streams if s["package_short_name"] == source)
for source in sources
]
get_streams(offers)
Of course if you have billions of streams it would be more efficient to build a mapping between source and offers, but for a few dozen the cost of an iterator is trivial.

Convert list of paths to dictionary in python

I'm making a program in Python where I need to interact with "hypothetical" paths, (aka paths that don't and won't exist on the actual filesystem) and I need to be able to listdir them like normal (path['directory'] would return every item inside the directory like os.listdir()).
The solution I came up with was to convert a list of string paths to a dictionary of dictionaries. I came up with this recursive function (it's inside a class):
def DoMagic(self,paths):
structure = {}
if not type(paths) == list:
raise ValueError('Expected list Value, not '+str(type(paths)))
for i in paths:
print(i)
if i[0] == '/': #Sanity check
print('trailing?',i) #Inform user that there *might* be an issue with the input.
i[0] = ''
i = i.split('/') #Split it, so that we can test against different parts.
if len(i[1:]) > 1: #Hang-a-bout, there's more content!
structure = {**structure, **self.DoMagic(['/'.join(i[1:])])}
else:
structure[i[1]] = i[1]
But when I go to run it with ['foo/e.txt','foo/bar/a.txt','foo/bar/b.cfg','foo/bar/c/d.txt'] as input, I get:
{'e.txt': 'e.txt', 'a.txt': 'a.txt', 'b.cfg': 'b.cfg', 'd.txt': 'd.txt'}
I want to be able to just path['foo']['bar'] to get everything in the foo/bar/ directory.
Edit:
A more desirable output would be:
{'foo':{'e.txt':'e.txt','bar':{'a.txt':'a.txt','c':{'d.txt':'d.txt'}}}}
Edit 10-14-22 My first answer matches what the OP asks but isn't really the ideal approach nor the cleanest output. Since this question appears to be used more often, see a cleaner approach below that is more resilient to Unix/Windows paths and the output dictionary makes more sense.
from pathlib import Path
import json
def get_path_dict(paths: list[str | Path]) -> dict:
"""Builds a tree like structure out of a list of paths"""
def _recurse(dic: dict, chain: tuple[str, ...] | list[str]):
if len(chain) == 0:
return
if len(chain) == 1:
dic[chain[0]] = None
return
key, *new_chain = chain
if key not in dic:
dic[key] = {}
_recurse(dic[key], new_chain)
return
new_path_dict = {}
for path in paths:
_recurse(new_path_dict, Path(path).parts)
return new_path_dict
l1 = ['foo/e.txt', 'foo/bar/a.txt', 'foo/bar/b.cfg', Path('foo/bar/c/d.txt'), 'test.txt']
result = get_path_dict(l1)
print(json.dumps(result, indent=2))
Output:
{
"foo": {
"e.txt": null,
"bar": {
"a.txt": null,
"b.cfg": null,
"c": {
"d.txt": null
}
}
},
"test.txt": null
}
Older approach
How about this. It gets your desired output, however a tree structure may be cleaner.
from collections import defaultdict
import json
def nested_dict():
"""
Creates a default dictionary where each value is an other default dictionary.
"""
return defaultdict(nested_dict)
def default_to_regular(d):
"""
Converts defaultdicts of defaultdicts to dict of dicts.
"""
if isinstance(d, defaultdict):
d = {k: default_to_regular(v) for k, v in d.items()}
return d
def get_path_dict(paths):
new_path_dict = nested_dict()
for path in paths:
parts = path.split('/')
if parts:
marcher = new_path_dict
for key in parts[:-1]:
marcher = marcher[key]
marcher[parts[-1]] = parts[-1]
return default_to_regular(new_path_dict)
l1 = ['foo/e.txt','foo/bar/a.txt','foo/bar/b.cfg','foo/bar/c/d.txt', 'test.txt']
result = get_path_dict(l1)
print(json.dumps(result, indent=2))
Output:
{
"foo": {
"e.txt": "e.txt",
"bar": {
"a.txt": "a.txt",
"b.cfg": "b.cfg",
"c": {
"d.txt": "d.txt"
}
}
},
"test.txt": "test.txt"
}
Wouldn't simple tree, implemented via dictionaries, suffice?
Your implementation seems a bit redundant. It's hard to tell easily to which folder a file belongs.
https://en.wikipedia.org/wiki/Tree_(data_structure)
There's a lot of libs on pypi, if you need something extra.
treelib
There're also Pure paths in pathlib.

Formatting JSON in Python

What is the simplest way to pretty-print a string of JSON as a string with indentation when the initial JSON string is formatted without extra spaces or line breaks?
Currently I'm running json.loads() and then running json.dumps() with indent=2 on the result. This works, but it feels like I'm throwing a lot of compute down the drain.
Is there a more simple or efficient (built-in) way to pretty-print a JSON string? (while keeping it as valid JSON)
Example
import requests
import json
response = requests.get('http://spam.eggs/breakfast')
one_line_json = response.content.decode('utf-8')
pretty_json = json.dumps(json.loads(response.content), indent=2)
print(f'Original: {one_line_json}')
print(f'Pretty: {pretty_json}')
Output:
Original: {"breakfast": ["spam", "spam", "eggs"]}
Pretty: {
"breakfast": [
"spam",
"spam",
"eggs"
]
}
json.dumps(obj, indent=2) is better than pprint because:
It is faster with the same load methodology.
It has the same or similar simplicity.
The output will produce valid JSON, whereas pprint will not.
pprint_vs_dumps.py
import cProfile
import json
import pprint
from urllib.request import urlopen
def custom_pretty_print():
url_to_read = "https://www.cbcmusic.ca/Component/Playlog/GetPlaylog?stationId=96&date=2018-11-05"
with urlopen(url_to_read) as resp:
pretty_json = json.dumps(json.load(resp), indent=2)
print(f'Pretty: {pretty_json}')
def pprint_json():
url_to_read = "https://www.cbcmusic.ca/Component/Playlog/GetPlaylog?stationId=96&date=2018-11-05"
with urlopen(url_to_read) as resp:
info = json.load(resp)
pprint.pprint(info)
cProfile.run('custom_pretty_print()')
>>> 71027 function calls (42309 primitive calls) in 0.084 seconds
cProfile.run('pprint_json()')
>>>164241 function calls (140121 primitive calls) in 0.208 seconds
Thanks #tobias_k for pointing out my errors along the way.
I think for a true JSON object print, it's probably as good as it gets. timeit(number=10000) for the following took about 5.659214497s:
import json
d = {
'breakfast': [
'spam', 'spam', 'eggs',
{
'another': 'level',
'nested': [
{'a':'b'},
{'c':'d'}
]
}
],
'foo': True,
'bar': None
}
s = json.dumps(d)
q = json.dumps(json.loads(s), indent=2)
print(q)
I tried with pprint, but it actually wouldn't print the pure JSON string unless it's converted to a Python dict, which loses your true, null and false etc valid JSON as mentioned in the other answer. As well it doesn't retain the order in which the items appeared, so it's not great if order is important for readability.
Just for fun I whipped up the following function:
def pretty_json_for_savages(j, indentor=' '):
ind_lvl = 0
temp = ''
for i, c in enumerate(j):
if c in '{[':
print(indentor*ind_lvl + temp.strip() + c)
ind_lvl += 1
temp = ''
elif c in '}]':
print(indentor*ind_lvl + temp.strip() + '\n' + indentor*(ind_lvl-1) + c, end='')
ind_lvl -= 1
temp = ''
elif c in ',':
print(indentor*(0 if j[i-1] in '{}[]' else ind_lvl) + temp.strip() + c)
temp = ''
else:
temp += c
print('')
# {
# "breakfast":[
# "spam",
# "spam",
# "eggs",
# {
# "another": "level",
# "nested":[
# {
# "a": "b"
# },
# {
# "c": "d"
# }
# ]
# }
# ],
# "foo": true,
# "bar": null
# }
It prints pretty alright, and unsurprisingly it took a whooping 16.701202023s to run in timeit(number=10000), which is 3 times as much as a json.dumps(json.loads()) would get you. It's probably not worthwhile to build your own function to achieve this unless you spend some time to optimize it, and with the lack of a builtin for the same, it's probably best you stick with your gun since your efforts will most likely give diminishing returns.

Facebook JSON badly encoded

I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information, then Download your information, then create a file with at least the Messages box checked) to do some cool statistics
However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw. When I try to open it with python (UTF-8) I get RadosÅ\x82aw. However I should get: Radosław.
My python script:
text = open(os.path.join(subdir, file), encoding='utf-8')
conversations.append(json.load(text))
I tried a few most common encodings. Example data is:
{
"sender_name": "Rados\u00c5\u0082aw",
"timestamp": 1524558089,
"content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD",
"type": "Generic"
}
I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.
What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH JSON escape notation (so a literal backslash, a literal lowercase letter u, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH sequences (a literal backslash, a literal lower case letter u, two 0 zero digits and two hex digits).
E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082.
You can repair the damage in two ways:
Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:
>>> import json
>>> data = r'"Rados\u00c5\u0082aw"'
>>> json.loads(data).encode('latin1').decode('utf8')
'Radosław'
This would require a full traversal of your data structure to find all those strings, of course.
Load the whole JSON document as binary data, replace all \u00hh JSON sequences with the byte the last two hex digits represent, then decode as JSON:
import re
from functools import partial
fix_mojibake_escapes = partial(
re.compile(rb'\\u00([\da-f]{2})').sub,
lambda m: bytes.fromhex(m[1].decode()),
)
with open(os.path.join(subdir, file), 'rb') as binary_data:
repaired = fix_mojibake_escapes(binary_data.read())
data = json.loads(repaired)
(If you are using Python 3.5 or older, you'll have to decode the repaired bytes object from UTF-8, so use json.loads(repaired.decode())).
From your sample data this produces:
{'content': 'No to trzeba ostatnie treningi zrobić xD',
'sender_name': 'Radosław',
'timestamp': 1524558089,
'type': 'Generic'}
The regular expression matches against all \u00HH sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by the json.loads() function when given binary data.
Here is a command-line solution with jq and iconv. Tested on Linux.
cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json
I would like to extend #Geekmoss' answer with the following recursive code snippet, I used to decode my facebook data.
import json
def parse_obj(obj):
if isinstance(obj, str):
return obj.encode('latin_1').decode('utf-8')
if isinstance(obj, list):
return [parse_obj(o) for o in obj]
if isinstance(obj, dict):
return {key: parse_obj(item) for key, item in obj.items()}
return obj
decoded_data = parse_obj(json.loads(file))
I noticed this works better, because the facebook data you download might contain list of dicts, in which case those dicts would be just returned 'as is' because of the lambda identity function.
My solution for parsing objects use parse_hook callback on load/loads function:
import json
def parse_obj(dct):
for key in dct:
dct[key] = dct[key].encode('latin_1').decode('utf-8')
pass
return dct
data = '{"msg": "Ahoj sv\u00c4\u009bte"}'
# String
json.loads(data)
# Out: {'msg': 'Ahoj svÄ\x9bte'}
json.loads(data, object_hook=parse_obj)
# Out: {'msg': 'Ahoj světe'}
# File
with open('/path/to/file.json') as f:
json.load(f, object_hook=parse_obj)
# Out: {'msg': 'Ahoj světe'}
pass
Update:
Solution for parsing list with strings does not working. So here is updated solution:
import json
def parse_obj(obj):
for key in obj:
if isinstance(obj[key], str):
obj[key] = obj[key].encode('latin_1').decode('utf-8')
elif isinstance(obj[key], list):
obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
pass
return obj
Based on #Martijn Pieters solution, I wrote something similar in Java.
public String getMessengerJson(Path path) throws IOException {
String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
String unescaped = unescapeMessenger(badlyEncoded);
byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
String fixed = new String(bytes, StandardCharsets.UTF_8);
return fixed;
}
The unescape method is inspired by the org.apache.commons.lang.StringEscapeUtils.
private String unescapeMessenger(String str) {
if (str == null) {
return null;
}
try {
StringWriter writer = new StringWriter(str.length());
unescapeMessenger(writer, str);
return writer.toString();
} catch (IOException ioe) {
// this should never ever happen while writing to a StringWriter
throw new UnhandledException(ioe);
}
}
private void unescapeMessenger(Writer out, String str) throws IOException {
if (out == null) {
throw new IllegalArgumentException("The Writer must not be null");
}
if (str == null) {
return;
}
int sz = str.length();
StrBuilder unicode = new StrBuilder(4);
boolean hadSlash = false;
boolean inUnicode = false;
for (int i = 0; i < sz; i++) {
char ch = str.charAt(i);
if (inUnicode) {
unicode.append(ch);
if (unicode.length() == 4) {
// unicode now contains the four hex digits
// which represents our unicode character
try {
int value = Integer.parseInt(unicode.toString(), 16);
out.write((char) value);
unicode.setLength(0);
inUnicode = false;
hadSlash = false;
} catch (NumberFormatException nfe) {
throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe);
}
}
continue;
}
if (hadSlash) {
hadSlash = false;
if (ch == 'u') {
inUnicode = true;
} else {
out.write("\\");
out.write(ch);
}
continue;
} else if (ch == '\\') {
hadSlash = true;
continue;
}
out.write(ch);
}
if (hadSlash) {
// then we're in the weird case of a \ at the end of the
// string, let's output it anyway.
out.write('\\');
}
}
Facebook programmers seem to have mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. Further details in Invalid Unicode encodings in Facebook data exports.
Try this:
import json
import io
class FacebookIO(io.FileIO):
def read(self, size: int = -1) -> bytes:
data: bytes = super(FacebookIO, self).readall()
new_data: bytes = b''
i: int = 0
while i < len(data):
# \u00c4\u0085
# 0123456789ab
if data[i:].startswith(b'\\u00'):
u: int = 0
new_char: bytes = b''
while data[i+u:].startswith(b'\\u00'):
hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
new_char = b''.join([new_char, bytes([hex])])
u += 6
char : str = new_char.decode('utf-8')
new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
new_data += new_chars
i += u
else:
new_data = b''.join([new_data, bytes([data[i]])])
i += 1
return new_data
if __name__ == '__main__':
f = FacebookIO('data.json','rb')
d = json.load(f)
print(d)
This is #Geekmoss' answer, but adapted for Python 3:
def parse_facebook_json(json_file_path):
def parse_obj(obj):
for key in obj:
if isinstance(obj[key], str):
obj[key] = obj[key].encode('latin_1').decode('utf-8')
elif isinstance(obj[key], list):
obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
pass
return obj
with json_file_path.open('rb') as json_file:
return json.load(json_file, object_hook=parse_obj)
# Usage
parse_facebook_json(Path("/.../message_1.json"))
Extending Martijn solution #1, that I see it can lead towards recursive object processing (It certainly lead me initially):
You can apply this to the whole string of json object, if you don't ensure_ascii
json.dumps(obj, ensure_ascii=False, indent=2).encode('latin-1').decode('utf-8')
then write it to file or something.
PS: This should be comment on #Martijn answer: https://stackoverflow.com/a/50011987/1309932 (but I can't add comments)
This is my approach for Node 17.0.1, based on #hotigeftas recursive code, using the iconv-lite package.
import iconv from 'iconv-lite';
function parseObject(object) {
if (typeof object == 'string') {
return iconv.decode(iconv.encode(object, 'latin1'), 'utf8');;
}
if (typeof object == 'object') {
for (let key in object) {
object[key] = parseObject(object[key]);
}
return object;
}
return object;
}
//usage
let file = JSON.parse(fs.readFileSync(fileName));
file = parseObject(file);

Categories

Resources