How to decode/read Flask filesystem session files? - python

I have a Python3 Flask app using Flask-Session (which adds server-side session support) and configured to use the filesystem type.
This type underlying uses the Werkzeug class werkzeug.contrib.cache.FileSystemCache (Werkzeug cache documentation).
The raw cache files look like this if opened:
J¬».].Äï;}î(å
_permanentîàå
respondentîåuuidîåUUIDîìî)Åî}î(åintîät˙ò∑flŒºçLÃ/∆6jhåis_safeîhåSafeUUIDîìîNÖîRîubåSECTIONS_VISITEDî]îåcurrent_sectionîKåSURVEY_CONTENTî}î(å0î}î(ås_idîås0îånameîåWelcomeîådescriptionîåîå questionsî]î}î(ås_idîhåq_idîhåq_constructîhåq_textîhå
q_descriptionîhåq_typeîhårequiredîhåoptions_rowîhåoptions_row_alpha_sortîhåreplace_rowîhåoptions_colîhåoptions_col_codesîhåoptions_col_alpha_sortîhåcond_continue_rules_rowîhåq_meta_notesîhuauå1î}î(hås1îhå Screeningîhå[This section determines if you fit into the target group.îh]î(}î(hh/håq1îh hh!å9Have you worked on a product in this field before?
The items stored in the session can be seen a bit above:
- current_section should be an integer, e.g., 0
- SECTIONS_VISITED should be an array of integers, e.g., [0,1,2]
- SURVEY_CONTENT format should be an object with structure like below
{
'item1': {
'label': string,
'questions': [{}]
},
'item2': {
'label': string,
'questions': [{}]
}
}
What you can see in the excerpt above, for example the text This section determines if you fit into the target group is the value of one label. The stuff after questions are keys that can be found in each questions object, e.g., q_text as well as their values, e.g., Have you worked on a product in this field before? is the value of q_text.
I need to retrieve data from the stored cache files in a way that I can read them without all the extra characters like å.
I tried using Werkzeug like this, where the item 9c3c48a94198f61aa02a744b16666317 is the name of the cache file I want to read. However, it was not found in the cache directory.
from werkzeug.contrib.cache import FileSystemCache
cache_dir="flask_session"
mode=0600
threshold=20000
cache = FileSystemCache(cache_dir, threshold=threshold, mode=mode)
item = "9c3c48a94198f61aa02a744b16666317"
print(cache.has(item))
data = cache.get(item)
print(data)
What ways are there to read the cache files?
I opened a GitHub issue in Flask-Session, but that's not really been actively maintained in years.
For context, I had an instance where for my web app writing to the database was briefly not working - but the data I need was also being saved in the session. So right now the only way to retrieve that data is to get it from these files.
EDIT:
Thanks to Tim's answer I solved it using the following:
import pickle
obj = []
with open(file_name,"rb") as fileOpener:
while True:
try:
obj.append(pickle.load(fileOpener))
except EOFError:
break
print(obj)
I needed to load all pickled objects in the file, so I combined Tim's solution with the one here for loading multiple objects: https://stackoverflow.com/a/49261333/11805662
Without this, I was just seeing the first pickled item.
Also, in case anyone has the same problem, I needed to use the same python version as my Flask app (related post). If I didn't, then I would get the following error:
ValueError: unsupported pickle protocol: 4

You can decode the data with pickle. Pickle is part of the Python standard library.
import pickle
with open("PATH/TO/SESSION/FILE") as f:
data = pickle.load(f)

Related

How to extract relation members from .osm xml files

All,
I've been trying to build a website (in Django) which is to be an index of all MTB routes in the world. I'm a Pythonian so wherever I can I try to use Python.
I've successfully extracted data from the OSM API (Display relation (trail) in leaflet) but found that doing this for all MTB trails (tag: route=mtb) is too much data (processing takes very long). So I tried to do everything locally by downloading a torrent of the entire OpenStreetMap dataset (from Latest Weekly Planet XML File) and filtering for tag: route=mtb using osmfilter (part of osmctools in Ubuntu 20.04), like this:
osmfilter $unzipped_osm_planet_file --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.osm
This produces a file of about 1.2 GB and on closer inspection seems to contain all the data I need. My goal was to transform the file into a pandas.DataFrame() so I could do some further filtering en transforming before pushing relevant aspects into my Django DB. I tried to load the file as a regular XML file using Python Pandas but this crashed the Jupyter notebook Kernel. I guess the data is too big.
My second approach was this solution: How to extract and visualize data from OSM file in Python. It worked for me, at least, I can get some of the information, like the tags of the relations in the file (and the other specified details). What I'm missing is the relation members (the ways) and then the way members (the nodes) and their latitude/longitudes. I need these to achieve what I did here: Plotting OpenStreetMap relations does not generate continuous lines
I'm open to many solutions, for example one could break the file up into many different files containing 1 relation and it's members per file, using an osmium based script. Perhaps then I can move on with pandas.read_xml(). This would be nice for batch processing en filling the Database. Loading the whole OSM XML file into a pd.DataFrame would be nice but I guess this really is a lot of data. Perhaps this can also be done on a per-relation basis with pyosmium?
Any help is appreciated.
Ok, I figured out how to get what I want (all information per relation of the type "route=mtb" stored in an accessible way), it's a multi-step process, I'll describe it here.
First, I downloaded the world file (went to wiki.openstreetmap.org/wiki/Planet.osm, opened the xml of the pbf file and downloaded the world file as .pbf (everything on Linux, and this file is referred to as $osm_planet_file below).
I converted this file to o5m using osmconvert (available in Ubuntu 20.04 by doing apt install osmctools, on the Linux cli:
osmconvert --verbose --drop-version $osm_planet_file -o=$osm_planet_dir/planet.o5m
The next step is to filter all relations of interest out of this file (in my case I wanted all MTB routes: route=mtb) and store them in a new file, like this:
osmfilter $osm_planet_dir/planet.o5m --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.o5m
This creates a much smaller file that contains all information on the relations that are MTB routes.
From there on I switched to a Jupyter notebook and used Python3 to further divide the file into useful, manageable chunks. I first installed osmium using conda (in the env I created first but that can be skipped):
conda install -c conda-forge osmium
Then I made a recommended osm.SimpleHandle class, this class iterates through the large o5m file and while doing this it can do actions. This is the way to deal with these files because they are far to big for memory. I made the choice to iterate through the file and store everything I needed into separate json files. This does generate more than 12.000 json files but it can be done on my laptop with 8 GB of memory. This is the class:
import osmium as osm
import json
import os
data_dump_dir = '../data'
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
data = dict()
data['version'] = elem.version,
data['members'] = [int(member.ref) for member in elem.members if member.type == 'w'], # filter nodes from waylist => could be a mistake
data['visible'] = elem.visible,
data['timestamp'] = str(elem.timestamp),
data['uid'] = elem.uid,
data['user'] = elem.user,
data['changeset'] = elem.changeset,
data['num_tags'] = len(elem.tags),
data['key'] = tag.k,
data['value'] = tag.v,
data['deleted'] = elem.deleted
with open(os.path.join(data_dump_dir, str(elem.id)+'.json'), 'w') as f:
json.dump(data, f)
def relation(self, r):
self.tag_inventory(r, "relation")
Run the class like this:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
Now we have json files with the relation number as their filename and with all metadata, and a list of the ways. But we want a list of the ways and then also all the nodes per way, so we can plot the full relations (the MTB routes). To achieve this, we parse the o5m file again (using a class build on the osm.SimpleHandler class) and this time we extract all way members (the nodes), and create a dictionary:
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = dict()
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data[int(elem.id)] = dict()
# self.osm_data[int(elem.id)]['is_closed'] = str(elem.is_closed)
self.osm_data[int(elem.id)]['nodes'] = [str(n) for n in elem.nodes]
def way(self, w):
self.tag_inventory(w, "way")
Execute the class:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
ways = osmhandler.osm_data
This gives is dict (called ways) of all ways as keys and the node IDs (!Meaning we need some more steps!) as values.
len(ways.keys())
>>> 337597
In the next (and almost last) step we add the node IDs for all ways to our relation jsons, so they become part of the files:
all_data = dict()
for relation_file in [
os.path.join(data_dump_dir,file) for file in os.listdir(data_dump_dir) if file.endswith('.json')
]:
with open(relation_file, 'r') as f:
data = json.load(f)
if 'members' in data: # Make sure these steps are never performed twice
try:
data['ways'] = dict()
for way in data['members'][0]:
data['ways'][way] = ways[way]['nodes']
del data['members']
with open(relation_file, 'w') as f:
json.dump(data, f)
except KeyError as err:
print(err, relation_file) # Not sure why some relations give errors?
So now we have relation jsons with all ways and all ways have all node IDs, the last thing to do is to replace the node IDs with their values (latitude and longitude). I also did this in 2 steps, first I build a nodeID:lat/lon dictionary, again using an osmium.SimpleHandler based class :
import osmium
class CounterHandler(osmium.SimpleHandler):
def __init__(self):
osmium.SimpleHandler.__init__(self)
self.osm_data = dict()
def node(self, n):
self.osm_data[int(n.id)] = [n.location.lat, n.location.lon]
Execute the class:
h = CounterHandler()
h.apply_file("../data/world_mtb_routes.o5m")
nodes = h.osm_data
This gives us dict with a latitude/longitude pair for every node ID. We can use this on our json files to fill the ways with coordinates (where there are now still only node IDs), I create these final json files in a new directory (data/with_coords in my case) because if there is an error, my original (input) json file is not affected and I can try again:
import os
relation_files = [file for file in os.listdir('../data/') if file.endswith('.json')]
for relation in relation_files:
relation_file = os.path.join('../data/',relation)
relation_file_with_coords = os.path.join('../data/with_coords',relation)
with open(relation_file, 'r') as f:
data = json.load(f)
try:
for way in data['ways']:
node_coords_per_way = []
for node in data['ways'][way]:
node_coords_per_way.append(nodes[int(node)])
data['ways'][way] = node_coords_per_way
with open(relation_file_with_coords, 'w') as f:
json.dump(data, f)
except KeyError:
print(relation)
Now I have what I need and I can start adding the info to my Django database, but that is beyond the scope of this question.
Btw, there are some relations that give an error, I suspect that for some relations ways were labelled as nodes but I'm not sure. I'll update here if I find out. I also have to do this process regularly (when the world file updates, or every now and then) so I'll probably write something more concise later on, but for now this works and the steps are understandable, to me, after a lot of thinking at least.
All of the complexity comes from the fact that the data is not big enough for memory, otherwise I'd have created a pandas.DataFrame in step one and be done with it. I could also have loaded the data in a database in one go perhaps, but I'm not that good with databases, yet.

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!
Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!
I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

Pickle dump huge file without memory error

I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle Dictionary object at Dictionary.txt.
The problem is that everytime that I run the program it pulls in the Dictionary.txt, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt. This is pretty memory intensive as the Dictionary.txt is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..
Is there a better (more efficient) way of doing the edits? (Perhaps w/o having to overwrite the entire file everytime)
Is there a way that I can invoke garbage collection (through gc module)? (I already have it auto-enabled via gc.enable())
I know that besides readlines() you can read line-by-line. Is there a way to edit the dictionary incrementally line-by-line when I already have a fully completed Dictionary object File in the program.
Any other solutions?
Thank you for your time.
I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.
save the model to disk
from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)
some time later... load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)
I am the author of a package called klepto (and also the author of dill).
klepto is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.
>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
>>>
klepto provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.
>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)
You can also pick from various levels of file compression, and whether
you want the files to be memory-mapped. There are a lot of different
options, both for file backends and database backends. The interface
is identical, however.
With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.
See a longer tutorial here: https://github.com/mmckerns/tlkklp
Get klepto here: https://github.com/uqfoundation
None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.
pip install hickle
Example:
# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')
# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')
# Load data
array_hkl = hkl.load('test.hkl')
I had memory error and resolved it by using protocol=2:
cPickle.dump(obj, file, protocol=2)
If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm module docs:
import anydbm
# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')
# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Loop through contents. Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
print k, '\t', v
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
db.close()
Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/
I have just solved a similar memory error by switching to streaming pickle.
How about this?
import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb"))
p.fast = True
p.dump(d) # d could be your dictionary or any file
I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.
import dill
with open(path,'wb') as fp:
dill.dump(outpath,fp)
dill.load(fp)
If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support.
My poly kernel svm saved.
This may seem trivial, but try to use the 64bit Python if you are not.
I have tried the following solution, but all of them can't resolve my problem.
Using hickle to replace pickle
Using joblib to replace pickle
Using sklearn.externals joblib to replace pickle
Change the pickle mode
Provide a different method for this issue:
Finally, I found the root cause is that the work directory folder was too long.
So that I change the directory to a very short structure.
Enjoy it.

python gnupg timestamp

i have noticed when using python gnupg, that if i sign some data and save the signed data to a file using pickle, lots of data gets saved along with the signed data. one of these things is a timestamp in unix time, for example the following line is part of a timestamp:
p24
sS'timestamp'
p25
V1347364912
the documentation does not mention any of this, which makes me a little confused. after loading in the file using pickle, i can't see any mention of the timestamp or any way to return the value. but if pickle is saving it, it must be part of the python object. does this mean there should be a way i can get to this information in python? i would also like to utilise this data, which i can maybe do by reading in the file itself but am looking for a cleaner way to do it using the gnupg module.
gnupg isn't very well documented but if you Inspect it you will see there are attributes besides the ones normally used...
#234567891123456789212345678931234567894123456789512345678961234567897123456789
# core
import inspect
import pickle
import datetime
# 3rd party
import gnupg
def depickle():
""" pull and depickle our signed data """
f = open('pickle.txt', 'r')
signed_data = pickle.load(f)
f.close()
return signed_data
# depickle our signed data
signed_data = depickle()
# inspect the object
for key, value in inspect.getmembers(signed_data):
print key
One of them is your timestamp... aptly named timestamp. Now that you know it you can use it easily enough...
# use the attribute now that we know it
print signed_data.timestamp
# make it pretty
print datetime.datetime.fromtimestamp(float(signed_data.timestamp))
That felt long winded but I thought this discussion would benefit from documenting the use of inspect to identify the undocumented attributes instead of just saying "use signed_data.timestamp".
I have found that some fields of python-gnupg Sign and Verify classes are not described in documentation. You will have to look at python-gnupg source: [PYTHONDIR]/Lib/site-packages/gnupg.py. There is Sign class with handle_status() method that fill all the variables/fields connected with signature including timestamp field.

Is there a memory efficient and fast way to load big JSON files?

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Categories

Resources