I'm not sure why this Pickle example is not showing both of the dictionary definitions. As I understand it, "ab+" should mean that the pickle.dat file is being appended to and can be read from. I'm new to the whole pickle concept, but the tutorials on the net don't seem to go beyond just the initial storage.
import cPickle as pickle
def append_object(d, fname):
"""appends a pickle dump of d to fname"""
print "append_hash", d, fname
with open(fname, 'ab') as pickler:
pickle.dump(d, pickler)
db_file = 'pickle.dat'
cartoon = {}
cartoon['Mouse'] = 'Mickey'
append_object(cartoon, db_file)
cartoon = {}
cartoon['Bird'] = 'Tweety'
append_object(cartoon, db_file)
print 'loading from pickler'
with open(db_file, 'rb') as pickler:
cartoon = pickle.load(pickler)
print 'loaded', cartoon
Ideally, I was hoping to build up a dictionary using a for loop and then add the key:value pair to the pickle.dat file, then clear the dictionary to save some RAM.
What's going on here?
Don't use pickle for that. Use a database.
Python dbm module seems to fit what you want perfectly. It presents you with a dictionary-like interface, but data is saved to disk.
Example usage:
>>> import dbm
>>> x = dbm.open('/tmp/foo.dat', 'c')
>>> x['Mouse'] = 'Mickey'
>>> x['Bird'] = 'Tweety'
Tomorrow you can load the data:
>>> import dbm
>>> x = dbm.open('/tmp/foo.dat', 'c')
>>> print x['Mouse']
Mickey
>>> print x['Bird']
Tweety
I started to edit your code for readability and factored out append_object in the process.
There are multiple confusions here. The first, is that pickle.dump writes a Python object in its entirety. You can put multiple objects in a pickle file, but each needs its own load. The code did what you asked of it and loaded the first dictionary you wrote to the file. The second dictionary was there waiting to be read but it isn't a concatenation to the first, it is its own loadable.
Don't underestimate the importance of names. append_object isn't a great name, but it is different than append_to_object.
If you are opening a file for reading, just open it for reading and the same for writing or appending. Not only does it make your intentions more clear but it prevents silly errors.
Related
I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.
I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.
You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset
You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)
The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.
For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.
The game is about tamaguchis, and I want the tamaguchi to remember its last size and it's 3 last actions the next time it plays. I also want the date to matter, like if you don't play with it for a week it shrinks in size. So first step I thought was to save down all the relevant data to a text file, and each time the game starts the code searchs through the text file and extracts the relevant data again! But I can't even get step 1 working :( I mean, I don't get why this doesn't work??
file = open("Tamaguchis.txt","w")
date = time.strftime("%c")
dictionary = {"size":tamaguchin.size,"date":date,"order":lista}
file.write(dictionary)
it says that it can't export dictonaries, only strings to a text file. But that's not correct is it, I thought you were supposed to be able to put dictionaries in text files? :o
If anyone also has an idea on how calculate the difference between the current date, and the date saved in the text file, that'd be much appriciated :)
Sorry if noob question, and thanks a lot!
If your dictionary consists only of simple python objects, you can use json module to serialize it and write into a file.
import json
with open("Tamaguchis.txt","w") as file:
date = time.strftime("%c")
dictionary = {"size":tamaguchin.size,"date":date,"order":lista}
file.write(json.dumps(dictionary))
The same can be then read with loads.
import json
with open("Tamaguchis.txt","r") as file:
dictionary = json.loads(file.read())
If your dictionary may contain more complex objects, you can either define json serializer for them, or use pickle module. Note, that the latter might make it possible to invoke arbitrary code if not used properly.
You need to convert the dict to a string:
file.write(str(dictionary))
... though you might want to use pickle, json or yaml for the task - reading back is easier/safer then.
Oh,, for date and time caculations you might want to check the timedelta module.
import pickle
a = {'a':1, 'b':2}
with open('temp.txt', 'w') as writer:
data = pickle.dumps(a)
writer.write(data)
with open('temp.txt', 'r') as reader:
data2= pickle.loads(reader.read())
print data2
print type(data2)
Output:
{'a': 1, 'b': 2}
<type 'dict'>
If you care efficiency, ujson or cPinkle is better.
I have two problems with loading data on python, both the scipts work properly but they need too much time to run and sometimes "Killed" is the result (with the first one).
I have a big zipped text file and I do something like this:
import gzip
import cPickle as pickle
f = gzip.open('filename.gz','r')
tab={}
for line in f:
#fill tab
with open("data_dict.pkl","wb") as g:
pickle.dump(tab,g)
f.close()
I have to do some operations on the dictionary I created in the previous script
import cPickle as pickle
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
#operations on tab (the dictionary)
Do you have other solutionsin mind? Maybe not the ones involving YAML or JSON...
If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.
First one comment, in:
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
f.close() is not necessary, the context manager (with syntax) does that automatically.
Now as for speed, I don't think you're going to get too much faster than cPickle for the purpose of reading in something from disk directly as a Python object. If this script needs to be run over and over I would try using memchached via pylibmc to keep the object stored persistently in memory so you can access it lightning fast:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = range(10000) ## some big object
mc["some_key"] = d ## save in memory
Then after saving it once you can access and modify it, it stays in memory even after the previous program finishes its execution:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = mc["some_key"] ## load from memory
d[0] = 'some other value' ## modify
mc["some_key"] = d ## save to memory again
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T