All,
I've been trying to build a website (in Django) which is to be an index of all MTB routes in the world. I'm a Pythonian so wherever I can I try to use Python.
I've successfully extracted data from the OSM API (Display relation (trail) in leaflet) but found that doing this for all MTB trails (tag: route=mtb) is too much data (processing takes very long). So I tried to do everything locally by downloading a torrent of the entire OpenStreetMap dataset (from Latest Weekly Planet XML File) and filtering for tag: route=mtb using osmfilter (part of osmctools in Ubuntu 20.04), like this:
osmfilter $unzipped_osm_planet_file --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.osm
This produces a file of about 1.2 GB and on closer inspection seems to contain all the data I need. My goal was to transform the file into a pandas.DataFrame() so I could do some further filtering en transforming before pushing relevant aspects into my Django DB. I tried to load the file as a regular XML file using Python Pandas but this crashed the Jupyter notebook Kernel. I guess the data is too big.
My second approach was this solution: How to extract and visualize data from OSM file in Python. It worked for me, at least, I can get some of the information, like the tags of the relations in the file (and the other specified details). What I'm missing is the relation members (the ways) and then the way members (the nodes) and their latitude/longitudes. I need these to achieve what I did here: Plotting OpenStreetMap relations does not generate continuous lines
I'm open to many solutions, for example one could break the file up into many different files containing 1 relation and it's members per file, using an osmium based script. Perhaps then I can move on with pandas.read_xml(). This would be nice for batch processing en filling the Database. Loading the whole OSM XML file into a pd.DataFrame would be nice but I guess this really is a lot of data. Perhaps this can also be done on a per-relation basis with pyosmium?
Any help is appreciated.
Ok, I figured out how to get what I want (all information per relation of the type "route=mtb" stored in an accessible way), it's a multi-step process, I'll describe it here.
First, I downloaded the world file (went to wiki.openstreetmap.org/wiki/Planet.osm, opened the xml of the pbf file and downloaded the world file as .pbf (everything on Linux, and this file is referred to as $osm_planet_file below).
I converted this file to o5m using osmconvert (available in Ubuntu 20.04 by doing apt install osmctools, on the Linux cli:
osmconvert --verbose --drop-version $osm_planet_file -o=$osm_planet_dir/planet.o5m
The next step is to filter all relations of interest out of this file (in my case I wanted all MTB routes: route=mtb) and store them in a new file, like this:
osmfilter $osm_planet_dir/planet.o5m --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.o5m
This creates a much smaller file that contains all information on the relations that are MTB routes.
From there on I switched to a Jupyter notebook and used Python3 to further divide the file into useful, manageable chunks. I first installed osmium using conda (in the env I created first but that can be skipped):
conda install -c conda-forge osmium
Then I made a recommended osm.SimpleHandle class, this class iterates through the large o5m file and while doing this it can do actions. This is the way to deal with these files because they are far to big for memory. I made the choice to iterate through the file and store everything I needed into separate json files. This does generate more than 12.000 json files but it can be done on my laptop with 8 GB of memory. This is the class:
import osmium as osm
import json
import os
data_dump_dir = '../data'
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
data = dict()
data['version'] = elem.version,
data['members'] = [int(member.ref) for member in elem.members if member.type == 'w'], # filter nodes from waylist => could be a mistake
data['visible'] = elem.visible,
data['timestamp'] = str(elem.timestamp),
data['uid'] = elem.uid,
data['user'] = elem.user,
data['changeset'] = elem.changeset,
data['num_tags'] = len(elem.tags),
data['key'] = tag.k,
data['value'] = tag.v,
data['deleted'] = elem.deleted
with open(os.path.join(data_dump_dir, str(elem.id)+'.json'), 'w') as f:
json.dump(data, f)
def relation(self, r):
self.tag_inventory(r, "relation")
Run the class like this:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
Now we have json files with the relation number as their filename and with all metadata, and a list of the ways. But we want a list of the ways and then also all the nodes per way, so we can plot the full relations (the MTB routes). To achieve this, we parse the o5m file again (using a class build on the osm.SimpleHandler class) and this time we extract all way members (the nodes), and create a dictionary:
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = dict()
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data[int(elem.id)] = dict()
# self.osm_data[int(elem.id)]['is_closed'] = str(elem.is_closed)
self.osm_data[int(elem.id)]['nodes'] = [str(n) for n in elem.nodes]
def way(self, w):
self.tag_inventory(w, "way")
Execute the class:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
ways = osmhandler.osm_data
This gives is dict (called ways) of all ways as keys and the node IDs (!Meaning we need some more steps!) as values.
len(ways.keys())
>>> 337597
In the next (and almost last) step we add the node IDs for all ways to our relation jsons, so they become part of the files:
all_data = dict()
for relation_file in [
os.path.join(data_dump_dir,file) for file in os.listdir(data_dump_dir) if file.endswith('.json')
]:
with open(relation_file, 'r') as f:
data = json.load(f)
if 'members' in data: # Make sure these steps are never performed twice
try:
data['ways'] = dict()
for way in data['members'][0]:
data['ways'][way] = ways[way]['nodes']
del data['members']
with open(relation_file, 'w') as f:
json.dump(data, f)
except KeyError as err:
print(err, relation_file) # Not sure why some relations give errors?
So now we have relation jsons with all ways and all ways have all node IDs, the last thing to do is to replace the node IDs with their values (latitude and longitude). I also did this in 2 steps, first I build a nodeID:lat/lon dictionary, again using an osmium.SimpleHandler based class :
import osmium
class CounterHandler(osmium.SimpleHandler):
def __init__(self):
osmium.SimpleHandler.__init__(self)
self.osm_data = dict()
def node(self, n):
self.osm_data[int(n.id)] = [n.location.lat, n.location.lon]
Execute the class:
h = CounterHandler()
h.apply_file("../data/world_mtb_routes.o5m")
nodes = h.osm_data
This gives us dict with a latitude/longitude pair for every node ID. We can use this on our json files to fill the ways with coordinates (where there are now still only node IDs), I create these final json files in a new directory (data/with_coords in my case) because if there is an error, my original (input) json file is not affected and I can try again:
import os
relation_files = [file for file in os.listdir('../data/') if file.endswith('.json')]
for relation in relation_files:
relation_file = os.path.join('../data/',relation)
relation_file_with_coords = os.path.join('../data/with_coords',relation)
with open(relation_file, 'r') as f:
data = json.load(f)
try:
for way in data['ways']:
node_coords_per_way = []
for node in data['ways'][way]:
node_coords_per_way.append(nodes[int(node)])
data['ways'][way] = node_coords_per_way
with open(relation_file_with_coords, 'w') as f:
json.dump(data, f)
except KeyError:
print(relation)
Now I have what I need and I can start adding the info to my Django database, but that is beyond the scope of this question.
Btw, there are some relations that give an error, I suspect that for some relations ways were labelled as nodes but I'm not sure. I'll update here if I find out. I also have to do this process regularly (when the world file updates, or every now and then) so I'll probably write something more concise later on, but for now this works and the steps are understandable, to me, after a lot of thinking at least.
All of the complexity comes from the fact that the data is not big enough for memory, otherwise I'd have created a pandas.DataFrame in step one and be done with it. I could also have loaded the data in a database in one go perhaps, but I'm not that good with databases, yet.
Related
I am struggling to find a way to retrieve metadata information from a FILE using GDAL.
Specifically, I would like to retrieve the band names and the order in which they are stored in a given file (may that be a GEOTIFF or a NETCDF).
For instance, if we follow the description within the GDAL documentation, we have the "GetMetaData" method from the gdal.Dataset (see here and here). Despite this method returning a whole set of information regarding the dataset, it does not provide the band names and the order that they are stored within the given FILE. As a matter of fact, it seems to be an old problem (from 2015) that seems not to be solved yet (more info here). As it seems, "R" language has already solved this problem (see here), though Python hasn't.
Just to be thorough here, I know that there are other Python packages that can help in this endeavour (e.g., xarray, rasterio, etc.); nevertheless, it would be important to be concise with the set of packages that one should use in a single script. Therefore, I would like to know a definite way to find the band (a.k.a., variable) names and the order they are stored within a single FILE using gdal.
Please, let me know your thoughs in this regard.
Below, I present a starting point for solving this Issue, in which a file is opened by GDAL (creating a Dataset object).
from gdal import Dataset
from osgeo import gdal
OpeneddatasetFile = gdal.Open(f'NETCDF:{input}/{file_name}.nc:' + var)
if isinstance(OpeneddatasetFile , Dataset):
print("File opened successfully")
# here is where one should be capable of fetching the variable (a.k.a., band) names
# of the OpeneddatasetFile.
# Ideally, it would be most welcome some kind of method that could return a dictionary
# with this information
# something like:
# VariablesWithinFile = OpeneddatasetFile.getVariablesWithinFileAsDictionary()
I have finally found a way to retrieve variable names from the NETCDF file using GDAL, and that is thank's to the comments given by Robert Davy above.
I have organized the code into a set of functions to help its visualization. Notice that there is also a function for reading metadata from the NETCDF, which returns this info in a dictionary format (see the "readInfo" function).
from gdal import Dataset, InfoOptions
from osgeo import gdal
import numpy as np
def read_data(filename):
dataset = gdal.Open(filename)
if not isinstance(dataset, Dataset):
raise FileNotFoundError("Impossible to open the netcdf file")
return dataset
def readInfo(ds, infoFormat="json"):
"how to: https://gdal.org/python/"
info = gdal.Info(ds, options=InfoOptions(format=infoFormat))
return info
def listAllSubDataSets(infoDict: dict):
subDatasetVariableKeys = [x for x in infoDict["metadata"]["SUBDATASETS"].keys()
if "_NAME" in x]
subDatasetVariableNames = [infoDict["metadata"]["SUBDATASETS"][x]
for x in subDatasetVariableKeys]
formatedsubDatasetVariableNames = []
for x in subDatasetVariableNames:
s = x.replace('"', '').split(":")[-1]
s = ''.join(s)
formatedsubDatasetVariableNames.append(s)
return formatedsubDatasetVariableNames
if "__main__" == __name__:
filename = "netcdfFile.nc"
ds = read_data(filename)
infoDict = readInfo(ds)
infoDict["VariableNames"] = listAllSubDataSets(infoDict)
I'm working with text and use torchtext.data.Dataset.
Creating the dataset takes a considerable amount of time.
For just running the program this is still acceptable. But I would like to debug the torch code for the neural network. And if python is started in debug mode, the dataset creation takes roughly 20 minutes (!!). That's just to get a working environment where I can debug-step through the neural network code.
I would like to save the Dataset, for example with pickle. This sample code is taken from here, but I removed everything that is not necessary for this example:
from torchtext import data
from fastai.nlp import *
PATH = 'data/aclImdb/'
TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
TEXT = data.Field(lower=True, tokenize="spacy")
bs = 64;
bptt = 70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
with open("md.pkl", "wb") as file:
pickle.dump(md, file)
To run the code, you need the aclImdb dataset, it can be downloaded from here. Extract it into a data/ folder next to this code snippet. The code produces an error in the last line, where pickle is used:
Traceback (most recent call last):
File "/home/lhk/programming/fastai_sandbox/lesson4-imdb2.py", line 27, in <module>
pickle.dump(md, file)
TypeError: 'generator' object is not callable
The samples from fastai often use dill instead of pickle. But that doesn't work for me either.
I came up with the following functions for myself:
import dill
from pathlib import Path
import torch
from torchtext.data import Dataset
def save_dataset(dataset, path):
if not isinstance(path, Path):
path = Path(path)
path.mkdir(parents=True, exist_ok=True)
torch.save(dataset.examples, path/"examples.pkl", pickle_module=dill)
torch.save(dataset.fields, path/"fields.pkl", pickle_module=dill)
def load_dataset(path):
if not isinstance(path, Path):
path = Path(path)
examples = torch.load(path/"examples.pkl", pickle_module=dill)
fields = torch.load(path/"fields.pkl", pickle_module=dill)
return Dataset(examples, fields)
Not that actual objects could be a bit different, for example, if you save TabularDataset, then load_dataset returns an instance of class Dataset. This unlikely affect the data pipeline but may require extra diligence for tests.
In the case of a custom tokenizer, it should be serializable as well (e.g. no lambda functions, etc).
You can use dill instead of pickle. It works for me.
You can save a torchtext Field like
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
dill.dump(TEXT,f)
And load a Field like
with open("model/TEXT.Field","rb")as f:
TEXT=dill.load(f)
The offical code suppport is under development,you can follow https://github.com/pytorch/text/issues/451 and https://github.com/pytorch/text/issues/73 .
You can always use the pickle to dump the objects, but keep in mind one thing that dumping a list of dictionary or fields objects are not taken care of by the module, so to the best try to decompose the list first
To Store the DataSet Object to a pickle file for later easy loading
def save_to_pickle(dataSetObject,PATH):
with open(PATH,'wb') as output:
for i in dataSetObject:
pickle.dump(vars(i), output, pickle.HIGHEST_PROTOCOL)
The toughest thing is yet to come, Yeah loading the pickle file.... ;)
First, try to look for all field names and field attributes and then go for the kill
To load the pickle file into the DataSetObject
def load_pickle(PATH, FIELDNAMES, FIELD):
dataList = []
with open(PATH, "rb") as input_file:
while True:
try:
# Taking the dictionary instance as the input Instance
inputInstance = pickle.load(input_file)
# plugging it into the list
dataInstance = [inputInstance[FIELDNAMES[0]],inputInstance[FIELDNAMES[1]]]
# Finally creating an example objects list
dataList.append(Example().fromlist(dataInstance,fields=FIELD))
except EOFError:
break
# At last creating a data Set Object
exampleListObject = Dataset(dataList, fields=data_fields)
return exampleListObject
This hackish solution has worked in my case, hope you will find it useful in your case too.
Btw any suggestion is welcome :).
The pickle/dill approach is fine if your dataset is small. But if you are working with large datasets I won't recommend it as it will be too slow.
I simply save the examples (iteratively) as JSON-strings. The reason behind this is because saving the whole Dataset object takes a lot of time, plus you need serialization tricks such a dill, which makes the serialization even slower.
Moreover, these serializers take a lot of memory (some of them even create copies of the dataset) and if they start making use of the swap memory, you're done. That process is gonna take so long that you will probably terminate it before it finishes.
Therefore, I end up with the following approach:
Iterate over the examples
Convert each example (on the fly) to a JSON-string
Write that JSON-string into a text file (one sample per
line)
When loading, add the examples to the Dataset object along with the fields
def save_examples(dataset, savepath):
with open(savepath, 'w') as f:
# Save num. elements (not really need it)
f.write(json.dumps(total)) # Write examples length
f.write("\n")
# Save elements
for pair in dataset.examples:
data = [pair.src, pair.trg]
f.write(json.dumps(data)) # Write samples
f.write("\n")
def load_examples(filename):
examples = []
with open(filename, 'r') as f:
# Read num. elements (not really need it)
total = json.loads(f.readline())
# Save elements
for i in range(total):
line = f.readline()
example = json.loads(line)
# example = data.Example().fromlist(example, fields) # Create Example obj. (you can do it here or later)
examples.append(example)
end = time.time()
print(end - start)
return examples
Then, you can simply rebuild the dataset by:
# Define fields
SRC = data.Field(...)
TRG = data.Field(...)
fields = [('src', SRC), ('trg', TRG)]
# Load examples from JSON and convert them to "Example objects"
examples = load_examples(filename)
examples = [data.Example().fromlist(d, fields) for d in examples]
# Build dataset
mydataset = Dataset(examples, fields)
The reason why I use JSON instead of pickle, dill, msgpack, etc is not arbitrary.
I did some tests and these are the results:
Dataset size: 2x (1,960,641)
Saving times:
- Pickle/Dill*: >30-45 min (...or froze my computer)
- MessagePack (iterative): 123.44 sec
100%|██████████| 1960641/1960641 [02:03<00:00, 15906.52it/s]
- JSON (iterative): 16.33 sec
100%|██████████| 1960641/1960641 [00:15<00:00, 125955.90it/s]
- JSON (bulk): 46.54 sec (memory problems)
Loading times:
- Pickle/Dill*: -
- MessagePack (iterative): 143.79 sec
100%|██████████| 1960641/1960641 [02:23<00:00, 13635.20it/s]
- JSON (iterative): 33.83 sec
100%|██████████| 1960641/1960641 [00:33<00:00, 57956.28it/s]
- JSON (bulk): 27.43 sec
*Similar approach as the other answers
I'm using python in the lab to control measurements. I often find myself looping over a value (let's say voltage), measuring another (current) and repeating that measurement a couple of times to be able to average the results later. Since I want to keep all the measured data, I like to write it to disk immediately and to keep things organized I use the hdf5 file format. This file format is hierarchical, meaning it has some sort of directory structure inside that uses Unix style names (e.g. / is the root of the file). Groups are the equivalent of directories and datasets are more or less equivalent to files and contain the actual data. The code resulting from such an approach looks something like:
import h5py
hdf_file = h5py.File('data.h5', 'w')
for v in range(5):
group = hdf_file.create_group('/'+str(v))
v_source.voltage = v
for i in range(3):
group2 = group.create_group(str(i))
current = i_probe.current
group2.create_dataset('current', data = current)
hdf_file.close()
I've written a small library to handle the communication with instruments in the lab and I want this library to automatically store the data to file, without explicitly instructing to do so in the script. The problem I run into when doing this is that the groups (or directories if you prefer) still need to be explicitly created at the start of the for loop. I want to get rid of all the file handling code in the script and therefore would like some way to automatically write to a new group on each iteration of the for loop. One way of achieving this would be to somehow modify the for statement itself, but I'm not sure how to do this. The for loop can of course be nested in more elaborate experiments.
Ideally I would be left with something along the lines of:
import h5py
hdf_file = h5py.File('data.h5', 'w')
for v_source.voltage in range(5): # v_source.voltage=x sets the voltage of a physical device to x
for i in range(3):
current = i_probe.current # i_probe.current reads the current from a physical device
current_group.create_dataset('current', data = current)
hdf_file.close()
Any pointers to implement this solution or something equally readable would be very welcome.
Edit:
The code below includes all class definitions etc and might give a better idea of my intentions. I'm looking for a way to move all the file IO to a library (e.g. the Instrument class).
import h5py
class Instrument(object):
def __init__(self, address):
self.address = address
#property
def value(self):
print('getting value from {}'.format(self.address))
return 2 # dummy value instead of value read from instrument
#value.setter
def value(self, value):
print('setting value of {} to {}'.format(self.address, value))
source1 = Instrument('source1')
source2 = Instrument('source2')
probe = Instrument('probe')
hdf_file = h5py.File('data.h5', 'w')
for v in range(5):
source1.value = v
group = hdf_file.create_group('/'+str(v))
group.attrs['address'] = source1.address
for i in range(4):
source2.value = i
group2 = group.create_group(str(i))
group2.attrs['address'] = source2.address
group2.create_dataset('current', data = probe.value)
hdf_file.close()
Without seeing the code it is hard to see, but essentially from the looks of it the pythonic way to do this is that every time you add a new dataset, you want to check whether the directory exists, and if it does you want to append the new dataset, and if it doesn't you want to create a new directory - i.e. this question might help
Writing to a new file if not exist, append to file if it do exist
Instead of writing a new file, use it to create a directory instead. Another helpful one might be
How to check if a directory exists and create it if necessary?
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T