I have a JSON file that I want to convert into DataFrame. Since the dataset is pretty large (~30 GB), I found that I need to set the chunksize as the limitation. The code is like this:
import pandas as pd
pd.options.display.max_rows
datas = pd.read_json('/Users/xxxxx/Downloads/Books.json', chunksize = 1, lines = True)
datas
Then when I run it the result is
<pandas.io.json._json.JsonReader at 0x15ce38550>
Is this an error?
Also I found that if you use loop in the datas, it worked. Is there any way to use the standard way?
I don't think pandas is the way to go when reading giant json files.
First you should check out if your file is actually in a valid JSON format (it is completely wrapped in one dictionary) or if it is a JSONL file (each row is one dictionary in JSON format but the rows are not connected).
Because if you are e.g. using the Amazon Review Dataset which has these huge files they are all in JSONL but named JSON.
Two packages I can recommend for parsing huge JSON/JSONL files:
Pretty fast: ujson
import ujson
with open(file_path) as f:
file_contents = ujson.load(f)
Even faster: ijson
import ijson
file_contents = [t for t in ijson.items(open(file_path), "item")]
This allows you to use a progress bar like tqdm as well:
import ijson
from tqdm import tqdm
file_contents = [t for t in tqdm(ijson.items(open(file_path), "item"))]
This is good to know how fast it is going as it shows you how many lines it already read. Since your file is 30GB it might take quite a while to read it all and it's good to know if its still going or the memory crashed or w/e else could have gone wrong.
You can then try to create a DataFrame from the dictionary by using pandas.DataFrame.from_dict(file_contents) but I think 30GB of contents is way more than pandas allows as maximum number of rows. Not quite sure though. Generally I would really recommend working with dictionaries for this amount of content as it's much faster. Then only when you need to display some parts of it for visualization or analysis convert it into a DataFrame.
Related
Im kinda new to Python and Datascience.
I have a 33gb csv file Dataset, and i want to parse it in a DataFrame to do some stuff on it.
I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..
I searched on the internet and found this article.
It says that the most efficent way to read a large csv file is to use csv.DictReader.
So i tried to do that :
import pandas as pd
import csv
df = pd.DataFrame(csv.DictReader(open("MyFilePath")))
Even with this solution it's taking ages to do the job..
Can you please guys tell me what's the most efficient way to parse a large dataset into pandas?
There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:
Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.
The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use.
sample code :
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)
This is the link for Chunking large data.
I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).
I have a dataset in CSV containing lists of values as strings in a single field that looks more or less like this:
Id,sequence
1,'1;0;2;6'
2,'0;1'
3,'1;0;9'
In the real dataset I'm dealing with, the sequence length vary greatly and can contain from one up to few thousands observations. There are many columns containing sequences all stored as strings.
I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame. This takes some time, but I'm ok with it.
However, later when I save the parsed results to pickle the read time of this pickle file is very high.
I'm facing the following:
Reading a raw ~600mb CSV file of such structure to Pandas takes around ~3
seconds.
Reading the same (raw, unprocessed) data from pickle takes ~0.1 second.
Reading the processed data from pickle takes 8 seconds!
I'm trying to find a way to read processed data from disk in the quickest possible way.
Already tried:
Experimenting with different storage formats but most of them can't store nested structures. The only one that worked was msgpack but that didn't improve the performance much.
Using structures other than Pandas DataFrame (like tuple of tuples) - faced similar performance.
I'm not very tied to the exact data structure. The thing is I would like to quickly read parsed data from disk directly to Python.
This might be a duplicate to this question
HDF5 is quite a bit quicker at handling nested pandas dataframes. I would give that a shot.
An example usage borrowed from here shows how you can chunk it efficiently when dumping:
import glob, os
import pandas as pd
df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
store = pd.HDFStore('test.h5')
nrows = store.get_storer('df').nrows
chunksize = 100
for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
store.close()
When reading it back, you can do it in chunks like this, too:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
print df.info()
print(df.head(5))
It seems that I can use both pandas and/or json to read a json file, i.e.
import pandas as pd
pd_example = pd.read_json('some_json_file.json')
or, equivalently,
import json
json_example = json.load(open('some_json_file.json'))
So my question is, what's the difference and which one should I use? Is one way recommended over another, are there certain situations where one is better than the other, etc. ? Thanks.
It Depends.
When you have a single JSON structure inside a json file, use read_json because it loads the JSON directly into a DataFrame. With json.loads, you've to load it into a python dictionary/list, and then into a DataFrame - an unnecessary two step process.
Of course, this is under the assumption that the structure is directly parsable into a DataFrame. For non-trivial structures (usually of the form of complex nested lists-of-dicts), you may want to use json_normalize instead.
On the other hand, with a JSON lines file, the story becomes different. From my experience, I've found loading a JSON lines file with pd.read_json(..., lines=True) is actually slightly slower on large data (tested on ~50k+ records once), and to make matters worse, cannot handle rows with errors - the entire read operation fails. In contrast, you can use json.loads on each line of your file inside a try-except brace for some robust code which actually ends up being a few clicks faster. Go figure.
Use whatever fits the situation.
I have a big 17 GB JSON file placed in hdfs . I need to read that file and convert into nummy array which is then passed into K-Means clustering algorithm. I tried many ways but system slows down and getting a memory error or kernel dies.
the code i tried is
from hdfs3 import HDFileSystem
import pandas as pd
import numpy as nm
import json
hdfs = HDFileSystem(host='hostname', port=8020)
with hdfs.open('/user/iot_all_valid.json/') as f:
for line in f:
data = json.loads(line)
df = pd.DataFrame(data)
dataset= nm.array(df)
I tried using ijson but still not sure which is the right way to do this in faster way.
I would stay away from both numpy and Pandas, since you will get memory issues in both cases. I'd rather stick with SFrame or the Blaze ecosystem, which are designed specifically to handle this kind of "big data" cases. Amazing tools!
Because the data types are all going to be different per column a pandas dataframe would be a more appropriate data structure to keep it in. You can still manipulate the data with numpy functions.
import pandas as pd
data = pd.read_json('/user/iot_all_valid.json', dtype={<express converters for the different types here>})
In order to avoid the crashing issue, try running the k-means on a small sample on the data set. Make sure that works like expected. Then you can increase the data size till you feel comfortable with the whole data set.
In order to deal with a numpy array potentially larger than available ram I would use a memory mapped numpy array. On my machine ujson was 3.8 times faster than builtin json. Assuming rows is the number of lines of json:
from hdfs3 import HDFileSystem
import numpy as nm
import ujson as json
rows=int(1e8)
columns=4
# 'w+' will overwrite any existing output.npy
out = np.memmap('output.npy', dtype='float32', mode='w+', shape=(rows,columns))
with hdfs.open('/user/iot_all_valid.json/') as f:
for row, line in enumerate(f):
data = json.loads(line)
# convert data to numerical array
out[row] = data
out.flush()
# memmap closes on delete.
del out