Convert 17GB JSON file to a numpy array - python

I have a big 17 GB JSON file placed in hdfs . I need to read that file and convert into nummy array which is then passed into K-Means clustering algorithm. I tried many ways but system slows down and getting a memory error or kernel dies.
the code i tried is
from hdfs3 import HDFileSystem
import pandas as pd
import numpy as nm
import json
hdfs = HDFileSystem(host='hostname', port=8020)
with hdfs.open('/user/iot_all_valid.json/') as f:
for line in f:
data = json.loads(line)
df = pd.DataFrame(data)
dataset= nm.array(df)
I tried using ijson but still not sure which is the right way to do this in faster way.

I would stay away from both numpy and Pandas, since you will get memory issues in both cases. I'd rather stick with SFrame or the Blaze ecosystem, which are designed specifically to handle this kind of "big data" cases. Amazing tools!

Because the data types are all going to be different per column a pandas dataframe would be a more appropriate data structure to keep it in. You can still manipulate the data with numpy functions.
import pandas as pd
data = pd.read_json('/user/iot_all_valid.json', dtype={<express converters for the different types here>})
In order to avoid the crashing issue, try running the k-means on a small sample on the data set. Make sure that works like expected. Then you can increase the data size till you feel comfortable with the whole data set.

In order to deal with a numpy array potentially larger than available ram I would use a memory mapped numpy array. On my machine ujson was 3.8 times faster than builtin json. Assuming rows is the number of lines of json:
from hdfs3 import HDFileSystem
import numpy as nm
import ujson as json
rows=int(1e8)
columns=4
# 'w+' will overwrite any existing output.npy
out = np.memmap('output.npy', dtype='float32', mode='w+', shape=(rows,columns))
with hdfs.open('/user/iot_all_valid.json/') as f:
for row, line in enumerate(f):
data = json.loads(line)
# convert data to numerical array
out[row] = data
out.flush()
# memmap closes on delete.
del out

Related

How to read a large file as Pandas dataframe?

I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf, and then convert back to a pandas dataframe.
However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.
import pandas as pd
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()
I have had similar problem. With some research, I came to know about Vaex.
You can read about its performance here and here.
Essentially this is what you can try to do:
Read the csv file using Vaex and convert it to a hdf5 file (file format most optimised for Vaex)
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)
Open the hdf5 file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.
vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).
You can also try to read your csv file directly into Vaex dataframe without converting it to hdf5. I had read somewhere that Vaex works fastest with hdf5 files therefore I suggested the above approach.
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)
Right now your code suggests that you first attempt to load data using pandas and then convert it to dask-cuDF dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv function (see docs):
from dask_cudf import read_csv
ddf = read_csv('example_output/foo_dask.csv')

How to read json file without using python loop?

I have a JSON file that I want to convert into DataFrame. Since the dataset is pretty large (~30 GB), I found that I need to set the chunksize as the limitation. The code is like this:
import pandas as pd
pd.options.display.max_rows
datas = pd.read_json('/Users/xxxxx/Downloads/Books.json', chunksize = 1, lines = True)
datas
Then when I run it the result is
<pandas.io.json._json.JsonReader at 0x15ce38550>
Is this an error?
Also I found that if you use loop in the datas, it worked. Is there any way to use the standard way?
I don't think pandas is the way to go when reading giant json files.
First you should check out if your file is actually in a valid JSON format (it is completely wrapped in one dictionary) or if it is a JSONL file (each row is one dictionary in JSON format but the rows are not connected).
Because if you are e.g. using the Amazon Review Dataset which has these huge files they are all in JSONL but named JSON.
Two packages I can recommend for parsing huge JSON/JSONL files:
Pretty fast: ujson
import ujson
with open(file_path) as f:
file_contents = ujson.load(f)
Even faster: ijson
import ijson
file_contents = [t for t in ijson.items(open(file_path), "item")]
This allows you to use a progress bar like tqdm as well:
import ijson
from tqdm import tqdm
file_contents = [t for t in tqdm(ijson.items(open(file_path), "item"))]
This is good to know how fast it is going as it shows you how many lines it already read. Since your file is 30GB it might take quite a while to read it all and it's good to know if its still going or the memory crashed or w/e else could have gone wrong.
You can then try to create a DataFrame from the dictionary by using pandas.DataFrame.from_dict(file_contents) but I think 30GB of contents is way more than pandas allows as maximum number of rows. Not quite sure though. Generally I would really recommend working with dictionaries for this amount of content as it's much faster. Then only when you need to display some parts of it for visualization or analysis convert it into a DataFrame.

Python 33gb csv file Dataset to Pandas DataFrame

Im kinda new to Python and Datascience.
I have a 33gb csv file Dataset, and i want to parse it in a DataFrame to do some stuff on it.
I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..
I searched on the internet and found this article.
It says that the most efficent way to read a large csv file is to use csv.DictReader.
So i tried to do that :
import pandas as pd
import csv
df = pd.DataFrame(csv.DictReader(open("MyFilePath")))
Even with this solution it's taking ages to do the job..
Can you please guys tell me what's the most efficient way to parse a large dataset into pandas?
There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:
Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.
The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use.
sample code :
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)
This is the link for Chunking large data.

Numpy CSV fromfile()

I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).

combining ranges for pandas (NumPy? core python?) indexing

I am loading data of size comparable to my memory limits, so I am conscious about efficient indexing and not making copies. I would need to work on columns 3:8 and 9: (also labeled), but combining ranges does not seem to work. Rearranging the columns in the underlying data is needlessly costly (an IO operation). Referencing two dataframes and combining them also sounds like something that would make copies. What is an efficient way to do this?
import numpy as np
import pandas as pd
data = pd.read_stata('S:/data/controls/lasso.dta')
X = pd.concat([data.iloc[:,3:8],data.iloc[:,9:888]])
By the way, if I could read in only half of my data (a random half, even), that would help, again I would not open the original data and save another, smaller copy just for this.
import numpy as np
import pandas as pd
data = pd.read_stata('S:/data/controls/lasso.dta')
cols = np.zeros(len(data.columns), np.dtype=bool)
cols[3:8] = True
cols[9:888] = True
X = data.iloc[:, cols]
del data
This still makes a copy (but just one...). It does not seem to be possible to return a view instead of a copy for this sort of shape (source).
Another suggestion is converting the .dta file to a .csv file (howto). Pandas read_csv is much more flexible: you can specify the columns you are interested in (usecols), and how many rows you would like to read (nrows). Unfortunately this requires a file copy.

Categories

Resources