Reading multiple JSON records into a Pandas dataframe - python

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.
import json
import pandas as pd
test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []
l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)

Note: Line separated json is now supported in read_json (since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.

If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.

As of Pandas 0.19, read_json has native support for line-delimited JSON:
pd.read_json(jsonfile, lines=True)

++++++++Update++++++++++++++
As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:
df=pd.read_json('test.json', lines=True)
++++++++Old Answer++++++++++
The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json() can consume the data.
Install jq https://stedolan.github.io/jq/.
Create a valid json file with cat test.json | jq -c --slurp . > valid_test.json
Create dataframe with df=pd.read_json('valid_test.json')
In ipython notebook, you can run the shell command directly from the cell interface with
!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')

Related

Vectorizing hashing function in pandas

I have the following dataset (with different values, just multiplied same rows).
I need to combine the columns and hash them, specifically with the library hashlib and the algorithm provided.
The problem is that it takes too long, and somehow I have the feeling I could vectorize the function but I am not an expert.
The function is pretty simple and I feel like it can be vectorized, but struggling to implement.
I am working with millions of rows and it takes hours, even if hashing 4 columns values.
import pandas as pd
import hashlib
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* 100000,'second_identifier':['RED413','BLU031']* 100000})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
Using a list comprehension will get you a significant speedup.
First your original:
import pandas as pd
import hashlib
n = 100000
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
1 loop, best of 5: 26.1 s per loop
Then as a list comprehension:
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})
def list_comp(df):
return pd.Series([ _mutate_hash(row) for row in df.to_numpy() ])
%timeit data['row_hash']=list_comp(data)
1 loop, best of 5: 872 ms per loop
...i.e., a speedup of ~30x.
As a check: You can check that these two methods yield equivalent results by putting the first one in "data2" and the second one in "data3" and then check that they're equal:
data2, data3 = pd.DataFrame([]), pd.DataFrame([])
%timeit data2['row_hash']=data.apply(_mutate_hash,axis=1)
...
%timeit data3['row_hash']=list_comp(data)
...
data2.equals(data3)
True
The easiest performance boost comes from using vectorized string operations. If you do the string prep (lowercasing and encoding) before applying the hash function, your performance is much more reasonable.
data = pd.DataFrame(
{
"first_identifier": ["ALP1x", "RDX2b"] * 1000000,
"second_identifier": ["RED413", "BLU031"] * 1000000,
}
)
def _mutate_hash(row):
return hashlib.md5(row).hexdigest()
prepped_data = data.apply(lambda col: col.str.lower().str.encode("utf8")).sum(axis=1)
data["row_hash"] = prepped_data.map(_mutate_hash)
I see ~25x speedup with that change.

Reading weird json file into pandas [duplicate]

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.
import json
import pandas as pd
test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []
l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)
Note: Line separated json is now supported in read_json (since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.
As of Pandas 0.19, read_json has native support for line-delimited JSON:
pd.read_json(jsonfile, lines=True)
++++++++Update++++++++++++++
As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:
df=pd.read_json('test.json', lines=True)
++++++++Old Answer++++++++++
The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json() can consume the data.
Install jq https://stedolan.github.io/jq/.
Create a valid json file with cat test.json | jq -c --slurp . > valid_test.json
Create dataframe with df=pd.read_json('valid_test.json')
In ipython notebook, you can run the shell command directly from the cell interface with
!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')

Select the max row per group - pandas performance issue

I'm selecting one max row per group and I'm using groupby/agg to return index values and select the rows using loc.
For example, to group by "Id" and then select the row with the highest "delta" value:
selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
selected_rows = df.loc[selected_idx, :]
However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows.
I have two questions for experts:
How can I make this query run fast in pandas? What am I doing wrong?
Why is this operation so expensive?
[Update]
Thank you so much for #unutbu 's analysis!
sort_drop it is! On my i7/32GRAM machine, groupby+idxmax hangs for nearly 14 hours (never return a thing) however sort_drop handled it LESS THAN A MINUTE!
I still need to look at how pandas implements each method but problems solved for now! I love StackOverflow.
The fastest option depends not only on length of the DataFrame (in this case, around 13M rows) but also on the number of groups. Below are perfplots which compare a number of ways of finding the maximum in each group:
If there an only a few (large) groups, using_idxmax may be the fastest option:
If there are many (small) groups and the DataFrame is not too large, using_sort_drop may be the fastest option:
Keep in mind, however, that while using_sort_drop, using_sort and using_rank start out looking very fast, as N = len(df) increases, their speed relative to the other options disappears quickly. For large enough N, using_idxmax becomes the fastest option, even if there are many groups.
using_sort_drop, using_sort and using_rank sorts the DataFrame (or groups within the DataFrame). Sorting is O(N * log(N)) on average, while the other methods use O(N) operations. This is why methods like using_idxmax beats using_sort_drop for very large DataFrames.
Be aware that benchmark results may vary for a number of reasons, including machine specs, OS, and software versions. So it is important to run benchmarks on your own machine, and with test data tailored to your situation.
Based on the perfplots above, using_sort_drop may be an option worth considering for your DataFrame of 13M rows, especially if it has many (small) groups. Otherwise, I would suspect using_idxmax to be the fastest option -- but again, it's important that you check benchmarks on your machine.
Here is the setup I used to make the perfplots:
import numpy as np
import pandas as pd
import perfplot
def make_df(N):
# lots of small groups
df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
# few large groups
# df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
return df
def using_idxmax(df):
return df.loc[df.groupby("Id")['delta'].idxmax()]
def max_mask(s):
i = np.asarray(s).argmax()
result = [False]*len(s)
result[i] = True
return result
def using_custom_mask(df):
mask = df.groupby("Id")['delta'].transform(max_mask)
return df.loc[mask]
def using_isin(df):
idx = df.groupby("Id")['delta'].idxmax()
mask = df.index.isin(idx)
return df.loc[mask]
def using_sort(df):
df = df.sort_values(by=['delta'], ascending=False, kind='mergesort')
return df.groupby('Id', as_index=False).first()
def using_rank(df):
mask = (df.groupby('Id')['delta'].rank(method='first', ascending=False) == 1)
return df.loc[mask]
def using_sort_drop(df):
# Thanks to jezrael
# https://stackoverflow.com/questions/50381064/select-the-max-row-per-group-pandas-performance-issue/50389889?noredirect=1#comment87795818_50389889
return df.sort_values(by=['delta'], ascending=False, kind='mergesort').drop_duplicates('Id')
def using_apply(df):
selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
return df.loc[selected_idx]
def check(df1, df2):
df1 = df1.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
df2 = df2.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
return df1.equals(df2)
perfplot.show(
setup=make_df,
kernels=[using_idxmax, using_custom_mask, using_isin, using_sort,
using_rank, using_apply, using_sort_drop],
n_range=[2**k for k in range(2, 20)],
logx=True,
logy=True,
xlabel='len(df)',
repeat=75,
equality_check=check)
Another way to benchmark is to use IPython %timeit:
In [55]: df = make_df(2**20)
In [56]: %timeit using_sort_drop(df)
1 loop, best of 3: 403 ms per loop
In [57]: %timeit using_rank(df)
1 loop, best of 3: 1.04 s per loop
In [58]: %timeit using_idxmax(df)
1 loop, best of 3: 15.8 s per loop
Using Numba's jit
from numba import njit
import numpy as np
#njit
def nidxmax(bins, k, weights):
out = np.zeros(k, np.int64)
trk = np.zeros(k)
for i, w in enumerate(weights - (weights.min() - 1)):
b = bins[i]
if w > trk[b]:
trk[b] = w
out[b] = i
return np.sort(out)
def with_numba_idxmax(df):
f, u = pd.factorize(df.Id)
return df.iloc[nidxmax(f, len(u), df.delta.values)]
Borrowing from #unutbu
def make_df(N):
# lots of small groups
df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
# few large groups
# df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
return df
Prime jit
with_numba_idxmax(make_df(10));
Test
df = make_df(2**20)
%timeit with_numba_idxmax(df)
%timeit using_sort_drop(df)
47.4 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
194 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Series string replace with contents from another series (without using apply)

For the sake of optimization, I want to know if its possible to do a faster string replace in one column, with the contents of the corresponding row from another column, without using apply.
Here is my dataframe:
data_dict = {'root': [r'c:/windows/'], 'file': [r'c:/windows/system32/calc.exe']}
df = pd.DataFrame.from_dict(data_dict)
"""
Result:
file root
0 c:/windows/system32/calc.exe c:/windows/
"""
Using the following apply, I can get what I'm after:
df['trunc'] = df.apply(lambda x: x['file'].replace(x['path'], ''), axis=1)
"""
Result:
file root trunc
0 c:/windows/system32/calc.exe c:/windows/ system32/calc.exe
"""
However, in the interest of making more efficient use of code, I'm wondering if there is a better way. I've tried the code below, but it doesn't seem to work the way I expected it to.
df['trunc'] = df['file'].replace(df['root'], '')
"""
Result (note that the root was NOT properly replaced with a black string in the 'trunc' column):
file root trunc
0 c:/windows/system32/calc.exe c:/windows/ c:/windows/system32/calc.exe
"""
Are there any more effecient alternatives? Thanks!
EDIT - With timings from the couple of examples below
# Expand out the data set to 1000 entries
data_dict = {'root': [r'c:/windows/']*1000, 'file': [r'c:/windows/system32/calc.exe']*1000}
df0 = pd.DataFrame.from_dict(data_dict)
Using Apply
%%timeit -n 100
df0['trunk0'] = df0.apply(lambda x: x['file'].replace(x['root'], ''), axis=1)
100 loops, best of 3: 13.9 ms per loop
Using Replace (thanks Gayatri)
%%timeit -n 100
df0['trunk1'] = df0['file'].replace(df0['root'], '', regex=True)
100 loops, best of 3: 365 ms per loop
Using Zip (thanks 0p3n5ourcE)
%%timeit -n 100
df0['trunk2'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df0.file, df0.root)]
100 loops, best of 3: 600 µs per loop
Overall, looks like zip is the best option here. Thanks for all the input!
Using similar approach as in link
df['trunc'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df.file, df.root)]
Output:
file root trunc
0 c:/windows/system32/calc.exe c:/windows/ system32/calc.exe
Checking with timeit:
%%timeit
df['trunc'] = df.apply(lambda x: x['file'].replace(x['root'], ''), axis=1)
Result:
1000 loops, best of 3: 469 µs per loop
Using zip:
%%timeit
df['trunc'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df.file, df.root)]
Result:
1000 loops, best of 3: 322 µs per loop
Try this:
df['file'] = df['file'].astype(str)
df['root'] = df['root'].astype(str)
df['file'].replace(df['root'],'', regex=True)
Output:
0 system32/calc.exe
Name: file, dtype: object

Quickly read HDF 5 file in python?

I have an instrument that saves data (many traces from an analog-to-digital converter) as an HDF 5 file. How can I efficiently open this file in python? I have tried the following code, but it seems to take a very long time to extract the data.
Also, it reads the data in the wrong order: instead of reading 1,2,3, it reads 1,10,100,1000.
Any ideas?
Here is a link to the sample data file: https://drive.google.com/file/d/0B4bj1tX3AZxYVGJpZnk2cDNhMzg/edit?usp=sharing
And here is my super-slow code:
import h5py
import matplotlib.pyplot as plt
import numpy as np
f = h5py.File('sample.h5','r')
ks = f.keys()
for index,key in enumerate(ks[:10]):
print index, key
data = np.array(f[key].values())
plt.plot(data.ravel())
plt.show()
As far as the order of your data:
In [10]: f.keys()[:10]
Out[10]:
[u'Acquisition.1',
u'Acquisition.10',
u'Acquisition.100',
u'Acquisition.1000',
u'Acquisition.1001',
u'Acquisition.1002',
u'Acquisition.1003',
u'Acquisition.1004',
u'Acquisition.1005',
u'Acquisition.1006']
This is the correct order for numbers that isn't left padded with zeros. It's doing its sort lexicographically, not numerically. See Python: list.sort() doesn't seem to work for a possible solution.
Second, you're killing your performance by rebuilding the array within the loop:
In [20]: d1 = f[u'Acquisition.990'].values()[0][:]
In [21]: d2 = np.array(f[u'Acquisition.990'].values())
In [22]: np.allclose(d1,d2)
Out[22]: True
In [23]: %timeit d1 = f[u'Acquisition.990'].values()[0][:]
1000 loops, best of 3: 401 µs per loop
In [24]: %timeit d2 = np.array(f[u'Acquisition.990'].values())
1 loops, best of 3: 1.77 s per loop

Categories

Resources