Quickly read HDF 5 file in python? - python

I have an instrument that saves data (many traces from an analog-to-digital converter) as an HDF 5 file. How can I efficiently open this file in python? I have tried the following code, but it seems to take a very long time to extract the data.
Also, it reads the data in the wrong order: instead of reading 1,2,3, it reads 1,10,100,1000.
Any ideas?
Here is a link to the sample data file: https://drive.google.com/file/d/0B4bj1tX3AZxYVGJpZnk2cDNhMzg/edit?usp=sharing
And here is my super-slow code:
import h5py
import matplotlib.pyplot as plt
import numpy as np
f = h5py.File('sample.h5','r')
ks = f.keys()
for index,key in enumerate(ks[:10]):
print index, key
data = np.array(f[key].values())
plt.plot(data.ravel())
plt.show()

As far as the order of your data:
In [10]: f.keys()[:10]
Out[10]:
[u'Acquisition.1',
u'Acquisition.10',
u'Acquisition.100',
u'Acquisition.1000',
u'Acquisition.1001',
u'Acquisition.1002',
u'Acquisition.1003',
u'Acquisition.1004',
u'Acquisition.1005',
u'Acquisition.1006']
This is the correct order for numbers that isn't left padded with zeros. It's doing its sort lexicographically, not numerically. See Python: list.sort() doesn't seem to work for a possible solution.
Second, you're killing your performance by rebuilding the array within the loop:
In [20]: d1 = f[u'Acquisition.990'].values()[0][:]
In [21]: d2 = np.array(f[u'Acquisition.990'].values())
In [22]: np.allclose(d1,d2)
Out[22]: True
In [23]: %timeit d1 = f[u'Acquisition.990'].values()[0][:]
1000 loops, best of 3: 401 µs per loop
In [24]: %timeit d2 = np.array(f[u'Acquisition.990'].values())
1 loops, best of 3: 1.77 s per loop

Related

Vectorizing hashing function in pandas

I have the following dataset (with different values, just multiplied same rows).
I need to combine the columns and hash them, specifically with the library hashlib and the algorithm provided.
The problem is that it takes too long, and somehow I have the feeling I could vectorize the function but I am not an expert.
The function is pretty simple and I feel like it can be vectorized, but struggling to implement.
I am working with millions of rows and it takes hours, even if hashing 4 columns values.
import pandas as pd
import hashlib
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* 100000,'second_identifier':['RED413','BLU031']* 100000})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
Using a list comprehension will get you a significant speedup.
First your original:
import pandas as pd
import hashlib
n = 100000
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
1 loop, best of 5: 26.1 s per loop
Then as a list comprehension:
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})
def list_comp(df):
return pd.Series([ _mutate_hash(row) for row in df.to_numpy() ])
%timeit data['row_hash']=list_comp(data)
1 loop, best of 5: 872 ms per loop
...i.e., a speedup of ~30x.
As a check: You can check that these two methods yield equivalent results by putting the first one in "data2" and the second one in "data3" and then check that they're equal:
data2, data3 = pd.DataFrame([]), pd.DataFrame([])
%timeit data2['row_hash']=data.apply(_mutate_hash,axis=1)
...
%timeit data3['row_hash']=list_comp(data)
...
data2.equals(data3)
True
The easiest performance boost comes from using vectorized string operations. If you do the string prep (lowercasing and encoding) before applying the hash function, your performance is much more reasonable.
data = pd.DataFrame(
{
"first_identifier": ["ALP1x", "RDX2b"] * 1000000,
"second_identifier": ["RED413", "BLU031"] * 1000000,
}
)
def _mutate_hash(row):
return hashlib.md5(row).hexdigest()
prepped_data = data.apply(lambda col: col.str.lower().str.encode("utf8")).sum(axis=1)
data["row_hash"] = prepped_data.map(_mutate_hash)
I see ~25x speedup with that change.

I want 10 numbers between RFMIn and RFMax with using linspace in python

I am reading csv file and in that csv file have a Columns
RFMin and RFMax
1000 3333
5125.5 5888
I want 10 numbers between RFMIn and RFMax with using linspace in python
import pandas as pd
Import numpy as np
df = csv.read_csv(filePath)
RFRange = np.linspace(RFMIn, RFMax, 10)
RFRange = RFRange.flatten()
RFarray=[]
for i in RFRange:
RFarray.append(i)
dict = {‘RFRange’: RFarray}
data = pd.DataFrame(dict)
data.to_csv(‘Output.csv’, header=True, sep=’\t’)
I want something like this:
1000
1259.22
1518.44
1777.67
……..
…….
3333
5125.5
5210.22
5294.94
……..
…….
5888
Your problem is coming from the call to flatten. The flatten function of matplotlib converts a 2d array into a 1d array. However this is done in row-major order by default (https://numpy.org/doc/1.18/reference/generated/numpy.ndarray.flatten.html).
In [1]: a = [1000,5125.5]
In [2]: b = [3333,5888]
In [3]: import numpy as np
In [4]: np.linspace(a,b,10)
Out[4]:
array([[1000. , 5125.5 ],
[1259.22222222, 5210.22222222],
[1518.44444444, 5294.94444444],
[1777.66666667, 5379.66666667],
[2036.88888889, 5464.38888889],
[2296.11111111, 5549.11111111],
[2555.33333333, 5633.83333333],
[2814.55555556, 5718.55555556],
[3073.77777778, 5803.27777778],
[3333. , 5888. ]])
In [5]: np.linspace(a,b,10).flatten()
Out[5]:
array([1000. , 5125.5 , 1259.22222222, 5210.22222222,
1518.44444444, 5294.94444444, 1777.66666667, 5379.66666667,
2036.88888889, 5464.38888889, 2296.11111111, 5549.11111111,
2555.33333333, 5633.83333333, 2814.55555556, 5718.55555556,
3073.77777778, 5803.27777778, 3333. , 5888. ])
As you can see this means that it converts your data into a different format to what you are expecting.
There are a few ways to change the order.
1) As per https://numpy.org/doc/1.18/reference/generated/numpy.ndarray.flatten.html you can use fortran ordering (column-major) when flattening
2) You can transpose your data before flattening
RFRange = RFRange.T.flatten() / RFRange = RFRange.transpose().flatten()
3) You can add a second loop when appending and append directly from the 2D array
I would suggest that this method is to be avoided though. It is ok for 10 points, however large loops can be quite slow in python and it is therefore better to use python built in functions where possible. For example in this case a numpy1d array can easily be converted to a list with the following command:
RFArray = list(RFRange)
You want the array in ascending order? If it is, just do RFarray.sort()

Reading weird json file into pandas [duplicate]

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.
import json
import pandas as pd
test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []
l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)
Note: Line separated json is now supported in read_json (since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.
As of Pandas 0.19, read_json has native support for line-delimited JSON:
pd.read_json(jsonfile, lines=True)
++++++++Update++++++++++++++
As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:
df=pd.read_json('test.json', lines=True)
++++++++Old Answer++++++++++
The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json() can consume the data.
Install jq https://stedolan.github.io/jq/.
Create a valid json file with cat test.json | jq -c --slurp . > valid_test.json
Create dataframe with df=pd.read_json('valid_test.json')
In ipython notebook, you can run the shell command directly from the cell interface with
!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')

Reading multiple JSON records into a Pandas dataframe

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.
import json
import pandas as pd
test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []
l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)
Note: Line separated json is now supported in read_json (since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.
As of Pandas 0.19, read_json has native support for line-delimited JSON:
pd.read_json(jsonfile, lines=True)
++++++++Update++++++++++++++
As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:
df=pd.read_json('test.json', lines=True)
++++++++Old Answer++++++++++
The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json() can consume the data.
Install jq https://stedolan.github.io/jq/.
Create a valid json file with cat test.json | jq -c --slurp . > valid_test.json
Create dataframe with df=pd.read_json('valid_test.json')
In ipython notebook, you can run the shell command directly from the cell interface with
!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')

Convert integer index from Fama-French factors to datetime index in pandas

I get the Fama-French factors from Ken French's data library using pandas.io.data, but I can't figure out how to convert the integer year-month date index (e.g., 200105) to a datetime index so that I can take advantage of more pandas features.
The following code runs, but my index attempt in the last un-commented line drops all data in DataFrame ff. I also tried .reindex(), but this doesn't change the index to range. What is the pandas way? Thanks!
import pandas as pd
from pandas.io.data import DataReader
import datetime as dt
ff = pd.DataFrame(DataReader("F-F_Research_Data_Factors", "famafrench")[0])
ff.columns = ['Mkt_rf', 'SMB', 'HML', 'rf']
start = ff.index[0]
start = dt.datetime(year=start//100, month=start%100, day=1)
end = ff.index[-1]
end = dt.datetime(year=end//100, month=end%100, day=1)
range = pd.DateRange(start, end, offset=pd.datetools.MonthEnd())
ff = pd.DataFrame(ff, index=range)
#ff.reindex(range)
reindex realigns the existing index to the given index rather than changing the index.
you can just do ff.index = range if you've made sure the lengths and the alignment matches.
Parsing each original index value is much safer. The easy approach is to do this by converting to a string:
In [132]: ints
Out[132]: Int64Index([201201, 201201, 201201, ..., 203905, 203905, 203905])
In [133]: conv = lambda x: datetime.strptime(str(x), '%Y%m')
In [134]: dates = [conv(x) for x in ints]
In [135]: %timeit [conv(x) for x in ints]
1 loops, best of 3: 222 ms per loop
This is kind of slow, so if you have a lot observations you might want to use an optimize cython function in pandas:
In [144]: years = (ints // 100).astype(object)
In [145]: months = (ints % 100).astype(object)
In [146]: days = np.ones(len(years), dtype=object)
In [147]: import pandas.lib as lib
In [148]: %timeit Index(lib.try_parse_year_month_day(years, months, days))
100 loops, best of 3: 5.47 ms per loop
Here ints has 10000 entries.
Try this list comprehensions, it works for me:
ff = pd.DataFrame(DataReader("F-F_Research_Data_Factors", "famafrench")[0])
ff.columns = ['Mkt_rf', 'SMB', 'HML', 'rf']
ff.index = [dt.datetime(d/100, d%100, 1) for d in ff.index]

Categories

Resources